Our Android testing process

Published in

Headspace-engineering

8 min readJun 22, 2022

Our testing strategy, how we integrate it into our development workflow, and what the future holds for testing at Headspace.

By John Qualls, Senior Software Engineer, Greg Rami, Principal Software Engineer, and Anton Gerasimenko, Software Engineer

Introduction

When we took on the daunting task of rewriting our entire app at the end of 2019, one of our key focus areas was testability. At the time, we had less than 20% of our code covered by unit tests — no integration or end-to-end (E2E) tests — and adding any tests to the code base was a big effort. From the beginning, we agreed on the standard of at least 80% unit test coverage on all Pull Requests (PRs), E2E tests for critical flows, and an architecture that was focused on testability.

The result is a more robust codebase. Today, we’re able to capture critical bugs via our tests suite coverage before they hit our users and, very importantly, it’s now trivial, and very fast, to add new tests.

Let’s detail our testing strategy, how we integrate it into our development workflow, and what the future holds for testing at Headspace.

Testing Strategy

Shortly after the rewrite, we had hundreds of unit tests and around twenty E2E tests. We wanted to optimize the number of E2E tests that we do to only certain use cases where it makes sense to take on the additional time and costs in exchange for the reduced risk.

Test selection pyramid for cost and speed

Unit Tests

Our unit tests are straightforward. They test a very small unit of code by relying on JUnit 5 and different layers of mocking:

MockK to mock all the dependencies that are unrelated to the current test. This is made easier by using good architecture principles such as Single Responsibility Principle (SRP) and Dependency Injection (DI)
We introduced a Java Faker library, that provides “fakes” for primitive values that do not need to have a specific value
Similarly, we have our own “model Fakers” that provide domain objects. That way when the underlying classes change, we only make modifications in the Fakers.
We use the AAA pattern to keep our tests organized and consistent
We also use Robolectric for the handful of unit tests that goes through some Android specific code

Our unit tests are very reliable and run in less than 5 minutes for more than 4,600 tests (at the time of writing).

E2E Tests

Our E2E tests validate the UI in our app by running instrumentation tests on physical devices in Firebase Test Lab (FTL). While this worked really well at catching UI regressions, we soon realized that this approach was not scalable as the number of tests increased. The reason for this is because of flakiness.

There’s network flakiness, requests/responses from REST APIs fail (we run our E2E tests on our staging environment), and there can be device flakiness (random unexpected states the physical device gets into that are outside of our control).

The more tests we added, the more chances for a CI pipeline failure to occur, which slows down developer productivity by blocking pull requests.

The knee jerk reaction to this was to simply disable the PR check and resolve the failed tests at a later time. The problem with this approach was that developers started to lose confidence in these tests, and didn’t pay attention to them when merging in new code changes, which defeated the purpose of writing tests in the first place.

Integration Tests

We realized this wasn’t working, so we decided to introduce UI Integration Tests. These solved our primary issue of network flakiness by mocking out the network completely, using OkHttp’s MockWebServer. These tests use fake JSON responses. We started to see fast and consistent test results that could reliably gate our PRs.

CI Pipeline and Reporting

Overall Setup

We want our tests to add confidence without slowing us down. As such, we have different rules for when tests are run.

Our unit tests are simple: they run on every commit.

Our UI tests are heavier to run, so we do not run all of them for every commit. Instead, we break them down into different categories based on how critical the part of the app is that they cover. We have 3 categories:

Smoke — runs on every developer’s commit in our CI pipeline.
Regression — runs at 12pm every day as a separate CI job.
Minimum Acceptance Tests (MAT) — runs at 4pm every day as a separate CI job.

In the app, we added these categories as annotations so we can annotate each test method and control when it’s executed in our CI pipeline. Then we tell Firebase Test Lab (FTL) to filter by test’s annotation, via the firebase CLI — test-targets argument, like so:

Beyond FTL

One of the difficult challenges that we have faced when maintaining our UI tests is keeping them in a green state.

FTL provides good enough test status reporting, logging, and video recording.

Parallel testing
We run our UI tests in parallel using Flank. This adds complexity and makes reading our tests in FTL a little more complicated.

For example, let’s say that we have three physical devices, and on each one, we run five shards with a set of tests. As a result, we have a nested structure with at least four levels: devices, shards, tests, and test details.

Test reporting
Another question is how does one navigate from the CI pipeline to the FTL report? Of course there is a way to open FTL results with a set of test matrices, but how do you determine which test is yours if there are several test runs?

Here’s how the links to the FTL test result matrices look like from our CircleCI pipeline:

So the process to find the test of interest would be to open a CI job, copy the test matrix web link, open it on your browser, dig through the shard hierarchy, then rinse and repeat until you find the test (and hope it’s not a flaky test that just needs to be re-run). Sounds terrible, right?

It gets worse when it’s a scheduled job because you need to first find the branch and job in your CI tool. That’s too many steps for something that should be surfaced and addressed easily, lest it slows down our developers (or worse, gets ignored).

To improve this process, we started to look into the following:

The report should be easily available and accessible. For that, we publish reports to a dedicated Slack channel.
It should group all tests in a single list, whether it was run in parallel or not.
Automate bug ticketing for test failures into JIRA.

Here’s what our solution looks like:

CircleCI is where all our tests run. It calls FTL to execute the tests on physical devices. On completion, a Slack message displays the results, and those messages contain direct links to the Report Portal. Report Portal is an open source tool that provides extensive functionality such as displaying reports, combining them into analytic charts, and integrations with other tools (JIRA, for example).

Here’s a successful test run message in our Slack:

and a failed one:

On a failed test run, we typically follow the process:

Open the detailed report:

2. Three tests failed (one failed twice, on different devices). Let’s click on one to see what the detailed report looks like:

3. The detailed report contains a link to FTL, which contains detailed information like logs, video, and errors:

Looking Into The Future

We are constantly focusing on ways to improve our testing strategy.

At the end of the day, we want to catch new issues and regressions as early as possible, to reduce the cost to fix. We try to do this without adding too much delay to our PRs and SDLC.

For example, we recently made our app accessible in compliance with WCAG. We want to make sure no regressions are introduced to keep our certification. One solution is screenshot testing, which will allow us to validate the UI, pixel by pixel. This way, we could test multiple configurations quickly as opposed to having many asserts via Espresso testing.

A few other things we’re doing are improving the entire code base ecosystem while benefiting our testing framework.

We’re introducing Jetpack Compose, which will tie in nicely with screenshot testing. This is because with Compose, you can easily isolate each composable in a UI test, provide the state with fake data, and call one line to take a screenshot. More to come on this later on!
We’re also (finally) finishing our migration from Dagger to Hilt. This will allow us, among many other things, to better swap dependencies in our tests.

Conclusion

There’s no one-size-fits-all for testing. We believe that it’s mostly about striking the right balance between having a stable app and moving with speed. However, there are ways to make your testing suite easier to maintain, more insightful and eventually make you move faster. By having tests that are cheap (fast, low maintenance, low flakiness), we allow our developers to move faster by having higher confidence in our code.

Or at least that’s what we strive for 😅

How about you? What’s your testing strategy? Please let us know in the comments below or join us and help us improve! https://jobs.lever.co/headspace