A typical story of failing integration tests
Imagine this, you are a developer who is attempting to get a simple change from your development environment to production. You identify the problem, make the fix, create a PR, the tests run, and boom, some random unrelated test fails.
...
You re-run the tests, it takes 20 minutes and ANOTHER ONE FAILS.
Why is flakiness a big problem?
Time. If your integration test suite takes 20 minutes to run, then that's time you are not using for other problems. Simple? Could you work on another issue while you wait? Well, it is excellent in theory, but in practice, people are not so good at juggling multiple tasks — context switches are challenging and highly draining.
It would be better if time were the only cost, though it is much worse. As soon as you lose the flow state (20-minute wait or that context switch), you are much more likely to become distracted, e.g., check emails, respond to meaningless Slack debates, etc. I love developing because I love building to solve real problems. Failing tests or waiting prevents me from achieving my goal.
So then, what is the solution?
So, first, let's agree on terminology. In the above example, Let's narrow our scope to API-based integration tests, which include testing the API stack. This includes deploying a production-like environment and performing API tests against that code. Some might refer to these as end-to-end tests, but that depends on what you are building. If it is a full-stack application, these might be integration tests; if you were developing an API, it would be an end-to-end test.
The problem is that these tests are only reliable if they are hermetic. You cannot get a consistent and reliable output for a given input. When working with the network, cloud vendor, etc. it is impossible to have a completely reliable output.
So the answer is clear? We remove all the integration tests and use unit tests.
Well, it is not simple either:
- Someone wrote these tests for a reason: likely, specific integration scenarios may not have unit test coverage. Meaning we cannot just remove them
- These integration tests are valuable. They allow us to test that all individual units will correctly compose and run together. I like to think of these API-based tests as smoke tests; you need a few.
- Sometimes, integration tests are more straightforward to create than unit tests. Developers are lazy creatures. They will take the path of least resistance
What do I do?
As mentioned above, integration tests use the network, so they will never be 100% reliable. So we have to shift our thinking from 'I want to get these to 100% reliable' to 'I want them to be reasonably reliable. ' There is no silver bullet here; we must do multiple things.
These changes cannot be made alone; the team must recognise and prioritise the problem. The best way to get buy-in from the team is to use data to explain how bad the situation is (Step 1.1 below) and to show them the most critical issue that needs to be fixed.
We can broadly break down our actions into two categories: working on stabilising our existing test suite and how we handle reducing the need for it.
1 Stabilizing/reliability
Stabilization is all about making the existing suite tolerable and how we can use data to make actionable improvements. There are going to be tests that are flakier than others; we will be able to use certain types of information to make improvements.
1.1 Use data
Whilst the goal of 100% reliability is never obtainable, we should aim to address the tests that often fail. The way to do this is to create failure categories. The two broad categories that should apply to all tests are infrastructure flake and then test flake.
Infrastructure flake: is anything that fails due to a test set-up OR because of some issue that can be attributed to the network. You should be able to extract this information from the test failure, e.g. by some unexpected networking exception or error string. If bad enough, this flake can be fixed in some cases, perhaps by adding certain low-level retries to some shared network client library, but most of the time, it is not. It is just the cost of using these sorts of tests. Remember, when adding retries, try to keep them as low as possible; creating high-level retries within the code may trigger a lot of higher-level timeouts.
Code/test flake: This is when a failure is not attributed to the infrastructure. This failure could be due to a problem with the integration test or the underlying code. Typically, the problem would be some race condition or constraint the system is violating; these problems should be at least investigated. It is essential to make researching these issues easy (see below). It is possible to determine how important this issue is by seeing how frequently this test fails OR by correlating the failure reason among other tests. If this test fails, say more than 1% of the time, it should be fixed.
1.2 Generate more data, periodically run your integration test suite against a stable commit
The problem with flake is that it's hard to determine; we could have had all our integration tests pass on our merge to main, but we have now introduced a new race condition or a flaky test. When working on a distributed system, there is no way to be 100% certain that our tests or code has no bugs; as mentioned above, we have to fall back to statistics. The best way to determine what is failing is to run your test suite on a timer periodically, perhaps once an hour, against the main branch. This work will give us additional data to help prioritise fixes.
1.3 Make it easy to track down issues
If a random test failure happens and it takes ~+30 minutes to track down the root cause, no one will do it. This is why it is essential to trace your integration tests. When you add tracing to your integration tests, you can see the sequencing of tests (are they run in parallel, serial, etc.), how long each of them takes, and if correctly set up, you will your trace will include all the spans from the backend API service. Distributed tracing will help tell the story better and give clues about what might be wrong.
Reducing the need for the suite
As previously mentioned, these tests likely use components that are not 100% reliable, and because of this, we want to reduce the number of tests that depend on those unreliable components.
2.1 System design
As with software engineering, everything comes back to sound system design. An over-reliance on integration tests might indicate that your system lacks boundaries. A typical example might be authorization; it is a cross-cutting concern, e.g., validated across multiple layers. A lousy design/implementation might lead an engineer to create an integration test for each possible scenario so it can be tested across every layer. A good design is to have a single integration test that validates authorization across all the layers and then individual unit tests that validate that each scenario is being adhered to.
2.2 Make doing the right thing easy
Developers are inherently lazy (in a good way). If creating an integration test is easier for them than creating a unit test, then they will do it. It is important that we have common patterns and approaches for working with integration tests. There should be a standard set of examples of 'good' that can be used in PRs to help guide people to doing the right thing WITHOUT involving the network.
2.3 Audit existing integration tests and move them to unit tests
Based on the results of periodically running our integration tests, we should see what suite or reason we are failing. Using this data, we should look at what the integration test is attempting to test for and then move those scenarios into unit tests. Based on the results of 2.2, we should now have a better way to test various components in isolation. This is a slow burn; it is unlikely we will be able to tackle them all, though this will likely fall into the 80/20 rule, where 20% of our tests are causing 80% of the failures.
What not to do?
You must not apply automatic test retries. It seems logical to retry a failing test, though retrying a failing test creates a few unintended problems:
- It masks genuine failures. People only care about problems when a big red X or the building breaks. If we silently retry issues, these failing tests will not get the attention that they deserve
- Tests that fail 100% of the time will be retried. This increases the amount of time it will take to get a failing result
Summary
Implementing these six steps can dramatically improve our integration test reliability. These steps are challenging; they require buy-in and constant attention from the engineering team. I believe that once you and your team start seeing a lot more successes than failures, they will once again start caring about the integration test suite.