Continuous Integration at Google

Software Engineering at Google chapter 23: Continuous Integration.
Written by Rachel Tannenbaum Edited by Lisa Carey

Recently I start working on CI infra. This chapter from "Software Engineering at Google" is the best material I've read about to learn both the fundamental concepts and various tips on scaling the system in a large organization. It is worth coming back to re-read these notes and compare with other companies' CI infra which are at different scales. This will help me think about how to scale and grow the system and processes.

TL;DRs

A CI system decides what tests to use, and when.
CI systems become progressively more necessary as your codebase ages and grows in scale.
CI should optimize quicker, more reliable tests on pre-submit and slower, less deterministic tests on post-submit.
Accessible, actionable feedback allows a CI system to become more efficient.

Continuous Integration (CI) definitions

is generally defined as “a software development practice where members of a team integrate their work frequently […] Each integration is verified by an automated build (including test) to detect integration errors as quickly as possible.”
- Simply put, the fundamental goal of CI is to automatically catch problematic changes as early as possible.
Better definition in today's world:
- Continuous Integration (2): the continuous assembling and testing of our entire complex and rapidly evolving ecosystem.
From a testing perspective, CI is a paradigm to inform the following:
- Which tests to run when in the development/release workflow, as code (and other) changes are continuously integrated into it
- How to compose the system under test (SUT) at each point, balancing concerns like fidelity and setup cost
- For example, which tests do we run on pre-submit, which do we save for post-submit, and which do we save even later until our staging deploy? Accordingly, how do we represent our SUT at each of these points?

CI Concepts

Fast Feedback Loops

To minimize the cost of bugs, CI encourages us to use fast feedback loops. A.K.A. shifting left.
Feedback in different forms
- The edit-compile-debug loop of local development
- An integration error between changes to two projects, detected after both are submitted and tested together (i.e., on post-submit)
- An incompatibility between our project and an upstream microservice dependency, detected by a QA tester in our staging environment, when the upstream service deploys its latest changes
- Bug reports by internal users who are opted in to a feature before external users
- Bug or outage reports by external users or the press

Accessible and actionable feedback

Open culture around test reporting: logs, detailed history
Flake classification (uses statistics to classify flakes at a Google-wide level)
By improving test output readability, you automate the understanding of feedback.

Automation

Continuous Build

The Continuous Build (CB) integrates the latest code changes at head. A.K.A., trunk-based development.
User can choose to integrate with true head or green head during local development.

Continuous Delivery

most teams cut Release candidates (RC) at green, as opposed to true, head.
we would recommend dynamic configuration, such as experiments or feature flags, for many scenarios.
static configuration is in version control along with the code.
- Version skew is often caught in this release-candidate-promotion process. This assumes, of course, that your static configuration is in version control.
Continuous Delivery (CD): a continuous assembling of release candidates, followed by the promotion and testing of those candidates throughout a series of environments—sometimes reaching production and sometimes not.
As an RC progresses through environments, its artifacts (e.g., binaries, containers) ideally should not be recompiled or rebuilt.
- Using containers such as Docker helps enforce consistency of an RC between environments,
- Using orchestration tools like Kubernetes helps enforce consistency between deployments.

Continuous Testing

One of CI key objectives is determining what to test when in the progression from local development to production.
As we shift to the right, the code change is subjected to progressively larger-scoped automated tests.

Why pre-submit isn't enough

With the objective to catch problematic changes as soon as possible: why not just run all tests on pre-submit?

It's too expensive.
Waiting a long time to run every test pre-submit wastes engineer productivity.
Efficiency gains from selective testing.

Mid-air collision: it is possible for two changes that touch completely different files to cause a test to fail.

Note(thomas.wang): I like to call it logical conflicts in comparison to merge conflicts
It happens most days at Google scale.
CI systems for smaller repositories or projects can avoid this problem by serializing submits so that there is no difference between what is about to enter and what just did. Note(thomas.wang): this is generally called a merge queue.

Presubmit versus post-submit

Which tests should be run on pre-submit?

General rule of thumb is: only fast, reliable ones.
You can accept some loss of coverage on pre-submit, but that means you need to catch any issues that slip by on post-submit, and accept some number of rollbacks.

Ways to avoid waiting for slow build.

Typically limit pre-submit tests to just those for the project where the change is happening.
Run tests concurrently.
Don't run unreliable tests on pre-submit.
1. Most teams at Google run their small tests (like unit tests) on pre-submit.
  1. Each team at Google configures a subset of its project’s tests to run on pre-submit (versus post-submit).
  2. In reality, our continuous build actually optimizes some pre-submit tests to be saved for post-submit, behind the scenes.
Whether and how to run larger-scoped tests on pre-submit is the more interesting question, and this varies by team.
1. Hermetic testing reduces inherent instability.
2. Allow large-scoped tests to be unreliable on pre-submit but disable them aggressively when they start failing.

Release candidate testing

After a code change has passed the CB, as CD builds RCs, it will run larger tests against the entire RC by promoting it through a series of test environments and testing it at each deployment.

This can include a combination of sandboxed, temporary environments and shared test environments (dev or staging).
It’s common to include some manual QA testing of the RC in shared environments, too.

Reasons to run same test suite that CB runs:

As a sanity check
For auditability
To allow for cherry picks
For emergency pushes

Production testing

We should run the same suite of tests against production (sometimes called probers) that we did against the release candidate earlier on to verify:

the working state of production, according to our tests
the relevance of our tests, according to production.

CI Challenges

Disruption to engineer productivity of unstable, slow, conflicting, or simply too many tests at pre-submit.
Presubmit optimization, including which tests to run at pre-submit time, and how to run them.
Culprit finding and failure isolation: Which code or other change caused the problem, and which system did it happen in?
Resource constraints: Tests and the infrastructure both need resources.
failure management: what to do when tests fail.
1. it’s extremely difficult to have a consistently green test suite when large end-to-end tests are involved.
2. A common technique at Google is to use bug “hotlists” filed by an on-call or release engineer (or even automatically) and triaged to the appropriate team.
  1. Any release-blocking bugs are fixed immediately.
  2. Nonrelease blockers should also be prioritized.
3. Often, the problems caught by end-to-end test failures are actually with tests rather than code.
Flaky tests.
1. Finding a change to roll back is often more difficult because the failure won’t happen all the time.
2. Some teams rely on a tool to remove such flaky tests from pre-submit temporarily while the flakiness is investigated and fixed.
Test instability.
1. Allow multiple attempts of the test to run.
2. Within test code, retries can be introduced at various points of specificity.
3. [[#Hermetic Testing]]

Hermetic Testing

Testing Overview introduced the concept of hermetic tests

Hermetic tests: tests run against a test environment (i.e., application servers and resources) that is entirely self-contained (i.e., no external dependencies like production backends).

See hermetic backend design and usage in tests.

Hermetic tests properties:

greater determinism (i.e., stability).
isolation.

Achieve a pre-submit-worth integration test:

With a fully hermetic setup—that is, starting up the entire stack sandboxed (Google provides out-of-the-box sandbox configurations for popular components).
Record/replay (See Larger Testing). Downside is that it leads to brittle tests: it’s difficult to strike a balance between the following:
1. False positives: hitting the cache too much and missing problems that would surface when capturing a new response.
2. False negatives: requires responses to be updated, which can take a long time and lead to test failures, often submit-blocking.

Case study - Google Assistant

Success story about making test suite fully hermetic on pre-submit:
- Cut the runtime by 14x.
- With no flakiness - failures tend to be fairly easy to find and roll back.
Non-hermetic tests have been pushed to post-submit, debugging failing end-to-end tests is still difficult. Teams have to disable them but it can result in production failures.
Challenges:
1. Continue to fine-tuning its caching mechanisms so that pre-submit can catch more types of issues.
2. Presubmit testing for the decentralized Assistant as components shift to own microservices.
  - Post-submit failure-isolation strategy: hotswapping with production backends, so the cost is O(N) instead of O(N^2).

Hermetic Testing Conclusions

Hermetic testing can both reduce instability in larger-scoped tests and help isolate failures.
However, hermetic backends can also be more expensive because they use more resources and are slower to set up.
Many teams use combinations of hermetic and live backends in their test environments.

To minimize the time spent waiting, Google’s CB approach allows potentially breaking changes to land in the repository.
- Each team to create a fast subset of tests, often a project’s unit tests, that can be run before a change is submitted.
- Empirically, a change that passes the pre-submit has a very high likelihood (95%+) of passing the rest of the tests.
After a change has been submitted, we use TAP to asynchronously run all potentially affected tests, including larger and slower tests.
- Established a cultural norm that strongly discourages committing any new work on top of known failing tests, though flaky tests make this difficult.
- Build Cop’s responsibility is keeping all the tests passing in their particular project, regardless of who breaks them.
In practice the trade-off pays off. The average wait time to submit a change is around 11 minutes and are able to efficiently detect and address breakages.

Culprit finding

TAP can no longer run every test on every change, instead it falls back to batching related changes together.
To speed up failure identification, we use two different approaches.
1. TAP automatically splits a failing batch up into individual changes and reruns the tests against each change in isolation. (Slow to converge)
2. Created culprit finding tools for manual binary search.

Note(thomas.wang): when the scale is smaller (3~5 commits in a batch, 1~2 breakages per day), it's probably okay to ping all authors in the commits range and find the culprit commit together.

Failure management

Fixing a broken build is the responsibility of the Build Cop.
The most effective tool the Build Cop has is the rollback.
TAP has recently been upgraded to automatically roll back changes when it has high confidence that they are the culprit. (SWE book is published March, 2020)

Resource constraints

Most test executions happen in a distributed build-and-test system called Forge.
Ways to determine which tests should be run at which times to ensure that the minimal amount of resources are spent to validate a given change:
1. Forge and Blaze, maintain a near-real-time version of the global dependency graph and make it available to TAP. As a result, TAP can quickly determine which tests are downstream from any change and run the minimal set to be sure the change is safe.
2. TAP's ability to run changes with fewer tests sooner encourages engineers to write small, focused changes.
  1. The difference in waiting time between a change that triggers 100 tests and one that triggers 1,000 can be tens of minutes on a busy day.

CI Case Study: Google Takeout

Prevent problems in nightly dev deploys:
- Check service health in pre-submit tests.
- Move end-to-end tests (which use test accounts) from nightly deploy to post-submit within 2 hours.
Accessible, actionable feedback from CI reduces test failures and improves productivity.
- Refactored the tests to report results in a friendlier UI.
- Improve failure debuggability, e.g., by displaying failure information, with links to logs, directly in the error message.
  - This reduces takeout team's involvement in plug-in failures debugging. Measured by "Mean # comments per bug by takeout team".
Running the same test suite against prod and a post-submit CI (with newly built binaries, but the same live backends) is a cheap way to isolate failures.
- Remaining challenge: Manual comparisons between this CI and prod are an expensive use of the Build Cop’s time. And it grows as takeout integrates with more Google services.
- Future improvement: try hermetic testing with record/replay in Takeout’s post-submit CI. [[#Hermetic Testing]]
Plug-in end-to-end test breaks, which takeout teams has no control of.
1. Solution is to disable failing tests by tagging them with an associated bug and filing that off to the responsible team.
2. Uses feature flags for plug-in to choose features to enable.
3. Tests would query bug system API, if it passes, prompt to clean up the tag and mark the bug fixed.
4. These created a self-maintaining test suite. Measured by "Mean time to close bug, after fix submitted" A.K.A., "MTTCU: mean time to clean up". In checkout case, MTTCU dropped from 60 days in 2018-09-01 to under 10 days in 2018-12-01.
5. Future improvement: Automating the filing and tagging of bugs would be a helpful next step. This is still a manual and burdensome process. As mentioned earlier, some of our larger teams already do this.