Thomas Wang

From journeyman to master.

Continuous Integration at Google

Software Engineering at Google chapter 23: Continuous Integration.
Written by Rachel Tannenbaum Edited by Lisa Carey

Recently I start working on CI infra. This chapter from "Software Engineering at Google" is the best material I've read about to learn both the fundamental concepts and various tips on scaling the system in a large organization. It is worth coming back to re-read these notes and compare with other companies' CI infra which are at different scales. This will help me think about how to scale and grow the system and processes.


Continuous Integration (CI) definitions

  1. is generally defined as “a software development practice where members of a team integrate their work frequently […] Each integration is verified by an automated build (including test) to detect integration errors as quickly as possible.”
    • Simply put, the fundamental goal of CI is to automatically catch problematic changes as early as possible.
  2. Better definition in today's world:
    • Continuous Integration (2): the continuous assembling and testing of our entire complex and rapidly evolving ecosystem.

  3. From a testing perspective, CI is a paradigm to inform the following:
    • Which tests to run when in the development/release workflow, as code (and other) changes are continuously integrated into it
    • How to compose the system under test (SUT) at each point, balancing concerns like fidelity and setup cost
    • For example, which tests do we run on pre-submit, which do we save for post-submit, and which do we save even later until our staging deploy? Accordingly, how do we represent our SUT at each of these points?

CI Concepts

Fast Feedback Loops

Accessible and actionable feedback


See also

Continuous Build

Continuous Delivery

Continuous Testing

Why pre-submit isn't enough

With the objective to catch problematic changes as soon as possible: why not just run all tests on pre-submit?

  1. It's too expensive.
  2. Waiting a long time to run every test pre-submit wastes engineer productivity.
  3. Efficiency gains from selective testing.

Mid-air collision: it is possible for two changes that touch completely different files to cause a test to fail.

Presubmit versus post-submit

Which tests should be run on pre-submit?

Ways to avoid waiting for slow build.

  1. Typically limit pre-submit tests to just those for the project where the change is happening.
  2. Run tests concurrently.
  3. Don't run unreliable tests on pre-submit.
    1. Most teams at Google run their small tests (like unit tests) on pre-submit.
      1. Each team at Google configures a subset of its project’s tests to run on pre-submit (versus post-submit).
      2. In reality, our continuous build actually optimizes some pre-submit tests to be saved for post-submit, behind the scenes.
  4. Whether and how to run larger-scoped tests on pre-submit is the more interesting question, and this varies by team.
    1. Hermetic testing reduces inherent instability.
    2. Allow large-scoped tests to be unreliable on pre-submit but disable them aggressively when they start failing.

Release candidate testing

After a code change has passed the CB, as CD builds RCs, it will run larger tests against the entire RC by promoting it through a series of test environments and testing it at each deployment.

Reasons to run same test suite that CB runs:

  1. As a sanity check
  2. For auditability
  3. To allow for cherry picks
  4. For emergency pushes

Production testing

We should run the same suite of tests against production (sometimes called probers) that we did against the release candidate earlier on to verify:

  1. the working state of production, according to our tests
  2. the relevance of our tests, according to production.

CI Challenges

  1. Disruption to engineer productivity of unstable, slow, conflicting, or simply too many tests at pre-submit.
  2. Presubmit optimization, including which tests to run at pre-submit time, and how to run them.
  3. Culprit finding and failure isolation: Which code or other change caused the problem, and which system did it happen in?
  4. Resource constraints: Tests and the infrastructure both need resources.
  5. failure management: what to do when tests fail.
    1. it’s extremely difficult to have a consistently green test suite when large end-to-end tests are involved.
    2. A common technique at Google is to use bug “hotlists” filed by an on-call or release engineer (or even automatically) and triaged to the appropriate team.
      1. Any release-blocking bugs are fixed immediately.
      2. Nonrelease blockers should also be prioritized.
    3. Often, the problems caught by end-to-end test failures are actually with tests rather than code.
  6. Flaky tests.
    1. Finding a change to roll back is often more difficult because the failure won’t happen all the time.
    2. Some teams rely on a tool to remove such flaky tests from pre-submit temporarily while the flakiness is investigated and fixed.
  7. Test instability.
    1. Allow multiple attempts of the test to run.
    2. Within test code, retries can be introduced at various points of specificity.
    3. [[#Hermetic Testing]]

Hermetic Testing

Testing Overview introduced the concept of hermetic tests

Hermetic tests: tests run against a test environment (i.e., application servers and resources) that is entirely self-contained (i.e., no external dependencies like production backends).

See hermetic backend design and usage in tests.

Hermetic tests properties:

  1. greater determinism (i.e., stability).
  2. isolation.

Achieve a pre-submit-worth integration test:

  1. With a fully hermetic setup—that is, starting up the entire stack sandboxed (Google provides out-of-the-box sandbox configurations for popular components).
  2. Record/replay (See Larger Testing). Downside is that it leads to brittle tests: it’s difficult to strike a balance between the following:
    1. False positives: hitting the cache too much and missing problems that would surface when capturing a new response.
    2. False negatives: requires responses to be updated, which can take a long time and lead to test failures, often submit-blocking.

Case study - Google Assistant

Hermetic Testing Conclusions

  1. Hermetic testing can both reduce instability in larger-scoped tests and help isolate failures.
  2. However, hermetic backends can also be more expensive because they use more resources and are slower to set up.
  3. Many teams use combinations of hermetic and live backends in their test environments.

CI at Google

Test Automation Platform (TAP)

Presubmit optimization

Culprit finding

Note( when the scale is smaller (3~5 commits in a batch, 1~2 breakages per day), it's probably okay to ping all authors in the commits range and find the culprit commit together.

Failure management

Resource constraints

CI Case Study: Google Takeout

  1. Prevent problems in nightly dev deploys:
    • Check service health in pre-submit tests.
    • Move end-to-end tests (which use test accounts) from nightly deploy to post-submit within 2 hours.
  2. Accessible, actionable feedback from CI reduces test failures and improves productivity.
    • Refactored the tests to report results in a friendlier UI.
    • Improve failure debuggability, e.g., by displaying failure information, with links to logs, directly in the error message.
      • This reduces takeout team's involvement in plug-in failures debugging. Measured by "Mean # comments per bug by takeout team".
  3. Running the same test suite against prod and a post-submit CI (with newly built binaries, but the same live backends) is a cheap way to isolate failures.
    • Remaining challenge: Manual comparisons between this CI and prod are an expensive use of the Build Cop’s time. And it grows as takeout integrates with more Google services.
    • Future improvement: try hermetic testing with record/replay in Takeout’s post-submit CI. [[#Hermetic Testing]]
  4. Plug-in end-to-end test breaks, which takeout teams has no control of.
    1. Solution is to disable failing tests by tagging them with an associated bug and filing that off to the responsible team.
    2. Uses feature flags for plug-in to choose features to enable.
    3. Tests would query bug system API, if it passes, prompt to clean up the tag and mark the bug fixed.
    4. These created a self-maintaining test suite. Measured by "Mean time to close bug, after fix submitted" A.K.A., "MTTCU: mean time to clean up". In checkout case, MTTCU dropped from 60 days in 2018-09-01 to under 10 days in 2018-12-01.
    5. Future improvement: Automating the filing and tagging of bugs would be a helpful next step. This is still a manual and burdensome process. As mentioned earlier, some of our larger teams already do this.

Further Reading