The Gift That Keeps on Giving

veri-ivanova-p3Pj7jOYvnM-unsplash.jpg

In his book The Effective Engineer, Edmond Lau encourages people to focus on high-leverage activities: those with a high impact per time invested. One example that is often overlooked is shortening a workflow. Many engineers don't internalize the long-term benefits of speed improvements, which compound over time, and multiply with the number of people who use them. Reducing by 1 minute the build time of a test suite that is run 10 times per day by 10 engineers translates to over one person-day in time saved every week!

It might seem easy to dismiss the time savings based on the fact that engineers can continue working while waiting for a test suite to finish. However, this inevitably involves context switching, which leads to lower cognitive performance, as shown by empirical research.

When I first joined the Caseflow project, I was warned not to try running the test suite locally because of how slow it was. With parallelization in Circle CI, the tests ran in a reasonable amount of time, and people got used to that. Still, I knew there was room for improvement, and I made a mental note to look into the slowness.

Once I had time to devote to the problem, I hypothesized that most of the slowness was caused by a small portion of the test setup. This was based on the Pareto principle (also known as the 80/20 rule), which states that a small number of things account for the majority of the results. A common source of bloat in Rails tests comes from the rails_helper.rb file. I tested this theory by running a set of unit tests that didn't require loading Rails at all. They completed in 16.77 seconds, which was obviously too slow. I noticed that none of the tests explicitly requirerails_helper, so I looked in our .rspec file to see if we were automatically requiring rails_helper in all tests, and we were! When I removed the --require rails_helper flag, those same tests ran in 0.03745 seconds. That's 448x faster!

Having narrowed down one cause of slowness to the rails_helper file, I decided this was a good first step, and opened a PR that stops automatically requiring rails_helper, and explicitly required it in tests that needed it.

Next, I scanned our large rails_helper file for any obvious culprits, but nothing jumped out that would affect those simple unit tests. My next test was to comment out the line that loads all the Ruby files in the spec/support directory. That did it! The offender was database_cleaner.rb, which restores databases to a clean state to make tests idempotent.

Caseflow interacts with two databases: Postgres for the newer Veteran appeals, and Oracle for legacy appeals. Cleaning the Oracle DB with the deletion strategy is particularly slow: about 5 seconds. Although this was configured to only happen once before the test suite is run, it meant that running a single test would incur this cost every time. Running single tests is common when writing a new feature or refactoring. In addition, cleaning with the transaction strategy was configured to run before each test, whether or not it needed the DB.

The solution I came up with was to use RSpec's metadata feature to tag tests that require either Postgres or both databases with :postgres or :all_dbs. Supporting that required a few more changes:

  • Splitting up database_cleaner.rb into two separate files, one for Postgres cleaning and one for Oracle.

  • Excluding the DB cleaning files from being automatically required.

    Dir[Rails.root.join("spec/support/**/*.rb")].each do |file|
      excluded_files = %w[database_cleaner.rb vacols_database_cleaner.rb]
      require file unless excluded_files.include?(file.split("/").last)
    end
  • Explicitly requiring one or both DB cleaning files in each test that needed it.

In addition, I noticed that we had a before(:all) block in rails_helper that populated some tables in the Oracle DB. These tables are excluded from the DB cleaning, and are required to set up data for some tests that interact with Oracle. As with the before(:suite) database cleaning, the issue with having this in a before(:all) block is that this code gets run every time you run any individual test, unnecessarily slowing down those that don't need it. Since this was meant as a one-time setup step, I moved this into a separate rake task that gets run when setting up the local environment, and as a separate step before running the tests in Circle CI.

I tested the impact of these changes by running all tests except for feature tests, which are slow by nature:

bundle exec rspec --exclude-pattern "**/feature/**/*_spec.rb"

Before my changes, these tests ran in about 18 minutes. After my changes, they ran in about 6.5 minutes! I thought this made a big enough difference to outweigh the extra effort of adopting a new process for writing and maintaining tests. I announced the results and proposed the workflow change to all engineers on Slack, and also documented it in a new Wiki entry I created to define testing best practices based on some patterns I observed while working on speeding up the test suite.

Going through each test to see whether it needed Postgres or both Postgres and Oracle was very tedious. It was also error prone at first, as I and others missed or forgot to add the appropriate metadata to some tests, causing intermittent failures (affectionately called flakey tests on the Caseflow team). To help address this, I created a Danger rule to remind people to tag their tests. Over time, the new habit was adopted, and although I'm no longer assigned to the Caseflow project, I'm glad to see this process is still being used.

In Circle CI, where parallelization is used, the difference wasn't that drastic, but still significant enough. I measured this by taking the average of the longest RSpec runs for 5 builds before and after my changes. The difference was 0.65 minutes. The average number of daily builds over the past week was 50. The current contract is for a period of 3.5 years, but it took about 6 weeks to recoup the costs of the performance improvements, so let's round down to 3 years for a conservative estimate. Assuming the daily build average holds up, the total projected net time saved is 50 builds x 0.65 minutes saved per build x 5 days per week x 46 weeks per year x 3 years = 9 work weeks!

This doesn't take into account the time saved when running tests locally, nor the future potential savings from following best practices to speed up existing tests cases and writing fast new tests. The areas of improvement that were identified were a direct result of this work.

What makes this change especially attractive is that it is a gift that keeps on giving. Every new test that doesn't use the Oracle DB will run faster than it would have with the old setup.