I love reading books while paying attention to details, checking references to other books recommended by the author, and following the recommended reading links. The icing on the cake is the author’s blog, as it usually contains a much-expanded version of the content available in the book. This was precisely the case with Will Larson’s “An Elegant Puzzle: Systems of Engineering Management.”
There’s a lot of interesting nuggets in his blog. What struck my attention today is the story of how Will started the SRE organization at Uber and grew it to 40 people or a story of Digg v4 launch, a complete rewrite marked by a series of problems.
These stories revolve around migrating a system to deal with technical debt, manifesting in growing pains the whole team could feel. Engineers and managers often consider migrations something that shouldn’t be done normally. “If only we’d considered covering this situation before, we wouldn’t have to spend six months migrating the data layer now.”
Migrating Service Provisioning at Uber
Around 2015, Uber was transitioning from a monolithic architecture to microservices. However, provisioning and deploying new services was incredibly complex and error-prone.
Will was running a small team tasked with streamlining this process. They started with a messy service cookbook that required coordinating config changes across multiple server tiers to launch a new service. It could take days of debugging to get a new service working correctly.
The team automated more and more of the provisioning steps, although it was clumsy at first. Engineers requesting new services had to paste puppet changes without understanding the effects. Service discovery was another challenge — ports were initially assigned manually from a wiki page. This led to outages when ports were reused.
Through an iterative migration process, the platform team scaled Uber from 15 to over 2000 services with a staff of only 4 people. Provisioning new services went from a week-long cross-team ordeal to something any new engineer could do on their first day.
Migrations As The Reality of Engineering
The reality of engineering is quite different from abstractions in our minds. Even if we tried to design systems for every possible future outcome, we’d create bloated, overengineered solutions that would take forever to implement and would still quickly become obsolete.
Instead, we need to keep the focus on the requirements we have right now and plan for migrating the systems to reflect the new, changed reality once in a while. This is exactly what Will talks about in his book. Here are a few of my notes from “Elegant Puzzle” on migrations.
Get used to migrations as your codebase ages, and the business grows
As an engineering organization matures, technical debt accumulates, and the codebase becomes more complex. This requires periodic migrations to improve the architecture. Without periodic migrations, the codebase becomes brittle and hard to change. Adding new features slows down. Bugs emerge that are hard to fix. Plan for migrations as a normal part of the engineering lifecycle. Allocate resources and schedule time for significant migrations. Celebrate migrations as progress.
Migrations reduce capacity today to have more capacity tomorrow
Performing a major migration requires engineers to focus on upgrading the architecture rather than building features. This reduces team velocity in the short term. Technical debt accrues interest. The longer you wait to pay it down, the more it costs. Taking the hit on velocity now avoids much greater slowdowns later. Get stakeholders buy-in on trading some immediate momentum for future speed. Explain how migrations reduce “drag” on the team over time.
Migrations are tricky to schedule
The bigger and older a codebase gets, the more costly and risky migrations become. It’s tempting to kick the can down the road yet again. Without scheduling time for migrations upfront, the code rots indefinitely. The migration only gets more complicated. Build in migration time upfront. Define targets for regular migrations, like every 2 years. Start planning the subsequent migration as soon as the current one finishes.
If something stops working, it’s a good sign
It’s a sign that the system wasn’t over-engineered. Remember the YAGNI principle — you ain’t gonna need it. The code should be simple and focused on the current requirements so it stops working as they evolve. Overengineering for possible future needs adds complexity. Embrace lean thinking and YAGNI when writing code. Build only what you need today without speculation. Then refactor mercilessly.
The ability to migrate can be a defining constraint to your overall velocity
The speed at which you can perform migrations puts a hard limit on feature velocity, no matter how fast developers code. Technical debt must be paid down to move fast long-term. How quickly you can migrate defines this pace. Invest in migration tools and talent. Practice migrating. Pay down technical debt aggressively to prevent massive slowdowns.
Running to stand still
Googlers have the phrase “Running to stand still,” describing a team just upgrading dependencies and patterns and being unable to progress forward. When a team spends all its time upgrading and refactoring without new features, it feels like running in place. No business value gets delivered. Users don’t see progress. The team burns out keeping things working. Balance maintenance and migration with new development. Allocate some sprint capacity to cleanup while ensuring regular feature delivery.
Every midsize company has a long queue of migrations they can’t staff, extending to sunset
Most growing engineering teams accumulate more technical debt than they can pay down, building up a backlog of migrations. New features take priority over technical fixes. Paying down debt consistently takes a lot of work. Shortcuts pile up over time. Make technical health a priority. Allocate regular resources to migrations, not just ad hoc. Build a culture focused on architectural fitness.
Get effective at migrations
Without continually modernizing, eventually, the codebase becomes unmanageable and requires a complete rewrite. A rewrite throws away institutional knowledge and productive processes. It’s massively expensive and risky. Migrate frequently in small steps vs. huge, scary jumps. Invest in automating migrations. Have a platform vision.
De-risk, enable, and finish the migration
Approach migrations in 3 phases — de-risk by piloting, enable by putting scaffolding in place, and finish by executing the migration. This breaks a big migration into smaller steps with built-in checkpoints. Each phase reduces risk and complexity. Don’t boil the ocean, decompose. Validate assumptions. Build a safety net and automate.
Start with the most challenging parts
Try to automate 90% of the actual migration. Tackle the most challenging parts of a migration first. Automate as much of the repetitive work as possible. The hardest problems reveal the biggest unknowns and risks early. Automation speeds execution and reduces human error. Identify tricky areas upfront. Invest in automation tools for migration. Do less by hand.
The best migration tools are incremental and reversible
Rely on migration tools that allow moving incrementally and rolling back if needed. Big bang migrations freeze progress and produce irreversible errors. Incremental allows continuous improvement. Reversible minimizes risk. Find or build tools that enable small, iterative steps forward. Design migrations to retreat safely. Refactor before rewriting.
Stop the bleeding
Depreciate legacy systems as soon as possible and ensure new code uses the new approach. Make sure managers prioritize the migration. Slowly turn down legacy systems while directing new work to the new approach. Get leadership buy-in on sunsetting. Stop the legacy system from getting bigger while scaling the new. Prevent dragging out the migration. Route new features to the new architecture. Disable legacy endpoints over time. Set a deadline to focus on priorities.
Celebrate the migrations after completion
Recognize the team’s effort no earlier than after finishing a major migration. Treat it as an achievement. Migrations are grueling marathons, and celebrating them energizes and motivates the team for the next one. Throw a migration party! Buy the team a dinner. Hand out certificates. Get leadership to call out the accomplishment.
Regular Migrations As a Solution
We’re often so busy with our day-to-day problems that it’s easy to recognize the patterns that repeat over and over. Reading about someone else dealing with similar issues offers me a different perspective and feels like a breath of fresh air. If Will’s team could do it, we can do it too.
Just hearing the word “migrations” makes most engineers groan. They’re painful, time-consuming, and can put a real damper on feature development. But they are a necessary part of any growing product. As systems become more complex and technical debt piles up, periodic cleanups are critical for maintaining velocity.
If we start viewing migrations as planned regular investments in the future rather than inconvenient chores that result from oversight, they become something else entirely.
Allocate those resources deliberately. Automate aggressively. Execute quickly. Celebrate when it’s done. Then, take a breath, give your team a high five, and start planning the next migration.
Originally published on Medium.com