Table of contents
Seemed like a good idea at the time
Let's say you are writing an application that requires a database with schema. Whether the database has an actual schema (e.g. relational) or just the set of required attributes and their formats (e.g. DynamoDB) is not really important for our argument here: in either case, we are updating the database to match the needs of the application.
It is inevitable that, over time, this schema will change to match the needs of the software that uses it. There are software packages that let you do DB migrations, such as https://www.npmjs.com/package/migrations. These are handy because they keep track of the migrations and ensure that they are only executed once and in order. You write a series of migrations with file names that are typically a date and jira ticket number:
#
# File name 20230102-aaa-1234.js
#
exports.up = function(cb) {
dbquery(`alter table statuses add (lastUpdate timestamp)`);
cb(null, '20230102-aaa-1234: up');
}
exports.down = function(cb) {
dbquery(`alter table statuses drop column lastUpdate`);
cb(null, '20230102-aaa-1234: down');
}
I can easily specify both an "up" migration - to be executed when upgrading my software - and a "down" one that is used only if the application version needs to be rolled back.
I want to be able to roll back deployments. "Down" migrations seem like a neat solution to a common need. What could go wrong?
The problem
Well, there's no real problem with the "up" migration. They are usually well-thought-out, well-tested, and clearly necessary, and most applications have a good framework like the migrations
package above.
Itʻs the "down" migration where the problems happen. Here are some of the problems I have seen with "down" migrations.
Half the time - in my experience - people don't write them anyway. A lot of developers donʻt even recognize that they need to write them.
There are never any requirements given for them, and thus they are poorly thought through. Not only do product owners not recognize they are needed (nor could they clearly describe what needs to be done), developers donʻt really know either. Do you drop a table? Drop columns? Leave them there but modify them? It's up to the developer's best guess, and two developers are unlikely to guess the same way.
They are almost never tested rigorously: maybe once by the original developer and probably never again. But do teams write tests against migrations, including applying "downs" and rolling back deployments? If they do a "down" test, do they test arbitrary sets of "down" migrations?
Because of all these, they give a false sense of security. Unless the migrations are faithfully written and tested 100% of the time, they are probably worse than useless. We think "we can do immediate rollbacks if there are any problems with the deployment" when we really can't.
They introduce problems with rolling, canary or other zero-downtime deployment strategies. Having "down" migrations is basically an admission you don't have database compatibility between versions, and that breaks zero-downtime type deployments. In zero-downtime deployments, you will always have multiple app versions running simultaneously.
They mean the production database is not actually in sync with the
main
code branch. This obviously happens also when the pipeline is gated, such as only doing production deployment after approval. But this is an unexpected case, and any unexpected case introduces surprises for the development team.
Let's say we take care of all of these problems. Let's say we rigorously define and test everything. The big problem here is that, from the customer's point of view, this is the wrong thing to do. If the "down" migration were ever used in production, you would almost certainly lose data. Typical "down" migrations would involve dropping tables or columns; whatever customers used the system in the meantime could end up losing data. Customers don't care that we've gone back to a previous version: if we've done it right, most of them won't even know it's happened. But they DO know if their data is lost. And lost data is not good for a customer's view of our system's stability.
The solution
A lot of these problems above have potential solutions. For example, we could have product owners faithfully describe behavior, we could give teams rigorous rules for how they create their migrations, and we could implement extensive deploy/rollback end-to-end testing. But will we? And just as importantly, should we?
My contention is no, we shouldnʻt: that they are not actually necessary. Some of the most successful products I know of do not use them in practice.
The solution is that we should be building our database migrations so they are compatible in several ways:
New database changes must be written to work with older code. Sometimes this is automatic: if we add a new table in the "up" migration, it won't cause a problem. New tables with foreign keys to/from other new tables are also not a problem. But sometimes it does requires care. New tables that have foreign keys to existing tables can be an issue. Consider the following situation with two tables. Table_A is existing before the "up" migration; Table_B is being added. The newer code creates a record in both Table_A and one or more records in Table_B; the older code then deletes the record from just Table_A, not knowing that Table_B even exists. We can deal with this by using tools like cascading delete. But it is something that needs to considered.
New code must be written to work with old code. This one is harder, and isn't a database issue directly. But if you have two versions of code running simultaneously, as you will with zero-downtime deployment strategies, you'll need to deal with this.
Go back and look at my migration code example above. I (intentionally) violated one of these rules. Can you spot it? (Hint: It's on line 5.)
Unfortunately, a lot of times people don't take care of these compatibility issues. Looking at code with issues, like my example above, can easily escape people's attention if they are not thinking about compatibility. But that's a solvable problem, and in the long run, it is of more utility to our applications than using the quick fix of "down" migrations.
Much like the row of shot glasses at the top, they seemed like a good idea at the time...