Hacker News new | past | comments | ask | show | jobs | submit login
Managing database schema changes without downtime (samsaffron.com)
267 points by jbaviat on March 24, 2018 | hide | past | favorite | 51 comments



Having no-down-time when doing migrations is not an easy topic.

> These defer drops will happen at least 30 minutes after the particular migration referenced ran (in the next migration cycle), giving us peace of mind that the new application code is in place.

Pretty much that's what I've seen before. You do series of deployments + migration scripts.

Usually :

- Migrate with backward-compatible modifications of the schema

- Release application logic that takes advantage of both migrated / to-be-deprecated tables

- Possible extra migration to sync the new / old tables.

- Release application logic that stops using the to-be-deprecated tables.

- Migrate with destructive modifications to remove the old table.

Usually that's a huge complicated process which might be replaced with one migration and a few minutes of downtime. Sadly some companies can't afford it, so they do such changes with weeks planing.

I usually vote for having some planned non-working-hours downtime.


There's actually some precedent with very high volume services taking the minimal downtime approach to avoid the increased amount of development time. Take Airbnb for example [1]. They chose to take two weeks to migrate their Messaging database with a few minutes of downtime rather than take longer to do it with no downtime. In the article, they discuss how they thought about the tradeoffs.

[1] https://medium.com/airbnb-engineering/how-we-partitioned-air...


> Having no-down-time when doing migrations is not an easy topic.

Depending on the nature of the migration and the type of data and the type of service and how coupled everything is, it's impossible. Especially when you are dealing with financial data and transactions that you can't afford to screw up even a single one.

> I usually vote for having some planned non-working-hours downtime.

Yep. No other way around it. You tell your clients weeks/months ahead of the planned downtime. They'll understand and migrate in the middle of the night over a holiday weekend.

But this is in regards to major structural migrations, not simple schema updates or changes.


Are you saying no downtime is impossible or merely expensive?

Erlang/OTP with relups is an example of a system that provides zero downtime online upgrade and rollback with zero downtime.


I'm talking about serious systematic migrations, not upgrades. In these instances, it is impossible to avoid downtime ( whether it be a few seconds, minutes or even hours ). It's one of the reasons why SLAs do not offer 100% uptimes.


Interested in a concrete example of where you see a migration being impossible without downtime; I'm not aware of Amazon or Google taking their APIs down for maintenance, which suggests that you can solve these problems with enough engineering spend.

> depending on... how coupled everything is

If you're just saying here that some systems aren't currently able to migrate in a hitless way, rather than claiming that it would never be possible for some systems, then I fully agree. Hitless migrations require a lot of work, and probably aren't worth the cost for many applications/systems.


Neither Amazon nor Google run off of an RDBMS.

Relational databases are awesome, but it's much harder to achieve 100% uptime compared to schemaless, distributed, replicated data stores. You easily can take down your system with one innocuous-looking DDL migration that works great in test but grinds the machine to a halt on a production dataset.



Good alternative to downtime: read-only copy of application.


As with most thing, it depends, right? For one, you’d have to actually implement a read-only version of your application, including creating an understandable user experience for that scenario. That’s an additional task per feature, though, as opposed to an additional task per migration as described by the article.

Additionally, that could cause a lot of headaches for your users if you’re continuously deploying. Then again, I kind of question what you’re doing if you’re constantly running DB migrations for a user application.


Is it really much extra work to do the migration online instead of offline? It's just that second step where the application writes to both tables (and usually a step where reads switch to the new tables). Everything else is necessary for either method.

In many situations taking downtime for it is more expensive for development. When taking downtime, often you need to be really damn sure it works beforehand so you don't take unnecessary down time. Having to roll back and try again later after fixing something (and eating more downtime) can be a bit embarrassing. When you are doing migrations without downtime you can take your time between each step and add additional validations that might be impractical when trying to minimize downtime.

If you don't need to minimize downtime then maybe it doesn't really matter but that situation seems rare from my perspective.


> In many situations taking downtime for it is more expensive for development.

This is absolutely true, and the main reason for this is that the tooling sucks. Actually testing that your migration will result in your intended database state is difficult and thus people simply don't do it.

I've attempted to remedy this situation by writing a diff tool for Postgres, so you can confirm your migrations will result in a matching schema, and autogenerate more correct migration scripts.

I also talked more about this topic at last year's PostgresOpen conference: https://www.youtube.com/watch?v=xr498W8oMRo


My teams have never had any difficulty in schema changes. To be honest, we don't even think of it as an issue.

1) No deleting or alternating columns needed by any production code. Need something different? Create a new column. After weeks of this column not being needed, we use 2) to delete it - but it isn't a priority.

2) Adding/Modifying columns just use pt-schema-change. I have seen this tool transform billion row tables - it also supports slave master.


This is the same story as service dependencies, as well. If you've doubled up on this, then you can actually hide your database dependency behind a service contract. If done perfectly (high bar), then you take full ownership of the migration and nobody external to your services should have any knowledge/care of your database schema. Or your service implementation.

Nor is this where this trick ends. It is almost literally how Intel survived the supposed superior instructionsets of RISK back in the CISC days. They turned their implementation contract of CISC into essentially a service ocntract of "the computer will execute this for you." Then, took advantage of the misdirection and were able to get many of the RISK benefits in their computers.

Obviously, this can burn you at an extreme. That fun quote of another layer of abstraction, and all of that. I would be surprised if there is a prescriptive "this is how you do it" that is applicable to everyone. To note, if you only have your company as a client of either service or schema, it might be faster to just make the change in one shot.

Let that sink in. The 5 step process outlined above requires manpower to execute. Don't ignore that.


> That fun quote of another layer of abstraction...

Haven't heard, please elaborate.


Apologies, I used the wrong word. I should have said indirection. The David Wheeler quote on this page: https://en.wikipedia.org/wiki/Indirection "All problems in computer science can be solved by another level of indirection"


At GitLab we do the following:

We have two migration directories: db/migrate and db/post_migrate. If a migration adds something (e.g. a table or a column) or migrates data that existing code can deal with (e.g. populating a new table) then it goes in db/migrate. If a migration removes or updates something that first requires a code deploy, it goes in db/post_migrate. We combine this with a variety of helpers in various places to allow for zero downtime upgrades (if you're using PostgreSQL). For example, to remove a column we take these steps:

1. Deploy a new version of the code that ignores the column we will drop (this is a matter of adding `ignore_column :column_name` in the right class).

2. Run a post-deployment migration that removes said column.

3. In the next release we can remove the `ignore_column` line (we require that users upgrade at most one minor version for online upgrades).

The migrations in db/post_migrate are executed by default but you can opt-out by setting an environment variable. In case of GitLab.com this results in the following deployment procedure:

1. Deploy code with this variable set (so we don't run the post-deployment migrations)

2. Re-run `rake db:migrate` on a particular host without setting this environment variable.

For big data migrations we use Sidekiq to run them in the background. This removes the need for a deployment procedure taking hours, though it comes with some additional complexity.

More information about this can be found at the following places:

1. https://docs.gitlab.com/ee/update/README.html#upgrading-with...

2. https://docs.gitlab.com/ee/development/what_requires_downtim...

3. https://docs.gitlab.com/ee/development/background_migrations...


Its worth noting that there are some edge cases where this could lead to potential problems. Its enough of an edge case for many apps to ignore the possibility. Take your example with the addition of the fact that the field you're removing has a unique database constraint. With your deploy process this would look something like this.

1. Run predeploy migration in db/migrate to remove the unique constraint 2. Deploy new version of code that ignores the column 3. Run post-deploy migration that removes column 4. Remove the part of code that ignores the column

Failure to do step 1 would result in multiple rows being created with null values, which would cause errors for all but one (or zero) insert. The edge case, however, is a little more subtle. Between steps 1 and 2, there's a period where the database constraint doesn't exist, but the old code still expects the constraint. Validating uniqueness in the code can mitigate this, but doesn't ensure consistency.


If you absolutely have to be in a spot to support this a trigger-based approach is by far the best way to handle it, either by pervasive use of database views (and have INSERT triggers to handle writes, that way you can version the view and handle any extra validation logic during migration with a temporary performance hit) or having a trigger on the old table feeding the new schema and then swap the tables upon upgrade of the software.

Both solutions suck in some way compared to the downtime required schema changes as far as simplicity is concerned, but they let you maintain consistency throughout the change at least and let you bail out midway if needed

On that note, I'm glad all but a single system I maintain is only used by employees - just schedule a maintenance window after-hours and I'm free to take things down for a couple hours if needed.


This is not how we do it. Constraints, columns, etc are removed using a post-deployment migration. This means you can't end up with NULL values and the likes.


This helps ensure database consistency, but might not actually avoid errors. In this scenario, the period of time between the app code being deployed and the post-deploy migration running there could potentially be application level errors because the app is now ignoring the deleted field and the database expects unique values (or non-null, or whatever constraints you have).

You can solve these issues with some creative migration paths both at the app level and at the database level, but those changes aren't always trivial, and may not necessarily fit in the pre-deploy post-deploy migration strategy.

This may also not be an issue for you. You may be willing to accept a brief period of time where an error is unlikely, but technically possible. Personally, most of the time this is acceptable to me, but I always find it important to think about these scenarios in case something does happen.


Researcher/author of a tool [0,1] also attempting to tackle this problem here.

Unfortunately zero-downtime schema changes are even more complex than suggested here. Although the expand-contract method as described in the post is a good approach to tackling this problem, the mere act of altering a database table that is in active use is a dangerous one. I've already found that some trivial operations such as adding a new column to an existing table can block database clients from reading from that table through full table locks for the duration of the schema operation [2].

In many cases it's safer to create a new table, copy data over from the old table to the new table, and switch clients over. However this introduces a whole new set of problems: keeping data in sync between tables, "fixing" foreign key constraints, etc.

If there are others researching/building tooling for this problem, I'd love to hear from you.

[0] http://github.com/quantumdb/quantumdb

[1] https://speakerdeck.com/michaeldejong/icse-17-zero-downtime-...

[2] http://blog.minicom.nl/blog/2015/04/03/revisiting-profiling-...


I agree that this is a very complicated problem that played a roled in the rise of NoSQL. Given it isn't the only reason for the rise of NoSQL but it contributed. This problem is worse when development teams and DBAs database do not report to the same person. Good luck convincing a DBA about your change. Disclaimer, I have the utmost respect for DBAs.


I agree with that. If your database doesn't care about a "schema", there usually are no operations available which change the schema of your data. Unfortunately that also means that if your database doesn't consider the schema its problem, when you do finally want to change the implicit schema of your data, it becomes the dev's problem.


Make the changes on a non-active replica database (assuming your database supports replication), then promote the replica to be primary. If you don’t have replicas, the concept of “zero downtime” is really at risk as your primary database becomes a single point of failure.


That's not great either. Moving the "master" role from one server to a replica isn't instantaneous (meaning you have a period of read-only state), and while your replica is performing the schema operation you still need to keep the data in sync with changes made on the master server (which to the best of my knowledge is non-trivial).

There are tools that "replay" data changes though: - FB's OSC (https://www.facebook.com/notes/mysql-at-facebook/online-sche...) - GH's gh-ost (https://github.com/github/gh-ost)


This is a more or less solved problem in MySQL with pt-online-schema-change [1] which accomplishes “create a new table ...” with the “copying data over” done for you using triggers and “switch clients over” done using a table rename, all within a tool with the longest full table lock held during the rename step. Some of this appears to be available in MySQL itself as of 5.6, but regardless all of this is fraught with edge cases.

[1] https://www.percona.com/doc/percona-toolkit/LATEST/pt-online...


Indeed. It seems there are many such tools available for MySQL: Openark kit [0] for instance, whose author also brought us gh-ost. Unfortunately most tools focus on just MySQL, and none really have an answer when you want to use (and enforce) foreign key constraints. Although it's hard to say, I suspect the users of these tools have "given up" on foreign key constraints.

[0] http://code.openark.org/forge/openark-kit [1] https://github.com/github/gh-ost


Significantly altering an active table without downtime is a problem that will not, and should not, ever be solved by a library. It's too fraught with peril and special case fun for any competent devops person to leave it up to external code.


The work that goes into "external code" is higher quality than most companies can achieve. The cost of developing a similar tool may also be outside what the company can afford to invest.

If your business is online schema changes, then inventing it yourself makes sense. Otherwise you are likely throwing money away to create an inferior product.


Agreed. But to be fair I've not yet encountered such a library/tool yet. All tools I've seen leave the active table alone, and simply create a new table, do all their work there, and finally somehow switch the two tables (either in naming or routing queries). Some even perform the work on a different database server.


We're using FlywayDB [0] and many of the ideas in Martin Fowler's _Evolutionary DB_ article [1]. When database changes are breaking, we use a pre-deploy and post-deploy "pair" to create an intermediate DB state that both application versions will run against.

[0] https://flywaydb.org/documentation/migrations

[1] https://www.martinfowler.com/articles/evodb.html


Liquibase [0] is an alternative to FlywayDB. Unfortunately, FlywayDB changed their feature scope for the free version.

[0] https://www.liquibase.org/


I haven't run into a feature that I need from either of the paid plans. What features were removed that you find compelling?

As an aside, Liquibase also effectively has a paid tier - it's just that it's marketed on a different web-site than the OSS version (http://www.datical.com/liquibase/).


Dry-runs and rollbacks (Flyway's undos). Especially rollbacks are useful for development.


I don't remember roll-backs ever being in the OSS project either. We always do potentially destructive operations in two different migrations since you'll never get data back once you delete it. We tend to favor soft-deletes anyway which makes life a lot easier but can bloat your DB.


> Especially rollbacks are useful for development

I abuse Liquibase to generate sql for me by diffing a reference db to create the changeset.

Will a rollback not be comparing your new schema against your old schema and then apply that changeset? Take into account this might/will drop data.

But looking at flywaydb undo docs[0] I don't see it address anything more then the schema?

[0]: https://flywaydb.org/documentation/command/undo

edit: for those interested I recently moved my shell script solution to python[1].

[1]: https://github.com/Morabaraba/python-liquibase


Liquibase is missing some of the automated aspects that this blog post talks about but I know Nathan the developer is very amenable to people adding changes to the code base since I did that in the past with Sybase and SQL file support. I think more needs to be done to make always on migrations easier on the developers but it is doable.


There is also anchor modeling [1] with quite a few publications [2] and even an online modelling tool [3] to play around with. It is essentially a method that yields a highly normalized bitemporal database model so that every prior database state, schema and data, is a subset of the current database state.

I was personally not able to use it yet but I really like the idea behind it. It is probably not the ideal choice in general due to its append only nature ever increasing the consumed storage space and due to its high degree of normalization which may or may not have a negative impact on performance when implemented on top of a rational database depending on the nature of your queries.

But if you do not have to deal with excessive amounts of data, if you expect a lot of schema evolution, if you need traceability of all changes for auditing purposes or such, then using anchor modeling might be worth a shoot.

[1] https://en.wikipedia.org/wiki/Anchor_modeling

[2] http://www.anchormodeling.com/?page_id=360

[3] http://www.anchormodeling.com/modeler/latest/


This looks like a helpful approach to a tricky topic! I'm curious how the post-deploy migrations work when you run all your migrations at once, from the beginning, e.g. in Circle CI or when onboarding a new developer?

Also, do you do anything to protect yourself against migrations written at time t_0 and then run much later at time t_n? I've seen a lot of problems there when migrations use application code (e.g. ActiveRecord models), and then that code changes. (My solution is to never call model code from migrations, but there are other ways.) This isn't really specific to your article I guess, but does your approach making managing that harder? Easier?

Does having that details table help at all when you have migrations on a long-lived branch that is merged after other migrations have been added? Or is the solution there just to rebase the branch and test before merging it in?

EDIT: I think this blog post by the Google SRE folks is great reading for people thinking about migrations and deployment:

https://cloudplatform.googleblog.com/2017/03/reliable-releas...

I've never been comfortable with rolling back migrations in production, but their plan changed my mind that it can be done safely. Is your approach compatible with theirs?


This is cool but I've started reducing my need for migrations by just serializing most of the structured pieces of the record into a text string (json for example), and only have separate columns for things that need to be joined (usually an id column) or selected on/indexed. I find this approach prevents me from needing to do migrations nearly as often. If you find that you really do need to index on a new field, adding a column is the easiest kind of migration to do.

Of course sometimes you do still need to which is where strategies like this come in.


>Migrate database to new schema... spin up new instance

Sadly this is more of a pain in Django, since it does not write default values to the database. To be able to safely run the migration first, one must manually rewrite the SQL for the migration. [1]

[1]: http://pankrat.github.io/2015/django-migrations-without-down...


Sigh, wishing there was some magic sauce for this. But currently will have to sit through a 14 HOUR downtime while my team upgrades our SonarQube DB to 6.7.


Just put your database schema in a VCS and get on with life: http://www.liquibase.org/


is it really 0 downtime as the title suggest, or minimal downtime?


If you do it right, it's zero downtime (i.e. no DB locks held for long enough to cause timeouts to clients).


In most cases. However, some changes, such as adding the column and needing to manipulate data (populate nulls, transforms, etc.) could still cause locking.

This really comes down to knowing the boundaries of what changes could cause timeouts and working around them as best you can. However, some edge cases could cause downtime.


I wonder how it is done in Dynamodb or Mongo.


I’d love to see an article about letting your users fucking deal with it, because no one really needs this much uptime anymore.


This is probably not going to get a lot of love but it's really not a terrible answer for many, many apps - especially anything business related where you can assume that most users work single-country office hours.

This exact same compatibility problem exists at other boundaries of your app - in particular, the web/backend interface.


The one thing that does make users angry is losing unsaved data. If that can be avoided then I don't think many would complain about a few minutes downtime once a month or even once a week.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: