Last active
December 30, 2017 15:12
-
-
Save marcintustin/917ad983d3071c7beb2dc505c4157099 to your computer and use it in GitHub Desktop.
Perfect, downtime-free table migrations in redshift, hive, and other columnar stores
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
BEGIN TRANSACTION; -- omit if not supported | |
-- may have many more columns below | |
CREATE TABLE "production"."events_new" ("referer" character varying(1024) ENCODE zstd,"subscriptions" character varying(1024) ENCODE zstd,"smtpid" character varying(1024) ENCODE zstd) ; | |
-- inserts segmented by time allow bringing data over incrementally | |
-- facilitates restarting and recovering from problems | |
-- also if transaction locking occurs, rows will be locked incrementally (assuming you have row level locking) | |
INSERT INTO "production"."events_new" SELECT * FROM "production"."events" WHERE _metadata__timestamp::date = '2017-06-29'; | |
-- many more time segments here if need be | |
-- Also, if you want to minimize the latency in the new table, you could have a final insert based on full timestamp | |
-- ONLY do this if you have either a sortkey (Redshift) or partition/bucket (Hive, many other database systems) on the date | |
INSERT INTO "production"."events_new" SELECT * FROM "production"."events" WHERE | |
-- TODAY | |
_metadata__timestamp::date = '2017-07-29'; | |
-- Static timestamp for now | |
and _metadata__timestamp <= 1501329600 -- NOW | |
COMMIT; | |
BEGIN TRANSACTION; | |
ALTER TABLE "production"."events" RENAME TO "events_old"; | |
ALTER TABLE "production"."events_new" RENAME TO "events"; | |
COMMIT; | |
BEGIN TRANSACTION; | |
-- Bring over events for today and tomorrow - it's already tomorrow somewhere in the world. | |
-- You don't need tomorrow if all your data is in UTC and you're not close to midnight UTC. | |
INSERT INTO "production"."events" SELECT * FROM "production"."events_old" WHERE _metadata__timestamp::date = '2017-10-16' | |
-- If you did the copy to current timestamp thing above - Note warning about physical table layout above at line 11 | |
AND _metadata__timestamp > 1501329600; | |
INSERT INTO "production"."events" SELECT * FROM "production"."events_old" WHERE _metadata__timestamp::date = '2017-10-17'; | |
COMMIT; |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What makes this perfect is that there is no loss of production availability to read data OR to write events to the right table.
Even in a non-transactional environment, if any current events get sent to the old table, the final copy copies them over.
There is also no risk of data loss, because operating by renaming means you can always recover your prior state.
The only risk of data loss is if your event streaming solution doesn't buffer events before they are finally committed to the database store. In that case, cutting over to the new table presents a risk if the schemas don't match; that said, if you can't buffer events somewhere, you'll always lose events if you have to change your table schema.