Skip to content

Instantly share code, notes, and snippets.

@eevee
Last active July 26, 2018 09:18
Show Gist options
  • Save eevee/6a257a9d42400e2d03f9 to your computer and use it in GitHub Desktop.
Save eevee/6a257a9d42400e2d03f9 to your computer and use it in GitHub Desktop.
Pokédex schema versioning problem

Schema versioning

I have some problems related to how core data in the Pokémon games has changed over time. "Time" here is a discrete quantity, where each game is a point in time.

Well actually

I just said each game is a point in time and I am lying. Many of the games come in pairs, which come out simultaneously. So Red and Blue are distinct games, but they are almost identical, except when they aren't. The current schema adds a layer of indirection called a "version group", but for the most part it's just been cumbersome and confusing.

There are also side games that are somewhat compatible to the main series, like Stadium, an N64 game that allowed you to transfer your Pokémon from your game and battle with them. The game exists, and has a couple of differences from the main-series games of the time, but placing it in the main-series chronology as a point in time doesn't quite make sense.

As an additional wrinkle, there are games that are completely incompatible with the main series, but which bear the Pokémon name and borrow many of the mechanics — primarily Pokémon Mystery Dungeon, Pokémon Ranger, and Pokémon Conquest. Putting these in the chronology doesn't make a lot of sense, either.

Oh and there have been a couple very rare cases where very minor details changed in the same game between different language releases. But I'm okay ignoring those, especially since it looks like we'll be doing simultaneous identical worldwide releases from now on.

Core entities

These are the primary elements of the game: Pokémon, moves, types, abilities, and whatnot.

They have some number of properties, which is not a surprise. A subset of the Pokémon table might look like this:

identifier parent capture_rate
bulbasaur 45
ivysaur bulbasaur 45
venusaur ivysaur 45

But wait, oops! X and Y changed the capture rate for a bunch of Pokémon. So now I would need a table like this:

identifier parent
bulbasaur
ivysaur bulbasaur
venusaur ivysaur

And ALSO a table (pokemon_data?) like this:

pokemon game capture_rate
bulbasaur red 45
bulbasaur blue 45
bulbasaur yellow 45
bulbasaur gold 45
bulbasaur silver 45
bulbasaur crystal 45
bulbasaur ruby 45
bulbasaur sapphire 45
bulbasaur fire-red 45
bulbasaur leaf-green 45
bulbasaur emerald 45
bulbasaur diamond 45
bulbasaur pearl 45
bulbasaur platinum 45
bulbasaur heart-gold 45
bulbasaur soul-silver 45
bulbasaur black 45
bulbasaur white 45
bulbasaur black-2 45
bulbasaur white-2 45
bulbasaur x 40
bulbasaur y 40
bulbasaur omega-ruby 40
bulbasaur alpha-sapphire 40

I feel in my soul that this may be slightly ridiculous.


The only obvious alternative is to keep "current" data in the main table:

identifier parent capture_rate
bulbasaur 40
ivysaur bulbasaur 40
venusaur ivysaur 40

And then have a "history" table for changes:

pokemon last_version capture_rate
bulbasaur white-2 45
ivysaur white-2 45
venusaur white-2 45

This is certainly more compact. But it's also very awkward to query, especially if all you want to know is "what was Bulbasaur's capture rate in Leaf Green"? And it relies on ordering the games, when their identifiers aren't inherently ordered. It doesn't really make sense to say "up through" White 2 anyway — what if I said up through Black 2? They were released at the same time.


It's worth noting that Pokémon evolution (the "parent" field above) has also changed, rather frequently. But only in the form of adding entirely new Pokémon. So it's easy to tell that Pikachu doesn't evolve from Pichu in Pokémon Red, because Pichu didn't yet exist at all.


A similar problem happens with one-to-many data, like Pokémon types. A Pokémon may have one or two types, thus:

pokemon type order
bulbasaur grass 1
bulbasaur poison 2
ivysaur grass 1
ivysaur poison 2
venusaur grass 1
venusaur poison 2

But, alas, these too have changed. New types have been introduced twice, and both times, existing Pokémon gained a new type. So now I need to do this:

pokemon game type order
magnemite red electric 1
magnemite blue electric 1
magnemite yellow electric 1
magnemite gold electric 1
magnemite gold steel 2
magnemite silver electric 1
magnemite silver steel 2
magnemite crystal electric 1
magnemite crystal steel 2

...and so on.

Until now, I actually haven't been doing the above; there's only been a single set of types. The reason was that the Pokémon who changed (like Magnemite) only grew a new type, which hadn't existed previously. So much like the evolution case, it was easy to "correct" the data for older versions.

X and Y have (once again) added a huge wrinkle. Clefairy has historically been normal-type, but now is fairy-type. This is a change, not an addition, which has never happened before.

Type efficacy

Types in Pokémon primarily serve as damage modifiers. Type A attacking type B might be super effective (×2), not very effective (×½), or not at all effective (×0). This matrix of interactions is usually called the type chart.

Naturally I have a table as follows:

attacker_type target_type modifier
electric water 200
electric normal 100
electric dragon 50
electric ground 0

The problem, of course, is that the type chart has also changed twice. So this table, too, needs to be duplicated once for every game.

There is one alternative here that doesn't readily present itself anywhere else. Since the type chart as a whole has only changed twice, I could make "type chart" itself an entity. Then I'd only have to store the data three times, and associate each game with one of the three type charts as appropriate.

However, in the grand scheme of things this is probably the smallest table that would need duplicating, so it doesn't seem worth the pain of doing things differently.

Names

Names are localized, so we already have a composite key for them, e.g.:

pokemon in_language name
eevee en eevee
eevee ja イーブイ

But names have, on rare occasions, also changed. Before Black and White, most names were actually rendered in all caps throughout the games, even though we refer to them (and falsely store them) as Title Case. Then X and Y increased the maximum length of move names, so several moves that were crammed awkwardly into the old limit have now grown spaces. Thus the move Thunder Shock has actually had three different names: THUNDERSHOCK, ThunderShock, and Thunder Shock.

The obvious thing to do is to add to the composite key.

move in_language game name
thunder-shock en red THUNDERSHOCK
thunder-shock en black ThunderShock
thunder-shock en x Thunder Shock

But... that example doesn't really tell the whole story. The table would actually contain, just for this one move in one language:

move in_language game name
thunder-shock en red THUNDERSHOCK
thunder-shock en blue THUNDERSHOCK
thunder-shock en yellow THUNDERSHOCK
thunder-shock en gold THUNDERSHOCK
thunder-shock en silver THUNDERSHOCK
thunder-shock en crystal THUNDERSHOCK
thunder-shock en ruby THUNDERSHOCK
thunder-shock en sapphire THUNDERSHOCK
thunder-shock en fire-red THUNDERSHOCK
thunder-shock en leaf-green THUNDERSHOCK
thunder-shock en diamond THUNDERSHOCK
thunder-shock en pearl THUNDERSHOCK
thunder-shock en platinum THUNDERSHOCK
thunder-shock en heart-gold THUNDERSHOCK
thunder-shock en soul-silver THUNDERSHOCK
thunder-shock en black ThunderShock
thunder-shock en white ThunderShock
thunder-shock en black-2 ThunderShock
thunder-shock en white-2 ThunderShock
thunder-shock en x Thunder Shock
thunder-shock en y Thunder Shock
thunder-shock en omega-ruby Thunder Shock
thunder-shock en alpha-sapphire Thunder Shock

This seems, to me, a slightly excessive way to express "fixed the case, then added a space".

Again, the only alternative is a "history" table — which has the same drawbacks as before, plus the added complication of an existing composite key.

Effects

The database contains descriptions of move effects, item effects, and the like.

These, too, have changed over time. But I especially don't want to duplicate this prose, because:

  • Prose is something we write by hand, not something we can just extract from the games once and be done with. Even prose for very old games might be updated now, as our understanding improves. So if prose were duplicated per game, it would be very easy for it to get out of sync across games where it should be identical. (We have enough trouble already, keeping prose similar for similar effects.)

  • Ideally we'll have localized versions of prose too, and duplicates make that all the more difficult.

  • It's useful to see how effects changed over time, but not necessarily easy to make this happen with prose. Inline diffs are not particularly reliable things, so we'd likely end up explicitly writing prose that's diff-friendly, which is even more of a pain.

I only really have two ideas for addressing this.

  1. Have a single prose description, plus a history table that expresses changes in relative terms, e.g. "Prior to Gold and Silver, the user skipped its recharge turn if this move fainted the target." Advantage is that it's easier to understand; disadvantage is that it gets kind of hairy for moves that have changed a lot or that have changed very significantly.

  2. Have a single prose description, with some kind of invented inline markup for expressing changes — i.e., write and maintain our own diffs, manually. The disadvantage, of course, is tedium. But the advantage is a reduction in how much overall stuff there is to write and maintain.

  3. Possibly outlandishly impractical, but a third option is to get the prose out of the database entirely, and instead represent move effects in as data-oriented a way as possible. Which is easy for "30% chance to paralyze", but more difficult for "applies a curse if the user is a ghost otherwise applies these stat changes".

So

Are there any great solutions I've overlooked? For the moment I'm heavily leaning towards just adding a game column to almost every goddamn table and duplicating the hell out of everything. The db will balloon in size and there'll be a lot of UI work to make any sense of it, but it's the most flexible and the most correct. It's also nice to have more or less the same versioning story throughout, whereas right now it's kind of ad-hoc and just scoped to whatever we needed at the time.

The thing that most bothers me is knowing how much unnecessary duplication there'll be for games in a pair. But version groups just kinda suck.

@myahl
Copy link

myahl commented Jan 16, 2015

Are separate schemas out of the question? For instance, schema GEN1 contains data pertaining to Red, Blue, and Yellow. GEN2 contains data pertaining to Gold, Silver, and Crystal. GEN3 contains Ruby, Sapphire, Emerald, AND Fire Red and Leaf Green. So on and so forth. You could do schemas per game, but this seems a bit excessive. Of course, you are completely correct that this could cause significant data duplication amongst games that didn't change much from their predecessors. In cases where there were no changes, that data could live in another schema called COMMON or something like that. This schema would contain anything prone to excessive duplication, like the original 150 Pokemon. Sure, the types and moves and whatnot have changed, but I've never seen a Pokemon removed or renamed. If something does radically change in the future, move that data out of the COMMON schema and into wherever it needs to go. But how do we know if data in a particular field is common or unique? Flag it. If it's unique, the field will contain meaningful data. If it's common between other games/gens, the field would contain some sort of flag or reference to where the common data is located. This is just me brainstorming, so feel free to build on it, correct me, or tear it apart with constructive criticism. 😃

Edit: And maybe a separate schema for side games.

@encukou
Copy link

encukou commented Jan 19, 2015

Umm, what are the benefits of separate schemas vs. version columns? Seems to me the amount of data duplication would be the same. Querying across two schemas (GENx & COMMON) sounds painful, before I even consider having to handle pieces being moved out of COMMON when I upgrade pokedex.
Separate schemas would also make it harder to answer questions like "give me all moves whose Italian name changed in some way other than fixing the case".

@exectails
Copy link

exectails commented Jul 24, 2018

Late to this party, and it seems like you're working on a solution, but wanted to give my two cents. I think wanting to cram data for different game series into the same database is a mistake. While I'm sure there are cross-overs between them, some of them are most likely quite a bit different. There's also the matter of scope and interest. There'll surely be some people who're curious how some move behaves differently in Stadium, but is that what the majority uses this data for? And are there enough contributors who would even try to navigate the data to add information for this or other even less related spin-off games? Is this worth the complications?

If you took the spin-offs out of the equation, either by putting them into their own database or by removing them entirely, you could operate on "points in time" much better, because the games did come out in a chronological order, and as such, you could version them. Once every game and "interation group" has its own "version id" you can easily find the data for that or the next best version, list the changes, etc. Even changes in the same generation wouldn't be difficult to handle, the one that came later gets a higher version (for example, 010300 for Gen 1, iteration 3, for the western RBY releases) and to query the last best information you simply get the last possible data in the 01XXXX range. I imagine with that setup it would be difficult for Game Freak to do something that would throw the whole database into disarray, as even the changes you mentioned could've been handled with a single duplicate row with a new version number. In that system the moves table could presumably also be shrunken down quite a a bit, because the moves don't change for all pokemon in each version.

In my opinion that would be the most viable and flexible approach, as all other solutions, as you described, seem difficult to handle or create way too much duplication to feel comfortable with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment