GeoJSON is a widely-used format for encoding geographic data. It's flexible and human-readable, and because it's just JSON it's easy to integrate into web applications.
But it has some real warts, and if we wanted to we could certainly come up with a better format. After tweeting about my frustrations, I was asked to elaborate. Here goes:
GeoJSON geometries can be one of seven types: Point
, MultiPoint
, LineString
, MultiLineString
, Polygon
, MultiPolygon
and GeometryCollection
.
I've never seen a GeometryCollection
in the wild, but let's be generous and assume they do exist. That leaves six, three of which are completely unnecessary: Point
, LineString
and Polygon
.
They're unnecessary because these two things are functionally equivalent:
{
type: 'Polygon',
coordinates: [ outerRing, hole1, hole2, ... ]
}
{
type: 'MultiPolygon',
coordinates: [[ outerRing, hole1, hole2, ... ]]
}
The Polygon
version is a few bytes shorter (a difference that in real-world applications will evaporate due to gzip), but apart from that it's just a special case of a MultiPolygon
.
But because that special case exists, code like this exists in just about every application that works with GeoJSON directly:
if ( geometry.type === 'Polygon' ) {
renderPolygon( geometry.coordinates );
} else {
geometry.coordinates.forEach( renderPolygon );
}
Something like that – along with the constant mental context switching that goes along with it (wait, at this point in the code am I dealing with a coordinate pair? a ring? something else?) – has to exist every time you touch GeoJSON. The cost of that special case is astronomical in proportion to its benefit. The same goes for LineString
and Point
.
Right now I'm working with a clipping library (I won't name and shame) that can return either Polygon
coordinates or MultiPolygon
coordinates. And it doesn't tell you which! You have to figure out by yourself whether the second and third items are separate polygons, or holes in the first one. That sort of confusion is deeply harmful to productivity, and totally unnecessary.
Another example of redundancy is the fact that a Polygon
ring must end with a coordinate pair that matches the first one. Why? In many applications your code for handling polygons will share functions with your code for handling line strings, and I've had to write code like this more times than I can count:
const end = /Polygon/.test( type ) : line.length - 1 : line.length;
for ( let i = 0; i < end; i += 1 ) {
doSomethingWith( line[i] );
}
Of course there are some cases where it is easier to iterate over an array of coordinate pairs that ends where it started – but in my experience it's almost always easier to adapt code that expects a closed ring than code that expects a non-closed ring. Bonus: the file gets smaller.
Every single point in a GeoJSON file gets its own array. That's terrible for performance, because allocating arrays isn't free, and garbage collecting them is liable to cause jank. Performant code relies on flat structures.
Instead of this...
[ [ x0, y0 ], [ x1, y1 ], [ x2, y2 ], ... ]
...we could do this:
[ x0, y0, x1, y1, x2, y2, ... ]
If you ever need to write any WebGL code, or find yourself triangulating your geometry, you'll quickly find that this is a more convenient way of working.
Perhaps you're thinking that it'll make things harder, because instead of doing this...
ring.forEach( coords => {
ctx.lineTo( coords[0], coords[1] );
});
...you'd have to do this...
for ( let i = 0; i < ring.length; i += 2 ) {
ctx.lineTo( coords[i], coords[i+1] );
}
...but that's a good thing, because the second example will be much faster. The right data structure encourages the right programming habits.
As a bonus, it's very easy to convert those flat arrays to typed arrays, which have excellent performance characteristics (because the browser is able to make stronger guarantees about their behaviour). You can also do really cool things like instantly transferring the data to a web worker to do expensive computation off the main thread, without the cost of serialization/deserialization.
One thing to note: if you have a flat array, you can't detect the dimensionality of the data by querying the first point. But that's also a good thing – it forces you to be explicit.
Each GeoJSON feature can have arbitrary properties
attached to it. That's useful in many situations, but I've never once actually used it in an app because typically that data lives somewhere else so that it can be accessed by other parts of my app. All I want in my GeoJSON is geometry – the object's id
field is enough. But if you don't include an empty properties: {}
object, it's not valid GeoJSON. We don't need it.
Is there a realistic possibility that we could displace GeoJSON with a superior (but still human-readable) format? I don't know. But if anyone is interested in making it happen then let me know – maybe we can do something.
Interesting ideas, Rich and thanks for writing them down. MapML as I see it is oriented towards browser devs, who already deal with HTML parsing. I based the MapML feature model on the GeoJSON metamodel, so that it might one day be able to be serialized as GeoJSON by the browser via the
<map>
element API. So Web developers could have the convenience of working with features as JSON, without the pain of parsing and validation (done by the browser, silently). But that's not to say that a better GeoJSON wouldn't be possible, and I will follow this conversation with interest to see if that would be a better serialization target in the future.