I wanted to extract buildings from OpenStreetMaps so I can display them on a map using Folium/Leaflet.
The first recommendation online was to use QGIS with the QuickOSM plugin.
While this was annoying to do via the QuickOSM GUI (I should've downloaded the files locally and used SQL) I was able to output a GeoJSON file of buildings.
Unfortunately, due to data quality issues and the format QGIS outputs, I had a ton of null
feature properties that I wanted to get rid of.
- 1.8G -> 101M
- pretty print JSON using jq
- use
sed
to replacenull
values in place - remove remaining
null
values usingjq
- compact the JSON using
jq
The first idea was to use jq to delete null values and this seemed simple enough
jq 'del(..|nulls)' buildings.geojson > buildings_nn.geojson
While I though an AMD Ryzen 7 7840HS and 32GB RAM would be plenty for a 1.8G JSON file, I turned out to be wrong as what ended up happening is that linux terminated the process.
Apparently jq does stuff in memory and I confirmed this via htop
which saw the memory spike up to its max after a while.
After some searching, I found out jq can stream data but examples for this aren't great and some users mentioned speed issues. I was also lazy and didn't feel like trying at this too hard since I had a dumber solution.
We should be able to remove null values using regex.
By pure luck, the GeoJSON format does not have any "important" nulls.
I wanted to know what the GeoJSON data was so I ended up pretty printing it
jq . buildings.geojson > buildings_pp.geojson
Here we go from 1.8G to 2.5G
Now we can use sed to delete lines with nulls
sed -i -E '/\s+"[[:alnum:]_:]+": null,/d'
buildings_pp.geojson
We're using the '/d' flag to delete a whole line since we pretty printed but you may be able to get away without pretty printing.
You mightr also be able to remove the \s+
since were deleting the whole line and matching on "key": null
data.
This brings us from 2.5G to 291M.
However, our regex didn't remove cases where the last key had a null value since we were match ,
and not ,?
I did the former because if I did the latter we might mess up the JSON format and I also didn't want to write a longer regex.
Now that the file is smaller (291M) let's use jq
to actually remove the last null values that are are the end of each feature property.
jq 'del(..|nulls)' buildings_pp.geojson > buildings_ppnn.geojson
This brings us down to 284M
Finally we can compact the JSON and remove the pretty printing.
jq -c . buildings_ppnn.geojson > buildings_ppnnc.geojson
This results into a more managable 101M GeoJSON