Skip to content

Instantly share code, notes, and snippets.

@iandees
Last active August 22, 2021 21:35
Show Gist options
  • Save iandees/a5ab382a13099bfad615d213ec588e6a to your computer and use it in GitHub Desktop.
Save iandees/a5ab382a13099bfad615d213ec588e6a to your computer and use it in GitHub Desktop.
Download and process Bing Buildings into Census tracts.

These are my notes for taking the Microsoft US Building Footprints and splitting them into more manageable chunks based on US Census Tracts.

All of this happened on an m5.xlarge in AWS and used up about ~300GB of EBS over the course of a few hours.

  1. Make a filesystem on the EBS volume and mount it:

    sudo mkfs.xfs /dev/nvme1n1
    mount /dev/nvme1n1 /mnt
    
  2. Make the directory to download into

    mkdir -p /mnt/source/buildings
    
  3. Run the download_buildings.sh script to grab the data to the EBS volume

  4. Download the US Census Tract shapefiles

    mkdir -p /mnt/source/tracts
    wget --recursive --continue --accept=*.zip \
         --no-parent --cut-dirs=3 --no-host-directories \
         --directory-prefix=/mnt/source/tracts \
         ftp://ftp2.census.gov/geo/tiger/TIGER2017/TRACT/
    
  5. Unzip the downloaded stuff

    sudo apt-get install -y unzip
    cd /mnt/source/tracts; mv TRACTS/*.zip .; for i in *.zip; do unzip $i; done
    cd /mnt/source/buildings; for i in *.zip; do unzip $i; done
    
  6. Dump the Features from these downloaded files into 1 million line chunks

    cat *.json | grep '"Feature",' | split -l 1000000 --additional-suffix .json
    
  7. Add missing commas from where one file ended and was concatenated with another file

    for i in x*.json; do echo $i; sed -i 's/}}[^,]/}},/' $i; done
    
  8. Add prefix and suffix for a GeoJSON FeatureCollection so ogr2ogr will read it properly.

    # Make it a FeatureCollection
    for i in x*.json; do sed -i '1s/^/{"type":"FeatureCollection","features":[/' $i; done
    # Chop off the comma on the last line
    for i in x*.json; do truncate $i --size=-3; done
    # Add the chars that close the FeatureCollection
    for i in x*.json; do sed -i -e '$a]}' $i; done
    
  9. Install and setup PostGIS

    sudo apt-get update
    sudo apt-get install -y postgresql postgis gdal-bin
    # Stop postgres to move the datadir
    sudo /etc/init.d/postgresql stop
    # Copy over the datadir
    sudo cp -pr /var/lib/postgresql/ /mnt/postgresql/
    # Modify the config file to change data_directory to '/mnt/postgresql/9.5/main'
    sudo vi /etc/postgresql/9.5/main/postgresql.conf
    # Start postgres
    sudo /etc/init.d/postgresql start
    # Create user
    sudo -u postgres createuser ubuntu -s
    # Create database
    createdb
    # Install postgis extension
    psql -c "CREATE EXTENSION postgis;"
    
  10. Import the buildings into PostGIS

    for i in x*.json; do echo $i; ogr2ogr -f "PostgreSQL" PG:"" $i -nln buildings -append; done
    
  11. Import the tracts into PostGIS

    cd /mnt/source/tracts
    shp2pgsql -p tl_2017_01_tract.shp tracts | psql
    for i in *.shp; do shp2pgsql -a $i tracts | psql; done
    psql -c "create index tracts_geo_idx on tracts using gist (geom);"
    psql -c "select UpdateGeometrySRID('tracts', 'geom', 4326);"
    
  12. Add a column to the buildings table for the tract ID and give it an index

    psql -c "alter table buildings add tract_geoid varchar(12);"
    psql -c "create index buildings_tract_idx on buildings(tract_geoid);"
    
  13. Add the tract IDs to the buildings

    # ST_Within is faster, so get the buildings that are completely contained in a tract first
    psql -c "update buildings b set tract_geoid=t.geoid from tracts t where st_within(b.wkb_geometry, t.geom);"
    # ... then go back and pick one of the tracts where they overlap
    psql -c "update buildings b set tract_geoid=t.geoid from tracts t where b.tract_geoid is null and st_intersects(b.wkb_geometry, t.geom);"
    
  14. Output the buildings (now with a tract ID) as CSV with WKT geometry

    mkdir -p /mnt/output
    ogr2ogr -f CSV /mnt/output/output.csv -lco GEOMETRY=AS_WKT "PG:" -sql "select tract_geoid, ogc_fid as id, wkb_geometry from buildings order by tract_geoid"
    
  15. Split up the CSV into per-tract GeoJSON feature collections using Python

    sudo apt-get install -y python-pip
    pip install shapely
    mkdir -p /mnt/output/bytract
    python split_buildings_csv.py
    
  16. Post the tracts to S3

    cd /mnt/output/bytract
    mkdir -p /mnt/output/bystate
    for i in *; do tar -zcf /mnt/output/bystate/$i.tar.gz $i; done
    sudo apt-get -y install awscli
    aws s3 sync --acl="public-read" /mnt/output/bystate/ s3://data.openstreetmap.us/bingbuildings/bystate/
    aws s3 sync --acl="public-read" --content-type="application/json" /mnt/output/bytract/ s3://data.openstreetmap.us/bingbuildings/bytract/
    
  17. Build Tippecanoe to generate an mbtiles of the dataset

    curl -L https://github.com/mapbox/tippecanoe/archive/master.tar.gz | tar -xz
    cd tippecanoe-master/
    sudo apt-get install -y build-essential libsqlite3-dev zlib1g-dev
    make -j
    sudo make install
    
  18. Generate the mbtiles of the dataset

    mkdir -p /mnt/tmp
    (find /mnt/output/bytract -type f -name '*.geojson' -exec cat {} \;) | \
      tippecanoe \
        --no-line-simplification \
        --buffer=0 \
        --read-parallel \
        --temporary-directory=/mnt/tmp \
        --base-zoom=12 \
        --maximum-zoom=12 \
        --minimum-zoom=12 \
        -o /mnt/output/bingbuildings.mbtiles
    
wget --continue -O /mnt/source/buildings/Alabama.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Alabama.zip
wget --continue -O /mnt/source/buildings/Alaska.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Alaska.zip
wget --continue -O /mnt/source/buildings/Arizona.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Arizona.zip
wget --continue -O /mnt/source/buildings/Arkansas.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Arkansas.zip
wget --continue -O /mnt/source/buildings/California.zip https://usbuildingdata.blob.core.windows.net/usbuildings/California.zip
wget --continue -O /mnt/source/buildings/Colorado.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Colorado.zip
wget --continue -O /mnt/source/buildings/Connecticut.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Connecticut.zip
wget --continue -O /mnt/source/buildings/Delaware.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Delaware.zip
wget --continue -O /mnt/source/buildings/DistrictofColumbia.zip https://usbuildingdata.blob.core.windows.net/usbuildings/DistrictofColumbia.zip
wget --continue -O /mnt/source/buildings/Florida.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Florida.zip
wget --continue -O /mnt/source/buildings/Georgia.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Georgia.zip
wget --continue -O /mnt/source/buildings/Hawaii.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Hawaii.zip
wget --continue -O /mnt/source/buildings/Idaho.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Idaho.zip
wget --continue -O /mnt/source/buildings/Illinois.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Illinois.zip
wget --continue -O /mnt/source/buildings/Indiana.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Indiana.zip
wget --continue -O /mnt/source/buildings/Iowa.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Iowa.zip
wget --continue -O /mnt/source/buildings/Kansas.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Kansas.zip
wget --continue -O /mnt/source/buildings/Kentucky.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Kentucky.zip
wget --continue -O /mnt/source/buildings/Louisiana.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Louisiana.zip
wget --continue -O /mnt/source/buildings/Maine.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Maine.zip
wget --continue -O /mnt/source/buildings/Maryland.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Maryland.zip
wget --continue -O /mnt/source/buildings/Massachusetts.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Massachusetts.zip
wget --continue -O /mnt/source/buildings/Michigan.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Michigan.zip
wget --continue -O /mnt/source/buildings/Minnesota.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Minnesota.zip
wget --continue -O /mnt/source/buildings/Mississippi.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Mississippi.zip
wget --continue -O /mnt/source/buildings/Missouri.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Missouri.zip
wget --continue -O /mnt/source/buildings/Montana.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Montana.zip
wget --continue -O /mnt/source/buildings/Nebraska.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Nebraska.zip
wget --continue -O /mnt/source/buildings/Nevada.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Nevada.zip
wget --continue -O /mnt/source/buildings/NewHampshire.zip https://usbuildingdata.blob.core.windows.net/usbuildings/NewHampshire.zip
wget --continue -O /mnt/source/buildings/NewJersey.zip https://usbuildingdata.blob.core.windows.net/usbuildings/NewJersey.zip
wget --continue -O /mnt/source/buildings/NewMexico.zip https://usbuildingdata.blob.core.windows.net/usbuildings/NewMexico.zip
wget --continue -O /mnt/source/buildings/NewYork.zip https://usbuildingdata.blob.core.windows.net/usbuildings/NewYork.zip
wget --continue -O /mnt/source/buildings/NorthCarolina.zip https://usbuildingdata.blob.core.windows.net/usbuildings/NorthCarolina.zip
wget --continue -O /mnt/source/buildings/NorthDakota.zip https://usbuildingdata.blob.core.windows.net/usbuildings/NorthDakota.zip
wget --continue -O /mnt/source/buildings/Ohio.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Ohio.zip
wget --continue -O /mnt/source/buildings/Oklahoma.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Oklahoma.zip
wget --continue -O /mnt/source/buildings/Oregon.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Oregon.zip
wget --continue -O /mnt/source/buildings/Pennsylvania.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Pennsylvania.zip
wget --continue -O /mnt/source/buildings/RhodeIsland.zip https://usbuildingdata.blob.core.windows.net/usbuildings/RhodeIsland.zip
wget --continue -O /mnt/source/buildings/SouthCarolina.zip https://usbuildingdata.blob.core.windows.net/usbuildings/SouthCarolina.zip
wget --continue -O /mnt/source/buildings/SouthDakota.zip https://usbuildingdata.blob.core.windows.net/usbuildings/SouthDakota.zip
wget --continue -O /mnt/source/buildings/Tennessee.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Tennessee.zip
wget --continue -O /mnt/source/buildings/Texas.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Texas.zip
wget --continue -O /mnt/source/buildings/Utah.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Utah.zip
wget --continue -O /mnt/source/buildings/Vermont.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Vermont.zip
wget --continue -O /mnt/source/buildings/Virginia.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Virginia.zip
wget --continue -O /mnt/source/buildings/Washington.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Washington.zip
wget --continue -O /mnt/source/buildings/WestVirginia.zip https://usbuildingdata.blob.core.windows.net/usbuildings/WestVirginia.zip
wget --continue -O /mnt/source/buildings/Wisconsin.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Wisconsin.zip
wget --continue -O /mnt/source/buildings/Wyoming.zip https://usbuildingdata.blob.core.windows.net/usbuildings/Wyoming.zip
import csv
import shapely.wkt
import shapely.geometry
import json
import os
from itertools import groupby
json.encoder.FLOAT_REPR = lambda f: ("%0.7f" % f)
with open('/mnt/output/output.csv', 'r') as f:
c = csv.DictReader(f)
for tract_id, rows in groupby(c, lambda x: x['tract_geoid']):
print(tract_id)
d = '/mnt/output/bytract/%s' % tract_id[:2]
if not os.path.exists(d):
os.mkdir(d)
with open(os.path.join(d, '%s.geojson' % tract_id), 'w') as g:
fc = {'type': 'FeatureCollection', 'features': []}
for r in rows:
s = shapely.wkt.loads(r['WKT'])
f = {
'type': 'Feature',
'id': r['id'],
'properties': {'tract_geoid': r['tract_geoid']},
'geometry': shapely.geometry.mapping(s)
}
fc['features'].append(f)
json.dump(fc, g, separators=(',', ':'))
g.write('\n')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment