Skip to content

Instantly share code, notes, and snippets.

@powersa
Created October 18, 2017 15:11
Show Gist options
  • Save powersa/e4867404bf5669043c58056e38292f31 to your computer and use it in GitHub Desktop.
Save powersa/e4867404bf5669043c58056e38292f31 to your computer and use it in GitHub Desktop.
Notes on building a transportation Neo4j db with gtfs data.

Goal

Identify Seattle bus stops that are within 1 hour of the Food Bank on public transportation.

  • From where in the city is the Food Bank accessible within 1 hour?
  • Where in the city is accessible from the Food Bank within 1 hour?

Data

GTFS: https://developers.google.com/transit/gtfs/

Mostly interested in:

Approach

Get usable data as quickly as possible Tooling needs to be flexible enough to operate on GTFS data with reasonable constraints (ie discrete period of time) End to end, should be able to process and query a day of data in ~1 hour

Tools

Python - easy for scripting and processing tabular data Neo4j -- because I thought this would be an interesting application for graph databases QGIS -- for visualization

Data Model

Route Two:                \>(T)---->(D)--------->(A)
                         /
Route One: (D)--------->(A)-->(D)--------->(A)

Three types of events (nodes) in our system:

  • Departures - a route instance leaves a stop (1:1)
  • Arrivals - a route instance arrives at a stop (1:1)
  • Transfer - leave one route to go to another that connects at the same stop

Nodes have:

  • Type
  • Time
  • Location
  • Route
  • Stop Name

Initiallly 2 Types of relationship:

  • Within a route. (D)--->(A)------>(D)
  • Between routes. (A)>(T)---->(D)
    -->(D)

Notice the direction of these relationships? They all have a direction.

Relationships also have a magnitude. The number of seconds between each event.

Relationships have:

  • Node 0
  • Node 1
  • Magnitude

Process Raw Data

Convert GTFS data from the raw feeds into something that looks like our data

We end up with our nodes and our relationships.

For convenience, we have:

  • arrivals.txt
  • departures.txt
  • route_relationships.txt
  • cross_route_relationships.txt

Sample Arrival:

32884061-1491443772-arrival 2017-04-05 18:56:12-07:00 1491443772 47.655735 -122.143089 B Line 148th Ave NE & NE 51st St

Sample Route Relationship:

32884061-1491443772-arrival 32884061-1491443772-departure 0

Import Processed Data into Neo4j

Used Neo4j Python library to import each data set into my local db using the graph query language Cypher.

Getting Data Back

Why do all this work? The whole point of loading the data into the db is so we can query for data dynamically.

Remember our initial question?

At a particular time, on a specific data, from where in the city can we reach the Food Bank within x amount of time?

Wrote another set of scripts for this. In the end we get a flat file that tells us about the origin, destination and the travel time between them.

Gotcha

You've gotta make the data structure work for you. Originally, I only created relationships in the "to" category. The db was very good at traversing the graph to get the next stop in the network, but very slow when asked to get stops upstream. The relationships are one direction, so from a given stop, it's really hard to find the stops that came before. To get around this, I added a relationship class "from" that mirrors "to" relationships but reverses their direction.

Fold those into our queries and we have a system that's just as good at traversing the network in reverse.

Sample Queries

We want to know "from where" for a given stop b/w a particular time.

First we start by finding an arrival event at a particular stop. So let's find all arrival at Martin L King Jr Way S & S Webster St between Saturday, May 6, 2017 9:30:00 AM GMT-07:00 DST and Saturday, May 6, 2017 2:00:00 PM GMT-07:00 DST.

      match (n)
      where n.stopName = '"Martin L King Jr Way S & S Webster St"' AND
      n.timeStamp >= "1494088200" AND
      n.timeStamp <= "1494104400" AND
      n.type = "arrival"
      return n;

Let's isolate a single event:

      match (n)
      where n.id = "33363087-1494092722-arrival"
      return n;

Now let's find all the paths that reach a given arrival within 15 minutes.

      MATCH p = ({id:"33363087-1494092722-arrival"})-[:From *1..50]->({type:"departure"})
      WITH p,reduce(s = 0, r IN relationships(p) | s + r.seconds) AS dist
      where dist < 900
      RETURN p;

And for ease in the export, we can just get the nodes.

      MATCH p = ({id:"33363087-1494092722-arrival"})-[:From *1..50]->({type:"departure"})
      WITH p,reduce(s = 0, r IN relationships(p) | s + r.seconds) AS dist
      where dist < 900
      RETURN nodes(p)[-1] AS n, dist ORDER BY dist DESC;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment