We want to be able to take the data that people give us in CSV files and turn them into graphs.
Some approaches to doing this and my thoughts/feelings/opinions on each:
-
DON'T
Just don't do it. Let them get the data into a graph format we know and love, like a graph on GES we can access through an API. This is an extreme case, but I feel it merits thinking about. Processing data is hard to automate since data is generally noisy and we don't want to trust ourselves with being able to handle every corner case and user input problem that might arise. So we can just say "Hey, give us some graph data" and let the data wranglers continue to collect their paychecks for the jobs that are hard to automate, like reshaping data from a SQL table into a graph database. -
SCHEMA
So we basically use the same approach as GES and make users upload a schema that describes the outline of their nodes.csv and edges.csv files they are going to hand to us. This is my second favorite method besides not worrying about it and just making them put the info into a graph already before they ever open up our software. Ultimately, it has a lot in common with the first approach, but gives the veneer that we've actually automated something when we so have not. -
AUTOMAGIC
Try to be clever and predict the structure of their CSV files they hand over and fill in the graph automagically. I HATE THIS IDEA. There's no such thing as a universal translator, and there are far too many corner cases and instances where improperly formatted data can sink this approach right from the start. That's why we make them fill out a schema for our "ingestion engine" so it knows which attributes go with which nodes instead of guessing where things are stored in the data we're being handed.
So we've given three options for how to handle this that show that there is really only one practical way to handle the problem. Make users that want to have our module process their data give us a schema. We can give lots of different templates with short edges.csv and nodes.csv files to go along with them to where users will have plenty of examples of successfully formatted input data and they'll be happy. There's no need to try to read their minds with a more involved process like number three proposes.