DATA INGESTION CONTINUED

I've been thinking about it, and if we're given CSV flat files, it shouldn't be that difficult to come up with an automated ingestion method that wouldn't necessarily require a schema if the CSV files have header rows. If we're going to be ingesting data from a SQL table, I can't really see a way around having a user define a schema because the columns in the table aren't necessarily ordered by a user-defined method. Pretty sure the automated way of getting the columns from a SQL table returns the column names in alphabetical order. So we would need to have users name their columns something specific, or more simply just have them provide us a schema.

My thoughts on the MLoG module are that it will serve to make it easier for people in industry to convert their tabular data into a graph format and leverage powerful Graph Machine Learning methods on their structured graph data. But it isn't just ingestion that's important, as we aren't going to be doing all of these computations for grins. We should also anticipate that users will want to keep the output from their computations inside a graph database. And we might want to be prepared to convert the data inside the graphs back into CSV or SQL format as they are widely accepted industry standards.

So we want to help people go from CSV -> structured graph data -> numerical representations of the data -> ML computations (make embedding, classify, etc.) -> convert back from numerical array data into an array of structures kept in a graphDB -> export CSV or SQL dump or connect to their SQL database and upload the data from their computations.

To people in industry, they don't care about the hows and the whys of ML as much as they care about predicting missing values or adding new valuable columns to their existing database. That's what ML really represents to most people in the industry, so let's give them what they want.

tomas-wood/data_ingestion_contd.md

DATA INGESTION CONTINUED