polyflow 🔬

Description

polyflow is a semantic data mediator. The main goal of polyflow is to create more accessible data models to users that need to use queries as part of their routine but aren't technology experts, such as researchers, PMs and marketing analysts. Consider the following:

Imagine that two reaserchers, Bob and Alice, share the same research field and tackle the problem using computer simulations. Simulations are usually structured as workflows (Wfs) that have some inputs, a sequence of programs that transform them and generate some output. Researchers usually use a tool to orchestrate, automate and control their Wfs. Such tools are called Workflow Management Systems (WFMSs)

Each WfMS have their own storing mechanism and conceptual representation of WFs since they may have been built for an specific domain (e.g. terra.bio). These tools often offer exports of raw data in a structured way (such as logs and Relational Databases (RDbs)) so reasearchers can query and analyze their findings. But how do Bob and Alice compare their inputs, pipelines and results since they use different orchestrators for their WFs?

In other words, exported data by different WfMSs may not be compatible for comparison right away since there might be different logical representations (logs vs RDbs) or even different conceptual models. polyflow was designed to solve this problem in a very user-friendly way.

Imagine that Bob and Alice want to compare how much time each step of their pipelines take. Let's assume Bob uses Kepler to orchestrate his workflow while Alice uses Swift/T. The image below shows the entities in each WFMS's data model that capture such information.

They are similar (logical) models and semantically identical. Because of that, it can be represented in a canonical fashion. In other words, a single Canonical Conceptual Model (CCM) can be used to represent data described by both models.

And that's where polyflow comes in. Database experts/tech savy people describe mapping strategies between the CCM and the local schemas using an object-like syntax so that reasearchers can query a single, simplified data model.

This software is derived from my master's thesis and you can check the Publications section for aditional information.

Requirements

All you need Docker to run polyflow. To install it, just follow the guide for your OS:

Running dockerized version

⚠️This guide is valid only to Unix-based distributions. If you have Windows, some commands may be different

First of all, open the Terminal program in your computer. Then, clone or download the repository.

git clone https://github.com/yanmendes/polyflow

If you opted for the download, unzip the file and change into your downloads directory in the Terminal. Otherwize, just change into polyflow's folder:

cd polyflow

Then, you need to create a docker network to enable communication between containers. To do that, run:

docker network create polyflow

If you have containerized databases that you wish to connect to polyflow, make sure to add them to the network.

To run the application just run the command below. It launches two docker containers: one containing the application and other a database that serves as polyflow's catalog.

docker-compose up polyflow

To make sure your installation worked, open http://localhost:3050. There should be a blue screen with a big play button in the middle. This is a GraphQL playground and where you'll be interacting with polyflow. If you don't know anything about GraphQL, I suggest you this read for a quick overview. To check out the endpoints of polyflow, click the Schema button on the right side of the screen. The GraphQL playground provides auto-complete to queries and mutations by tapping ctrl + space.

Understanding polyflow

This section provides an overview of polyflow's core concepts and how to interact with the software.

⚠️ This is NOT a valid example and errors will be thrown if you try to execute these queries and mutations. For working examples, check the examples section.

There are three main concepts in polyflow: Data Sources, Mediators and Entities. In terms of our example, a data source is where the resources are located. We currently provide support to PostgreSQL, MySQL and BigDAWG, so a data source is a PSQL/MySQL URL or a BigDAWG endpoint, but you can think of it as an Unique Resource Identifier (URI) to any resource accross the web (e.g. databases, files) that will be mediated by polyflow.

To connect polyflow to Bob's database, we should use the addDataSource mutation.

mutation addDataSource {

addDataSource(dataSource: { type: postgres, uri: "<bob's-db-URI>", slug: "bobs-db" }) {

slug

}

}

To check existing Data Sources, you should run the dataSources query:

query {

dataSources {

id

uri

type

}

}

Mediators are “...a software module that exploits encoded knowledge about certain sets or subsets of data to create information for a higher layer of applications.”. In simpler terms, mediators can be used to enrich data, leveraging domain-specific knowledge to provide simpler and/or more semantically rich data models. Recalling our scenario, mediators would be the transformations needed to convert data described by local (Bob and Alice's) schemas to the global one (CCM).

Since the target CCM can have multiple entities that compose it, the transformations are described by a more granular abstraction. We can create a mediator of a data source using the addMediator mutation. Note that a data source can have multiple mediators. The mediator's slug will define the mediator being used to handle a query as we'll see further ahead.

mutation {

addMediator(mediator: { dataSourceSlug: "bobs-db", name: "Bob's mediator", slug: "bob" }) {

id

}

}

Entities are the most crucial piece of the puzzle. The entityMapper prop defines the transformation from this entity's local schema to its global representation. Since polyflow aims for a technology-agnostic approach, different data storages can be added by implementing new interfaces and query resolvers. With that in mind, note that entityMappers structure may vary depending on the data source being used.

For now, polyflow only supports a query resolver for SQL and can interface with PostgreSQL, MySQL and BigDawg. The relational's entityMapper structure can be found here. It's composed by the table's name, a designed (optional) alias, the columns that will be projected and a where prop that applies filters.

However, since the data in our local schema may be more granular than in our CCM, we may need to aggregate entities to get the desired outcome. For instance, in Bob's model, the price of a sale is not recorded in the Sales entity. If we were to write a plain SQL statement to retrieve the data in our CCM's format, it would result in something along the lines of

SELECT customer, city, value  AS price FROM Sales s, Prices p WHERE s.id = p.saleId

Because of that, an entityMapper has optional fields entity2, type, params that allows the creation of complex entities. To create Bob's CCM's only entity, Sales, we will use the addEntity mutation below. Note that aggregations can be recursevely defined, i.e. you can aggregate more than 2 entities. You can check the examples folder for that.

mutation addBobSaleEntity {

addEntity(

entity: {

name: "Sales"

slug: "sales"

mediatorSlug: "bob"

entityMapper: {

entity1: { name: "Sales" }

entity2: { name: "Prices" }

columns: [

{ projection: "city" }

{ projection: "customer" }

{ projection: "value" }

]

type: INNER

params: ["s.id", "p.saleId"]

}

}

) {

id

}

}

Finally, after providing the proper mappings, Bob can query the data using the CCM via the query endpoint. As mentioned before, the mediator's slug provides polyflow the proper context to transform incoming queries. Similarly, the entity's slug tells polyflow which entityMapper should be used.

For relational databases, the query syntax is SQL-like, with the only difference being: instead of querying regular tables in your schema, you should be targeting the entities inserted in the catalog, wrapped by [ ] and preceded by the desired mediator, as shown in the example below:

query {

query(query: "SELECT * FROM bob[sales]")

}

Examples

The examples folder contains two implemented examples: one using PostgreSQL and MySQL databases and other using BigDAWG: a polystore database system. There are README files in the directories containing instructions on how to run them.

Cleaning up

To stop the execution of the containers, run:


docker rm -f polyflow polyflow-catalog

Running locally

Firstly, you need to set up your environment by installing Node.js, PostgreSQL and Yarn. If you're running in a Mac distribution, I highly recommend using Brew to install and manage your packages.

Now, you need to copy the .env.sample file into a .env file an write your local settings there.

$ yarn

$ yarn run build

$ yarn run start:prod

Future work

Add support to group by and order by in the entity mapper
Export performance metrics from container

References

Theoretical fundamentals that guided this work - ÖZSU, M. Tamer; VALDURIEZ, Patrick. Principles of distributed database systems. Springer Science & Business Media, 2011.
Workflow definition - De Oliveira, Daniel CM, Ji Liu, and Esther Pacitti. "Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments." Synthesis Lectures on Data Management 14.4 (2019): 1-179.
Swift/T WFMS - Wozniak, Justin M., et al. "Swift/t: Large-scale application composition via distributed-memory dataflow processing." 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. IEEE, 2013.
Kepler WFMS - Altintas, Ilkay, et al. "Kepler: an extensible system for design and execution of scientific workflows." Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.. IEEE, 2004.

Publications

Polyflow: A SOA for Analyzing Workflow Heterogeneous Provenance Data in Distributed Environments

yanmendes/contributing.md