Skip to content

Instantly share code, notes, and snippets.

@kinow
Last active February 2, 2020 06:11
Show Gist options
  • Save kinow/9013c8ad1d13d4f50d6e1df769bf49c7 to your computer and use it in GitHub Desktop.
Save kinow/9013c8ad1d13d4f50d6e1df769bf49c7 to your computer and use it in GitHub Desktop.
Running Cylc Flow on Postgres

Preparing the environment

The branch used for this test was this one.

After checking it out, and installing everything in a virtual environment, you will need pip and Docker (or have a running Postgres server).

# install Postgres Python driver/module
pip install psycopg2
# start a postgres database for the suite five (or any other suite you'd like to try)
docker run --name cylc-postgres -p 5432:5432 -e POSTGRES_PASSWORD=cylc -e POSTGRES_USER=cylc -e POSTGRES_DB=cylc-five -d postgres

Finally,

(venv) kinow@ranma:~/Development/python/workspace/cylc-flow$ cylc run --no-detach five
            ._.                                                       
            | |                 The Cylc Suite Engine [8.0a1]         
._____._. ._| |_____.           Copyright (C) 2008-2019 NIWA          
| .___| | | | | .___|   & British Crown (Met Office) & Contributors.  
| !___| !_! | | !___.  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
!_____!___. |_!_____!  This program comes with ABSOLUTELY NO WARRANTY.
      .___! |              It is free software, you are welcome to    
      !_____!             redistribute it under certain conditions;   
                        see `COPYING' in the Cylc source distribution. 
                                                                       
2020-02-02T04:39:22Z INFO - Suite server: url=tcp://ranma:43028/ pid=21747
2020-02-02T04:39:22Z INFO - Suite publisher: url=tcp://ranma:43077
2020-02-02T04:39:22Z INFO - Run: (re)start=0 log=1
2020-02-02T04:39:22Z INFO - Cylc version: 8.0a1
2020-02-02T04:39:22Z INFO - Run mode: live
2020-02-02T04:39:22Z INFO - Initial point: 20130808T0000Z
2020-02-02T04:39:22Z INFO - Final point: None
2020-02-02T04:39:22Z INFO - Cold Start 20130808T0000Z
2020-02-02T04:39:22Z INFO - [prep.20130808T0000Z] -submit-num=01, owner@host=ranma
2020-02-02T04:39:22Z INFO - [prep.20130808T0000Z] -triggered off []
2020-02-02T04:39:23Z INFO - [prep.20130808T0000Z] status=ready: (internal)submitted at 2020-02-02T04:39:23Z for job(01)
2020-02-02T04:39:23Z INFO - [prep.20130808T0000Z] -health check settings: submission timeout=None
2020-02-02T04:39:24Z INFO - [client-command] put_messages kinow@ranma:cylc-message
2020-02-02T04:39:24Z INFO - [prep.20130808T0000Z] status=submitted: (received)started at 2020-02-02T04:39:24Z for job(01)
2020-02-02T04:39:24Z INFO - [prep.20130808T0000Z] -health check settings: execution timeout=None

You can check the folders where you would normally find the db SQLite databases. They should both be

image

image

And you can also confirm that you have now everything stored in your remote/local Postgres database.

kinow@ranma:~$ docker exec -t -i cylc-postgres /bin/bash
root@8e89eba5cb0c:/# psql -h localhost -U cylc -W cylc-five
Password: 
psql (12.1 (Debian 12.1-1.pgdg100+1))
Type "help" for help.

cylc-five=# select * from suite_params;
     key      |                value                 
--------------+--------------------------------------
 uuid_str     | d8cab0e4-cb72-44ba-92ba-9a28bf8d061c
 cylc_version | 8.0a1
 UTC_mode     | true
(3 rows)

cylc-five=# select * from task_pool;
     cycle      | name | spawned | status  | is_held 
----------------+------+---------+---------+---------
 20130808T0000Z | prep |       1 | running | f
 20130808T0000Z | foo  |       0 | waiting | f
 20130808T0000Z | bar  |       0 | waiting | f
(3 rows)

cylc-five=#

image

That's it. Everything works! 🎉

ps: if you have already executed the suite, and would like to re-create the Postgres database, it takes 2-3 seconds:

$ docker stop cylc-postgres; docker rm cylc-postgres; docker run --name cylc-postgres -p 5432:5432 -e POSTGRES_PASSWORD=cylc -e POSTGRES_USER=cylc -e POSTGRES_DB=cylc-five -d postgres

More about what was done

There is a draft pull request that adds support to SQLAlchemy Core (not ORM) here.

All tests passed. It does add SQLAlchemy core, but alas it doesn't mean it's that simple to use Cylc with different databases. Here are the pending issues:

  • Cylc database is single-tenant. Schema changes would be required to allow multi-tenancy (regarding workflows).
  • The INSERT statements in Cylc are performed using SQLite INSERT OR REPLACE; that's problematic because not all databases support it. And the ones that do, offer different ways of doing it.
  • The schema uses INTEGER for some fields like is_held, and passes false values for example. That works with SQLite using sqlite3, and with SQLAlchemy when the database is SQLite. With Postgres, false can only be used for BOOLEAN types. I suspect the same is true for Oracle & MySQL. And I haven't tested, but I think SQLite should work with BOOLEAN too.
  • Cylc uses two databases, private and public. This obviously doesn't work with Postgres, Oracle, MySQL, etc.
  • We would need to extend the Cylc configuration file, to support passing the connection parameters for each implementation. If a site uses Postgres, then we would need user, port, database, etc. But also, the user could want to customize the connection string, to pass extra parameters, or use a different implementation (a different Python module).

So here's a summary of what changes were required, besides the work in the draft pull request.

Disable public database

Cylc Flow uses sqlite for its database. It keeps two databases, private & public. When the scheduler starts it creates a private database. This is the most important database.

Then it copies the private database as public database. And in some parts of the process, the scheduler and other components of the system will send queries to the private database, or to both.

Errors in the public database are logged. Errors in the private database result in system errors.

This commit contains a change disables the creation of the public database, and also comments some code so that it is not called causing runtime errors.

Hard-coded connection URL

The connection URL used by SQLAlchemy for SQLite is something like sqlite:///home/kinow/five/log/db. But for Postgres, that could be postgres://cylc:cylc@localhost:5432/cylc-five. Or it could have extra parameters like psycog2://cylc:cylc@localhost:5432/cylc-five?timeout=??? (if supported, haven't tested but quite sure it has this feature just like Hibernate).

So I've hardcoded the connection string to self.conn_url = "postgresql://cylc:cylc@localhost/cylc-five" in this commit.

Set is_held and is_manual_submit to BOOLEAN type

That was due to Postgres not liking to have a field INTEGER receiving a false or true as value. Same commit mentioned before.

Use Postgres UPSERT

Postgres supports UPSERT. It can update or ignore when there are duplicates. But you must specify the constraint name or the index used to identify duplicated values. And also need to specify what gets updated.

This is clearly more complex than INSERT OR REPLACE from SQLite. MySQL has a similar feature. Not quite sure if there is a similar feature in Oracle.

It would be much easier to support multiple databases if we used just INSERT. Not sure if doable, as that would mean either querying the database to confirm whether values exist or not, or keep some variable in memory or a cache in memory I think.

Done in the same commit as before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment