This page presents a project idea for a Python framework on job submission, with the intent to trigger collaboration on the topics.
- Kenneth Hoste, Stijn De Weirdt (HPC-UGent)
- job script: text file implementing a workload in a scripting language (e.g. bash, tclsh, Python, ...)
- has certain well-defined parts, e.g. shebang, RM header (e.g.
#PBS ...
), actual workload implementation, ... - job: instance of a job script representing a workload, e.g. a simulation, scientific experiment, ...
- resource manager (RM): (remote) service where jobs are submitted to
- does not (have to) include job scheduler
- does not assume tracking of job states or job history (example: completed jobs are no longer available via the RM interface)
support for (in order of preference):
- job submission (
qsub script.sh
) - minimal job attributes
- walltime (e.g.
qsub -l walltime=10:00:00
) - dependencies (e.g.
qsub -l afterok:<jobid>
) - job hold/release
- DAG job submission
- DAG job: set of jobs with interdependencies
- 'array' jobs (
qsub -t
) - job querying (state) (e.g.
qstat
) - job removal (e.g.
qdel
) - advanced job attributes
- queue
- target partition/reservation
- node features (e.g.
qsub -W ...
) - mail settings (e.g.
qsub -m abe
) - memory requirements (e.g.
qsub -l vmem=10gb
) - mapping of abstract node features to job attributes (e.g. a Harpertown-based GPGPU node =>
qsub -q gpu_harpertown
,qsub -l nodes=1:harpertown:gpu
) - support for remote submission (cfr. Galaxy)
- e.g. via SSH tunnel to cluster login nodes, ...
- general interface to multiple (remote) systems (all around the world)
- API
- support for various resource managers, e.g. PBS, SLURM, PBSPro, OAR, LoadLeveler, MOAB, ...
- command line client (e.g.
mysub
)
- source code repository: git
- collaborative framework: GitHub (github.com)
- documentation: GitHub wiki pages
- compatible with Python v2.6 and more recent Python v2.x
- compatibility with Python 3.x is definitely worth considering (and feasible alongside Python v2.6 support)
- references to guides on maintaining a Python2/3 compatible codebase?
- object-oriented design
- 'abstract' class
ResourceManager
:
class ResourceManager(object):
...
- subclasses for specific resource managers:
class Pbs(ResourceManager):
...
class Slurm(ResourceManager):
...
- simple and clean API, e.g.:
class Job(object):
"""Representation of a job."""
def __init__(self, *args, **kwargs):
self.name = None
self.script = None
self.dependencies = []
self.jobid = None
...
class GroupOfJobs(object):
"""Representation of a group of jobs (e.g. a DAG)."""
def __init__(self, *args, **kwargs):
self.jobs = []
...
def create_job(job_script, jobs_specs=None):
"""Create a new Job instance."""
...
class ResourceManager(object):
"""Abstract class representing a resource manager."""
def submit_job(self, job, job_attrs=[]):
"""Submit a job."""
raise NotImplementedError
def hold_job(self, job, hold_type=UserHold):
"""Set hold on a job."""
raise NotImplementedError
...
- includes a suite of unit tests from the very start
- interfacing with specific resource managing software can be done via mocking
(preliminary draft, in no way is this final or approved yet!!!)
- pick a catchy name: joint effort
- agree on license: joint effort
- LGPL because of integration into other tools?
- make intentory of related/existing (Python) frameworks: joint effort
- includes looking into available Python APIs for resource managers, e.g. pbs_python, python-torque, ...
- design: joint effort
- framework implementation: NeSI (?)
- includes documenting via docstrings, unit tests, ...
- implementation for specific resource managers:
- Torque: HPC-UGent
- SLURM: NeSI, JSC (?)
- PBSPro: GMI/azet.org (?)
- OAR: Uni.lu (?)
- LoadLeveler: JSC (?)
cfr. EasyBuild structure
- release manager (NeSI?)
- only person with merging rights
- his/her institution hosts central git repository with
master
branch (e.g. https://github.com/nesi/) - code reviewers
- developers
- testers
- backend for job submission in EasyBuild (
--job
) - minimal requirement: DAG job submission
- backend for job submission in benchmarking/performance monitoring frameworks (e.g. JuBE)
- backed for user portal (e.g. in Django)
- e.g. Galaxy
- backend for Hanything-On-Demand (https://github.ugent.be/hpcugent/hanythingondemand/)
- PBS-specific:
- EasyBuild backend for
--job
: https://github.com/hpcugent/easybuild-framework/blob/master/easybuild/tools/pbs_job.py - vsc-jobs: https://github.com/hpcugent/vsc-jobs
- specific to gathering job statistics
vsc.job
Python package (HPC-UGent internal for now, part ofgsub
)- DRMAA
- current job submission backend in Galaxy?
- see https://bitbucket.org/galaxy/galaxy-dist/src/29ce93a13ac7c4a5d2b54e5e2c10960f30a350b3/lib/galaxy/jobs/runners
I think a lot of inspiration here can be taken from ibcloud which does something similar but for virtualization requisitioning. Have a poke through the code here and take a look at how the
get_driver
works by returning a class which can then be used to run the job.Would testing require something to spawn a virtualized cluster? Otherwise, it's not clear how testing could be done.