Ian Rose ian-r-rose

Notes on High-Level-Graph-ication

The essential idea behind a high-level-graph is this: it's a lazy mapping which can produce low-level Dask task graphs on demand. Until these low-level tasks are produced (called "materialization"), they are a couple of advantages:

They allow for higher level reasoning about graph structure, including optimizations that would be challenging or impossible once the graph is represented by many low-level tasks.
They can be used to produce only the necessary keys for a full computation. That is, later operations like slicing can feed back into previous HLG layers and allow them to not produce tasks which won't be needed (called HLG culling). This can be a significant time and memory saving process.
They can be much cheaper to serialize and communicate than low level task graphs.

However, HLG Layers have proven difficult to write. Broadly speaking, these difficulties have been for two reasons: algorithmic (specifically regarding culling) and serializability.

	import time
	import subprocess

	import dask
	from dask.distributed import Client, wait

	# sp = subprocess.Popen(["viztracer", "-m", "distributed.cli.dask_scheduler"])
	sp = subprocess.Popen(
	["viztracer", "-m", "distributed.cli.dask_scheduler", "-o", "results.json"]
	)

	import io
	import os
	import re
	import string
	import zipfile

	import altair
	import altair_saver
	import junitparser
	import pandas

	import contextlib
	import os
	import subprocess
	import time

	import s3fs

	BUCKET = "ursa-labs-taxi-data"
	KEY = "2009/01/data.parquet"
	URL = f"s3://{BUCKET}/{KEY}"

	import os

	import dask
	import s3fs

	dask.config.set({"num_workers": 2})
	dask.config.set({"scheduler": "threads"})

	fs = s3fs.S3FileSystem(anon=True)
	paths = fs.glob("s3://ursa-labs-taxi-data/2009/**.parquet")

	import datetime

	import distributed
	from distributed.diagnostics import SchedulerPlugin
	from distributed.utils import key_split, key_split_group


	class TaskGroupStatistics(SchedulerPlugin):
	def __init__(self):
	"""Initialize the plugin"""

Ian Rose ian-r-rose

Notes on High-Level-Graph-ication

Cu