Skip to content

Instantly share code, notes, and snippets.

@dataders
Created September 15, 2021 20:49
Show Gist options
  • Save dataders/c364a3f682e967b8bd873ed2f084c5e5 to your computer and use it in GitHub Desktop.
Save dataders/c364a3f682e967b8bd873ed2f084c5e5 to your computer and use it in GitHub Desktop.
Azure ML OutputFileDatasetConfig rant

originally for this SO question before I realized they were hitting a bug

TL;DR

the .read_json_lines_files() method is part of the Dataset.Tabular class and is used for creating Datasets.

OutputFileDatasetConfig is for capturing data coming out of a PipelineStep and passing it to downstream step (or, optionally register

Deep Dive

there's a lot going on in Azure ML datasets, it can be hard to grok at first.

A few years ago, PipelineData was the way you passed data between steps of an AzureML Pipeline. This was amazingly effective -- especially when coupled with PythonScriptStep's allow_reuse parameter.

This worked for the most common use case where users only cared about the data at the end of the pipeline and less about the data passed between steps. the intermediary data could be treated as ephemeral, so it need not be stored somewhere for easy access later, as long as Azure ML could get to it when determining if a step should be re-ran or not.

However, you to get the final dataset where you wanted it, you'd need to use DataTransferStep to land to persist it into your output datastore that has your desired filename and folder hierarchy. While this also works, it requires you to use an Azure Data Factory to accomplish this, which is unnecessary complexity.

Enter OutputFileDatasetConfig: the evolution of PipelineData. Now you can materialize the data coming out of steps where ever you like, but still easily pass it to downstream steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment