Skip to content

Instantly share code, notes, and snippets.

@stsievert
Last active January 27, 2020 14:16
Show Gist options
  • Save stsievert/30702575de95328f199ab1d7e50795ef to your computer and use it in GitHub Desktop.
Save stsievert/30702575de95328f199ab1d7e50795ef to your computer and use it in GitHub Desktop.
Criteo dataset example
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@stsievert
Copy link
Author

Is dask distributed is free

Yes. Free as in beer (i.e, doesn't cost money) and free as in speech (the source is freely available).

And will read data from lobsvm format?

Yes. Dask-ML is a wrapper around scikit-learn, and they have a function for read in libsvm: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html. It'd be pretty simple to wrap that function with Dask:

from sklearn.datasets import load_svmlight_file

def read_chunk(filename):
    X, y = load_svmlight_file(filename)
    return X, y  # scipy.sparse matrix, raw ndarray

from distributed import Client
client = Client()

filenames = ["criteo-day-1.svmlight", ...]
Xs_ys = client.map(read_chunk, filenames)
# Xs_ys will be tasked to the cluster, and will perform work in the background

# continue with rest of notebook

This code is untested.

@Sandy4321
Copy link

Great code thanks
So dusk can help in both cases to read original RTB Criteo file or libsvm format
Only short question:
In your code above - load_svmlight_file meanse to read any svmlib format or specific svmlight format
Again thank a lot taking care

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment