stsievert/Criteo.ipynb

Last active January 27, 2020 14:16

Star (2) You must be signed in to star a gist
Fork (1) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/stsievert/30702575de95328f199ab1d7e50795ef.js"></script>
Save stsievert/30702575de95328f199ab1d7e50795ef to your computer and use it in GitHub Desktop.

Download ZIP

Criteo dataset example

Raw

Criteo.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

Sandy4321 commented Jan 26, 2020

if somebody tried to run locally ?
like
https://github.com/rambler-digital-solutions/criteo-1tb-benchmark

Author

stsievert commented Jan 27, 2020 •

edited

Loading

It looks like (dask/dask-ml#295 (comment)) I ran this notebook locally because it only uses a (very small) subset of the dataset. I don't recall ever using the complete Criteo dataset, or even a significant fraction of it

Sandy4321 commented Jan 27, 2020

But you use dask.distrebuted
Also
Data is only 370gb in zipped file
In this link
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#criteo_tb

Author

stsievert commented Jan 27, 2020 •

edited

Loading

I mostly used Distributed for it's useful dashboard (for debugging/profiling/etc). I didn't focus on actually scaling to entire Criteo dataset; IIRC this simple use case illustrated some problems in Dask-ML.

My metric for "big data" is any data that's too large to fit in RAM. 370GB is certainly more RAM than the 16GB my local machine has.

Sandy4321 commented Jan 27, 2020

I see
Is dask distributed is free
And will read data from lobsvm format?

Author

stsievert commented Jan 27, 2020

Is dask distributed is free

Yes. Free as in beer (i.e, doesn't cost money) and free as in speech (the source is freely available).

And will read data from lobsvm format?

Yes. Dask-ML is a wrapper around scikit-learn, and they have a function for read in libsvm: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html. It'd be pretty simple to wrap that function with Dask:

from sklearn.datasets import load_svmlight_file

def read_chunk(filename):
    X, y = load_svmlight_file(filename)
    return X, y  # scipy.sparse matrix, raw ndarray

from distributed import Client
client = Client()

filenames = ["criteo-day-1.svmlight", ...]
Xs_ys = client.map(read_chunk, filenames)
# Xs_ys will be tasked to the cluster, and will perform work in the background

# continue with rest of notebook

This code is untested.

Sandy4321 commented Jan 27, 2020

Great code thanks
So dusk can help in both cases to read original RTB Criteo file or libsvm format
Only short question:
In your code above - load_svmlight_file meanse to read any svmlib format or specific svmlight format
Again thank a lot taking care

stsievert/Criteo.ipynb

Sandy4321 commented Jan 26, 2020

Uh oh!

stsievert commented Jan 27, 2020 •

edited

Loading

Uh oh!

Sandy4321 commented Jan 27, 2020

Uh oh!

stsievert commented Jan 27, 2020 •

edited

Loading

Uh oh!

Sandy4321 commented Jan 27, 2020

Uh oh!

stsievert commented Jan 27, 2020

Uh oh!

Sandy4321 commented Jan 27, 2020

Uh oh!

stsievert/Criteo.ipynb

Sandy4321 commented Jan 26, 2020

Uh oh!

stsievert commented Jan 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sandy4321 commented Jan 27, 2020

Uh oh!

stsievert commented Jan 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sandy4321 commented Jan 27, 2020

Uh oh!

stsievert commented Jan 27, 2020

Uh oh!

Sandy4321 commented Jan 27, 2020

Uh oh!

stsievert commented Jan 27, 2020 •

edited

Loading

stsievert commented Jan 27, 2020 •

edited

Loading