- Xavier Dupré (Microsoft): keep .fit in the API but X can be a stream from Spark for example. Transparent for the user. Gaël: indexable and len is n_samples, is that good enough? Answer: X accessible through sequential iterator.
- Jean-François Puget (IBM): IBM is betting on Spark at the scale of the company. Most machine learning applications have small data but some don't. How can the bridge b/w scikit-learn and Spark get better? How to get scikit-learn used in a distributed environment? Not all algorithms can work out-of-core, need distributed algorithm.
- Jean-Paul Smet (Nexedi): Nexedi is an example company. Wendelin.core helps us removing the overhead, and enabling out of core computing. Next step of the story in a year.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# how to use : df should be the dataframe restricted to categorical values to impact, | |
# target should be the pd.series of target values. | |
# Use fit, transform etc. | |
# three types : binary, multiple, continuous. | |
# for now m is a param <===== but what should we put here ? I guess some function of total shape. | |
# I mean what would be the value of m we want to have for 0.5 ? | |
import pandas as pd | |
import numpy as np |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
A scikit-learn like transformer to remove a confounding effect on X. | |
""" | |
from sklearn.base import BaseEstimator, TransformerMixin, clone | |
from sklearn.linear_model import LinearRegression | |
import numpy as np | |
class DeConfounder(BaseEstimator, TransformerMixin): | |
""" A transformer removing the effect of y on X. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Fast counting of 3-grams for short strings. | |
Quick benchmarking seems to show that pure Python code is faster when | |
for strings less that 1000 characters, and numpy versions is faster for | |
longer strings. | |
Very long strings would benefit from probabilistic counting (bloom | |
filter, count min sketch) as implemented eg in the "bounter" module. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
import pylab as pl | |
import matplotlib.transforms as mtransforms | |
################################################################################ | |
# Display correlation matrices | |
def fit_axes(ax): | |
""" Redimension the given axes to have labels fitting. | |
""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""Persistence strategies comparison script. | |
This script compute the speed, memory used and disk space used when dumping and | |
loading arbitrary data. The data are taken among: | |
- scikit-learn Labeled Faces in the Wild dataset (LFW) | |
- a fully random numpy array with 10000x10000 shape | |
- a dictionary with 1M random keys/values | |
- a list containing 10M random value | |
The compared persistence strategies are: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
import matplotlib.pyplot as plt | |
from sklearn.linear_model import Ridge, Lasso | |
from sklearn.cross_validation import ShuffleSplit | |
from sklearn.grid_search import GridSearchCV | |
from sklearn.utils import check_random_state | |
from sklearn import datasets | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
''' | |
Non-parametric computation of entropy and mutual-information | |
Adapted by G Varoquaux for code created by R Brette, itself | |
from several papers (see in the code). | |
This code is maintained at https://github.com/mutualinfo/mutual_info | |
Please download the latest code there, to have improvements and | |
bug fixes. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### Keybase proof | |
I hereby claim: | |
* I am GaelVaroquaux on github. | |
* I am gaelvaroquaux (https://keybase.io/gaelvaroquaux) on keybase. | |
* I have a public key whose fingerprint is 44B8 B843 6321 47EB 59A9 8992 6C52 6A43 ABE0 36FC | |
To claim this, I am signing this object: |
NewerOlder