Skip to content

Instantly share code, notes, and snippets.

@pierdom
Created December 5, 2017 13:58
Show Gist options
  • Select an option

  • Save pierdom/6d48044f9c7386a1a30fca7bf9340533 to your computer and use it in GitHub Desktop.

Select an option

Save pierdom/6d48044f9c7386a1a30fca7bf9340533 to your computer and use it in GitHub Desktop.
[A full data-preparation pipeline in Scikit-learn] #python #datascience #machinelearning #scikit
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, Imputer, LabelBinarizer
# We will use to sepate Pipelines for numerical and categorical attributes
num_attribs = list(housing_num) # list of numerical attributes
cat_attribs = ["ocean_proximity"] # list of categorical attributes
# Define Pipeline of numerical attributes as list of encoders and a name (arbitray)
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()), # note: this is a custom encoder
('std_scaler', StandardScaler()),
])
# Define Pipeline for categorical attributes
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer()),
])
# Full pipeline
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
housing_prepared = full_pipeline.fit_transform(housing)
# define a new class Selector (used above)
from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
@elseagle
Copy link

elseagle commented Jun 4, 2018

how do you import DataFrameSelector?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment