Last active
March 9, 2019 12:30
-
-
Save dvgodoy/49eed20ee3e20daea5ee9dd91af6053c to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from pyspark.ml.feature import VectorAssembler | |
| from pyspark.ml.classification import RandomForestClassifier | |
| from pyspark.ml.pipeline import Pipeline | |
| # Let's generate a transformer to make both imputations we specified earlier | |
| imputer = hdf_fenced.transformers.imputer() | |
| # And a transformer to fence outliers as well | |
| fencer = hdf_fenced.transformers.fencer() | |
| # We choose only 3 numeric features (so we don't need to encode categorical features) | |
| assem = VectorAssembler(inputCols=['Fare', 'Age', 'SibSp', 'Parch'], outputCol='features') | |
| # Then we build a simple RF classifier with 20 trees | |
| rf = RandomForestClassifier(featuresCol='features', labelCol='Survived', numTrees=20) | |
| # And put the two steps into a nice pipeline | |
| pipeline = Pipeline(stages=[imputer, fencer, assem, rf]) | |
| # Now we fit the model and use transform to get the predictions | |
| # Thanks to handyspark, now stratified imputations and fencing outliers are | |
| # also part of the pipeline! :-) | |
| model = pipeline.fit(sdf) | |
| predictions = model.transform(sdf) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment