This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from sklearn.preprocessing import StandardScaler, MinMaxScaler | |
| standardise_age = StandardScaler() | |
| rescale_fare = MinMaxScaler() | |
| standardise_age.fit(train[['Age']]) | |
| rescale_fare.fit(train[['Fare']]) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| transformed_age = \ | |
| pd.DataFrame(imputer_age.transform(train[['Age']]), | |
| columns=['Age', 'Age_missing'], | |
| index=train.index) # the most important line: do not forget the index | |
| train = train.drop(columns=['Age']).join(transformed_age) | |
| transformed_age = \ | |
| pd.DataFrame(imputer_age.transform(validation[['Age']]), | |
| columns=['Age', 'Age_missing'], |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| pd.DataFrame(imputer_age.transform(train[['Age']]), | |
| columns=['Age', 'Age_missing']) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from sklearn.impute import SimpleImputer | |
| imputer_age = SimpleImputer(strategy='median', | |
| add_indicator=True) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from sklearn.ensemble import IsolationForest | |
| outlier_detection = IsolationForest(random_state=1, behaviour="new") | |
| outlier_detection.fit(titanic[['Fare', 'SibSp', 'Parch', 'Age']].dropna()) | |
| data = titanic[['Fare', 'SibSp', 'Parch', 'Age']].dropna() | |
| data['anomaly_score'] = outlier_detection.score_samples(data) | |
| data.sort_values('anomaly_score') |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| for column_to_delete in ['Ticket', 'Cabin', 'Name']: | |
| del titanic[column_to_delete] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| profile = ProfileReport(data, progress_bar=True, minimal=True) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from sklearn.model_selection import train_test_split | |
| target = 'Survived' | |
| intermediate_sample, holdout = train_test_split(titanic, | |
| test_size=.2, | |
| random_state=2020, | |
| stratify=titanic[target]) | |
| train, validation = train_test_split(intermediate_sample, | |
| test_size=.2, | |
| random_state=2020, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # The following code removes the data where 'column_with_outliers' is more than 10 times its average | |
| data = data.loc[data['column_with_outliers'] < data['column_with_outliers'].mean()*10] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| data = data.loc[~(data['column_class'] == 'imbalanced_class')].reset_index(drop=True) |