Skip to content

Instantly share code, notes, and snippets.

@zouzias
Created December 15, 2017 10:57
Show Gist options
  • Save zouzias/6db3cb72f7e35f5a4c8d267151ed176e to your computer and use it in GitHub Desktop.
Save zouzias/6db3cb72f7e35f5a4c8d267151ed176e to your computer and use it in GitHub Desktop.
One hot encoding issue Sklearn and Pipeline (solved)
See comment https://github.com/jpmml/jpmml-sklearn/issues/38 (vruusmann commented on Apr 19)
If you want to apply one-hot-encoding to string columns, then you should simply use the sklearn.preprocessing.LabelBinarizer transformer class for that. It has exactly the same effect as a sequence of LabelEncoder followed by OneHotEncoder.
mapper = DataFrameMapper([
("country_name", LabelBinarizer())
])
The OneHotEncoder transformation makes sense if your input data contains categorical integer columns.
Currently, sklearn_pandas.DataFrameMapper is unable to apply [LabelEncoder(), OneHotEncoder()] on a string column due to the above "matrix transpose" problem. You could additionally open an issue with the sklearn_pandas project, and ask for their opinion about it.
It would be possible to make [LabelEncoder(), OneHotEncoder()] work by developing a custom Scikit-Learn transformer that handles "matrix transpose". For example, [LabelEncoder(), MatrixTransposer(), OneHotEncoder()]. This MatrixTransposer operation would be no-op from the PMML perspective.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment