Created
December 15, 2017 10:57
-
-
Save zouzias/6db3cb72f7e35f5a4c8d267151ed176e to your computer and use it in GitHub Desktop.
One hot encoding issue Sklearn and Pipeline (solved)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
See comment https://github.com/jpmml/jpmml-sklearn/issues/38 (vruusmann commented on Apr 19) | |
If you want to apply one-hot-encoding to string columns, then you should simply use the sklearn.preprocessing.LabelBinarizer transformer class for that. It has exactly the same effect as a sequence of LabelEncoder followed by OneHotEncoder. | |
mapper = DataFrameMapper([ | |
("country_name", LabelBinarizer()) | |
]) | |
The OneHotEncoder transformation makes sense if your input data contains categorical integer columns. | |
Currently, sklearn_pandas.DataFrameMapper is unable to apply [LabelEncoder(), OneHotEncoder()] on a string column due to the above "matrix transpose" problem. You could additionally open an issue with the sklearn_pandas project, and ask for their opinion about it. | |
It would be possible to make [LabelEncoder(), OneHotEncoder()] work by developing a custom Scikit-Learn transformer that handles "matrix transpose". For example, [LabelEncoder(), MatrixTransposer(), OneHotEncoder()]. This MatrixTransposer operation would be no-op from the PMML perspective. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment