two_hot_encoder_for_categorical_data.md

There 3 options how to convert categorical features to numerical:

Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".
Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.
Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.

Code:

def two_hot(x): return np.concatenate([ (x == "morning") | (x == "afternoon"), (x == "afternoon") | (x == "evening"), (x == "evening") | (x == "night"), (x == "night") | (x == "morning"), ], axis=1).astype(int)

x = np.array([["morning", "afternoon", "evening", "night"]]).T print(x) x = two_hot(x) print(x)

Output:

[['morning'] ['afternoon'] ['evening'] ['night']] [[1 0 0 1] [1 1 0 0] [0 1 1 0] [0 0 1 1]]

Then we can measure the distances:

from sklearn.metrics.pairwise import euclidean_distances euclidean_distances(x)

Output:

array([[0. , 1.41421356, 2. , 1.41421356], [1.41421356, 0. , 1.41421356, 2. ], [2. , 1.41421356, 0. , 1.41421356], [1.41421356, 2. , 1.41421356, 0. ]])

duttashi/two_hot_encoder_for_categorical_data.md

duttashi commented Jun 20, 2019