Skip to content

Instantly share code, notes, and snippets.

@birkin
Created June 15, 2023 18:56
Show Gist options
  • Save birkin/8c165e929526bbd782e67e38edf09e65 to your computer and use it in GitHub Desktop.
Save birkin/8c165e929526bbd782e67e38edf09e65 to your computer and use it in GitHub Desktop.

As of my last knowledge update in September 2021, TensorFlow doesn't have a direct function to perform one-hot encoding on a list of keywords inside a DataFrame. However, you can utilize pandas to preprocess your data before feeding it into a TensorFlow model. Here's a general approach using pandas and sklearn to achieve one-hot encoding for a category of keywords in a DataFrame:

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

# Sample DataFrame
data = {'keywords': [['keyword_a', 'keyword_c'], ['keyword_c', 'keyword_d']]}
df = pd.DataFrame(data)

# MultiLabelBinarizer is used for encoding multiple labels per instance
mlb = MultiLabelBinarizer()

# Fit and transform the keywords
one_hot = mlb.fit_transform(df['keywords'])

# Create a DataFrame from the one-hot encoded data
one_hot_df = pd.DataFrame(one_hot, columns=mlb.classes_, index=df.index)

# Concatenate the one-hot encoded features to the original DataFrame
result = pd.concat([df, one_hot_df], axis=1)

print(result)

This code snippet creates a new DataFrame with one-hot encoded columns for each unique keyword in the 'keywords' column of the original DataFrame. Note that MultiLabelBinarizer from sklearn.preprocessing is used to one-hot encode lists of keywords in a way that's independent of the order of keywords inside the lists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment