Skip to content

Instantly share code, notes, and snippets.

@audhiaprilliant
Last active April 25, 2022 14:46
Show Gist options
  • Select an option

  • Save audhiaprilliant/5f356063475a07b52cf67b5d83c81c47 to your computer and use it in GitHub Desktop.

Select an option

Save audhiaprilliant/5f356063475a07b52cf67b5d83c81c47 to your computer and use it in GitHub Desktop.
How to Automatically Build Stopwords
# Linear regression for ideal Zipf's line
linear = LinearRegression()
linear.fit(
X = np.log(np.array(df['rank'])).reshape(-1, 1),
y = np.log(df['zipf_freq'])
)
# Print slope and intercept
print('Intercept: {intercept}\nSlope: {slope}'.format(
intercept = linear.intercept_,
slope = linear.coef_[0]
)
)
# Data viz
plotnine.options.figure_size = (10, 4.8)
(
ggplot(
data = df
)+
geom_line(
aes(
x = np.log(df['rank']),
y = np.log(df['actual_freq']),
group = 1
),
size = 1.5
)+
geom_abline(
intercept = linear.intercept_,
slope = linear.coef_[0],
size = 1,
color = '#981220'
)+
labs(
title = 'Zipf Distribution in English Literature'
)+
xlab(
xlab = 'Rank (log)'
)+
ylab(
ylab = 'Frequency of words (log)'
)+
theme_minimal()
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment