This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import newspaper | |
from newspaper import Config | |
from newspaper import Article | |
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0' | |
config = Config() | |
config.browser_user_agent = USER_AGENT | |
config.request_timeout = 10 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def align_gensim_models(models, words=None): | |
""" | |
Returns the aligned/intersected models from a list of gensim word2vec models. | |
Generalized from original two-way intersection as seen above. | |
Also updated to work with the most recent version of gensim | |
Requires reduce from functools | |
In order to run this, make sure you run 'model.init_sims()' for each model before you input them for alignment. | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Code to make a network out of the shortest N cosine-distances (or, equivalently, the strongest N associations) | |
between a set of words in a gensim word2vec model. | |
To use: | |
Set the filenames for the word2vec model. | |
Set `my_words` to be a list of your own choosing. | |
Set `num_top_dists` to be a number or a factor of the length of `my_words.` | |
Choose between the two methods below to produce distances, and comment-out the other one. | |
""" |