This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import newspaper | |
from newspaper import Config | |
from newspaper import Article | |
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0' | |
config = Config() | |
config.browser_user_agent = USER_AGENT | |
config.request_timeout = 10 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def align_gensim_models(models, words=None): | |
""" | |
Returns the aligned/intersected models from a list of gensim word2vec models. | |
Generalized from original two-way intersection as seen above. | |
Also updated to work with the most recent version of gensim | |
Requires reduce from functools | |
In order to run this, make sure you run 'model.init_sims()' for each model before you input them for alignment. | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Code to make a network out of the shortest N cosine-distances (or, equivalently, the strongest N associations) | |
between a set of words in a gensim word2vec model. | |
To use: | |
Set the filenames for the word2vec model. | |
Set `my_words` to be a list of your own choosing. | |
Set `num_top_dists` to be a number or a factor of the length of `my_words.` | |
Choose between the two methods below to produce distances, and comment-out the other one. | |
""" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def measure_semantic_shift_by_neighborhood(model1,model2,word,k=25,verbose=False): | |
""" | |
Basic implementation of William Hamilton (@williamleif) et al's measure of semantic change | |
proposed in their paper "Cultural Shift or Linguistic Drift?" (https://arxiv.org/abs/1606.02821), | |
which they call the "local neighborhood measure." They find this measure better suited to understand | |
the semantic change of nouns owing to "cultural shift," or changes in meaning "local" to that word, | |
rather than global changes in language ("linguistic drift") use that are better suited to a | |
Procrustes-alignment method (also described in the same paper.) | |
Arguments are: |