Skip to content

Instantly share code, notes, and snippets.

@Felflare
Last active August 11, 2022 15:41
Show Gist options
  • Save Felflare/7f4d1f67c034998954246777cec195bc to your computer and use it in GitHub Desktop.
Save Felflare/7f4d1f67c034998954246777cec195bc to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!git clone [email protected]:huggingface/transformers.git\n",
"!cd transformers\n",
"!pip install ."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import scipy\n",
"from sentence_transformers import SentenceTransformer\n",
"model = SentenceTransformer('bert-base-nli-mean-tokens')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Get a sample corpus to search over\n",
"_c=\"\"\"\n",
"Coronavirus:\n",
"White House organizing program to slash development time for coronavirus vaccine by as much as eight months (Bloomberg)\n",
"Trump says he is pushing FDA to approve emergency-use authorization for Gilead's remdesivir (WSJ)\n",
"AstraZeneca to make an experimental coronavirus vaccine developed by Oxford University (Bloomberg)\n",
"Reopening:\n",
"Inconsistent patchwork of state, local and business decision-making on reopening raising concerns about a second wave of the coronavirus (Politico)\n",
"White House risks backlash with coronavirus optimism if cases flare up again (The Hill)\n",
"Florida plans to start reopening on Monday with restaurants and retail in most areas allowed to resume business in most areas (Bloomberg)\n",
"California Governor Newsom plans to order closure of all state beaches and parks starting Friday due to concerns about overcrowding (CNN)\n",
"Japan preparing to extend coronavirus state of emergency, which is scheduled to end 6-May, by about another month (Reuters)\n",
"Policy/Stimulus:\n",
"Economists from a broad range of ideological backgrounds encouraging Congress to keep spending to combat the coronavirus fallout and don't believe now is time to worry about deficit (Politico)\n",
"Global economy:\n",
"China's official PMIs mixed with beat from services and miss from manufacturing (Bloomberg)\n",
"China's Beige Book shows employment situation in Chinese factories worsened in April from end of March, suggesting economy on less solid ground than government data (Bloomberg)\n",
"Japan's March factory output fell at the fastest pace in five months, while retail sales also dropped (Reuters)\n",
"Eurozone economy contracts by 3.8% in Q1, the fastest decline on record (FT)\n",
"US-China:\n",
"Trump says China wants to him to lose his bid for re-election and notes he is looking at different options in terms of consequences for Beijing over the virus (Reuters)\n",
"Senior White House official confident China will meet obligations under trad deal despite fallout from coronavirus pandemic (WSJ)\n",
"Oil:\n",
"Trump administration may announce plans as soon as today to offer loans to oil companies, possibly in exchange for a financial stake (Bloomberg)\n",
"Munchin says Trump administration could allow oil companies to store another several hundred million barrels (NY Times)\n",
"Norway, Europe's biggest oil producer, joins international efforts to cut supply for first time in almost two decades (Bloomberg)\n",
"IEA says coronavirus could drive 6% decline in global energy demand in 2020 (FT)\n",
"Corporate:\n",
"Microsoft reports strong results as shift to more activities online drives growth in areas from cloud-computing to video gams (WSJ)\n",
"Facebook revenue beats expectations and while ad revenue fell sharply in March there have been recent signs of stability (Bloomberg)\n",
"Tesla posts third straight quarterly profit while Musk rants on call about need for lockdowns to be lifted (Bloomberg)\n",
"eBay helped by online shopping surge though classifieds business hurt by closure of car dealerships and lower traffic (WSJ)\n",
"Royal Dutch Shell cuts dividend for first time since World War II and also suspends next tranche of buyback program (Reuters)\n",
"Chesapeake Energy preparing bankruptcy filing and has held discussions with lenders about a ~$1B loan (Reuters)\n",
"Amazon accused by Trump administration of tolerating counterfeit sales, but company says hit politically motivated (WSJ)\n",
"Trump contradicts US intel, says Covid-19 started in Wuhan lab.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Convert the corpus into a list of headlines\n",
"corpus=[i for i in _c.split('\\n')if i != ''and len(i.split(' '))>=4]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Get a vector for each headline (sentence) in the corpus\n",
"corpus_embeddings = model.encode(corpus)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Define search queries and embed them to vectors as well\n",
"queries = [\n",
" 'The economy is more resilient and improving.', 'The economy is in a lot of trouble.', 'Trump is hurting his own reelection chances.']\n",
"query_embeddings = model.encode(queries)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"======================\n",
"\n",
"\n",
"Query: The economy is more resilient and improving.\n",
"\n",
"Top 5 most similar sentences in corpus:\n",
"Microsoft reports strong results as shift to more activities online drives growth in areas from cloud-computing to video gams (WSJ) (Score: 0.5362)\n",
"Facebook revenue beats expectations and while ad revenue fell sharply in March there have been recent signs of stability (Bloomberg) (Score: 0.4632)\n",
"Senior White House official confident China will meet obligations under trad deal despite fallout from coronavirus pandemic (WSJ) (Score: 0.3558)\n",
"Economists from a broad range of ideological backgrounds encouraging Congress to keep spending to combat the coronavirus fallout and don't believe now is time to worry about deficit (Politico) (Score: 0.3052)\n",
"White House risks backlash with coronavirus optimism if cases flare up again (The Hill) (Score: 0.2885)\n",
"\n",
"\n",
"======================\n",
"\n",
"\n",
"Query: The economy is in a lot of trouble.\n",
"\n",
"Top 5 most similar sentences in corpus:\n",
"Inconsistent patchwork of state, local and business decision-making on reopening raising concerns about a second wave of the coronavirus (Politico) (Score: 0.4667)\n",
"eBay helped by online shopping surge though classifieds business hurt by closure of car dealerships and lower traffic (WSJ) (Score: 0.4338)\n",
"China's Beige Book shows employment situation in Chinese factories worsened in April from end of March, suggesting economy on less solid ground than government data (Bloomberg) (Score: 0.4283)\n",
"Eurozone economy contracts by 3.8% in Q1, the fastest decline on record (FT) (Score: 0.4252)\n",
"China's official PMIs mixed with beat from services and miss from manufacturing (Bloomberg) (Score: 0.4052)\n",
"\n",
"\n",
"======================\n",
"\n",
"\n",
"Query: Trump is hurting his own reelection chances.\n",
"\n",
"Top 5 most similar sentences in corpus:\n",
"Trump contradicts US intel, says Covid-19 started in Wuhan lab. (Score: 0.7472)\n",
"Amazon accused by Trump administration of tolerating counterfeit sales, but company says hit politically motivated (WSJ) (Score: 0.7408)\n",
"Trump says China wants to him to lose his bid for re-election and notes he is looking at different options in terms of consequences for Beijing over the virus (Reuters) (Score: 0.7111)\n",
"Inconsistent patchwork of state, local and business decision-making on reopening raising concerns about a second wave of the coronavirus (Politico) (Score: 0.6213)\n",
"White House risks backlash with coronavirus optimism if cases flare up again (The Hill) (Score: 0.6181)\n"
]
}
],
"source": [
"# For each search term return 5 closest sentences\n",
"closest_n = 5\n",
"for query, query_embedding in zip(queries, query_embeddings):\n",
" distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, \"cosine\")[0]\n",
"\n",
" results = zip(range(len(distances)), distances)\n",
" results = sorted(results, key=lambda x: x[1])\n",
"\n",
" print(\"\\n\\n======================\\n\\n\")\n",
" print(\"Query:\", query)\n",
" print(\"\\nTop 5 most similar sentences in corpus:\")\n",
"\n",
" for idx, distance in results[0:closest_n]:\n",
" print(corpus[idx].strip(), \"(Score: %.4f)\" % (1-distance))\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "transformers",
"language": "python",
"name": "transformers"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment