Last active
May 25, 2023 21:25
-
-
Save avidale/e4450da902d36bb14c595987943120dc to your computer and use it in GitHub Desktop.
subparagraphs.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "subparagraphs.ipynb", | |
"provenance": [], | |
"collapsed_sections": [], | |
"authorship_tag": "ABX9TyPPFlpnRjBayY9yB+TN3dl6", | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/avidale/e4450da902d36bb14c595987943120dc/subparagraphs.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "DXxaAlxZy8tA", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"The goal is to split a text into meanungful subparagraphs - see https://stackoverflow.com/questions/62164280.\n", | |
"\n", | |
"\"Meaningfulness\" will be measured by similarity of consecutive sentence vectors: we want neighboring sentences in the same subparagraph to be similar. \n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "46T3HNB7k310", | |
"colab_type": "code", | |
"outputId": "dc301953-e6b8-4bd3-ca6e-e7799d8cc2a3", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 51 | |
} | |
}, | |
"source": [ | |
"from sklearn.dammtasets import fetch_20newsgroups\n", | |
"twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)" | |
], | |
"execution_count": 0, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Downloading 20news dataset. This may take a few minutes.\n", | |
"Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)\n" | |
], | |
"name": "stderr" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "zB6ngeWYnGah", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"!python -m spacy download en_core_web_sm" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Q5dDyE8clGPH", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"import spacy\n", | |
"import numpy as np\n", | |
"nlp = spacy.load('en_core_web_sm')" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "BkIo9Celygia", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"text = twenty_train.data[1]" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "PnDEuQcImj_l", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"doc = nlp(text)" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "pxTZmweynRpz", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"sents = list(doc.sents)\n", | |
"vecs = np.stack([sent.vector / sent.vector_norm for sent in sents])" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "umXXkkrtzqTE", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"This parameter should be tuned in order to make the segmentation as meaningful as possible. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "yPQgOj1un-eB", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"threshold = 0.5" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "r29io_MipgQT", | |
"colab_type": "code", | |
"outputId": "fe588920-f7b0-44db-8d15-403ff6ce628f", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
} | |
}, | |
"source": [ | |
"clusters = [[0]]\n", | |
"for i in range(1, len(sents)):\n", | |
" if np.dot(vecs[i], vecs[i-1]) < threshold:\n", | |
" # here we use only the similarity between neighboring pairs of sentences. \n", | |
" # instead, we can use the \"weakest link\" or \"strongest link\" approach.\n", | |
" # potentially, it could improve the quality of clustering. \n", | |
" clusters.append([])\n", | |
" clusters[-1].append(i)\n", | |
"print(clusters)" | |
], | |
"execution_count": 0, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"[[0], [1], [2], [3], [4], [5], [6, 7, 8], [9], [10], [11, 12], [13], [14], [15, 16], [17], [18]]\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "DjnHMZGrwMyV", | |
"colab_type": "code", | |
"outputId": "2b847ccc-554e-4d78-b7bc-904315b56782", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 867 | |
} | |
}, | |
"source": [ | |
"for cluster in clusters:\n", | |
" print(' '.join([sents[i].text for i in cluster]))\n", | |
" print('---------------------------------------')" | |
], | |
"execution_count": 0, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"From: [email protected]\n", | |
"---------------------------------------\n", | |
"(Guy Kuo)\n", | |
"\n", | |
"---------------------------------------\n", | |
"Subject:\n", | |
"---------------------------------------\n", | |
"SI Clock Poll - Final Call\n", | |
"\n", | |
"---------------------------------------\n", | |
"Summary:\n", | |
"---------------------------------------\n", | |
"Final call for SI clock reports\n", | |
"\n", | |
"---------------------------------------\n", | |
"Keywords: SI,acceleration,clock,upgrade\n", | |
" Article-I.D.: shelley.1qvfo9INNc3s\n", | |
"Organization: University of Washington\n", | |
"Lines: 11\n", | |
"\n", | |
"---------------------------------------\n", | |
"NNTP-Posting-Host:\n", | |
"---------------------------------------\n", | |
"carson.u.washington.edu\n", | |
"\n", | |
"\n", | |
"---------------------------------------\n", | |
"A fair number of brave souls who upgraded their SI clock oscillator have\n", | |
"shared their experiences for this poll. Please send a brief message detailing\n", | |
"your experiences with the procedure.\n", | |
"---------------------------------------\n", | |
"Top speed attained, CPU rated speed,\n", | |
"add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n", | |
"functionality with 800 and 1.4\n", | |
"---------------------------------------\n", | |
"m floppies are especially requested.\n", | |
"\n", | |
"\n", | |
"---------------------------------------\n", | |
"I will be summarizing in the next two days, so please add to the network\n", | |
"knowledge base if you have done the clock upgrade and haven't answered this\n", | |
"poll.\n", | |
"---------------------------------------\n", | |
"Thanks.\n", | |
"\n", | |
"\n", | |
"---------------------------------------\n", | |
"Guy Kuo <[email protected]>\n", | |
"\n", | |
"---------------------------------------\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "bn5W9KBSwuGw", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment