Created
June 10, 2020 16:26
-
-
Save balouf/7fb8894646360c4465768b09e0710a04 to your computer and use it in GitHub Desktop.
Companion Notebook for the Lincs Seminar
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# GISMO: Notebook for the LINCS Seminar" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Gismo input" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2020-06-10T12:36:47.082943Z", | |
"end_time": "2020-06-10T12:36:52.499931Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "from gismo.common import toy_source_text\nprint(toy_source_text)", | |
"execution_count": 1, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "['Gizmo is a Mogwaï.', 'This is a sentence about Blade.', 'This is another sentence about Shadoks.', 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.', 'In chinese folklore, a Mogwaï is a demon.']\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Gismo uses sources (list-like object)" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Build corpus" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2020-06-10T12:37:03.945842Z", | |
"end_time": "2020-06-10T12:37:03.951526Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "from gismo.corpus import Corpus\ncorpus = Corpus(toy_source_text)", | |
"execution_count": 2, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "A Gismo Corpus is mostly a source with a `to_text` method." | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Build Embedding" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2020-06-10T12:37:14.760315Z", | |
"end_time": "2020-06-10T12:37:19.309403Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "from gismo.embedding import Embedding\nfrom sklearn.feature_extraction.text import CountVectorizer\n\nvectorizer = CountVectorizer(dtype=float)\nembedding = Embedding(vectorizer=vectorizer)\nembedding.fit_transform(corpus)", | |
"execution_count": 3, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "- The embedding builds on `sklearn` Countvectorizer and provides a similar `fit`/`transform` approach\n- Gismo will provide a default vectorizer, but learning to shape your own vectorizer is recommended\n- You can embed a `spacy` preprocessing in the vectorizer if you want to lemmatize transparently\n- Methods to re-use pre-computed embeddings are provided" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Build Gismo" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2020-06-10T12:37:34.098556Z", | |
"end_time": "2020-06-10T12:37:34.103586Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "from gismo.gismo import Gismo\ngismo = Gismo(corpus, embedding)", | |
"execution_count": 4, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Creation of a gismo object is instant." | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Rank Gismo" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2020-06-10T12:37:42.157914Z", | |
"end_time": "2020-06-10T12:37:46.935223Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "gismo.rank(\"Gizmo\")\ngismo.get_ranked_documents(3)", | |
"execution_count": 5, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"execution_count": 5, | |
"data": { | |
"text/plain": "['Gizmo is a Mogwaï.',\n 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.',\n 'In chinese folklore, a Mogwaï is a demon.']" | |
}, | |
"metadata": {} | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "- Rank is the base operation after a gismo object is built\n- It performs a diffusion algorithm that *tunes* the dataset to the query\n- The `get_ranked...` methods select top objects" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Importance/relevance trade-off" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2020-06-10T12:38:13.567305Z", | |
"end_time": "2020-06-10T12:38:13.578874Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "gismo.diteration.alpha=.8\ngismo.rank(\"Gizmo\")\ngismo.get_ranked_documents(3)", | |
"execution_count": 6, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"execution_count": 6, | |
"data": { | |
"text/plain": "['Gizmo is a Mogwaï.',\n 'In chinese folklore, a Mogwaï is a demon.',\n 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.']" | |
}, | |
"metadata": {} | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "- Changing $\\alpha$ changes the ranking" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Clustering" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2020-06-10T12:42:46.868180Z", | |
"end_time": "2020-06-10T12:42:46.873649Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "from gismo.post_processing import print_document_cluster\ngismo.post_document_cluster = print_document_cluster", | |
"execution_count": 7, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "- Gismo uses `post_...` methods to shape the output\n- A few basic options are provided\n- Users can define their own post-processors (cf Xplorer example later)" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Resolution" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2020-06-10T12:43:31.440957Z", | |
"end_time": "2020-06-10T12:43:31.485854Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "gismo.get_clustered_ranked_documents(5, resolution=.01)", | |
"execution_count": 8, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": " F: 0.04. R: 1.85. S: 0.99.\n- Gizmo is a Mogwaï. (R: 1.23; S: 0.98)\n- In chinese folklore, a Mogwaï is a demon. (R: 0.27; S: 0.72)\n- This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. (R: 0.26; S: 0.67)\n- This is a sentence about Blade. (R: 0.04; S: 0.17)\n- This is another sentence about Shadoks. (R: 0.04; S: 0.17)\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Low resolution: flat tree" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Resolution" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2020-06-10T12:44:19.238196Z", | |
"end_time": "2020-06-10T12:44:19.274563Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "gismo.get_clustered_ranked_documents(5, resolution=.9)", | |
"execution_count": 9, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": " F: 0.05. R: 1.85. S: 0.99.\n- F: 0.58. R: 1.77. S: 0.98.\n-- F: 0.69. R: 1.51. S: 0.98.\n--- Gizmo is a Mogwaï. (R: 1.23; S: 0.98)\n--- In chinese folklore, a Mogwaï is a demon. (R: 0.27; S: 0.72)\n-- This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. (R: 0.26; S: 0.67)\n- F: 0.70. R: 0.08. S: 0.19.\n-- This is a sentence about Blade. (R: 0.04; S: 0.17)\n-- This is another sentence about Shadoks. (R: 0.04; S: 0.17)\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "High resolution: binary tree" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Resolution" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2020-06-10T12:44:59.548224Z", | |
"end_time": "2020-06-10T12:44:59.582960Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "gismo.get_clustered_ranked_documents(5, resolution=.5)", | |
"execution_count": 10, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": " F: 0.05. R: 1.85. S: 0.99.\n- F: 0.68. R: 1.77. S: 0.98.\n-- Gizmo is a Mogwaï. (R: 1.23; S: 0.98)\n-- In chinese folklore, a Mogwaï is a demon. (R: 0.27; S: 0.72)\n-- This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. (R: 0.26; S: 0.67)\n- F: 0.70. R: 0.08. S: 0.19.\n-- This is a sentence about Blade. (R: 0.04; S: 0.17)\n-- This is another sentence about Shadoks. (R: 0.04; S: 0.17)\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Medium resolution: intermediate" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Query-based distortion" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2020-06-10T12:45:49.120423Z", | |
"end_time": "2020-06-10T12:45:49.165597Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "gismo.query_distortion = False\ngismo.get_clustered_ranked_documents(5)", | |
"execution_count": 11, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": " F: 0.04. R: 1.85. S: 0.72.\n- F: 0.22. R: 1.51. S: 0.92.\n-- Gizmo is a Mogwaï. (R: 1.23; S: 0.99)\n-- In chinese folklore, a Mogwaï is a demon. (R: 0.27; S: 0.36)\n- F: 0.08. R: 0.34. S: 0.11.\n-- This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. (R: 0.26; S: 0.15)\n-- F: 0.23. R: 0.08. S: 0.07.\n--- This is a sentence about Blade. (R: 0.04; S: 0.06)\n--- This is another sentence about Shadoks. (R: 0.04; S: 0.05)\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Without query-distortion, the long sentence is grouped with the other objects that contain the word sentence." | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3", | |
"language": "python" | |
}, | |
"toc": { | |
"nav_menu": {}, | |
"number_sections": true, | |
"sideBar": true, | |
"skip_h1_title": true, | |
"base_numbering": 1, | |
"title_cell": "Table of Contents", | |
"title_sidebar": "Contents", | |
"toc_cell": false, | |
"toc_position": {}, | |
"toc_section_display": true, | |
"toc_window_display": false | |
}, | |
"language_info": { | |
"name": "python", | |
"version": "3.7.7", | |
"mimetype": "text/x-python", | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"pygments_lexer": "ipython3", | |
"nbconvert_exporter": "python", | |
"file_extension": ".py" | |
}, | |
"celltoolbar": "Slideshow", | |
"gist": { | |
"id": "", | |
"data": { | |
"description": "Companion Notebook for the Lincs Seminar", | |
"public": true | |
} | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment