Skip to content

Instantly share code, notes, and snippets.

@balouf
Created June 10, 2020 16:26
Show Gist options
  • Save balouf/7fb8894646360c4465768b09e0710a04 to your computer and use it in GitHub Desktop.
Save balouf/7fb8894646360c4465768b09e0710a04 to your computer and use it in GitHub Desktop.
Companion Notebook for the Lincs Seminar
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"cell_type": "markdown",
"source": "# GISMO: Notebook for the LINCS Seminar"
},
{
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"cell_type": "markdown",
"source": "# Gismo input"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2020-06-10T12:36:47.082943Z",
"end_time": "2020-06-10T12:36:52.499931Z"
},
"trusted": true
},
"cell_type": "code",
"source": "from gismo.common import toy_source_text\nprint(toy_source_text)",
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"text": "['Gizmo is a Mogwaï.', 'This is a sentence about Blade.', 'This is another sentence about Shadoks.', 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.', 'In chinese folklore, a Mogwaï is a demon.']\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Gismo uses sources (list-like object)"
},
{
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"cell_type": "markdown",
"source": "# Build corpus"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2020-06-10T12:37:03.945842Z",
"end_time": "2020-06-10T12:37:03.951526Z"
},
"trusted": true
},
"cell_type": "code",
"source": "from gismo.corpus import Corpus\ncorpus = Corpus(toy_source_text)",
"execution_count": 2,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "A Gismo Corpus is mostly a source with a `to_text` method."
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Build Embedding"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2020-06-10T12:37:14.760315Z",
"end_time": "2020-06-10T12:37:19.309403Z"
},
"trusted": true
},
"cell_type": "code",
"source": "from gismo.embedding import Embedding\nfrom sklearn.feature_extraction.text import CountVectorizer\n\nvectorizer = CountVectorizer(dtype=float)\nembedding = Embedding(vectorizer=vectorizer)\nembedding.fit_transform(corpus)",
"execution_count": 3,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "- The embedding builds on `sklearn` Countvectorizer and provides a similar `fit`/`transform` approach\n- Gismo will provide a default vectorizer, but learning to shape your own vectorizer is recommended\n- You can embed a `spacy` preprocessing in the vectorizer if you want to lemmatize transparently\n- Methods to re-use pre-computed embeddings are provided"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Build Gismo"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2020-06-10T12:37:34.098556Z",
"end_time": "2020-06-10T12:37:34.103586Z"
},
"trusted": true
},
"cell_type": "code",
"source": "from gismo.gismo import Gismo\ngismo = Gismo(corpus, embedding)",
"execution_count": 4,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Creation of a gismo object is instant."
},
{
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"cell_type": "markdown",
"source": "# Rank Gismo"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2020-06-10T12:37:42.157914Z",
"end_time": "2020-06-10T12:37:46.935223Z"
},
"trusted": true
},
"cell_type": "code",
"source": "gismo.rank(\"Gizmo\")\ngismo.get_ranked_documents(3)",
"execution_count": 5,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 5,
"data": {
"text/plain": "['Gizmo is a Mogwaï.',\n 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.',\n 'In chinese folklore, a Mogwaï is a demon.']"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "- Rank is the base operation after a gismo object is built\n- It performs a diffusion algorithm that *tunes* the dataset to the query\n- The `get_ranked...` methods select top objects"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Importance/relevance trade-off"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2020-06-10T12:38:13.567305Z",
"end_time": "2020-06-10T12:38:13.578874Z"
},
"trusted": true
},
"cell_type": "code",
"source": "gismo.diteration.alpha=.8\ngismo.rank(\"Gizmo\")\ngismo.get_ranked_documents(3)",
"execution_count": 6,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 6,
"data": {
"text/plain": "['Gizmo is a Mogwaï.',\n 'In chinese folklore, a Mogwaï is a demon.',\n 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.']"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "- Changing $\\alpha$ changes the ranking"
},
{
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"cell_type": "markdown",
"source": "# Clustering"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2020-06-10T12:42:46.868180Z",
"end_time": "2020-06-10T12:42:46.873649Z"
},
"trusted": true
},
"cell_type": "code",
"source": "from gismo.post_processing import print_document_cluster\ngismo.post_document_cluster = print_document_cluster",
"execution_count": 7,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "- Gismo uses `post_...` methods to shape the output\n- A few basic options are provided\n- Users can define their own post-processors (cf Xplorer example later)"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Resolution"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2020-06-10T12:43:31.440957Z",
"end_time": "2020-06-10T12:43:31.485854Z"
},
"trusted": true
},
"cell_type": "code",
"source": "gismo.get_clustered_ranked_documents(5, resolution=.01)",
"execution_count": 8,
"outputs": [
{
"output_type": "stream",
"text": " F: 0.04. R: 1.85. S: 0.99.\n- Gizmo is a Mogwaï. (R: 1.23; S: 0.98)\n- In chinese folklore, a Mogwaï is a demon. (R: 0.27; S: 0.72)\n- This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. (R: 0.26; S: 0.67)\n- This is a sentence about Blade. (R: 0.04; S: 0.17)\n- This is another sentence about Shadoks. (R: 0.04; S: 0.17)\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Low resolution: flat tree"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Resolution"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2020-06-10T12:44:19.238196Z",
"end_time": "2020-06-10T12:44:19.274563Z"
},
"trusted": true
},
"cell_type": "code",
"source": "gismo.get_clustered_ranked_documents(5, resolution=.9)",
"execution_count": 9,
"outputs": [
{
"output_type": "stream",
"text": " F: 0.05. R: 1.85. S: 0.99.\n- F: 0.58. R: 1.77. S: 0.98.\n-- F: 0.69. R: 1.51. S: 0.98.\n--- Gizmo is a Mogwaï. (R: 1.23; S: 0.98)\n--- In chinese folklore, a Mogwaï is a demon. (R: 0.27; S: 0.72)\n-- This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. (R: 0.26; S: 0.67)\n- F: 0.70. R: 0.08. S: 0.19.\n-- This is a sentence about Blade. (R: 0.04; S: 0.17)\n-- This is another sentence about Shadoks. (R: 0.04; S: 0.17)\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "High resolution: binary tree"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Resolution"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2020-06-10T12:44:59.548224Z",
"end_time": "2020-06-10T12:44:59.582960Z"
},
"trusted": true
},
"cell_type": "code",
"source": "gismo.get_clustered_ranked_documents(5, resolution=.5)",
"execution_count": 10,
"outputs": [
{
"output_type": "stream",
"text": " F: 0.05. R: 1.85. S: 0.99.\n- F: 0.68. R: 1.77. S: 0.98.\n-- Gizmo is a Mogwaï. (R: 1.23; S: 0.98)\n-- In chinese folklore, a Mogwaï is a demon. (R: 0.27; S: 0.72)\n-- This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. (R: 0.26; S: 0.67)\n- F: 0.70. R: 0.08. S: 0.19.\n-- This is a sentence about Blade. (R: 0.04; S: 0.17)\n-- This is another sentence about Shadoks. (R: 0.04; S: 0.17)\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Medium resolution: intermediate"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Query-based distortion"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2020-06-10T12:45:49.120423Z",
"end_time": "2020-06-10T12:45:49.165597Z"
},
"trusted": true
},
"cell_type": "code",
"source": "gismo.query_distortion = False\ngismo.get_clustered_ranked_documents(5)",
"execution_count": 11,
"outputs": [
{
"output_type": "stream",
"text": " F: 0.04. R: 1.85. S: 0.72.\n- F: 0.22. R: 1.51. S: 0.92.\n-- Gizmo is a Mogwaï. (R: 1.23; S: 0.99)\n-- In chinese folklore, a Mogwaï is a demon. (R: 0.27; S: 0.36)\n- F: 0.08. R: 0.34. S: 0.11.\n-- This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. (R: 0.26; S: 0.15)\n-- F: 0.23. R: 0.08. S: 0.07.\n--- This is a sentence about Blade. (R: 0.04; S: 0.06)\n--- This is another sentence about Shadoks. (R: 0.04; S: 0.05)\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Without query-distortion, the long sentence is grouped with the other objects that contain the word sentence."
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"toc": {
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": true,
"base_numbering": 1,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
},
"language_info": {
"name": "python",
"version": "3.7.7",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"celltoolbar": "Slideshow",
"gist": {
"id": "",
"data": {
"description": "Companion Notebook for the Lincs Seminar",
"public": true
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment