Created
April 12, 2019 19:51
-
-
Save pratu16x7/b7f834a0eb14df721ff3d2d57f135b7d to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Auto-tagging Resource Links\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## The Problem\n", | |
"\n", | |
"*\"Given a set of URLs (over 100 thousand), categorize them based one or more definative characteristics: language, type of site, stage of learning etc.\"*\n", | |
"\n", | |
"To see how they look, here's a small sampling of links:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"urls = [\n", | |
" \"https://medium.com/@pratu16x7\",\n", | |
" \"https://signalvnoise.com/posts/3250-clarity-over-brevity-in-variable-and-method-names\",\n", | |
" \"http://flask.pocoo.org/\",\"http://docs.python-requests.org/en/master/\",\n", | |
" \"http://docs.python-guide.org/en/latest\",\n", | |
" \"http://www.writethedocs.org/guide/writing/beginners-guide-to-docs/#what-to-write\",\n", | |
" \"https://getbootstrap.com/docs/4.1/content/reboot/\",\n", | |
" \"https://vuejs.org/v2/guide/\",\n", | |
" \"https://yarnpkg.com/en/docs\",\n", | |
" \"http://blissfuljs.com/docs.html\",\n", | |
" \"https://www.kennethreitz.org/documentation-is-king/\",\n", | |
" \"https://signalvnoise.com/posts/454-why-most-copywriting-on-the-web-sucks\",\n", | |
" \"https://jgthms.com/web-design-in-4-minutes/\",\n", | |
" \"https://vuejs.org/v2/guide/components.html\",\n", | |
" \"https://vuejs.org/v2/guide/\",\n", | |
" \"https://vuejs.org/v2/api/#vm-mount\",\n", | |
" \"https://news.ycombinator.com/item?id=11164013\",\n", | |
" \"https://docs.gitbook.com/\",\n", | |
" \"https://docsify.js.org/#/\",\n", | |
" \"https://docsify.js.org/#/plugins?id=list-of-plugins\",\n", | |
" \"https://vuepress.vuejs.org/guide/#how-it-works\",\n", | |
" \"https://vuepress.vuejs.org/guide/#why-not\",\n", | |
" \"https://frappe.io/charts\",\n", | |
" \"https://www.youtube.com/watch?v=ahXIMUkSXX0\",\n", | |
" \"https://gist.github.com/kenneth-reitz/973705\",\n", | |
" \n", | |
" \"Stack overflow 3\",\n", | |
" \"\"\n", | |
"]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Making the program aware\n", | |
"\n", | |
"Before parsing any data, it makes sense to define certain features that the program is looking for in the links. Let's look at three forms of categorisation, or *tags*:\n", | |
"\n", | |
"1. #### Which technology?\n", | |
" Programmining languages or software tools, widely used. \n", | |
" Let's get them over from [StackOverflow's most popular tags](https://data.stackexchange.com/stackoverflow/query/172362/get-all-tags) which they are primarily technology focussed:\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Collecting stackapi\n", | |
" Downloading https://files.pythonhosted.org/packages/69/6e/e7212377e14435a33075011c6a3a6eb20707e62bff9b0e7f2330a3a100f5/StackAPI-0.1.12.tar.gz\n", | |
"Collecting requests (from stackapi)\n", | |
" Using cached https://files.pythonhosted.org/packages/7d/e3/20f3d364d6c8e5d2353c72a67778eb189176f08e873c9900e10c0287b84b/requests-2.21.0-py2.py3-none-any.whl\n", | |
"Collecting certifi>=2017.4.17 (from requests->stackapi)\n", | |
" Using cached https://files.pythonhosted.org/packages/60/75/f692a584e85b7eaba0e03827b3d51f45f571c2e793dd731e598828d380aa/certifi-2019.3.9-py2.py3-none-any.whl\n", | |
"Collecting idna<2.9,>=2.5 (from requests->stackapi)\n", | |
" Using cached https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl\n", | |
"Collecting chardet<3.1.0,>=3.0.2 (from requests->stackapi)\n", | |
" Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl\n", | |
"Collecting urllib3<1.25,>=1.21.1 (from requests->stackapi)\n", | |
" Using cached https://files.pythonhosted.org/packages/62/00/ee1d7de624db8ba7090d1226aebefab96a2c71cd5cfa7629d6ad3f61b79e/urllib3-1.24.1-py2.py3-none-any.whl\n", | |
"Building wheels for collected packages: stackapi\n", | |
" Building wheel for stackapi (setup.py) ... \u001b[?25ldone\n", | |
"\u001b[?25h Stored in directory: /Users/pratu/Library/Caches/pip/wheels/16/bf/a5/56362daf788c5ce88244796b26d3e7d6fec649f5e77aff694c\n", | |
"Successfully built stackapi\n", | |
"Installing collected packages: certifi, idna, chardet, urllib3, requests, stackapi\n", | |
"Successfully installed certifi-2019.3.9 chardet-3.0.4 idna-2.8 requests-2.21.0 stackapi-0.1.12 urllib3-1.24.1\n", | |
"Note: you may need to restart the kernel to use updated packages.\n" | |
] | |
} | |
], | |
"source": [ | |
"pip install stackapi" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"['javascript', 'java', 'c#', 'php', 'android', 'python', 'jquery', 'html', 'c++', 'ios', 'css', 'mysql', 'sql', 'asp.net', 'ruby-on-rails', 'c', 'arrays', 'objective-c', 'r', '.net', 'node.js', 'json', 'angularjs', 'sql-server', 'swift', 'iphone', 'regex', 'ruby', 'ajax', 'django', 'excel', 'xml', 'asp.net-mvc', 'linux', 'angular', 'database', 'python-3.x', 'spring', 'wpf', 'wordpress', 'vba', 'string', 'reactjs', 'xcode', 'windows', 'vb.net', 'html5', 'eclipse', 'multithreading', 'laravel']\n", | |
"CPU times: user 150 ms, sys: 15.5 ms, total: 165 ms\n", | |
"Wall time: 12.5 s\n" | |
] | |
} | |
], | |
"source": [ | |
"%%time\n", | |
"from stackapi import StackAPI\n", | |
"SITE = StackAPI('stackoverflow')\n", | |
"tags = SITE.fetch('tags')\n", | |
"print([t['name'] for t in tags['items']][:50])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" These can now be used for fuzzing matching content later on to decide what technology a link belongs to.\n", | |
"\n", | |
"\n", | |
"2. #### Which type of resource? \n", | |
" There are usually 4-5 kinds that a typical developer browses to gain information:\n", | |
"\n", | |
" - [docs] **Official/Widely popular tech docs and guides**\n", | |
" Eg: Python, yarn, hithikers, Vuejs, Flask, MDN\n", | |
" - [code] **Code blocks and code bases**\n", | |
" Eg: Github, Github Gist, Sourceforge, Gitlab\n", | |
" - [forum] **Forums and communties**\n", | |
" Eg: StackOverflow, AskUbuntu, Reddit\n", | |
" - [tuts] **Tutorials, blogs, teching sites**\n", | |
" Eg: Medium, html5rocks, web design in 4 minutes\n", | |
" - [skills] **Pre-skills**\n", | |
" Eg: TypingClub, cli crash course \n", | |
"\n", | |
" There are a few examples of these in our list of links.\n", | |
"\n", | |
" Curation will mostly be from the first three kinds, [docs], [code] and [forum]. Hence it makes sense to make a map of all the well-known resources according to these tags, so that it becomes easier to assign them to a large percent of links.\n", | |
"\n", | |
"\n", | |
"3. #### What stage of learning?\n", | |
"\n", | |
" This is somewhat tricky and will depend on the kind of course/training, but we can come up with a basic scale that can apply to most:\n", | |
"\n", | |
" - **[Level 0]**: Pre requisites and skills\n", | |
" - **[Level 1]**: Basic tech, Hello world project \n", | |
" - **[Level 2]**: Roadblocks to Intermediate\n", | |
" - **[Level 3]**: Advanced concepts\n", | |
" - **[Level 4]**: Case studies / Main Project\n", | |
"\n", | |
"\n", | |
"\n", | |
"Now that we know what tags we have to categorize by, let's take a closer look at our data.\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"\n", | |
"## Parsing the data\n", | |
"\n", | |
"### Stage 1: The URL strings\n", | |
"\n", | |
"The URLs on their own can also . Not all links will present useful information, but many will. It makes sense to perform all the parsing that we'd do for the link page data on the link URL itself, to see if we can get down some of the categorisations. \n", | |
"\n", | |
"The most we can infer is the tech if referred in the URL. And if the domain matches one of the links in our knowledge base of links in diffferent resource types\n", | |
"\n", | |
"Assigning an internal domain tag will also be useful for specific API handling (discussed later in stage 3).\n", | |
"\n", | |
"\n", | |
"\n", | |
"### Stage 2: The link page data\n", | |
"\n", | |
"While somewhat expensive for a large number of links, \n", | |
"\n", | |
"\n", | |
"\n", | |
"### Stage 3: Specific handling based on domain\n", | |
"\n", | |
"Why is this worth investing in? Because as we have seen, most links will be from official docs, forums like stackexchange and code hosting services, most of which have well maintained developer APIs. By writing modules to handle just 3-4 of these services, we can acquire highly accurate data than page parsing for possibly 30-50% of the links.\n", | |
"\n", | |
"[Photo]\n", | |
"\n", | |
"\n", | |
"\n", | |
"### Stage 4: NLP (Opt)\n", | |
"\n", | |
"In order to derive information from the bare scratch, using TF-IDF can lead to interesting results.\n", | |
"\n" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment