Created
May 28, 2013 21:28
-
-
Save pilipolio/5666298 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "PlayingWithTheTdIdf" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Wikipedia's [page](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) and [Levente's tutorial p14](https://drive.google.com/a/we7.com/?tab=mo#folders/0BwsmmX4SxUXiS1EycEY3a3VZcU0) :\n", | |
"\n", | |
"TFIDF: term frequency\u2013inverse document frequency\n", | |
"\n", | |
"$ \\text{tf-idf}_ {ij} = \\text{tf}_ {ij} \\times \\text{idf}_i $\n", | |
"\n", | |
"$ \\text{tf}_ {ij} = n_ {ij} / \\sum_j n_ {ij} $ , where $ t_i $ refers to the $i$th term, $ d_j $ denotes the $j$th document.\n", | |
"NB. : normalizing for the length of the document.\n", | |
"\n", | |
"$$ \\text{idf}_i = \\frac{\\log |D|}{1 + | \\{d:t_i \\in d\\} | }$$\n", | |
"\n", | |
"where $ |D| $ is the number of documents in the corpus and the denumerator is the number of documents in which the term $t_j$ appeared.\n", | |
"\n", | |
"Linked-in's _skills and expertises_ of [Levente](http://hu.linkedin.com/in/toroklev), [Krishna](http://uk.linkedin.com/in/krishnajrao), [Barak](http://uk.linkedin.com/in/barakschiller), [me](http://www.linkedin.com/pub/allain-guillaume/2/233/5ba) and [Miklos](http://uk.linkedin.com/in/miklosparrag) the 28th of May 2013:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"people_to_skills = documents_to_terms = {\n", | |
" 'Levente': ['Machine Learning', 'Data Mining', 'C++', 'Algorithms', 'Recommender Systems', 'Octave', 'Java'],\n", | |
" 'Krishna': ['Java', 'Python', 'C#', 'Hibernate', 'XML', 'Software Engineering', 'Agile', 'TDD',\n", | |
" 'Object Oriented Design', 'Software Development', 'SQL'],\n", | |
" 'Barak': ['Java', 'OOP', 'Eclipse', 'Python', 'Multithreading', 'Embedded Systems', 'Software Engineering', 'SQL',\n", | |
" 'Agile'],\n", | |
" 'Guillaume':['Statistics', 'C#', 'Data Mining', 'Machine Learning', 'Algorithms', 'Python', 'Applied Mathematics'],\n", | |
" 'Miklos':['Agile', 'Software Development', 'Software Engineering', 'Object Oriented Design', 'Scrum',\n", | |
" 'XML', 'Python', 'Java']\n", | |
"}" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 42 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"documents = sorted([d for d in documents_to_terms.keys()])\n", | |
"n_D = len(documents)\n", | |
"import itertools\n", | |
"all_terms = list(itertools.chain(*[doc_terms for doc_terms in documents_to_terms.values()]))\n", | |
"terms = sorted(set(all_terms))\n", | |
"n_T = len(terms)\n", | |
"\n", | |
"print '{} unique terms from {} documents with a total of {} terms (sparsity = {}%)'.format(\n", | |
" n_T, n_D, len(all_terms), 100 * len(all_terms) / (n_D * n_T))" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"24 unique terms from 5 documents with a total of 42 terms (sparsity = 35%)\n" | |
] | |
} | |
], | |
"prompt_number": 43 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"tf = np.array([[t in documents_to_terms[d] for d in documents] for t in terms]) \n", | |
"tf[0:4,:]" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 44, | |
"text": [ | |
"array([[ True, False, True, False, True],\n", | |
" [False, True, False, True, False],\n", | |
" [False, True, False, False, False],\n", | |
" [False, True, True, False, False]], dtype=bool)" | |
] | |
} | |
], | |
"prompt_number": 44 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"idf = np.log(n_D) / (1 + np.sum(tf==1,axis=1))\n", | |
"from operator import itemgetter\n", | |
"print sorted(zip(terms, idf), key=itemgetter(1))" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"[('Java', 0.32188758248682003), ('Python', 0.32188758248682003), ('Agile', 0.40235947810852507), ('Software Engineering', 0.40235947810852507), ('Algorithms', 0.53647930414470013), ('C#', 0.53647930414470013), ('Data Mining', 0.53647930414470013), ('Machine Learning', 0.53647930414470013), ('Object Oriented Design', 0.53647930414470013), ('SQL', 0.53647930414470013), ('Software Development', 0.53647930414470013), ('XML', 0.53647930414470013), ('Applied Mathematics', 0.80471895621705014), ('C++', 0.80471895621705014), ('Eclipse', 0.80471895621705014), ('Embedded Systems', 0.80471895621705014), ('Hibernate', 0.80471895621705014), ('Multithreading', 0.80471895621705014), ('OOP', 0.80471895621705014), ('Octave', 0.80471895621705014), ('Recommender Systems', 0.80471895621705014), ('Scrum', 0.80471895621705014), ('Statistics', 0.80471895621705014), ('TDD', 0.80471895621705014)]\n" | |
] | |
} | |
], | |
"prompt_number": 45 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"documents" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 46, | |
"text": [ | |
"['Barak', 'Guillaume', 'Krishna', 'Levente', 'Miklos']" | |
] | |
} | |
], | |
"prompt_number": 46 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"terms" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 34, | |
"text": [ | |
"['Agile',\n", | |
" 'Algorithms',\n", | |
" 'Applied Mathematics',\n", | |
" 'C#',\n", | |
" 'C++',\n", | |
" 'Data Analysis',\n", | |
" 'Data Mining',\n", | |
" 'Eclipse',\n", | |
" 'Embedded Systems',\n", | |
" 'Hibernate',\n", | |
" 'Java',\n", | |
" 'Machine Learning',\n", | |
" 'Multithreading',\n", | |
" 'OOP',\n", | |
" 'Object Oriented Design',\n", | |
" 'Octave',\n", | |
" 'Python',\n", | |
" 'Recommender Systems',\n", | |
" 'SQL',\n", | |
" 'Scrum',\n", | |
" 'Software Development',\n", | |
" 'Software Engineering',\n", | |
" 'Statistics',\n", | |
" 'TDD',\n", | |
" 'XML']" | |
] | |
} | |
], | |
"prompt_number": 34 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment