Skip to content

Instantly share code, notes, and snippets.

@ricalanis
Created November 28, 2017 00:40
Show Gist options
  • Save ricalanis/8f2722a3c6ca99f2e01441ecfaa3ed0c to your computer and use it in GitHub Desktop.
Save ricalanis/8f2722a3c6ca99f2e01441ecfaa3ed0c to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Extraer texto"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import nltk \n",
"from bs4 import BeautifulSoup\n",
"\n",
"def extract_text(url):\n",
" extractor = Extractor(extractor='ArticleExtractor', url=url)\n",
" return(extractor.getText())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ejemplo"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Registro de matrimonio entre mexicanos en el extranjero\\nAa+\\nSecretaría de Relaciones Exteriores\\nRegistro de matrimonio entre mexicanos en el extranjero\\n¿Son mexicanos, desean casarse y se encuentran en el extranjero? Pueden casarse en una Oficina Consular de México en el exterior, presentándose y comprobando su identidad como mujer y hombre, ambos de nacionalidad mexicana, mayores de edad o menores de edad (éstos últimos deberán contar con el consentimiento de sus padres, tutores o quienes ejerzan su patria potestad).\\xa0\\nDocumentos necesarios\\nDocumento requerido\\nPresentación\\nCopia certificada del acta de nacimiento expedida por la Oficina del Registro Civil mexicano u Oficina Consular mexicana de ambos contrayentes\\nOriginal\\nIdentificación oficial vigente con fotografía de ambos contrayentes\\nOriginal\\nCertificado médico de ambos contrayentes\\nOriginal\\nNotas:\\nOtras opciones que tienes para comprobar tu nacionalidad mexicana son: Pasaporte mexicano vigente, Carta de Naturalización, Certificado de Nacionalidad Mexicana, Declaratoria de Nacionalidad, entre otros que puedes consultar directamente con la Oficina Consular de México en el exterior.\\nDeben acreditar su identidad mediante la presentación de una identificación oficial vigente en original. Por ejemplo: Pasaporte vigente, Credencial INE o IFE, Matrícula consular, Cartilla del Servicio Militar Nacional, entre otros que puedes consultar directamente con la Oficina Consular de México en el exterior.\\nEn el caso de los menores hombres que tengan 16 años y las menores mujeres que tengan 14 años, sus padres, tutores o quienes ejercen su patria potestad deberán dar su autorización por escrito.\\nEn el caso de estar divorciado(a), deberán presentar Copia certificada del Acta de divorcio o de la Sentencia de disolución de matrimonio, o bien Copia certificada del acta de defunción del cónyuge fallecido si alguno o ambos son viudos.\\nTienen la opción de presentar dos testigos por cada uno (hombre y mujer), quienes pueden ser familiares o amigos sin importar su nacionalidad y deben presentar en original una identificación oficial vigente con fotografía.\\nEs requisito indispensable que presenten el comprobante del pago de derechos correspondiente, consulten con la Oficina Consular de México en el exterior para conocer dónde efectuarlo.\\nCostos\\n'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"extract_text(\"https://www.bing.com/cr?IG=BC18C63722A7417FB9B5A7E74AB85F17&CID=123D90FB1D1962583B589A741C1F63A1&rd=1&h=vDI7n5W61XkL3Ds4N6ZVq12CLw6jRV4lEGfOLZp8lcM&v=1&r=https%3a%2f%2fwww.gob.mx%2ftramites%2fficha%2fregistro-de-matrimonio-entre-mexicanos-en-el-extranjero%2fSRE93&p=DevEx,5095.1\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Extractor de texto, boilerpipe"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from boilerpipe.extract import Extractor"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"data = [\"https://www.ecobici.cdmx.gob.mx/\",\"http://www.buenosaires.gob.ar/ecobici/sistema-ecobici/mapa-bicis\",\"http://www.buenosaires.gob.ar/ecobici\",\"https://www.facebook.com/ecobici/\",\"http://mxcity.mx/2014/07/las-dudas-mas-frecuentes-del-servicio-ecobici/\",\"https://www.facebook.com/ecobici/\",\"http://mxcity.mx/2014/07/las-dudas-mas-frecuentes-del-servicio-ecobici/\",\"http://www.atraccion360.com/como-convertirte-en-usuario-de-ecobici\",\"https://twitter.com/ecobici?lang=en\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cargar tokenizador"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def tokenizer(text):\n",
" tokens = nltk.word_tokenize(text)\n",
" return tokens"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Vectorizador"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.feature_extraction.text import TfidfVectorizer"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from nltk.corpus import stopwords"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"stopwords = [word for word in stopwords.words('spanish')]"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"313"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(stopwords)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Configurar vectorizador"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"td = TfidfVectorizer(min_df=1,tokenizer=tokenizer, stop_words=stopwords)"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"output = []\n",
"for element in data:\n",
" output.append(extract_text(element))"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"a = td.fit_transform(output)"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"feature_names = td.get_feature_names()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cambiar a denso, para encontrar el peso."
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"dense = a.todense()"
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"out_data = {}"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"for key in td.vocabulary_:\n",
" out_data[key] = np.mean(dense[:,td.vocabulary_[key]])"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1322"
]
},
"execution_count": 125,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(out_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"extraer el peso por llave"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import operator"
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"sorted_x = sorted(out_data.items(), key=operator.itemgetter(1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Encontrar las palabras más relevantes"
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('distrito', 0.028662482964515474),\n",
" ('federal', 0.028662482964515474),\n",
" ('lunes', 0.029026773232012724),\n",
" ('»»', 0.029075324785924877),\n",
" ('tiempo', 0.030001295979925815),\n",
" ('24', 0.030042234209256904),\n",
" ('días', 0.030042234209256904),\n",
" ('movilidad', 0.031731486098307914),\n",
" ('00:30', 0.03307596552919851),\n",
" ('domingo', 0.03307596552919851),\n",
" ('públicas', 0.03307596552919851),\n",
" ('público', 0.033824838859946943),\n",
" ('transporte', 0.03433780767404964),\n",
" ('control', 0.037684098880024011),\n",
" ('servicio', 0.037874619880711946),\n",
" ('disfrutá', 0.038627169460603052),\n",
" ('gratis', 0.038627169460603052),\n",
" ('año', 0.038638018305341595),\n",
" ('¿qué', 0.038889761219260501),\n",
" ('sistema', 0.040910469876461235),\n",
" ('méxico', 0.042824002859355378),\n",
" ('seguridad', 0.047484048051650574),\n",
" ('this', 0.047484048051650574),\n",
" ('is', 0.047484048051650574),\n",
" ('standard', 0.047484048051650574),\n",
" ('security', 0.047484048051650574),\n",
" ('test', 0.047484048051650574),\n",
" ('that', 0.047484048051650574),\n",
" ('we', 0.047484048051650574),\n",
" ('use', 0.047484048051650574),\n",
" ('to', 0.047484048051650574),\n",
" ('prevent', 0.047484048051650574),\n",
" ('spammers', 0.047484048051650574),\n",
" ('from', 0.047484048051650574),\n",
" ('creating', 0.047484048051650574),\n",
" ('fake', 0.047484048051650574),\n",
" ('accounts', 0.047484048051650574),\n",
" ('and', 0.047484048051650574),\n",
" ('spamming', 0.047484048051650574),\n",
" ('users', 0.047484048051650574),\n",
" ('enviar', 0.047484048051650574),\n",
" ('horas', 0.04829653038098574),\n",
" (':', 0.050362404334044847),\n",
" ('disponibilidad', 0.051222207623732433),\n",
" ('?', 0.054206380158623647),\n",
" ('bicicletas', 0.059232044787027829),\n",
" ('ciudad', 0.068929086604456444),\n",
" ('ecobici', 0.089478520068313944),\n",
" ('.', 0.22332475526022286)]"
]
},
"execution_count": 137,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"c = td.vocabulary_[\"ecobici\"]\n",
"sorted_x[-50:-1]"
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'neighbors' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m-----------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-138-c56b125f8aeb>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mneighbors\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mword\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'superb'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmat\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mww\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrownames\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mww\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdistfunc\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0meuclidean\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;36m5\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mNameError\u001b[0m: name 'neighbors' is not defined"
]
}
],
"source": [
"neighbors(word='superb', mat=ww[0], rownames=ww[1], distfunc=euclidean)[: 5]"
]
},
{
"cell_type": "code",
"execution_count": 148,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"matrix([[ 0., 0., 0., ..., 0., 0., 0.]])"
]
},
"execution_count": 148,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dense[0,:]"
]
},
{
"cell_type": "code",
"execution_count": 152,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import scipy"
]
},
{
"cell_type": "code",
"execution_count": 153,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def cosine(u, v): \n",
" \"\"\"Cosine distance between 1d np.arrays `u` and `v`, which must have \n",
" the same dimensionality. Returns a float.\"\"\"\n",
" # Use scipy's method:\n",
" return scipy.spatial.distance.cosine(u, v)\n",
" # Or define it yourself:\n",
" # return 1.0 - (np.dot(u, v) / (vector_length(u) * vector_length(v)))"
]
},
{
"cell_type": "code",
"execution_count": 154,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def neighbors(word, mat, rownames, distfunc=cosine): \n",
" if word not in rownames:\n",
" raise ValueError('%s is not in this VSM' % word)\n",
" w = mat[rownames.index(word)]\n",
" dists = [(rownames[i], distfunc(w, mat[i])) for i in range(len(mat))]\n",
" return sorted(dists, key=itemgetter(1), reverse=False)"
]
},
{
"cell_type": "code",
"execution_count": 156,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def euclidean(u, v): \n",
" \"\"\"Eculidean distance between 1d np.arrays `u` and `v`, which must \n",
" have the same dimensionality. Returns a float.\"\"\"\n",
" # Use scipy's method:\n",
" return scipy.spatial.distance.euclidean(u, v)\n",
" # Or define it yourself:\n",
" # return vector_length(u - v) "
]
},
{
"cell_type": "code",
"execution_count": 159,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from operator import itemgetter"
]
},
{
"cell_type": "code",
"execution_count": 168,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[('bici', 0.0),\n",
" ('ecobici', 0.8882674236733964),\n",
" ('toma', 0.8882674236733964),\n",
" ('23°c', 1.1063743354021291),\n",
" ('general', 1.1153198243266322),\n",
" ('cualquier', 1.3235682467659207),\n",
" ('información', 1.3676585294592691),\n",
" ('tarjeta', 1.3941330047212954),\n",
" (',', 1.3941330047212954)]"
]
},
"execution_count": 168,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"neighbors(word='bici', mat=dense, rownames=list(td.vocabulary_), distfunc=euclidean)[: 10]"
]
},
{
"cell_type": "code",
"execution_count": 169,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('bici', -6.6613381477509392e-16),\n",
" ('ecobici', 0.39450950797968631),\n",
" ('toma', 0.39450950797968631),\n",
" ('23°c', 0.6120320850182509),\n",
" ('general', 0.62196915526799457),\n",
" ('cualquier', 0.87591645192350609),\n",
" ('información', 0.93524492660134462),\n",
" ('tarjeta', 0.97180341742661325),\n",
" (',', 0.97180341742661325)]"
]
},
"execution_count": 169,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"neighbors(word='bici', mat=dense, rownames=list(td.vocabulary_), distfunc=cosine)[: 10]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment