Skip to content

Instantly share code, notes, and snippets.

View turicas's full-sized avatar

Álvaro Justen turicas

View GitHub Profile
@turicas
turicas / balneabilidade_sc.py
Created November 22, 2018 23:53
Extrator do PDF de Balneabilidade SC
# Dependencies:
# - Python 3.6+
# - pip install pymupdf git+https://github.com/turicas/rows.git@develop#egg=rows
# Usage:
# - python balneabilidade_sc.py doc.pdf doc.csv
import re
import fitz
import rows
#pip install requests splinter
# TODO: add argparse
import shlex
import subprocess
import requests
import splinter
def get_ips(device):
@turicas
turicas / etnia.py
Created August 13, 2018 17:23
Detecta etnia usando a API do Clarifai
import json
import os
import click
from clarifai.rest import ClarifaiApp
def extract_concepts(concepts):
return {concept['name']: concept['value'] for concept in concepts}
@turicas
turicas / links-pizza-de-dados-10.md
Created July 17, 2018 03:22
Links que citei durante o Pizza de Dados 10
import io
import re
import requests
import rows
def extrai_tabela(url):
response = requests.get(url)
return rows.import_from_pdf(
@turicas
turicas / description.md
Last active January 22, 2018 15:16
lxml deletes data from malformed HTML

I'm extracting data from a website and was testing some XPath expressions in Chrome Developer Tools (using $x(...) in console). After creating the expressions I need, I've automated the process using lxml to extract this data using Python. Problem: the number of results in lxml is different from the number I've got using Developer Tools! It seems lxml delete some data and adds a lot of </table> in the end (doing the process of loading the HTML into an lxml.html.Element and then extracting it using lxml.html.tostring results in completely different HTMLs - the majority of the data is removed). The HTML is attached in this gist (e-SIC.html) and the XPath is the following: //table[@class="padrao"]. I've tested the XPath in Developer Tools by executing the code in console: $x('//table[@class="padrao"]').length - it returns 2496.

import csv
import openpyxl # pip install openpyxl
filename = '/home/turicas/Downloads/planilha-municipios-2017.xlsx'
book = openpyxl.load_workbook(filename)
sheet = book.get_sheet_by_name(book.get_sheet_names()[0])
state, city = None, None
with open('planilha-municipios-2017.csv', mode='w', encoding='utf8') as fobj:
writer = csv.writer(fobj)
writer.writerow(['uf', 'municipio', 'empresa'])

Rows Plugins

Plugins de formatos (input e/ou output)

  • Separar os repositórios (pip install rows rows-html rows-pdf)
  • Detecção dos plugins instalados (ideal não carregar os imports)
    • rows print arquivo.html
  • Metadados de plugins:
  • URIs (regexp): rows print postgresql://asdfafasdf/