Álvaro Justen turicas

Links que citei durante o Pizza de Dados 10:

Turicas no Twitter e GitHub: https://twitter.com/turicas https://github.com/turicas
Brasil.IO: https://brasil.io/
Brasil.IO - Manifesto: https://brasil.io/manifesto
Brasil.IO - Como contribuir: entrar no chat https://chat.brasil.io/ e ver repositório https://github.com/turicas/brasil.io/
Brasil.IO - Doações são bem-vindas: https://apoia.se/brasilio
Brasil.IO - API: https://brasil.io/api/
Brasil.IO - Dataset com mais de 100k nomes brasileiros classificados por gênero e agrupados: https://brasil.io/dataset/genero-nomes/nomes
Evento PythonBrasil: http://pythonbrasil.com.br/

Slides:

Tweets:

Repositórios:

I'm extracting data from a website and was testing some XPath expressions in Chrome Developer Tools (using $x(...) in console). After creating the expressions I need, I've automated the process using lxml to extract this data using Python. Problem: the number of results in lxml is different from the number I've got using Developer Tools! It seems lxml delete some data and adds a lot of </table> in the end (doing the process of loading the HTML into an lxml.html.Element and then extracting it using lxml.html.tostring results in completely different HTMLs - the majority of the data is removed). The HTML is attached in this gist (e-SIC.html) and the XPath is the following: //table[@class="padrao"]. I've tested the XPath in Developer Tools by executing the code in console: $x('//table[@class="padrao"]').length - it returns 2496.

Rows Plugins

Plugins de formatos (input e/ou output)

Separar os repositórios (pip install rows rows-html rows-pdf)
Detecção dos plugins instalados (ideal não carregar os imports)
- rows print arquivo.html
Metadados de plugins:
URIs (regexp): rows print postgresql://asdfafasdf/

	# Dependencies:
	# - Python 3.6+
	# - pip install pymupdf git+https://github.com/turicas/rows.git@develop#egg=rows
	# Usage:
	# - python balneabilidade_sc.py doc.pdf doc.csv

	import re

	import fitz
	import rows

	#pip install requests splinter
	# TODO: add argparse
	import shlex
	import subprocess

	import requests
	import splinter


	def get_ips(device):

	import json
	import os

	import click
	from clarifai.rest import ClarifaiApp


	def extract_concepts(concepts):
	return {concept['name']: concept['value'] for concept in concepts}

	import io
	import re

	import requests
	import rows


	def extrai_tabela(url):
	response = requests.get(url)
	return rows.import_from_pdf(

	import csv
	import openpyxl # pip install openpyxl

	filename = '/home/turicas/Downloads/planilha-municipios-2017.xlsx'
	book = openpyxl.load_workbook(filename)
	sheet = book.get_sheet_by_name(book.get_sheet_names()[0])
	state, city = None, None
	with open('planilha-municipios-2017.csv', mode='w', encoding='utf8') as fobj:
	writer = csv.writer(fobj)
	writer.writerow(['uf', 'municipio', 'empresa'])