Skip to content

Instantly share code, notes, and snippets.

@sergiolucero
Created March 3, 2020 18:18
Show Gist options
  • Select an option

  • Save sergiolucero/f726619617bdaaf8d3f9c9c73d6abdd8 to your computer and use it in GitHub Desktop.

Select an option

Save sergiolucero/f726619617bdaaf8d3f9c9c73d6abdd8 to your computer and use it in GitHub Desktop.
Procesando el padrón del SERVEL con Dask
from dask.distributed import Client
def pdf2csv(fn):
doc = fitz.open(fn)
fw = open(fn.replace('.pdf','.csv'),'w')
writer = csv.writer(fw)
writer.writerow(['nombre','rut','genero','direccion'])
for ix,page in enumerate(doc):
t = str(page.getText().encode('latin-1'))
data = t[t.find('de'):].split('\\n')[14:]
nombres = [n for n in data[::5] if len(n)>1]
ruts=data[1::5]
genero_direccion=[gd.split(' ') for gd in data[2::5]]
genero = [gd[0] for gd in genero_direccion]
direccion = [' '.join(gd[1:]) if len(gd)>1 else 'N/A'
for gd in genero_direccion]
zit = zip(nombres,ruts,genero,direccion) # iterador
for nombre, rut, genero, direccion in zit:
writer.writerow([nombre,rut,genero,direccion])
client = Client()
files = list(glob.glob('A*.pdf'))
L = [client.submit(process,fn) for fn in files]
@sergiolucero
Copy link
Copy Markdown
Author

needs a correction for missing dirección electoral

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment