Skip to content

Instantly share code, notes, and snippets.

@jtemporal
Created January 30, 2017 17:37
Show Gist options
  • Save jtemporal/833b3176f3ef575593a39699fb331bd7 to your computer and use it in GitHub Desktop.
Save jtemporal/833b3176f3ef575593a39699fb331bd7 to your computer and use it in GitHub Desktop.
Investigate parser
In [6]: from serenata_toolbox.xml2csv import convert_xml_to_csv
In [7]: convert_xml_to_csv('data/AnoAtual.xml', 'data/AnoAtual.csv')
2017-01-30 17:28:26 Creating the CSV file
2017-01-30 17:28:26 Reading the XML file
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-7-87ccd4d5ef66> in <module>()
----> 1 convert_xml_to_csv('data/AnoAtual.xml', 'data/AnoAtual.csv')
/root/anaconda3/envs/serenata_rosie/lib/python3.6/site-packages/serenata_toolbox/xml2csv.py in convert_xml_to_csv(xml_file_path, csv_file_path)
75 output('Writing record #{:,} to the CSV'.format(count), end='\r')
76 with open(csv_file_path, 'a') as csv_file:
---> 77 print(csv_io.getvalue(), file=csv_file)
78
79 json_io.close()
UnicodeEncodeError: 'ascii' codec can't encode character '\xc7' in position 51: ordinal not in range(128)
In [8]: convert_xml_to_csv('data/AnoAnterior.xml', 'data/AnoAnterior.csv')
2017-01-30 17:29:01 Creating the CSV file
2017-01-30 17:29:01 Reading the XML file
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-8-7cf7cedcb958> in <module>()
----> 1 convert_xml_to_csv('data/AnoAnterior.xml', 'data/AnoAnterior.csv')
/root/anaconda3/envs/serenata_rosie/lib/python3.6/site-packages/serenata_toolbox/xml2csv.py in convert_xml_to_csv(xml_file_path, csv_file_path)
75 output('Writing record #{:,} to the CSV'.format(count), end='\r')
76 with open(csv_file_path, 'a') as csv_file:
---> 77 print(csv_io.getvalue(), file=csv_file)
78
79 json_io.close()
UnicodeEncodeError: 'ascii' codec can't encode characters in position 59-60: ordinal not in range(128)
In [9]: convert_xml_to_csv('data/AnosAnteriores.xml', 'data/AnosAnteriores.xml')
2017-01-30 17:30:23 Creating the CSV file
2017-01-30 17:30:24 Reading the XML file
File "data/AnosAnteriores.xml", line 1
idedocumento,txnomeparlamentar,idecadastro,nucarteiraparlamentar,nulegislatura,sguf,sgpartido,codlegislatura,numsubcota,txtdescricao,numespecificacaosubcota,txtdescricaoespecificacao,txtfornecedor,txtcnpjcpf,txtnumero,indtipodocumento,datemissao,vlrdocumento,vlrglosa,vlrliquido,nummes,numano,numparcela,txtpassageiro,txttrecho,numlote,numressarcimento,vlrrestituicao,nudeputadoid
^
XMLSyntaxError: Document is empty, line 1, column 1
@jtemporal
Copy link
Author

jtemporal commented Jan 30, 2017

After running all that is here, tried to run changing the encoding in the iterparse with iterparse(open(xml_path, encoding='utf-16'), tag=tag). Still no luck *** UnicodeEncodeError: 'ascii' codec can't encode characters in position 224-225: ordinal not in range(128)

Tried to use xmllint to check if the XML files weren't currpted, here is the output:

$ xmllint --noout data/AnoAtual.xml
$ echo $?
0
$ xmllint --noout data/AnoAnterior.xml
$ echo $?
0
$ xmllint --noout data/AnosAnteriores.xml
Killed
$ echo $?
137
$ xmllint data/AnosAnteriores.xml 
Killed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment