Created
January 15, 2017 11:55
-
-
Save LouisAmon/300b4a906a6d25a7fb5d2c4d174d242e to your computer and use it in GitHub Desktop.
Read Avro file from Pandas
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas | |
import fastavro | |
def avro_df(filepath, encoding): | |
# Open file stream | |
with open(filepath, encoding) as fp: | |
# Configure Avro reader | |
reader = fastavro.reader(fp) | |
# Load records in memory | |
records = [r for r in reader] | |
# Populate pandas.DataFrame with records | |
df = pandas.DataFrame.from_records(records) | |
# Return created DataFrame | |
return df |
I had an error 'utf8' codec can't decode byte 0x83
. This was solved by using open(filepath, 'rb')
, where the b
means to read the file in binary format. As already mentioned, this argument is the "mode", not "encoding".
It looks like you can save a line of code and avoid temporarily duplicating the data in memory by passing the reader
iterable directly to from_records
rather than loading it into a list first.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hey @Neetu2407 ,
I don't know much about avro files, but you can check if the avro file schema is parsed correctly with
print(reader.schema)
Also, instead of printing df,
df.head()
anddf.dtypes
may help.Is the dataset public? Can we at least see your schema and corresponding
df.columns
? If you want me to take a look, drop me an email.