Skip to content

Instantly share code, notes, and snippets.

@LouisAmon
Created January 15, 2017 11:55
Show Gist options
  • Save LouisAmon/300b4a906a6d25a7fb5d2c4d174d242e to your computer and use it in GitHub Desktop.
Save LouisAmon/300b4a906a6d25a7fb5d2c4d174d242e to your computer and use it in GitHub Desktop.
Read Avro file from Pandas
import pandas
import fastavro
def avro_df(filepath, encoding):
# Open file stream
with open(filepath, encoding) as fp:
# Configure Avro reader
reader = fastavro.reader(fp)
# Load records in memory
records = [r for r in reader]
# Populate pandas.DataFrame with records
df = pandas.DataFrame.from_records(records)
# Return created DataFrame
return df
@wrzasa
Copy link

wrzasa commented Dec 20, 2019

👍

@jeet143
Copy link

jeet143 commented Mar 11, 2020

What will be the encoding?

@Takaklas
Copy link

Takaklas commented Apr 7, 2020

What will be the encoding?

"Encoding" (actually, referring to the python mode of opening the file) should be 'rb' .
This is a typo, should be named "mode" instead of "encoding".

@pjk645-zz
Copy link

Thanks, very useful

@Neetu2407
Copy link

Hi thanks !

But when we are printing df,data is displayed in dictionary format. is there any way we can covert in tabular format ?

As of i am getting data like below example :

IPython 7.8.0 -- An enhanced Interactive Python.
<bound method DataFrame.count of MbrData ... Identifiers
0 {'SourceName': 'AFFINITY', 'SourceId': '159234... ... [{'Identifiers': 'DUNSNumber', 'Identifier_Val...
1 {'SourceName': 'AFFINITY', 'SourceId': '595713... ... [{'Identifiers': 'DUNSNumber', 'Identifier_Val...
2 {'SourceName': 'AFFINITY', 'SourceId': '168155... ... [{'Identifiers': 'DUNSNumber', 'Identifier_Val...
3 {'SourceName': 'AFFINITY', 'SourceId': '725398... ... [{'Identifiers': 'DUNSNumber', 'Identifier_Val...
4 {'SourceName': 'AFFINITY', 'SourceId': '727384... ... [{'Identifiers': 'DUNSNumber', 'Identifier_Val...

@Takaklas
Copy link

Hey @Neetu2407 ,
I don't know much about avro files, but you can check if the avro file schema is parsed correctly with print(reader.schema)

You can look for more in https://www.perfectlyrandom.org/2019/11/29/handling-avro-files-in-python/

Also, instead of printing df, df.head() and df.dtypes may help.
Is the dataset public? Can we at least see your schema and corresponding df.columns? If you want me to take a look, drop me an email.

@WASDi
Copy link

WASDi commented Apr 28, 2021

I had an error 'utf8' codec can't decode byte 0x83. This was solved by using open(filepath, 'rb'), where the b means to read the file in binary format. As already mentioned, this argument is the "mode", not "encoding".

@nat-n
Copy link

nat-n commented Nov 13, 2021

It looks like you can save a line of code and avoid temporarily duplicating the data in memory by passing the reader iterable directly to from_records rather than loading it into a list first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment