-
-
Save LouisAmon/300b4a906a6d25a7fb5d2c4d174d242e to your computer and use it in GitHub Desktop.
import pandas | |
import fastavro | |
def avro_df(filepath, encoding): | |
# Open file stream | |
with open(filepath, encoding) as fp: | |
# Configure Avro reader | |
reader = fastavro.reader(fp) | |
# Load records in memory | |
records = [r for r in reader] | |
# Populate pandas.DataFrame with records | |
df = pandas.DataFrame.from_records(records) | |
# Return created DataFrame | |
return df |
What will be the encoding?
"Encoding" (actually, referring to the python mode of opening the file) should be 'rb'
.
This is a typo, should be named "mode" instead of "encoding".
Thanks, very useful
Hi thanks !
But when we are printing df,data is displayed in dictionary format. is there any way we can covert in tabular format ?
As of i am getting data like below example :
IPython 7.8.0 -- An enhanced Interactive Python.
<bound method DataFrame.count of MbrData ... Identifiers
0 {'SourceName': 'AFFINITY', 'SourceId': '159234... ... [{'Identifiers': 'DUNSNumber', 'Identifier_Val...
1 {'SourceName': 'AFFINITY', 'SourceId': '595713... ... [{'Identifiers': 'DUNSNumber', 'Identifier_Val...
2 {'SourceName': 'AFFINITY', 'SourceId': '168155... ... [{'Identifiers': 'DUNSNumber', 'Identifier_Val...
3 {'SourceName': 'AFFINITY', 'SourceId': '725398... ... [{'Identifiers': 'DUNSNumber', 'Identifier_Val...
4 {'SourceName': 'AFFINITY', 'SourceId': '727384... ... [{'Identifiers': 'DUNSNumber', 'Identifier_Val...
Hey @Neetu2407 ,
I don't know much about avro files, but you can check if the avro file schema is parsed correctly with print(reader.schema)
You can look for more in https://www.perfectlyrandom.org/2019/11/29/handling-avro-files-in-python/
Also, instead of printing df, df.head()
and df.dtypes
may help.
Is the dataset public? Can we at least see your schema and corresponding df.columns
? If you want me to take a look, drop me an email.
I had an error 'utf8' codec can't decode byte 0x83
. This was solved by using open(filepath, 'rb')
, where the b
means to read the file in binary format. As already mentioned, this argument is the "mode", not "encoding".
It looks like you can save a line of code and avoid temporarily duplicating the data in memory by passing the reader
iterable directly to from_records
rather than loading it into a list first.
What will be the encoding?