-
-
Save LouisAmon/300b4a906a6d25a7fb5d2c4d174d242e to your computer and use it in GitHub Desktop.
| import pandas | |
| import fastavro | |
| def avro_df(filepath, encoding): | |
| # Open file stream | |
| with open(filepath, encoding) as fp: | |
| # Configure Avro reader | |
| reader = fastavro.reader(fp) | |
| # Load records in memory | |
| records = [r for r in reader] | |
| # Populate pandas.DataFrame with records | |
| df = pandas.DataFrame.from_records(records) | |
| # Return created DataFrame | |
| return df |
Hi thanks !
But when we are printing df,data is displayed in dictionary format. is there any way we can covert in tabular format ?
As of i am getting data like below example :
IPython 7.8.0 -- An enhanced Interactive Python.
<bound method DataFrame.count of MbrData ... Identifiers
0 {'SourceName': 'AFFINITY', 'SourceId': '159234... ... [{'Identifiers': 'DUNSNumber', 'Identifier_Val...
1 {'SourceName': 'AFFINITY', 'SourceId': '595713... ... [{'Identifiers': 'DUNSNumber', 'Identifier_Val...
2 {'SourceName': 'AFFINITY', 'SourceId': '168155... ... [{'Identifiers': 'DUNSNumber', 'Identifier_Val...
3 {'SourceName': 'AFFINITY', 'SourceId': '725398... ... [{'Identifiers': 'DUNSNumber', 'Identifier_Val...
4 {'SourceName': 'AFFINITY', 'SourceId': '727384... ... [{'Identifiers': 'DUNSNumber', 'Identifier_Val...
Hey @Neetu2407 ,
I don't know much about avro files, but you can check if the avro file schema is parsed correctly with print(reader.schema)
You can look for more in https://www.perfectlyrandom.org/2019/11/29/handling-avro-files-in-python/
Also, instead of printing df, df.head() and df.dtypes may help.
Is the dataset public? Can we at least see your schema and corresponding df.columns? If you want me to take a look, drop me an email.
I had an error 'utf8' codec can't decode byte 0x83. This was solved by using open(filepath, 'rb'), where the b means to read the file in binary format. As already mentioned, this argument is the "mode", not "encoding".
It looks like you can save a line of code and avoid temporarily duplicating the data in memory by passing the reader iterable directly to from_records rather than loading it into a list first.
Thanks, very useful