Last active
July 6, 2020 11:44
-
-
Save simonlindgren/018bbb792140a85da6d1cd46701f3163 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This notebook will parse a folder full of json files (one object per file) into a dataframe." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd\n", | |
"import glob\n", | |
"import json" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Read all files into a list (point to the folder with the files). See how long the list is." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"files = glob.glob('tweet/*')\n", | |
"#files = glob.glob('user/*')\n", | |
"len(files)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Create an empty list; for each file in the file list, open it in read mode; read the json as a string; convert the json string to a dictionary; append the dictionary to the empty list." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"dictlist = []\n", | |
"\n", | |
"for file in files:\n", | |
" json_string = open(file, 'r').read()\n", | |
" json_dict = json.loads(json_string)\n", | |
" dictlist.append(json_dict)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We now have a list of dictionaries. We read it into a dataframe." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df = pd.DataFrame(dictlist)\n", | |
"\n", | |
"df = df.replace({'\\n': ' '}, regex=True) # remove linebreaks in the dataframe\n", | |
"df = df.replace({'\\t': ' '}, regex=True) # remove tabs in the dataframe\n", | |
"df = df.replace({'\\r': ' '}, regex=True) # remove carriage return in the dataframe" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Export to csv\n", | |
"df.to_csv(\"data.csv\")" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment