Forked from simonlindgren/chopped-json-parser.ipynb
Created
November 8, 2018 01:38
-
-
Save catwhocode/bf01da9ad9205cb8c7cbabbc4faf0900 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This notebook will parse a folder full of json files (one object per file) into a dataframe." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd\n", | |
"import glob\n", | |
"import json" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Read all files into a list (point to the folder with the files). See how long the list is." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"files = glob.glob('tweet/*')\n", | |
"#files = glob.glob('user/*')\n", | |
"len(files)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Create an empty list; for each file in the file list, open it in read mode; read the json as a string; convert the json string to a dictionary; append the dictionary to the empty list." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"dictlist = []\n", | |
"\n", | |
"for file in files:\n", | |
" json_string = open(file, 'r').read()\n", | |
" json_dict = json.loads(json_string)\n", | |
" dictlist.append(json_dict)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We now have a list of dictionaries. We read it into a dataframe." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df = pd.DataFrame(dictlist)\n", | |
"\n", | |
"df = df.replace({'\\n': ' '}, regex=True) # remove linebreaks in the dataframe\n", | |
"df = df.replace({'\\t': ' '}, regex=True) # remove tabs in the dataframe\n", | |
"df = df.replace({'\\r': ' '}, regex=True) # remove carriage return in the dataframe" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Export to csv\n", | |
"df.to_csv(\"data.csv\")" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment