Last active
March 31, 2023 15:09
-
-
Save byteshiva/6045974ed2eea89076b8542b95c8bc61 to your computer and use it in GitHub Desktop.
Data_Cleaning_Customer_Data_Analysis_CSV_Parquet.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"provenance": [], | |
"authorship_tag": "ABX9TyNPS6kVEu83As1YwftRxh//", | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
}, | |
"language_info": { | |
"name": "python" | |
}, | |
"gpuClass": "standard" | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/byteshiva/6045974ed2eea89076b8542b95c8bc61/data_cleaning_customer_data_analysis_csv_parquet.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# The code performs the following tasks:\n", | |
">\n", | |
"This Python code converts a CSV file to Parquet format and removes either rows or columns with NaN values based on the specified drop_option parameter. Specifically, the code:\n", | |
"\n", | |
"* reads a CSV file into a Pandas DataFrame using the specified input file path\n", | |
"* removes rows or columns with NaN values based on the specified drop_option parameter\n", | |
"* converts the Pandas DataFrame to a PyArrow Table\n", | |
"* writes the PyArrow Table to a Parquet file using the specified output file path\n", | |
"* reads the Parquet file into a PyArrow Table\n", | |
"* converts the PyArrow Table back to a Pandas DataFrame\n", | |
"* prints the first 100 rows of the resulting Pandas DataFrame.\n", | |
"\n", | |
"\n" | |
], | |
"metadata": { | |
"id": "8rVrenQJVFBr" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"\n", | |
"\n", | |
"---\n", | |
"\n", | |
"\n", | |
"> ### !apt update is a command that updates the package lists on a Debian-based Linux distribution. It is important to run periodically to keep your system up-to-date.\n", | |
"\n", | |
"\n" | |
], | |
"metadata": { | |
"id": "mlr8hh9-Lt6t" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"!apt update " | |
], | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "1BHJmN1xWzxf", | |
"outputId": "7f1bb3c8-6333-43ce-dbc8-e6e32a760286" | |
}, | |
"execution_count": 37, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"\u001b[33m\r0% [Working]\u001b[0m\r \rHit:1 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease\n", | |
"\u001b[33m\r0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com (185.1\u001b[0m\r \rHit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 InRelease\n", | |
"\u001b[33m\r0% [Connecting to archive.ubuntu.com (91.189.91.39)] [Connecting to security.ub\u001b[0m\u001b[33m\r0% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpad.net\u001b[0m\r \rHit:3 http://archive.ubuntu.com/ubuntu focal InRelease\n", | |
"\u001b[33m\r0% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpad.net\u001b[0m\r \rHit:4 http://security.ubuntu.com/ubuntu focal-security InRelease\n", | |
"Hit:5 http://archive.ubuntu.com/ubuntu focal-updates InRelease\n", | |
"Hit:6 http://archive.ubuntu.com/ubuntu focal-backports InRelease\n", | |
"Hit:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu focal InRelease\n", | |
"Hit:8 http://ppa.launchpad.net/cran/libgit2/ubuntu focal InRelease\n", | |
"Hit:9 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu focal InRelease\n", | |
"Hit:10 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu focal InRelease\n", | |
"Hit:11 http://ppa.launchpad.net/ubuntugis/ppa/ubuntu focal InRelease\n", | |
"Reading package lists... Done\n", | |
"Building dependency tree \n", | |
"Reading state information... Done\n", | |
"4 packages can be upgraded. Run 'apt list --upgradable' to see them.\n" | |
] | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"### !apt upgrade -y is a command that upgrades all packages to their latest versions without prompting for confirmation. It is recommended to run periodically to ensure system security and performance.\n", | |
"\n", | |
"\n", | |
"\n" | |
], | |
"metadata": { | |
"id": "Js_5KnOmMQ-X" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "U5IyR7d2K3hv", | |
"outputId": "62124ffb-138d-4a4b-c0c4-9520700df740" | |
}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"Reading package lists... Done\n", | |
"Building dependency tree \n", | |
"Reading state information... Done\n", | |
"Calculating upgrade... Done\n", | |
"The following packages have been kept back:\n", | |
" libcudnn8 libcudnn8-dev libnccl-dev libnccl2\n", | |
"0 upgraded, 0 newly installed, 0 to remove and 4 not upgraded.\n" | |
] | |
} | |
], | |
"source": [ | |
"!apt upgrade -y" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"cd /content/" | |
], | |
"metadata": { | |
"id": "eDWweiikOnv_", | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"outputId": "a9536448-fa63-42e9-f3a8-7ef183a3d05b" | |
}, | |
"execution_count": 39, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"/content\n" | |
] | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"![ -d csv_parq ] || mkdir csv_parq" | |
], | |
"metadata": { | |
"id": "R-8kovF_RvPa" | |
}, | |
"execution_count": 40, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"cd csv_parq" | |
], | |
"metadata": { | |
"id": "448SpnL3Oszg", | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"outputId": "d38c1ac9-a086-4b8b-a107-746545fae04f" | |
}, | |
"execution_count": 41, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"/content/csv_parq\n" | |
] | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"!pwd" | |
], | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "jnVRdkxAOzu4", | |
"outputId": "f9d7067a-f228-49ca-dbc7-3e7a9ab8f25f" | |
}, | |
"execution_count": 42, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"/content/csv_parq\n" | |
] | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"cd /content/csv_parq" | |
], | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "JRqunWhkO0tZ", | |
"outputId": "da5c5c22-95f6-4d07-e18e-a50c17141b8d" | |
}, | |
"execution_count": 43, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"/content/csv_parq\n" | |
] | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"### This shell command installs pandas and pyarrow Python packages using pip.\n", | |
"\n", | |
"\n", | |
"\n" | |
], | |
"metadata": { | |
"id": "r82UoG6YTsiL" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"!pip install pandas pyarrow" | |
], | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "Rj97AsIQJc9q", | |
"outputId": "107520b0-9459-464f-e542-241b919b9f04" | |
}, | |
"execution_count": 44, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", | |
"Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (1.4.4)\n", | |
"Requirement already satisfied: pyarrow in /usr/local/lib/python3.9/dist-packages (9.0.0)\n", | |
"Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.9/dist-packages (from pandas) (2.8.2)\n", | |
"Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/dist-packages (from pandas) (2022.7.1)\n", | |
"Requirement already satisfied: numpy>=1.18.5 in /usr/local/lib/python3.9/dist-packages (from pandas) (1.22.4)\n", | |
"Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)\n" | |
] | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"### This code writes a CSV file named \"input.csv\" in the current working directory with a predefined set of rows and columns, where some values are NaN, and then prints a message confirming the completion of writing the file.\n", | |
"\n", | |
"\n", | |
"\n" | |
], | |
"metadata": { | |
"id": "vgF_w5SGSOBh" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Import the necessary modules\n", | |
"import os\n", | |
"\n", | |
"# Define the content to be written\n", | |
"content = \"\"\"\n", | |
"Name,Age,Country,Salary\n", | |
"John,30,USA,50000\n", | |
"Jane,35,UK,NaN\n", | |
"Bob,25,Canada,40000\n", | |
"Alice,NaN,USA,60000\n", | |
"Mike,40,USA,75000\n", | |
"Emma,27,UK,NaN\n", | |
"Dave,33,Australia,NaN\n", | |
"Mary,29,USA,55000\n", | |
"Mark,45,Canada,NaN\n", | |
"Sara,NaN,USA,90000\n", | |
"Chris,31,UK,38000\n", | |
"Steph,38,USA,NaN\n", | |
"Kevin,26,USA,NaN\n", | |
"Julia,37,Canada,NaN\n", | |
"Tom,NaN,Australia,65000\n", | |
"Amy,24,USA,30000\n", | |
"Ben,36,UK,70000\n", | |
"Laura,NaN,Canada,50000\n", | |
"Matt,42,USA,80000\n", | |
"Sophie,28,UK,NaN\n", | |
"Lisa,34,USA,55000\n", | |
"Max,39,Australia,60000\n", | |
"Jenny,30,USA,NaN\n", | |
"Greg,29,UK,NaN\n", | |
"Emily,32,Canada,47000\n", | |
"Ryan,27,USA,NaN\n", | |
"Olivia,NaN,Australia,72000\n", | |
"Sam,35,USA,95000\n", | |
"Lucy,23,UK,40000\n", | |
"Adam,26,USA,30000\n", | |
"Katie,37,USA,NaN\n", | |
"Nate,31,Canada,NaN\n", | |
"Beth,33,USA,60000\n", | |
"Scott,29,USA,50000\n", | |
"Kelly,43,Australia,NaN\n", | |
"Derek,27,USA,NaN\n", | |
"Tina,38,USA,80000\n", | |
"George,NaN,Canada,67000\n", | |
"Jill,24,USA,45000\n", | |
"Oscar,42,UK,NaN\n", | |
"Maggie,NaN,USA,55000\n", | |
"David,29,Australia,NaN\n", | |
"Lena,36,USA,NaN\n", | |
"Joe,33,Canada,48000\n", | |
"Helen,40,USA,90000\n", | |
"Fred,26,USA,NaN\n", | |
"Kate,35,Australia,NaN\n", | |
"\"\"\"\n", | |
"\n", | |
"# Define the file name and path\n", | |
"file_name = \"input.csv\"\n", | |
"file_path = os.path.join(os.getcwd(), file_name)\n", | |
"\n", | |
"# Open the file in write mode and write the content\n", | |
"with open(file_path, \"w\") as f:\n", | |
" f.write(content)\n", | |
" \n", | |
"print(f\"Content written to {file_name}\")\n" | |
], | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "r-RqWtJgSNZJ", | |
"outputId": "55a0ec2d-ce7e-446c-faca-284b2c8edc01" | |
}, | |
"execution_count": 45, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"Content written to input.csv\n" | |
] | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"### This code creates a Python script that converts a CSV file to a Parquet file using Pandas and PyArrow, and saves it to disk.\n", | |
"\n", | |
"\n", | |
"\n", | |
".\n", | |
"\n", | |
"\n", | |
"\n", | |
"\n", | |
"\n", | |
"\n", | |
"\n" | |
], | |
"metadata": { | |
"id": "AsSNFdikT5U8" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Import the necessary modules\n", | |
"import os\n", | |
"\n", | |
"# Define the content to be written\n", | |
"content = \"\"\"\n", | |
"import pandas as pd\n", | |
"import pyarrow as pa\n", | |
"import pyarrow.parquet as pq\n", | |
"\n", | |
"\n", | |
"def convert_csv_to_parquet(input_file_path, output_file_path, drop_option):\n", | |
" # Read CSV file into a Pandas DataFrame\n", | |
" df = pd.read_csv(input_file_path)\n", | |
"\n", | |
" # Remove rows or columns with NaN fields based on the drop_option argument\n", | |
" if drop_option == 'row':\n", | |
" df = df.dropna()\n", | |
" elif drop_option == 'column':\n", | |
" df = df.dropna(axis=1)\n", | |
"\n", | |
" # Convert Pandas DataFrame to PyArrow Table\n", | |
" table = pa.Table.from_pandas(df)\n", | |
"\n", | |
" # Write PyArrow Table to Parquet file\n", | |
" pq.write_table(table, output_file_path)\n", | |
"\n", | |
" # Open the parquet file\n", | |
" table = pq.read_table(output_file_path)\n", | |
"\n", | |
" # Convert the table to a pandas dataframe\n", | |
" df = table.to_pandas()\n", | |
"\n", | |
" # Print the dataframe\n", | |
" # print(df.head(100))\n", | |
" print(df.head(100).to_string(index=False))\n", | |
"\n", | |
"\n", | |
"input_file_path = 'input.csv'\n", | |
"output_file_path = 'output.parquet'\n", | |
"drop_option = 'row' # options: 'row' or 'column'\n", | |
"\n", | |
"convert_csv_to_parquet(input_file_path, output_file_path, drop_option)\n", | |
"\n", | |
"\"\"\"\n", | |
"\n", | |
"# Define the file name and path\n", | |
"file_name = \"main.py\"\n", | |
"file_path = os.path.join(os.getcwd(), file_name)\n", | |
"\n", | |
"# Open the file in write mode and write the content\n", | |
"with open(file_path, \"w\") as f:\n", | |
" f.write(content)\n", | |
" \n", | |
"print(f\"Content written to {file_name}\")\n" | |
], | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "joAhGR77wDgb", | |
"outputId": "8eed13ff-87be-48cb-d8d2-7859e713c9d9" | |
}, | |
"execution_count": 46, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"Content written to main.py\n" | |
] | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"### This command runs the Python script \"main.py\".\n", | |
"\n", | |
"\n", | |
"\n" | |
], | |
"metadata": { | |
"id": "1TCAvOLIXM85" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"!python main.py " | |
], | |
"metadata": { | |
"id": "zKgfs49CO9SL", | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"outputId": "80aec2e4-59ad-44ac-9017-29217c2bd4b9" | |
}, | |
"execution_count": 47, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
" Name Age Country Salary\n", | |
" John 30.0 USA 50000.0\n", | |
" Bob 25.0 Canada 40000.0\n", | |
" Mike 40.0 USA 75000.0\n", | |
" Mary 29.0 USA 55000.0\n", | |
"Chris 31.0 UK 38000.0\n", | |
" Amy 24.0 USA 30000.0\n", | |
" Ben 36.0 UK 70000.0\n", | |
" Matt 42.0 USA 80000.0\n", | |
" Lisa 34.0 USA 55000.0\n", | |
" Max 39.0 Australia 60000.0\n", | |
"Emily 32.0 Canada 47000.0\n", | |
" Sam 35.0 USA 95000.0\n", | |
" Lucy 23.0 UK 40000.0\n", | |
" Adam 26.0 USA 30000.0\n", | |
" Beth 33.0 USA 60000.0\n", | |
"Scott 29.0 USA 50000.0\n", | |
" Tina 38.0 USA 80000.0\n", | |
" Jill 24.0 USA 45000.0\n", | |
" Joe 33.0 Canada 48000.0\n", | |
"Helen 40.0 USA 90000.0\n" | |
] | |
} | |
] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment