Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save byteshiva/6045974ed2eea89076b8542b95c8bc61 to your computer and use it in GitHub Desktop.
Save byteshiva/6045974ed2eea89076b8542b95c8bc61 to your computer and use it in GitHub Desktop.
Data_Cleaning_Customer_Data_Analysis_CSV_Parquet.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"authorship_tag": "ABX9TyNPS6kVEu83As1YwftRxh//",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"gpuClass": "standard"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/byteshiva/6045974ed2eea89076b8542b95c8bc61/data_cleaning_customer_data_analysis_csv_parquet.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# The code performs the following tasks:\n",
">\n",
"This Python code converts a CSV file to Parquet format and removes either rows or columns with NaN values based on the specified drop_option parameter. Specifically, the code:\n",
"\n",
"* reads a CSV file into a Pandas DataFrame using the specified input file path\n",
"* removes rows or columns with NaN values based on the specified drop_option parameter\n",
"* converts the Pandas DataFrame to a PyArrow Table\n",
"* writes the PyArrow Table to a Parquet file using the specified output file path\n",
"* reads the Parquet file into a PyArrow Table\n",
"* converts the PyArrow Table back to a Pandas DataFrame\n",
"* prints the first 100 rows of the resulting Pandas DataFrame.\n",
"\n",
"\n"
],
"metadata": {
"id": "8rVrenQJVFBr"
}
},
{
"cell_type": "markdown",
"source": [
"\n",
"\n",
"---\n",
"\n",
"\n",
"> ### !apt update is a command that updates the package lists on a Debian-based Linux distribution. It is important to run periodically to keep your system up-to-date.\n",
"\n",
"\n"
],
"metadata": {
"id": "mlr8hh9-Lt6t"
}
},
{
"cell_type": "code",
"source": [
"!apt update "
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "1BHJmN1xWzxf",
"outputId": "7f1bb3c8-6333-43ce-dbc8-e6e32a760286"
},
"execution_count": 37,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"\u001b[33m\r0% [Working]\u001b[0m\r \rHit:1 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease\n",
"\u001b[33m\r0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com (185.1\u001b[0m\r \rHit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 InRelease\n",
"\u001b[33m\r0% [Connecting to archive.ubuntu.com (91.189.91.39)] [Connecting to security.ub\u001b[0m\u001b[33m\r0% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpad.net\u001b[0m\r \rHit:3 http://archive.ubuntu.com/ubuntu focal InRelease\n",
"\u001b[33m\r0% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpad.net\u001b[0m\r \rHit:4 http://security.ubuntu.com/ubuntu focal-security InRelease\n",
"Hit:5 http://archive.ubuntu.com/ubuntu focal-updates InRelease\n",
"Hit:6 http://archive.ubuntu.com/ubuntu focal-backports InRelease\n",
"Hit:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu focal InRelease\n",
"Hit:8 http://ppa.launchpad.net/cran/libgit2/ubuntu focal InRelease\n",
"Hit:9 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu focal InRelease\n",
"Hit:10 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu focal InRelease\n",
"Hit:11 http://ppa.launchpad.net/ubuntugis/ppa/ubuntu focal InRelease\n",
"Reading package lists... Done\n",
"Building dependency tree \n",
"Reading state information... Done\n",
"4 packages can be upgraded. Run 'apt list --upgradable' to see them.\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"### !apt upgrade -y is a command that upgrades all packages to their latest versions without prompting for confirmation. It is recommended to run periodically to ensure system security and performance.\n",
"\n",
"\n",
"\n"
],
"metadata": {
"id": "Js_5KnOmMQ-X"
}
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "U5IyR7d2K3hv",
"outputId": "62124ffb-138d-4a4b-c0c4-9520700df740"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Reading package lists... Done\n",
"Building dependency tree \n",
"Reading state information... Done\n",
"Calculating upgrade... Done\n",
"The following packages have been kept back:\n",
" libcudnn8 libcudnn8-dev libnccl-dev libnccl2\n",
"0 upgraded, 0 newly installed, 0 to remove and 4 not upgraded.\n"
]
}
],
"source": [
"!apt upgrade -y"
]
},
{
"cell_type": "code",
"source": [
"cd /content/"
],
"metadata": {
"id": "eDWweiikOnv_",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "a9536448-fa63-42e9-f3a8-7ef183a3d05b"
},
"execution_count": 39,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"/content\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"![ -d csv_parq ] || mkdir csv_parq"
],
"metadata": {
"id": "R-8kovF_RvPa"
},
"execution_count": 40,
"outputs": []
},
{
"cell_type": "code",
"source": [
"cd csv_parq"
],
"metadata": {
"id": "448SpnL3Oszg",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "d38c1ac9-a086-4b8b-a107-746545fae04f"
},
"execution_count": 41,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"/content/csv_parq\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"!pwd"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "jnVRdkxAOzu4",
"outputId": "f9d7067a-f228-49ca-dbc7-3e7a9ab8f25f"
},
"execution_count": 42,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"/content/csv_parq\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"cd /content/csv_parq"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "JRqunWhkO0tZ",
"outputId": "da5c5c22-95f6-4d07-e18e-a50c17141b8d"
},
"execution_count": 43,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"/content/csv_parq\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"### This shell command installs pandas and pyarrow Python packages using pip.\n",
"\n",
"\n",
"\n"
],
"metadata": {
"id": "r82UoG6YTsiL"
}
},
{
"cell_type": "code",
"source": [
"!pip install pandas pyarrow"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Rj97AsIQJc9q",
"outputId": "107520b0-9459-464f-e542-241b919b9f04"
},
"execution_count": 44,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
"Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (1.4.4)\n",
"Requirement already satisfied: pyarrow in /usr/local/lib/python3.9/dist-packages (9.0.0)\n",
"Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.9/dist-packages (from pandas) (2.8.2)\n",
"Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/dist-packages (from pandas) (2022.7.1)\n",
"Requirement already satisfied: numpy>=1.18.5 in /usr/local/lib/python3.9/dist-packages (from pandas) (1.22.4)\n",
"Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"### This code writes a CSV file named \"input.csv\" in the current working directory with a predefined set of rows and columns, where some values are NaN, and then prints a message confirming the completion of writing the file.\n",
"\n",
"\n",
"\n"
],
"metadata": {
"id": "vgF_w5SGSOBh"
}
},
{
"cell_type": "code",
"source": [
"# Import the necessary modules\n",
"import os\n",
"\n",
"# Define the content to be written\n",
"content = \"\"\"\n",
"Name,Age,Country,Salary\n",
"John,30,USA,50000\n",
"Jane,35,UK,NaN\n",
"Bob,25,Canada,40000\n",
"Alice,NaN,USA,60000\n",
"Mike,40,USA,75000\n",
"Emma,27,UK,NaN\n",
"Dave,33,Australia,NaN\n",
"Mary,29,USA,55000\n",
"Mark,45,Canada,NaN\n",
"Sara,NaN,USA,90000\n",
"Chris,31,UK,38000\n",
"Steph,38,USA,NaN\n",
"Kevin,26,USA,NaN\n",
"Julia,37,Canada,NaN\n",
"Tom,NaN,Australia,65000\n",
"Amy,24,USA,30000\n",
"Ben,36,UK,70000\n",
"Laura,NaN,Canada,50000\n",
"Matt,42,USA,80000\n",
"Sophie,28,UK,NaN\n",
"Lisa,34,USA,55000\n",
"Max,39,Australia,60000\n",
"Jenny,30,USA,NaN\n",
"Greg,29,UK,NaN\n",
"Emily,32,Canada,47000\n",
"Ryan,27,USA,NaN\n",
"Olivia,NaN,Australia,72000\n",
"Sam,35,USA,95000\n",
"Lucy,23,UK,40000\n",
"Adam,26,USA,30000\n",
"Katie,37,USA,NaN\n",
"Nate,31,Canada,NaN\n",
"Beth,33,USA,60000\n",
"Scott,29,USA,50000\n",
"Kelly,43,Australia,NaN\n",
"Derek,27,USA,NaN\n",
"Tina,38,USA,80000\n",
"George,NaN,Canada,67000\n",
"Jill,24,USA,45000\n",
"Oscar,42,UK,NaN\n",
"Maggie,NaN,USA,55000\n",
"David,29,Australia,NaN\n",
"Lena,36,USA,NaN\n",
"Joe,33,Canada,48000\n",
"Helen,40,USA,90000\n",
"Fred,26,USA,NaN\n",
"Kate,35,Australia,NaN\n",
"\"\"\"\n",
"\n",
"# Define the file name and path\n",
"file_name = \"input.csv\"\n",
"file_path = os.path.join(os.getcwd(), file_name)\n",
"\n",
"# Open the file in write mode and write the content\n",
"with open(file_path, \"w\") as f:\n",
" f.write(content)\n",
" \n",
"print(f\"Content written to {file_name}\")\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "r-RqWtJgSNZJ",
"outputId": "55a0ec2d-ce7e-446c-faca-284b2c8edc01"
},
"execution_count": 45,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Content written to input.csv\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"### This code creates a Python script that converts a CSV file to a Parquet file using Pandas and PyArrow, and saves it to disk.\n",
"\n",
"\n",
"\n",
".\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
],
"metadata": {
"id": "AsSNFdikT5U8"
}
},
{
"cell_type": "code",
"source": [
"# Import the necessary modules\n",
"import os\n",
"\n",
"# Define the content to be written\n",
"content = \"\"\"\n",
"import pandas as pd\n",
"import pyarrow as pa\n",
"import pyarrow.parquet as pq\n",
"\n",
"\n",
"def convert_csv_to_parquet(input_file_path, output_file_path, drop_option):\n",
" # Read CSV file into a Pandas DataFrame\n",
" df = pd.read_csv(input_file_path)\n",
"\n",
" # Remove rows or columns with NaN fields based on the drop_option argument\n",
" if drop_option == 'row':\n",
" df = df.dropna()\n",
" elif drop_option == 'column':\n",
" df = df.dropna(axis=1)\n",
"\n",
" # Convert Pandas DataFrame to PyArrow Table\n",
" table = pa.Table.from_pandas(df)\n",
"\n",
" # Write PyArrow Table to Parquet file\n",
" pq.write_table(table, output_file_path)\n",
"\n",
" # Open the parquet file\n",
" table = pq.read_table(output_file_path)\n",
"\n",
" # Convert the table to a pandas dataframe\n",
" df = table.to_pandas()\n",
"\n",
" # Print the dataframe\n",
" # print(df.head(100))\n",
" print(df.head(100).to_string(index=False))\n",
"\n",
"\n",
"input_file_path = 'input.csv'\n",
"output_file_path = 'output.parquet'\n",
"drop_option = 'row' # options: 'row' or 'column'\n",
"\n",
"convert_csv_to_parquet(input_file_path, output_file_path, drop_option)\n",
"\n",
"\"\"\"\n",
"\n",
"# Define the file name and path\n",
"file_name = \"main.py\"\n",
"file_path = os.path.join(os.getcwd(), file_name)\n",
"\n",
"# Open the file in write mode and write the content\n",
"with open(file_path, \"w\") as f:\n",
" f.write(content)\n",
" \n",
"print(f\"Content written to {file_name}\")\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "joAhGR77wDgb",
"outputId": "8eed13ff-87be-48cb-d8d2-7859e713c9d9"
},
"execution_count": 46,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Content written to main.py\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"### This command runs the Python script \"main.py\".\n",
"\n",
"\n",
"\n"
],
"metadata": {
"id": "1TCAvOLIXM85"
}
},
{
"cell_type": "code",
"source": [
"!python main.py "
],
"metadata": {
"id": "zKgfs49CO9SL",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "80aec2e4-59ad-44ac-9017-29217c2bd4b9"
},
"execution_count": 47,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" Name Age Country Salary\n",
" John 30.0 USA 50000.0\n",
" Bob 25.0 Canada 40000.0\n",
" Mike 40.0 USA 75000.0\n",
" Mary 29.0 USA 55000.0\n",
"Chris 31.0 UK 38000.0\n",
" Amy 24.0 USA 30000.0\n",
" Ben 36.0 UK 70000.0\n",
" Matt 42.0 USA 80000.0\n",
" Lisa 34.0 USA 55000.0\n",
" Max 39.0 Australia 60000.0\n",
"Emily 32.0 Canada 47000.0\n",
" Sam 35.0 USA 95000.0\n",
" Lucy 23.0 UK 40000.0\n",
" Adam 26.0 USA 30000.0\n",
" Beth 33.0 USA 60000.0\n",
"Scott 29.0 USA 50000.0\n",
" Tina 38.0 USA 80000.0\n",
" Jill 24.0 USA 45000.0\n",
" Joe 33.0 Canada 48000.0\n",
"Helen 40.0 USA 90000.0\n"
]
}
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment