byteshiva · March 31, 2023 15:09
diff --git a/data_cleaning_customer_data_analysis_csv_parquet.ipynb b/data_cleaning_customer_data_analysis_csv_parquet.ipynb
 {
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "authorship_tag": "ABX9TyNPS6kVEu83As1YwftRxh//",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "gpuClass": "standard"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/byteshiva/6045974ed2eea89076b8542b95c8bc61/data_cleaning_customer_data_analysis_csv_parquet.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# The code performs the following tasks:\n",
        ">\n",
        "This Python code converts a CSV file to Parquet format and removes either rows or columns with NaN values based on the specified drop_option parameter. Specifically, the code:\n",
        "\n",
        "* reads a CSV file into a Pandas DataFrame using the specified input file path\n",
        "* removes rows or columns with NaN values based on the specified drop_option parameter\n",
        "* converts the Pandas DataFrame to a PyArrow Table\n",
        "* writes the PyArrow Table to a Parquet file using the specified output file path\n",
        "* reads the Parquet file into a PyArrow Table\n",
        "* converts the PyArrow Table back to a Pandas DataFrame\n",
        "* prints the first 100 rows of the resulting Pandas DataFrame.\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "8rVrenQJVFBr"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n",
        "\n",
        "---\n",
        "\n",
        "\n",
        "> ### !apt update is a command that updates the package lists on a Debian-based Linux distribution. It is important to run periodically to keep your system up-to-date.\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "mlr8hh9-Lt6t"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!apt update "
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "1BHJmN1xWzxf",
        "outputId": "7f1bb3c8-6333-43ce-dbc8-e6e32a760286"
      },
      "execution_count": 37,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\u001b[33m\r0% [Working]\u001b[0m\r            \rHit:1 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease\n",
            "\u001b[33m\r0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com (185.1\u001b[0m\r                                                                               \rHit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease\n",
            "\u001b[33m\r0% [Connecting to archive.ubuntu.com (91.189.91.39)] [Connecting to security.ub\u001b[0m\u001b[33m\r0% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpad.net\u001b[0m\r                                                                               \rHit:3 http://archive.ubuntu.com/ubuntu focal InRelease\n",
            "\u001b[33m\r0% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpad.net\u001b[0m\r                                                                               \rHit:4 http://security.ubuntu.com/ubuntu focal-security InRelease\n",
            "Hit:5 http://archive.ubuntu.com/ubuntu focal-updates InRelease\n",
            "Hit:6 http://archive.ubuntu.com/ubuntu focal-backports InRelease\n",
            "Hit:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu focal InRelease\n",
            "Hit:8 http://ppa.launchpad.net/cran/libgit2/ubuntu focal InRelease\n",
            "Hit:9 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu focal InRelease\n",
            "Hit:10 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu focal InRelease\n",
            "Hit:11 http://ppa.launchpad.net/ubuntugis/ppa/ubuntu focal InRelease\n",
            "Reading package lists... Done\n",
            "Building dependency tree       \n",
            "Reading state information... Done\n",
            "4 packages can be upgraded. Run 'apt list --upgradable' to see them.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### !apt upgrade -y is a command that upgrades all packages to their latest versions without prompting for confirmation. It is recommended to run periodically to ensure system security and performance.\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "Js_5KnOmMQ-X"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 38,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "U5IyR7d2K3hv",
        "outputId": "62124ffb-138d-4a4b-c0c4-9520700df740"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Reading package lists... Done\n",
            "Building dependency tree       \n",
            "Reading state information... Done\n",
            "Calculating upgrade... Done\n",
            "The following packages have been kept back:\n",
            "  libcudnn8 libcudnn8-dev libnccl-dev libnccl2\n",
            "0 upgraded, 0 newly installed, 0 to remove and 4 not upgraded.\n"
          ]
        }
      ],
      "source": [
        "!apt upgrade -y"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "cd /content/"
      ],
      "metadata": {
        "id": "eDWweiikOnv_",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "a9536448-fa63-42e9-f3a8-7ef183a3d05b"
      },
      "execution_count": 39,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "/content\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "![ -d csv_parq ] || mkdir csv_parq"
      ],
      "metadata": {
        "id": "R-8kovF_RvPa"
      },
      "execution_count": 40,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "cd csv_parq"
      ],
      "metadata": {
        "id": "448SpnL3Oszg",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "d38c1ac9-a086-4b8b-a107-746545fae04f"
      },
      "execution_count": 41,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "/content/csv_parq\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!pwd"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "jnVRdkxAOzu4",
        "outputId": "f9d7067a-f228-49ca-dbc7-3e7a9ab8f25f"
      },
      "execution_count": 42,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "/content/csv_parq\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "cd /content/csv_parq"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "JRqunWhkO0tZ",
        "outputId": "da5c5c22-95f6-4d07-e18e-a50c17141b8d"
      },
      "execution_count": 43,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "/content/csv_parq\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### This shell command installs pandas and pyarrow Python packages using pip.\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "r82UoG6YTsiL"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install pandas pyarrow"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Rj97AsIQJc9q",
        "outputId": "107520b0-9459-464f-e542-241b919b9f04"
      },
      "execution_count": 44,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (1.4.4)\n",
            "Requirement already satisfied: pyarrow in /usr/local/lib/python3.9/dist-packages (9.0.0)\n",
            "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.9/dist-packages (from pandas) (2.8.2)\n",
            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/dist-packages (from pandas) (2022.7.1)\n",
            "Requirement already satisfied: numpy>=1.18.5 in /usr/local/lib/python3.9/dist-packages (from pandas) (1.22.4)\n",
            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### This code writes a CSV file named \"input.csv\" in the current working directory with a predefined set of rows and columns, where some values are NaN, and then prints a message confirming the completion of writing the file.\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "vgF_w5SGSOBh"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Import the necessary modules\n",
        "import os\n",
        "\n",
        "# Define the content to be written\n",
        "content = \"\"\"\n",
        "Name,Age,Country,Salary\n",
        "John,30,USA,50000\n",
        "Jane,35,UK,NaN\n",
        "Bob,25,Canada,40000\n",
        "Alice,NaN,USA,60000\n",
        "Mike,40,USA,75000\n",
        "Emma,27,UK,NaN\n",
        "Dave,33,Australia,NaN\n",
        "Mary,29,USA,55000\n",
        "Mark,45,Canada,NaN\n",
        "Sara,NaN,USA,90000\n",
        "Chris,31,UK,38000\n",
        "Steph,38,USA,NaN\n",
        "Kevin,26,USA,NaN\n",
        "Julia,37,Canada,NaN\n",
        "Tom,NaN,Australia,65000\n",
        "Amy,24,USA,30000\n",
        "Ben,36,UK,70000\n",
        "Laura,NaN,Canada,50000\n",
        "Matt,42,USA,80000\n",
        "Sophie,28,UK,NaN\n",
        "Lisa,34,USA,55000\n",
        "Max,39,Australia,60000\n",
        "Jenny,30,USA,NaN\n",
        "Greg,29,UK,NaN\n",
        "Emily,32,Canada,47000\n",
        "Ryan,27,USA,NaN\n",
        "Olivia,NaN,Australia,72000\n",
        "Sam,35,USA,95000\n",
        "Lucy,23,UK,40000\n",
        "Adam,26,USA,30000\n",
        "Katie,37,USA,NaN\n",
        "Nate,31,Canada,NaN\n",
        "Beth,33,USA,60000\n",
        "Scott,29,USA,50000\n",
        "Kelly,43,Australia,NaN\n",
        "Derek,27,USA,NaN\n",
        "Tina,38,USA,80000\n",
        "George,NaN,Canada,67000\n",
        "Jill,24,USA,45000\n",
        "Oscar,42,UK,NaN\n",
        "Maggie,NaN,USA,55000\n",
        "David,29,Australia,NaN\n",
        "Lena,36,USA,NaN\n",
        "Joe,33,Canada,48000\n",
        "Helen,40,USA,90000\n",
        "Fred,26,USA,NaN\n",
        "Kate,35,Australia,NaN\n",
        "\"\"\"\n",
        "\n",
        "# Define the file name and path\n",
        "file_name = \"input.csv\"\n",
        "file_path = os.path.join(os.getcwd(), file_name)\n",
        "\n",
        "# Open the file in write mode and write the content\n",
        "with open(file_path, \"w\") as f:\n",
        "    f.write(content)\n",
        "    \n",
        "print(f\"Content written to {file_name}\")\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "r-RqWtJgSNZJ",
        "outputId": "55a0ec2d-ce7e-446c-faca-284b2c8edc01"
      },
      "execution_count": 45,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Content written to input.csv\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### This code creates a Python script that converts a CSV file to a Parquet file using Pandas and PyArrow, and saves it to disk.\n",
        "\n",
        "\n",
        "\n",
        ".\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "AsSNFdikT5U8"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Import the necessary modules\n",
        "import os\n",
        "\n",
        "# Define the content to be written\n",
        "content = \"\"\"\n",
        "import pandas as pd\n",
        "import pyarrow as pa\n",
        "import pyarrow.parquet as pq\n",
        "\n",
        "\n",
        "def convert_csv_to_parquet(input_file_path, output_file_path, drop_option):\n",
        "    # Read CSV file into a Pandas DataFrame\n",
        "    df = pd.read_csv(input_file_path)\n",
        "\n",
        "    # Remove rows or columns with NaN fields based on the drop_option argument\n",
        "    if drop_option == 'row':\n",
        "        df = df.dropna()\n",
        "    elif drop_option == 'column':\n",
        "        df = df.dropna(axis=1)\n",
        "\n",
        "    # Convert Pandas DataFrame to PyArrow Table\n",
        "    table = pa.Table.from_pandas(df)\n",
        "\n",
        "    # Write PyArrow Table to Parquet file\n",
        "    pq.write_table(table, output_file_path)\n",
        "\n",
        "    # Open the parquet file\n",
        "    table = pq.read_table(output_file_path)\n",
        "\n",
        "    # Convert the table to a pandas dataframe\n",
        "    df = table.to_pandas()\n",
        "\n",
        "    # Print the dataframe\n",
        "    # print(df.head(100))\n",
        "    print(df.head(100).to_string(index=False))\n",
        "\n",
        "\n",
        "input_file_path = 'input.csv'\n",
        "output_file_path = 'output.parquet'\n",
        "drop_option = 'row'  # options: 'row' or 'column'\n",
        "\n",
        "convert_csv_to_parquet(input_file_path, output_file_path, drop_option)\n",
        "\n",
        "\"\"\"\n",
        "\n",
        "# Define the file name and path\n",
        "file_name = \"main.py\"\n",
        "file_path = os.path.join(os.getcwd(), file_name)\n",
        "\n",
        "# Open the file in write mode and write the content\n",
        "with open(file_path, \"w\") as f:\n",
        "    f.write(content)\n",
        "    \n",
        "print(f\"Content written to {file_name}\")\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "joAhGR77wDgb",
        "outputId": "8eed13ff-87be-48cb-d8d2-7859e713c9d9"
      },
      "execution_count": 46,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Content written to main.py\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### This command runs the Python script \"main.py\".\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "1TCAvOLIXM85"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!python main.py "
      ],
      "metadata": {
        "id": "zKgfs49CO9SL",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "80aec2e4-59ad-44ac-9017-29217c2bd4b9"
      },
      "execution_count": 47,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            " Name  Age   Country  Salary\n",
            " John 30.0       USA 50000.0\n",
            "  Bob 25.0    Canada 40000.0\n",
            " Mike 40.0       USA 75000.0\n",
            " Mary 29.0       USA 55000.0\n",
            "Chris 31.0        UK 38000.0\n",
            "  Amy 24.0       USA 30000.0\n",
            "  Ben 36.0        UK 70000.0\n",
            " Matt 42.0       USA 80000.0\n",
            " Lisa 34.0       USA 55000.0\n",
            "  Max 39.0 Australia 60000.0\n",
            "Emily 32.0    Canada 47000.0\n",
            "  Sam 35.0       USA 95000.0\n",
            " Lucy 23.0        UK 40000.0\n",
            " Adam 26.0       USA 30000.0\n",
            " Beth 33.0       USA 60000.0\n",
            "Scott 29.0       USA 50000.0\n",
            " Tina 38.0       USA 80000.0\n",
            " Jill 24.0       USA 45000.0\n",
            "  Joe 33.0    Canada 48000.0\n",
            "Helen 40.0       USA 90000.0\n"
          ]
        }
      ]
    }
  ]
 }
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"provenance": [],
	"authorship_tag": "ABX9TyNPS6kVEu83As1YwftRxh//",
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	},
	"language_info": {
	"name": "python"
	},
	"gpuClass": "standard"
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/byteshiva/6045974ed2eea89076b8542b95c8bc61/data_cleaning_customer_data_analysis_csv_parquet.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"# The code performs the following tasks:\n",
	">\n",
	"This Python code converts a CSV file to Parquet format and removes either rows or columns with NaN values based on the specified drop_option parameter. Specifically, the code:\n",
	"\n",
	"* reads a CSV file into a Pandas DataFrame using the specified input file path\n",
	"* removes rows or columns with NaN values based on the specified drop_option parameter\n",
	"* converts the Pandas DataFrame to a PyArrow Table\n",
	"* writes the PyArrow Table to a Parquet file using the specified output file path\n",
	"* reads the Parquet file into a PyArrow Table\n",
	"* converts the PyArrow Table back to a Pandas DataFrame\n",
	"* prints the first 100 rows of the resulting Pandas DataFrame.\n",
	"\n",
	"\n"
	],
	"metadata": {
	"id": "8rVrenQJVFBr"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"\n",
	"\n",
	"---\n",
	"\n",
	"\n",
	"> ### !apt update is a command that updates the package lists on a Debian-based Linux distribution. It is important to run periodically to keep your system up-to-date.\n",
	"\n",
	"\n"
	],
	"metadata": {
	"id": "mlr8hh9-Lt6t"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"!apt update "
	],
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "1BHJmN1xWzxf",
	"outputId": "7f1bb3c8-6333-43ce-dbc8-e6e32a760286"
	},
	"execution_count": 37,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"\u001b[33m\r0% [Working]\u001b[0m\r \rHit:1 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease\n",
	"\u001b[33m\r0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com (185.1\u001b[0m\r \rHit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 InRelease\n",
	"\u001b[33m\r0% [Connecting to archive.ubuntu.com (91.189.91.39)] [Connecting to security.ub\u001b[0m\u001b[33m\r0% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpad.net\u001b[0m\r \rHit:3 http://archive.ubuntu.com/ubuntu focal InRelease\n",
	"\u001b[33m\r0% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpad.net\u001b[0m\r \rHit:4 http://security.ubuntu.com/ubuntu focal-security InRelease\n",
	"Hit:5 http://archive.ubuntu.com/ubuntu focal-updates InRelease\n",
	"Hit:6 http://archive.ubuntu.com/ubuntu focal-backports InRelease\n",
	"Hit:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu focal InRelease\n",
	"Hit:8 http://ppa.launchpad.net/cran/libgit2/ubuntu focal InRelease\n",
	"Hit:9 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu focal InRelease\n",
	"Hit:10 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu focal InRelease\n",
	"Hit:11 http://ppa.launchpad.net/ubuntugis/ppa/ubuntu focal InRelease\n",
	"Reading package lists... Done\n",
	"Building dependency tree \n",
	"Reading state information... Done\n",
	"4 packages can be upgraded. Run 'apt list --upgradable' to see them.\n"
	]
	}
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"### !apt upgrade -y is a command that upgrades all packages to their latest versions without prompting for confirmation. It is recommended to run periodically to ensure system security and performance.\n",
	"\n",
	"\n",
	"\n"
	],
	"metadata": {
	"id": "Js_5KnOmMQ-X"
	}
	},
	{
	"cell_type": "code",
	"execution_count": 38,
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "U5IyR7d2K3hv",
	"outputId": "62124ffb-138d-4a4b-c0c4-9520700df740"
	},
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"Reading package lists... Done\n",
	"Building dependency tree \n",
	"Reading state information... Done\n",
	"Calculating upgrade... Done\n",
	"The following packages have been kept back:\n",
	" libcudnn8 libcudnn8-dev libnccl-dev libnccl2\n",
	"0 upgraded, 0 newly installed, 0 to remove and 4 not upgraded.\n"
	]
	}
	],
	"source": [
	"!apt upgrade -y"
	]
	},
	{
	"cell_type": "code",
	"source": [
	"cd /content/"
	],
	"metadata": {
	"id": "eDWweiikOnv_",
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"outputId": "a9536448-fa63-42e9-f3a8-7ef183a3d05b"
	},
	"execution_count": 39,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"/content\n"
	]
	}
	]
	},
	{
	"cell_type": "code",
	"source": [
	"![ -d csv_parq ] \|\| mkdir csv_parq"
	],
	"metadata": {
	"id": "R-8kovF_RvPa"
	},
	"execution_count": 40,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"cd csv_parq"
	],
	"metadata": {
	"id": "448SpnL3Oszg",
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"outputId": "d38c1ac9-a086-4b8b-a107-746545fae04f"
	},
	"execution_count": 41,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"/content/csv_parq\n"
	]
	}
	]
	},
	{
	"cell_type": "code",
	"source": [
	"!pwd"
	],
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "jnVRdkxAOzu4",
	"outputId": "f9d7067a-f228-49ca-dbc7-3e7a9ab8f25f"
	},
	"execution_count": 42,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"/content/csv_parq\n"
	]
	}
	]
	},
	{
	"cell_type": "code",
	"source": [
	"cd /content/csv_parq"
	],
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "JRqunWhkO0tZ",
	"outputId": "da5c5c22-95f6-4d07-e18e-a50c17141b8d"
	},
	"execution_count": 43,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"/content/csv_parq\n"
	]
	}
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"### This shell command installs pandas and pyarrow Python packages using pip.\n",
	"\n",
	"\n",
	"\n"
	],
	"metadata": {
	"id": "r82UoG6YTsiL"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"!pip install pandas pyarrow"
	],
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "Rj97AsIQJc9q",
	"outputId": "107520b0-9459-464f-e542-241b919b9f04"
	},
	"execution_count": 44,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
	"Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (1.4.4)\n",
	"Requirement already satisfied: pyarrow in /usr/local/lib/python3.9/dist-packages (9.0.0)\n",
	"Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.9/dist-packages (from pandas) (2.8.2)\n",
	"Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/dist-packages (from pandas) (2022.7.1)\n",
	"Requirement already satisfied: numpy>=1.18.5 in /usr/local/lib/python3.9/dist-packages (from pandas) (1.22.4)\n",
	"Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)\n"
	]
	}
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"### This code writes a CSV file named \"input.csv\" in the current working directory with a predefined set of rows and columns, where some values are NaN, and then prints a message confirming the completion of writing the file.\n",
	"\n",
	"\n",
	"\n"
	],
	"metadata": {
	"id": "vgF_w5SGSOBh"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"# Import the necessary modules\n",
	"import os\n",
	"\n",
	"# Define the content to be written\n",
	"content = \"\"\"\n",
	"Name,Age,Country,Salary\n",
	"John,30,USA,50000\n",
	"Jane,35,UK,NaN\n",
	"Bob,25,Canada,40000\n",
	"Alice,NaN,USA,60000\n",
	"Mike,40,USA,75000\n",
	"Emma,27,UK,NaN\n",
	"Dave,33,Australia,NaN\n",
	"Mary,29,USA,55000\n",
	"Mark,45,Canada,NaN\n",
	"Sara,NaN,USA,90000\n",
	"Chris,31,UK,38000\n",
	"Steph,38,USA,NaN\n",
	"Kevin,26,USA,NaN\n",
	"Julia,37,Canada,NaN\n",
	"Tom,NaN,Australia,65000\n",
	"Amy,24,USA,30000\n",
	"Ben,36,UK,70000\n",
	"Laura,NaN,Canada,50000\n",
	"Matt,42,USA,80000\n",
	"Sophie,28,UK,NaN\n",
	"Lisa,34,USA,55000\n",
	"Max,39,Australia,60000\n",
	"Jenny,30,USA,NaN\n",
	"Greg,29,UK,NaN\n",
	"Emily,32,Canada,47000\n",
	"Ryan,27,USA,NaN\n",
	"Olivia,NaN,Australia,72000\n",
	"Sam,35,USA,95000\n",
	"Lucy,23,UK,40000\n",
	"Adam,26,USA,30000\n",
	"Katie,37,USA,NaN\n",
	"Nate,31,Canada,NaN\n",
	"Beth,33,USA,60000\n",
	"Scott,29,USA,50000\n",
	"Kelly,43,Australia,NaN\n",
	"Derek,27,USA,NaN\n",
	"Tina,38,USA,80000\n",
	"George,NaN,Canada,67000\n",
	"Jill,24,USA,45000\n",
	"Oscar,42,UK,NaN\n",
	"Maggie,NaN,USA,55000\n",
	"David,29,Australia,NaN\n",
	"Lena,36,USA,NaN\n",
	"Joe,33,Canada,48000\n",
	"Helen,40,USA,90000\n",
	"Fred,26,USA,NaN\n",
	"Kate,35,Australia,NaN\n",
	"\"\"\"\n",
	"\n",
	"# Define the file name and path\n",
	"file_name = \"input.csv\"\n",
	"file_path = os.path.join(os.getcwd(), file_name)\n",
	"\n",
	"# Open the file in write mode and write the content\n",
	"with open(file_path, \"w\") as f:\n",
	" f.write(content)\n",
	" \n",
	"print(f\"Content written to {file_name}\")\n"
	],
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "r-RqWtJgSNZJ",
	"outputId": "55a0ec2d-ce7e-446c-faca-284b2c8edc01"
	},
	"execution_count": 45,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"Content written to input.csv\n"
	]
	}
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"### This code creates a Python script that converts a CSV file to a Parquet file using Pandas and PyArrow, and saves it to disk.\n",
	"\n",
	"\n",
	"\n",
	".\n",
	"\n",
	"\n",
	"\n",
	"\n",
	"\n",
	"\n",
	"\n"
	],
	"metadata": {
	"id": "AsSNFdikT5U8"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"# Import the necessary modules\n",
	"import os\n",
	"\n",
	"# Define the content to be written\n",
	"content = \"\"\"\n",
	"import pandas as pd\n",
	"import pyarrow as pa\n",
	"import pyarrow.parquet as pq\n",
	"\n",
	"\n",
	"def convert_csv_to_parquet(input_file_path, output_file_path, drop_option):\n",
	" # Read CSV file into a Pandas DataFrame\n",
	" df = pd.read_csv(input_file_path)\n",
	"\n",
	" # Remove rows or columns with NaN fields based on the drop_option argument\n",
	" if drop_option == 'row':\n",
	" df = df.dropna()\n",
	" elif drop_option == 'column':\n",
	" df = df.dropna(axis=1)\n",
	"\n",
	" # Convert Pandas DataFrame to PyArrow Table\n",
	" table = pa.Table.from_pandas(df)\n",
	"\n",
	" # Write PyArrow Table to Parquet file\n",
	" pq.write_table(table, output_file_path)\n",
	"\n",
	" # Open the parquet file\n",
	" table = pq.read_table(output_file_path)\n",
	"\n",
	" # Convert the table to a pandas dataframe\n",
	" df = table.to_pandas()\n",
	"\n",
	" # Print the dataframe\n",
	" # print(df.head(100))\n",
	" print(df.head(100).to_string(index=False))\n",
	"\n",
	"\n",
	"input_file_path = 'input.csv'\n",
	"output_file_path = 'output.parquet'\n",
	"drop_option = 'row' # options: 'row' or 'column'\n",
	"\n",
	"convert_csv_to_parquet(input_file_path, output_file_path, drop_option)\n",
	"\n",
	"\"\"\"\n",
	"\n",
	"# Define the file name and path\n",
	"file_name = \"main.py\"\n",
	"file_path = os.path.join(os.getcwd(), file_name)\n",
	"\n",
	"# Open the file in write mode and write the content\n",
	"with open(file_path, \"w\") as f:\n",
	" f.write(content)\n",
	" \n",
	"print(f\"Content written to {file_name}\")\n"
	],
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "joAhGR77wDgb",
	"outputId": "8eed13ff-87be-48cb-d8d2-7859e713c9d9"
	},
	"execution_count": 46,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"Content written to main.py\n"
	]
	}
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"### This command runs the Python script \"main.py\".\n",
	"\n",
	"\n",
	"\n"
	],
	"metadata": {
	"id": "1TCAvOLIXM85"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"!python main.py "
	],
	"metadata": {
	"id": "zKgfs49CO9SL",
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"outputId": "80aec2e4-59ad-44ac-9017-29217c2bd4b9"
	},
	"execution_count": 47,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	" Name Age Country Salary\n",
	" John 30.0 USA 50000.0\n",
	" Bob 25.0 Canada 40000.0\n",
	" Mike 40.0 USA 75000.0\n",
	" Mary 29.0 USA 55000.0\n",
	"Chris 31.0 UK 38000.0\n",
	" Amy 24.0 USA 30000.0\n",
	" Ben 36.0 UK 70000.0\n",
	" Matt 42.0 USA 80000.0\n",
	" Lisa 34.0 USA 55000.0\n",
	" Max 39.0 Australia 60000.0\n",
	"Emily 32.0 Canada 47000.0\n",
	" Sam 35.0 USA 95000.0\n",
	" Lucy 23.0 UK 40000.0\n",
	" Adam 26.0 USA 30000.0\n",
	" Beth 33.0 USA 60000.0\n",
	"Scott 29.0 USA 50000.0\n",
	" Tina 38.0 USA 80000.0\n",
	" Jill 24.0 USA 45000.0\n",
	" Joe 33.0 Canada 48000.0\n",
	"Helen 40.0 USA 90000.0\n"
	]
	}
	]
	}
	]
	}