dgadiraju · December 17, 2020 12:11
diff --git a/06_csv_to_pandas_data_frame.ipynb b/06_csv_to_pandas_data_frame.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CSV to Pandas Data Frame\n",
    "\n",
    "Let us see how we can create **Pandas Data Frames** using data from files. `read_csv` is the most popular API to create a Data Frame by reading data from files.\n",
    "* Here are some of the important options.\n",
    "  * sep or delimiter\n",
    "  * header or names\n",
    "  * index_col\n",
    "  * dtype\n",
    "  * and many more\n",
    "* We have several other APIs which will facilitate us to create Data Frame\n",
    "  * read_fwf\n",
    "  * read_table\n",
    "  * pandas.io.json\n",
    "  * and more\n",
    "* Here is how we can create a Data Frame for orders dataset.\n",
    "  * Delimiter in our data is **,** which is default for Pandas `read_csv`.\n",
    "  * There is no Header and hence we have to set keyword argument `header` to None.\n",
    "  * We can pass the column names as a list using keyword argument `columns`.\n",
    "  * Data types of each column are typically inferred based on the data, however we can explicitly specify Data Types using `dtype`.\n",
    "  \n",
    "```{note}\n",
    "We will be running this notebook from other notebooks to create orders and order_items data frames while exploring Pandas libraries. \n",
    "\n",
    "Make sure you comment out all the informational lines, so that output is not printed when we invoke this notebook from other notebooks.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pd.read_csv?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sh\n",
    "\n",
    "# ls -ltr /data/retail_db/orders/part-00000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sh\n",
    "\n",
    "# tail /data/retail_db/orders/part-00000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sh\n",
    "\n",
    "# head /data/retail_db/orders/part-00000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "orders_path = \"/data/retail_db/orders/part-00000\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "orders_schema = [\n",
    "    \"order_id\",\n",
    "    \"order_date\",\n",
    "    \"order_customer_id\",\n",
    "    \"order_status\"\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "orders = pd.read_csv(orders_path,\n",
    "                     delimiter=',',\n",
    "                     header=None,\n",
    "                     names=orders_schema\n",
    "                    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# orders"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# orders.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "order_items_path = \"/data/retail_db/order_items/part-00000\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sh\n",
    "\n",
    "# ls -ltr /data/retail_db/order_items/part-00000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sh\n",
    "\n",
    "# tail /data/retail_db/order_items/part-00000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sh\n",
    "\n",
    "# head /data/retail_db/order_items/part-00000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "order_items_schema = [\n",
    "    \"order_item_id\",\n",
    "    \"order_item_order_id\",\n",
    "    \"order_item_product_id\",\n",
    "    \"order_item_quantity\",\n",
    "    \"order_item_subtotal\",\n",
    "    \"order_item_product_price\"\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "order_items = pd.read_csv(order_items_path,\n",
    "                     delimiter=',',\n",
    "                     header=None,\n",
    "                     names=order_items_schema\n",
    "                    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# order_items"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# order_items.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## CSV to Pandas Data Frame\n",
	"\n",
	"Let us see how we can create Pandas Data Frames using data from files. `read_csv` is the most popular API to create a Data Frame by reading data from files.\n",
	"* Here are some of the important options.\n",
	" * sep or delimiter\n",
	" * header or names\n",
	" * index_col\n",
	" * dtype\n",
	" * and many more\n",
	"* We have several other APIs which will facilitate us to create Data Frame\n",
	" * read_fwf\n",
	" * read_table\n",
	" * pandas.io.json\n",
	" * and more\n",
	"* Here is how we can create a Data Frame for orders dataset.\n",
	" * Delimiter in our data is , which is default for Pandas `read_csv`.\n",
	" * There is no Header and hence we have to set keyword argument `header` to None.\n",
	" * We can pass the column names as a list using keyword argument `columns`.\n",
	" * Data types of each column are typically inferred based on the data, however we can explicitly specify Data Types using `dtype`.\n",
	" \n",
	"```{note}\n",
	"We will be running this notebook from other notebooks to create orders and order_items data frames while exploring Pandas libraries. \n",
	"\n",
	"Make sure you comment out all the informational lines, so that output is not printed when we invoke this notebook from other notebooks.\n",
	"```"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import pandas as pd"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# pd.read_csv?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%%sh\n",
	"\n",
	"# ls -ltr /data/retail_db/orders/part-00000"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%%sh\n",
	"\n",
	"# tail /data/retail_db/orders/part-00000"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%%sh\n",
	"\n",
	"# head /data/retail_db/orders/part-00000"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"orders_path = \"/data/retail_db/orders/part-00000\""
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"orders_schema = [\n",
	" \"order_id\",\n",
	" \"order_date\",\n",
	" \"order_customer_id\",\n",
	" \"order_status\"\n",
	"]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"orders = pd.read_csv(orders_path,\n",
	" delimiter=',',\n",
	" header=None,\n",
	" names=orders_schema\n",
	" )"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# orders"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# orders.head(10)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"order_items_path = \"/data/retail_db/order_items/part-00000\""
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%%sh\n",
	"\n",
	"# ls -ltr /data/retail_db/order_items/part-00000"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%%sh\n",
	"\n",
	"# tail /data/retail_db/order_items/part-00000"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%%sh\n",
	"\n",
	"# head /data/retail_db/order_items/part-00000"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"order_items_schema = [\n",
	" \"order_item_id\",\n",
	" \"order_item_order_id\",\n",
	" \"order_item_product_id\",\n",
	" \"order_item_quantity\",\n",
	" \"order_item_subtotal\",\n",
	" \"order_item_product_price\"\n",
	"]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"order_items = pd.read_csv(order_items_path,\n",
	" delimiter=',',\n",
	" header=None,\n",
	" names=order_items_schema\n",
	" )"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# order_items"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# order_items.head(10)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.12"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}