schocco · January 15, 2018 19:20
diff --git a/concat-stock-csvs.ipynb b/concat-stock-csvs.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparing CSVs for import to Kafka\n",
    "The [kafka-connect-spooldir](https://github.com/jcustenborder/kafka-connect-spooldir) plugin can be used for importing entries of a CSV file to a target topic.\n",
    "The Stock Market Dataset with historical daily prices can be obtained from [Kaggle](https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs) \n",
    "and contains one CSV file for each symbol.\n",
    "For an easier import we concatenate the CSVs and add the symbol as a column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import glob\n",
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "etf_files = glob.glob(\"ETFs/*.txt\")\n",
    "stock_files = glob.glob(\"Stocks/*.txt\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = re.compile('.*/(.*) is valid')\n",
    "\n",
    "def read_csv(path):\n",
    "    try:\n",
    "        df = pd.read_csv(path)\n",
    "        df['Symbol'] = re.match(r\"(?P<folder>\\w+)/(?P<symbol>[\\w_-]+).\\w+.txt\", path).group(\"symbol\")\n",
    "        df['Date'] = pd.to_datetime(df['Date'])\n",
    "        return df\n",
    "    except pd.errors.EmptyDataError:\n",
    "        return pd.DataFrame()\n",
    "    \n",
    "def export_csv(files, name):\n",
    "    dfs = (read_csv(f) for f in files)\n",
    "    dfs_concat = pd.concat(dfs, ignore_index=True).sort_values(by=['Date'])\n",
    "    dfs_concat.to_csv(name, date_format=\"%Y-%m-%d %H:%M:%S\", index=False, float_format=\"%.2f\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exporting dataframes "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "export_csv(etf_files, \"us-etfs.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "export_csv(stock_files, \"us-stocks.csv\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Preparing CSVs for import to Kafka\n",
	"The [kafka-connect-spooldir](https://github.com/jcustenborder/kafka-connect-spooldir) plugin can be used for importing entries of a CSV file to a target topic.\n",
	"The Stock Market Dataset with historical daily prices can be obtained from [Kaggle](https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs) \n",
	"and contains one CSV file for each symbol.\n",
	"For an easier import we concatenate the CSVs and add the symbol as a column."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"import glob\n",
	"import re"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"etf_files = glob.glob(\"ETFs/*.txt\")\n",
	"stock_files = glob.glob(\"Stocks/*.txt\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"p = re.compile('./(.) is valid')\n",
	"\n",
	"def read_csv(path):\n",
	" try:\n",
	" df = pd.read_csv(path)\n",
	" df['Symbol'] = re.match(r\"(?P<folder>\\w+)/(?P<symbol>[\\w_-]+).\\w+.txt\", path).group(\"symbol\")\n",
	" df['Date'] = pd.to_datetime(df['Date'])\n",
	" return df\n",
	" except pd.errors.EmptyDataError:\n",
	" return pd.DataFrame()\n",
	" \n",
	"def export_csv(files, name):\n",
	" dfs = (read_csv(f) for f in files)\n",
	" dfs_concat = pd.concat(dfs, ignore_index=True).sort_values(by=['Date'])\n",
	" dfs_concat.to_csv(name, date_format=\"%Y-%m-%d %H:%M:%S\", index=False, float_format=\"%.2f\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Exporting dataframes "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [],
	"source": [
	"export_csv(etf_files, \"us-etfs.csv\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [],
	"source": [
	"export_csv(stock_files, \"us-stocks.csv\")"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}