romicofre · February 1, 2021 01:24
diff --git a/Carga_datos_bigquery_Colab.ipynb b/Carga_datos_bigquery_Colab.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HeDxXFS8s5oq"
   },
   "source": [
    "# Cargando datos desde Google Colab a Bigquery\n",
    "\n",
    "Trabajar en Google Colab, puede ahorrarnos el tener que instalar Jupyter en nuestro computador, podemos compartir el trabajo facilmente y ahorramos recursos propios para usar lo que generosamente Google nos comparte.\n",
    "\n",
    "Pero cómo llevamos los datos a un entorno de producción como BigQuery, si tenemos que autenticarnos por ejemplo. Gracias (o no?) a que estamos usando todo el ecosistema de Google esto está muy fácil. \n",
    "\n",
    "Tenemos el siguiente ejemplo, que utiliza una base de datos de COVID-19 de la universidad de Johns Hopkins.\n",
    "\n",
    "https://github.com/CSSEGISandData/COVID-19\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "qUKe5xA_vJsz"
   },
   "outputs": [],
   "source": [
    "# Utilizamos pandas para trabajar con los datos en CSV\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "aAkj0w9h0-Oi"
   },
   "source": [
    "## Obtención de datos"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MNO-L9dmvZ08"
   },
   "source": [
    "Podemos utilizar el menú de la izquierda para cargar archivos, van a funcionar de la misma manera. Incluso podemos ejecutar un código que abrirá una ventana emergente para que quien lo ejecute realice la carga desde su computadora. Esta vez usaremos el truco de utilizar comandos de terminal con **!**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "HkEJT9ojxfdK",
    "outputId": "7ac5c0a0-9243-4a30-e171-9bd4fd3fd74c"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cloning into 'COVID-19'...\n",
      "remote: Enumerating objects: 6, done.\u001b[K\n",
      "remote: Counting objects: 100% (6/6), done.\u001b[K\n",
      "remote: Compressing objects: 100% (6/6), done.\u001b[K\n",
      "remote: Total 90072 (delta 0), reused 2 (delta 0), pack-reused 90066\u001b[K\n",
      "Receiving objects: 100% (90072/90072), 744.75 MiB | 26.76 MiB/s, done.\n",
      "Resolving deltas: 100% (61931/61931), done.\n",
      "Checking out files: 100% (898/898), done.\n"
     ]
    }
   ],
   "source": [
    "# Estamos utilizando un comando como si fuera la terminal y descargando tooodo el repositorio\n",
    "!git clone https://github.com/CSSEGISandData/COVID-19.git"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "DnJpx2ncvait"
   },
   "outputs": [],
   "source": [
    "# Podemos leer sólo los archivos que nos interesan\n",
    "confirmed = pd.read_csv('/content/COVID-19/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')\n",
    "deaths = pd.read_csv('/content/COVID-19/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')\n",
    "recovered = pd.read_csv('/content/COVID-19/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "rYEFooxoye9B",
    "outputId": "ab031bd5-4ff6-4f0a-efc4-af8448ec8e14"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',\n",
      "       '1/24/20', '1/25/20', '1/26/20', '1/27/20',\n",
      "       ...\n",
      "       '1/21/21', '1/22/21', '1/23/21', '1/24/21', '1/25/21', '1/26/21',\n",
      "       '1/27/21', '1/28/21', '1/29/21', '1/30/21'],\n",
      "      dtype='object', length=379)\n",
      "(273, 379)\n"
     ]
    }
   ],
   "source": [
    "print(confirmed.columns)\n",
    "print(confirmed.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "tqPVY3J5ywyi",
    "outputId": "e7f8d1aa-ff43-41bd-cd2e-87640d9942d8"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',\n",
      "       '1/24/20', '1/25/20', '1/26/20', '1/27/20',\n",
      "       ...\n",
      "       '1/21/21', '1/22/21', '1/23/21', '1/24/21', '1/25/21', '1/26/21',\n",
      "       '1/27/21', '1/28/21', '1/29/21', '1/30/21'],\n",
      "      dtype='object', length=379)\n",
      "(273, 379)\n"
     ]
    }
   ],
   "source": [
    "print(deaths.columns)\n",
    "print(deaths.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "UUu8uj34y1FT",
    "outputId": "ac719ee1-0efc-42f9-bc31-da0ea7f5ae33"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',\n",
      "       '1/24/20', '1/25/20', '1/26/20', '1/27/20',\n",
      "       ...\n",
      "       '1/21/21', '1/22/21', '1/23/21', '1/24/21', '1/25/21', '1/26/21',\n",
      "       '1/27/21', '1/28/21', '1/29/21', '1/30/21'],\n",
      "      dtype='object', length=379)\n",
      "(258, 379)\n"
     ]
    }
   ],
   "source": [
    "print(recovered.columns)\n",
    "print(recovered.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "CSZDtQbE_UW9",
    "outputId": "e8d9c775-8e62-4056-9988-05998eea188c"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<bound method NDFrame.head of     Province/State      Country/Region        Lat  ...  1/28/21  1/29/21  1/30/21\n",
       "0              NaN         Afghanistan  33.939110  ...    54891    54939    55008\n",
       "1              NaN             Albania  41.153300  ...    75454    76350    77251\n",
       "2              NaN             Algeria  28.033900  ...   106610   106887   107122\n",
       "3              NaN             Andorra  42.506300  ...     9779     9837     9885\n",
       "4              NaN              Angola -11.202700  ...    19672    19723    19782\n",
       "..             ...                 ...        ...  ...      ...      ...      ...\n",
       "268            NaN             Vietnam  14.058324  ...     1651     1657     1767\n",
       "269            NaN  West Bank and Gaza  31.952200  ...   157593   158168   158559\n",
       "270            NaN               Yemen  15.552727  ...     2120     2120     2120\n",
       "271            NaN              Zambia -13.133897  ...    50319    51624    53352\n",
       "272            NaN            Zimbabwe -19.015438  ...    32646    32952    33273\n",
       "\n",
       "[273 rows x 379 columns]>"
      ]
     },
     "execution_count": 7,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "confirmed.head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ejLLuVLp0tDa"
   },
   "source": [
    "# Preparación de los datos\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dBfaj5ocwDEX"
   },
   "source": [
    "Tenemos archivos separados, en BigQuery por lo general vamos a querer tener la mayor información de un modelo en sólo una tabla, sin preocuparnos de la redundancia, o la cantidad de datos. Lo importante es que responda a lo que se quiere analizar, ya que su objetivo es analítico y basado en columnas.\n",
    "\n",
    "Para este caso, vamos a unir los datos que están separados en archivos, pero que se pueden asociar por fecha y lugar. Además de cambiar el tipo de algunas columnas para que podamos subirlos a BigQuery."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "WDFnMagHy5Dg"
   },
   "outputs": [],
   "source": [
    "# Unpivot de las columnas fechas\n",
    "confirmed = pd.melt(confirmed, id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], var_name='date', value_name='qty_confirmed')\n",
    "deaths = pd.melt(deaths, id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], var_name='date', value_name='qty_deaths')\n",
    "recovered = pd.melt(recovered, id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], var_name='date', value_name='qty_recovered')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "asOfm1QZ8ojA",
    "outputId": "13e7c8db-c0ee-49bc-e835-c04ed0557f69"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<bound method NDFrame.head of        Province/State      Country/Region  ...     date  qty_confirmed\n",
       "0                 NaN         Afghanistan  ...  1/22/20              0\n",
       "1                 NaN             Albania  ...  1/22/20              0\n",
       "2                 NaN             Algeria  ...  1/22/20              0\n",
       "3                 NaN             Andorra  ...  1/22/20              0\n",
       "4                 NaN              Angola  ...  1/22/20              0\n",
       "...               ...                 ...  ...      ...            ...\n",
       "102370            NaN             Vietnam  ...  1/30/21           1767\n",
       "102371            NaN  West Bank and Gaza  ...  1/30/21         158559\n",
       "102372            NaN               Yemen  ...  1/30/21           2120\n",
       "102373            NaN              Zambia  ...  1/30/21          53352\n",
       "102374            NaN            Zimbabwe  ...  1/30/21          33273\n",
       "\n",
       "[102375 rows x 6 columns]>"
      ]
     },
     "execution_count": 9,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "confirmed.head"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "0szUqIzWzWW2"
   },
   "outputs": [],
   "source": [
    "# Unir dataframes, para adjuntar columnas de cantidades\n",
    "new_df = confirmed.merge(deaths, how='outer', on=['Province/State', 'Country/Region', 'Lat', 'Long', 'date'])\n",
    "new_df = new_df.merge(recovered, how='outer', on=['Province/State', 'Country/Region', 'Lat', 'Long', 'date'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "nVrp0Hue9uoq"
   },
   "outputs": [],
   "source": [
    "# Renombrar columnas, por ejemplo, en BigQuery no podemos usar \"/\" en el nombre de una tabla\n",
    "new_df = new_df.rename(columns = {'Province/State': 'province_state', 'Country/Region': 'country_region',\n",
    "                                  'Lat': 'lat', 'Long': 'long'})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "PDY_Qzvb-MHi",
    "outputId": "e6e26f0a-41fc-4eea-c300-298b790e2dcd"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<bound method NDFrame.sample of        province_state country_region  ...  qty_deaths  qty_recovered\n",
       "0                 NaN    Afghanistan  ...         0.0            0.0\n",
       "1                 NaN        Albania  ...         0.0            0.0\n",
       "2                 NaN        Algeria  ...         0.0            0.0\n",
       "3                 NaN        Andorra  ...         0.0            0.0\n",
       "4                 NaN         Angola  ...         0.0            0.0\n",
       "...               ...            ...  ...         ...            ...\n",
       "104620          Hebei          China  ...         NaN          868.0\n",
       "104621          Henan          China  ...         NaN         1278.0\n",
       "104622            NaN     Mozambique  ...         NaN        23945.0\n",
       "104623            NaN          Syria  ...         NaN         7481.0\n",
       "104624            NaN    Timor-Leste  ...         NaN           55.0\n",
       "\n",
       "[104625 rows x 8 columns]>"
      ]
     },
     "execution_count": 12,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_df.sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "TyPxSBI7Ghi6",
    "outputId": "a3d08d0c-0be1-4d3d-e525-de45c28e9192"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "province_state     object\n",
       "country_region     object\n",
       "lat               float64\n",
       "long              float64\n",
       "date               object\n",
       "qty_confirmed     float64\n",
       "qty_deaths        float64\n",
       "qty_recovered     float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 13,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "tvEivVDtapmY"
   },
   "outputs": [],
   "source": [
    "# Formatear fecha, en BigQuery sólo es capaz de entender formatos de fecha separadas por \"-\"\n",
    "new_df['date'] = pd.to_datetime(new_df['date'], format='%m/%d/%y')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "g3KpcBErb5oD",
    "outputId": "09f99bf6-9482-4cae-a30d-ab3ddf1da396"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        2020-01-22\n",
       "1        2020-01-22\n",
       "2        2020-01-22\n",
       "3        2020-01-22\n",
       "4        2020-01-22\n",
       "            ...    \n",
       "104620   2021-01-30\n",
       "104621   2021-01-30\n",
       "104622   2021-01-30\n",
       "104623   2021-01-30\n",
       "104624   2021-01-30\n",
       "Name: date, Length: 104625, dtype: datetime64[ns]"
      ]
     },
     "execution_count": 45,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_df['date']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NVqO8WNb1OdA"
   },
   "source": [
    "# Carga a BigQuery"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Al5nbfMd6cdv"
   },
   "source": [
    "Este entorno tiene todas las librerías necesarias para poder autentificarse con los servicios de Google. A pesar de ello, la versión de google.auth dio algunos problemas e instalaremos la versión 1.7.2, esto es necesario solo si tienen error en la librería."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "HEUpfuKQDLtY",
    "outputId": "4c595112-142e-477f-eea8-4ea0d164b36a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting google.auth==1.7.2\n",
      "\u001b[?25l  Downloading https://files.pythonhosted.org/packages/ec/11/1d90cbfa72a084b08498e8cea1fee199bc965cdac391d241f5ae6257073e/google_auth-1.7.2-py2.py3-none-any.whl (74kB)\n",
      "\r",
      "\u001b[K     |████▍                           | 10kB 16.2MB/s eta 0:00:01\r",
      "\u001b[K     |████████▊                       | 20kB 12.7MB/s eta 0:00:01\r",
      "\u001b[K     |█████████████▏                  | 30kB 10.0MB/s eta 0:00:01\r",
      "\u001b[K     |█████████████████▌              | 40kB 7.6MB/s eta 0:00:01\r",
      "\u001b[K     |█████████████████████▉          | 51kB 4.4MB/s eta 0:00:01\r",
      "\u001b[K     |██████████████████████████▎     | 61kB 4.9MB/s eta 0:00:01\r",
      "\u001b[K     |██████████████████████████████▋ | 71kB 5.0MB/s eta 0:00:01\r",
      "\u001b[K     |████████████████████████████████| 81kB 3.7MB/s \n",
      "\u001b[?25hRequirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/dist-packages (from google.auth==1.7.2) (0.2.8)\n",
      "Collecting rsa<4.1,>=3.1.4\n",
      "  Downloading https://files.pythonhosted.org/packages/02/e5/38518af393f7c214357079ce67a317307936896e961e35450b70fad2a9cf/rsa-4.0-py2.py3-none-any.whl\n",
      "Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from google.auth==1.7.2) (1.15.0)\n",
      "Requirement already satisfied: setuptools>=40.3.0 in /usr/local/lib/python3.6/dist-packages (from google.auth==1.7.2) (51.3.3)\n",
      "Collecting cachetools<3.2,>=2.0.0\n",
      "  Downloading https://files.pythonhosted.org/packages/2f/a6/30b0a0bef12283e83e58c1d6e7b5aabc7acfc4110df81a4471655d33e704/cachetools-3.1.1-py2.py3-none-any.whl\n",
      "Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.6/dist-packages (from pyasn1-modules>=0.2.1->google.auth==1.7.2) (0.4.8)\n",
      "\u001b[31mERROR: google-colab 1.0.0 has requirement google-auth~=1.17.2, but you'll have google-auth 1.7.2 which is incompatible.\u001b[0m\n",
      "Installing collected packages: rsa, cachetools, google.auth\n",
      "  Found existing installation: rsa 4.7\n",
      "    Uninstalling rsa-4.7:\n",
      "      Successfully uninstalled rsa-4.7\n",
      "  Found existing installation: cachetools 4.2.1\n",
      "    Uninstalling cachetools-4.2.1:\n",
      "      Successfully uninstalled cachetools-4.2.1\n",
      "Successfully installed cachetools-3.1.1 google.auth rsa-4.0\n"
     ]
    }
   ],
   "source": [
    "!pip install google.auth==1.7.2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "id": "QgbR1pnf_Gc8"
   },
   "outputs": [],
   "source": [
    "from google.colab import auth"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "9geIg2lwDHeQ"
   },
   "outputs": [],
   "source": [
    "auth.authenticate_user()\n",
    "print('Authenticated')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Em7iEz6I1Tpy"
   },
   "outputs": [],
   "source": [
    "from google.cloud import bigquery"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "fqcgVA950Y7a"
   },
   "outputs": [],
   "source": [
    "# Construir el cliente de BigQuery.\n",
    "project_id = 'tareas-303400'  # cambiar\n",
    "client = bigquery.Client(project = project_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "jPs8IRkLDz1x"
   },
   "outputs": [],
   "source": [
    "# Nombre tabla, utiliza un dataset que esté creado en tu proyecto.\n",
    "table_id = \"tareas-303400.dataset_test.covid_19\"\n",
    "\n",
    "# Esquema tabla\n",
    "table_schema = [\n",
    "                    bigquery.SchemaField(\"province_state\", \"STRING\"),\n",
    "                    bigquery.SchemaField(\"country_region\", \"STRING\"),\n",
    "                    bigquery.SchemaField(\"lat\", \"FLOAT\"),\n",
    "                    bigquery.SchemaField(\"long\", \"FLOAT\"),\n",
    "                    bigquery.SchemaField(\"date\", \"DATE\"),\n",
    "                    bigquery.SchemaField(\"qty_confirmed\", \"FLOAT\"),\n",
    "                    bigquery.SchemaField(\"qty_deaths\", \"FLOAT\"),\n",
    "                    bigquery.SchemaField(\"qty_recovered\", \"FLOAT\"),\n",
    "                ]\n",
    "\n",
    "# Trabajo de carga\n",
    "job_config = bigquery.LoadJobConfig(\n",
    "    schema=table_schema,\n",
    "    write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE, # Reemplazara la tabla\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "4YDBlXu_FmDw"
   },
   "outputs": [],
   "source": [
    "# Carga\n",
    "\n",
    "load_job = client.load_table_from_dataframe(\n",
    "    new_df, table_id, job_config=job_config\n",
    ") \n",
    "\n",
    "load_job.result()  # Espera que la carga termine\n",
    "destination_table = client.get_table(table_id)  # Revisando si la tabla existe\n",
    "print(\"Loaded {} rows.\".format(destination_table.num_rows))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "fpRx7ShTZAy7"
   },
   "outputs": [],
   "source": [
    "# If errors :\n",
    "# load_job.errors"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "Carga_datos_bigquery_Colab.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "HeDxXFS8s5oq"
	},
	"source": [
	"# Cargando datos desde Google Colab a Bigquery\n",
	"\n",
	"Trabajar en Google Colab, puede ahorrarnos el tener que instalar Jupyter en nuestro computador, podemos compartir el trabajo facilmente y ahorramos recursos propios para usar lo que generosamente Google nos comparte.\n",
	"\n",
	"Pero cómo llevamos los datos a un entorno de producción como BigQuery, si tenemos que autenticarnos por ejemplo. Gracias (o no?) a que estamos usando todo el ecosistema de Google esto está muy fácil. \n",
	"\n",
	"Tenemos el siguiente ejemplo, que utiliza una base de datos de COVID-19 de la universidad de Johns Hopkins.\n",
	"\n",
	"https://github.com/CSSEGISandData/COVID-19\n",
	"\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "qUKe5xA_vJsz"
	},
	"outputs": [],
	"source": [
	"# Utilizamos pandas para trabajar con los datos en CSV\n",
	"import pandas as pd"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "aAkj0w9h0-Oi"
	},
	"source": [
	"## Obtención de datos"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "MNO-L9dmvZ08"
	},
	"source": [
	"Podemos utilizar el menú de la izquierda para cargar archivos, van a funcionar de la misma manera. Incluso podemos ejecutar un código que abrirá una ventana emergente para que quien lo ejecute realice la carga desde su computadora. Esta vez usaremos el truco de utilizar comandos de terminal con !."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "HkEJT9ojxfdK",
	"outputId": "7ac5c0a0-9243-4a30-e171-9bd4fd3fd74c"
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Cloning into 'COVID-19'...\n",
	"remote: Enumerating objects: 6, done.\u001b[K\n",
	"remote: Counting objects: 100% (6/6), done.\u001b[K\n",
	"remote: Compressing objects: 100% (6/6), done.\u001b[K\n",
	"remote: Total 90072 (delta 0), reused 2 (delta 0), pack-reused 90066\u001b[K\n",
	"Receiving objects: 100% (90072/90072), 744.75 MiB \| 26.76 MiB/s, done.\n",
	"Resolving deltas: 100% (61931/61931), done.\n",
	"Checking out files: 100% (898/898), done.\n"
	]
	}
	],
	"source": [
	"# Estamos utilizando un comando como si fuera la terminal y descargando tooodo el repositorio\n",
	"!git clone https://github.com/CSSEGISandData/COVID-19.git"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "DnJpx2ncvait"
	},
	"outputs": [],
	"source": [
	"# Podemos leer sólo los archivos que nos interesan\n",
	"confirmed = pd.read_csv('/content/COVID-19/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')\n",
	"deaths = pd.read_csv('/content/COVID-19/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')\n",
	"recovered = pd.read_csv('/content/COVID-19/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "rYEFooxoye9B",
	"outputId": "ab031bd5-4ff6-4f0a-efc4-af8448ec8e14"
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',\n",
	" '1/24/20', '1/25/20', '1/26/20', '1/27/20',\n",
	" ...\n",
	" '1/21/21', '1/22/21', '1/23/21', '1/24/21', '1/25/21', '1/26/21',\n",
	" '1/27/21', '1/28/21', '1/29/21', '1/30/21'],\n",
	" dtype='object', length=379)\n",
	"(273, 379)\n"
	]
	}
	],
	"source": [
	"print(confirmed.columns)\n",
	"print(confirmed.shape)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "tqPVY3J5ywyi",
	"outputId": "e7f8d1aa-ff43-41bd-cd2e-87640d9942d8"
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',\n",
	" '1/24/20', '1/25/20', '1/26/20', '1/27/20',\n",
	" ...\n",
	" '1/21/21', '1/22/21', '1/23/21', '1/24/21', '1/25/21', '1/26/21',\n",
	" '1/27/21', '1/28/21', '1/29/21', '1/30/21'],\n",
	" dtype='object', length=379)\n",
	"(273, 379)\n"
	]
	}
	],
	"source": [
	"print(deaths.columns)\n",
	"print(deaths.shape)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "UUu8uj34y1FT",
	"outputId": "ac719ee1-0efc-42f9-bc31-da0ea7f5ae33"
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',\n",
	" '1/24/20', '1/25/20', '1/26/20', '1/27/20',\n",
	" ...\n",
	" '1/21/21', '1/22/21', '1/23/21', '1/24/21', '1/25/21', '1/26/21',\n",
	" '1/27/21', '1/28/21', '1/29/21', '1/30/21'],\n",
	" dtype='object', length=379)\n",
	"(258, 379)\n"
	]
	}
	],
	"source": [
	"print(recovered.columns)\n",
	"print(recovered.shape)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "CSZDtQbE_UW9",
	"outputId": "e8d9c775-8e62-4056-9988-05998eea188c"
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<bound method NDFrame.head of Province/State Country/Region Lat ... 1/28/21 1/29/21 1/30/21\n",
	"0 NaN Afghanistan 33.939110 ... 54891 54939 55008\n",
	"1 NaN Albania 41.153300 ... 75454 76350 77251\n",
	"2 NaN Algeria 28.033900 ... 106610 106887 107122\n",
	"3 NaN Andorra 42.506300 ... 9779 9837 9885\n",
	"4 NaN Angola -11.202700 ... 19672 19723 19782\n",
	".. ... ... ... ... ... ... ...\n",
	"268 NaN Vietnam 14.058324 ... 1651 1657 1767\n",
	"269 NaN West Bank and Gaza 31.952200 ... 157593 158168 158559\n",
	"270 NaN Yemen 15.552727 ... 2120 2120 2120\n",
	"271 NaN Zambia -13.133897 ... 50319 51624 53352\n",
	"272 NaN Zimbabwe -19.015438 ... 32646 32952 33273\n",
	"\n",
	"[273 rows x 379 columns]>"
	]
	},
	"execution_count": 7,
	"metadata": {
	"tags": []
	},
	"output_type": "execute_result"
	}
	],
	"source": [
	"confirmed.head"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "ejLLuVLp0tDa"
	},
	"source": [
	"# Preparación de los datos\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "dBfaj5ocwDEX"
	},
	"source": [
	"Tenemos archivos separados, en BigQuery por lo general vamos a querer tener la mayor información de un modelo en sólo una tabla, sin preocuparnos de la redundancia, o la cantidad de datos. Lo importante es que responda a lo que se quiere analizar, ya que su objetivo es analítico y basado en columnas.\n",
	"\n",
	"Para este caso, vamos a unir los datos que están separados en archivos, pero que se pueden asociar por fecha y lugar. Además de cambiar el tipo de algunas columnas para que podamos subirlos a BigQuery."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "WDFnMagHy5Dg"
	},
	"outputs": [],
	"source": [
	"# Unpivot de las columnas fechas\n",
	"confirmed = pd.melt(confirmed, id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], var_name='date', value_name='qty_confirmed')\n",
	"deaths = pd.melt(deaths, id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], var_name='date', value_name='qty_deaths')\n",
	"recovered = pd.melt(recovered, id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], var_name='date', value_name='qty_recovered')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "asOfm1QZ8ojA",
	"outputId": "13e7c8db-c0ee-49bc-e835-c04ed0557f69"
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<bound method NDFrame.head of Province/State Country/Region ... date qty_confirmed\n",
	"0 NaN Afghanistan ... 1/22/20 0\n",
	"1 NaN Albania ... 1/22/20 0\n",
	"2 NaN Algeria ... 1/22/20 0\n",
	"3 NaN Andorra ... 1/22/20 0\n",
	"4 NaN Angola ... 1/22/20 0\n",
	"... ... ... ... ... ...\n",
	"102370 NaN Vietnam ... 1/30/21 1767\n",
	"102371 NaN West Bank and Gaza ... 1/30/21 158559\n",
	"102372 NaN Yemen ... 1/30/21 2120\n",
	"102373 NaN Zambia ... 1/30/21 53352\n",
	"102374 NaN Zimbabwe ... 1/30/21 33273\n",
	"\n",
	"[102375 rows x 6 columns]>"
	]
	},
	"execution_count": 9,
	"metadata": {
	"tags": []
	},
	"output_type": "execute_result"
	}
	],
	"source": [
	"confirmed.head"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "0szUqIzWzWW2"
	},
	"outputs": [],
	"source": [
	"# Unir dataframes, para adjuntar columnas de cantidades\n",
	"new_df = confirmed.merge(deaths, how='outer', on=['Province/State', 'Country/Region', 'Lat', 'Long', 'date'])\n",
	"new_df = new_df.merge(recovered, how='outer', on=['Province/State', 'Country/Region', 'Lat', 'Long', 'date'])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "nVrp0Hue9uoq"
	},
	"outputs": [],
	"source": [
	"# Renombrar columnas, por ejemplo, en BigQuery no podemos usar \"/\" en el nombre de una tabla\n",
	"new_df = new_df.rename(columns = {'Province/State': 'province_state', 'Country/Region': 'country_region',\n",
	" 'Lat': 'lat', 'Long': 'long'})"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "PDY_Qzvb-MHi",
	"outputId": "e6e26f0a-41fc-4eea-c300-298b790e2dcd"
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<bound method NDFrame.sample of province_state country_region ... qty_deaths qty_recovered\n",
	"0 NaN Afghanistan ... 0.0 0.0\n",
	"1 NaN Albania ... 0.0 0.0\n",
	"2 NaN Algeria ... 0.0 0.0\n",
	"3 NaN Andorra ... 0.0 0.0\n",
	"4 NaN Angola ... 0.0 0.0\n",
	"... ... ... ... ... ...\n",
	"104620 Hebei China ... NaN 868.0\n",
	"104621 Henan China ... NaN 1278.0\n",
	"104622 NaN Mozambique ... NaN 23945.0\n",
	"104623 NaN Syria ... NaN 7481.0\n",
	"104624 NaN Timor-Leste ... NaN 55.0\n",
	"\n",
	"[104625 rows x 8 columns]>"
	]
	},
	"execution_count": 12,
	"metadata": {
	"tags": []
	},
	"output_type": "execute_result"
	}
	],
	"source": [
	"new_df.sample"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "TyPxSBI7Ghi6",
	"outputId": "a3d08d0c-0be1-4d3d-e525-de45c28e9192"
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"province_state object\n",
	"country_region object\n",
	"lat float64\n",
	"long float64\n",
	"date object\n",
	"qty_confirmed float64\n",
	"qty_deaths float64\n",
	"qty_recovered float64\n",
	"dtype: object"
	]
	},
	"execution_count": 13,
	"metadata": {
	"tags": []
	},
	"output_type": "execute_result"
	}
	],
	"source": [
	"new_df.dtypes"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "tvEivVDtapmY"
	},
	"outputs": [],
	"source": [
	"# Formatear fecha, en BigQuery sólo es capaz de entender formatos de fecha separadas por \"-\"\n",
	"new_df['date'] = pd.to_datetime(new_df['date'], format='%m/%d/%y')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "g3KpcBErb5oD",
	"outputId": "09f99bf6-9482-4cae-a30d-ab3ddf1da396"
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"0 2020-01-22\n",
	"1 2020-01-22\n",
	"2 2020-01-22\n",
	"3 2020-01-22\n",
	"4 2020-01-22\n",
	" ... \n",
	"104620 2021-01-30\n",
	"104621 2021-01-30\n",
	"104622 2021-01-30\n",
	"104623 2021-01-30\n",
	"104624 2021-01-30\n",
	"Name: date, Length: 104625, dtype: datetime64[ns]"
	]
	},
	"execution_count": 45,
	"metadata": {
	"tags": []
	},
	"output_type": "execute_result"
	}
	],
	"source": [
	"new_df['date']"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "NVqO8WNb1OdA"
	},
	"source": [
	"# Carga a BigQuery"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "Al5nbfMd6cdv"
	},
	"source": [
	"Este entorno tiene todas las librerías necesarias para poder autentificarse con los servicios de Google. A pesar de ello, la versión de google.auth dio algunos problemas e instalaremos la versión 1.7.2, esto es necesario solo si tienen error en la librería."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "HEUpfuKQDLtY",
	"outputId": "4c595112-142e-477f-eea8-4ea0d164b36a"
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Collecting google.auth==1.7.2\n",
	"\u001b[?25l Downloading https://files.pythonhosted.org/packages/ec/11/1d90cbfa72a084b08498e8cea1fee199bc965cdac391d241f5ae6257073e/google_auth-1.7.2-py2.py3-none-any.whl (74kB)\n",
	"\r",
	"\u001b[K \|████▍ \| 10kB 16.2MB/s eta 0:00:01\r",
	"\u001b[K \|████████▊ \| 20kB 12.7MB/s eta 0:00:01\r",
	"\u001b[K \|█████████████▏ \| 30kB 10.0MB/s eta 0:00:01\r",
	"\u001b[K \|█████████████████▌ \| 40kB 7.6MB/s eta 0:00:01\r",
	"\u001b[K \|█████████████████████▉ \| 51kB 4.4MB/s eta 0:00:01\r",
	"\u001b[K \|██████████████████████████▎ \| 61kB 4.9MB/s eta 0:00:01\r",
	"\u001b[K \|██████████████████████████████▋ \| 71kB 5.0MB/s eta 0:00:01\r",
	"\u001b[K \|████████████████████████████████\| 81kB 3.7MB/s \n",
	"\u001b[?25hRequirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/dist-packages (from google.auth==1.7.2) (0.2.8)\n",
	"Collecting rsa<4.1,>=3.1.4\n",
	" Downloading https://files.pythonhosted.org/packages/02/e5/38518af393f7c214357079ce67a317307936896e961e35450b70fad2a9cf/rsa-4.0-py2.py3-none-any.whl\n",
	"Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from google.auth==1.7.2) (1.15.0)\n",
	"Requirement already satisfied: setuptools>=40.3.0 in /usr/local/lib/python3.6/dist-packages (from google.auth==1.7.2) (51.3.3)\n",
	"Collecting cachetools<3.2,>=2.0.0\n",
	" Downloading https://files.pythonhosted.org/packages/2f/a6/30b0a0bef12283e83e58c1d6e7b5aabc7acfc4110df81a4471655d33e704/cachetools-3.1.1-py2.py3-none-any.whl\n",
	"Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.6/dist-packages (from pyasn1-modules>=0.2.1->google.auth==1.7.2) (0.4.8)\n",
	"\u001b[31mERROR: google-colab 1.0.0 has requirement google-auth~=1.17.2, but you'll have google-auth 1.7.2 which is incompatible.\u001b[0m\n",
	"Installing collected packages: rsa, cachetools, google.auth\n",
	" Found existing installation: rsa 4.7\n",
	" Uninstalling rsa-4.7:\n",
	" Successfully uninstalled rsa-4.7\n",
	" Found existing installation: cachetools 4.2.1\n",
	" Uninstalling cachetools-4.2.1:\n",
	" Successfully uninstalled cachetools-4.2.1\n",
	"Successfully installed cachetools-3.1.1 google.auth rsa-4.0\n"
	]
	}
	],
	"source": [
	"!pip install google.auth==1.7.2"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"id": "QgbR1pnf_Gc8"
	},
	"outputs": [],
	"source": [
	"from google.colab import auth"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "9geIg2lwDHeQ"
	},
	"outputs": [],
	"source": [
	"auth.authenticate_user()\n",
	"print('Authenticated')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "Em7iEz6I1Tpy"
	},
	"outputs": [],
	"source": [
	"from google.cloud import bigquery"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "fqcgVA950Y7a"
	},
	"outputs": [],
	"source": [
	"# Construir el cliente de BigQuery.\n",
	"project_id = 'tareas-303400' # cambiar\n",
	"client = bigquery.Client(project = project_id)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "jPs8IRkLDz1x"
	},
	"outputs": [],
	"source": [
	"# Nombre tabla, utiliza un dataset que esté creado en tu proyecto.\n",
	"table_id = \"tareas-303400.dataset_test.covid_19\"\n",
	"\n",
	"# Esquema tabla\n",
	"table_schema = [\n",
	" bigquery.SchemaField(\"province_state\", \"STRING\"),\n",
	" bigquery.SchemaField(\"country_region\", \"STRING\"),\n",
	" bigquery.SchemaField(\"lat\", \"FLOAT\"),\n",
	" bigquery.SchemaField(\"long\", \"FLOAT\"),\n",
	" bigquery.SchemaField(\"date\", \"DATE\"),\n",
	" bigquery.SchemaField(\"qty_confirmed\", \"FLOAT\"),\n",
	" bigquery.SchemaField(\"qty_deaths\", \"FLOAT\"),\n",
	" bigquery.SchemaField(\"qty_recovered\", \"FLOAT\"),\n",
	" ]\n",
	"\n",
	"# Trabajo de carga\n",
	"job_config = bigquery.LoadJobConfig(\n",
	" schema=table_schema,\n",
	" write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE, # Reemplazara la tabla\n",
	")\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "4YDBlXu_FmDw"
	},
	"outputs": [],
	"source": [
	"# Carga\n",
	"\n",
	"load_job = client.load_table_from_dataframe(\n",
	" new_df, table_id, job_config=job_config\n",
	") \n",
	"\n",
	"load_job.result() # Espera que la carga termine\n",
	"destination_table = client.get_table(table_id) # Revisando si la tabla existe\n",
	"print(\"Loaded {} rows.\".format(destination_table.num_rows))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "fpRx7ShTZAy7"
	},
	"outputs": [],
	"source": [
	"# If errors :\n",
	"# load_job.errors"
	]
	}
	],
	"metadata": {
	"colab": {
	"collapsed_sections": [],
	"name": "Carga_datos_bigquery_Colab.ipynb",
	"provenance": []
	},
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.8.5"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}