Skip to content

Instantly share code, notes, and snippets.

@smola
Last active May 29, 2018 15:16
Show Gist options
  • Save smola/c5a190856eaa580bfcde372aeff2cb7f to your computer and use it in GitHub Desktop.
Save smola/c5a190856eaa580bfcde372aeff2cb7f to your computer and use it in GitHub Desktop.
Gitbase reference (DRAFT)
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set up\n",
"\n",
"You can get this notebook up and running with Docker:\n",
"\n",
"```\n",
"$ docker run -d -v /path/to/your/repositories:/opt/repos --name gitbase srcd/gitbase:latest\n",
"$ docker run -it -p 8888:8888 --link gitbase jupyter/scipy-notebook\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting pymysql\n",
" Downloading PyMySQL-0.8.0-py2.py3-none-any.whl (83kB)\n",
"\u001b[K 100% |████████████████████████████████| 92kB 2.4MB/s ta 0:00:011\n",
"\u001b[?25hInstalling collected packages: pymysql\n",
"Successfully installed pymysql-0.8.0\n",
"\u001b[33mYou are using pip version 9.0.1, however version 9.0.3 is available.\n",
"You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n"
]
}
],
"source": [
"!pip install pymysql"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import pymysql\n",
"\n",
"conn = pymysql.connect(host='gitbase',\n",
" user='gitbase',\n",
" password='',\n",
" autocommit=False)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Show tables"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>table</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>blobs</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>commits</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>refs</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>remotes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>repositories</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>tree_entries</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" table\n",
"0 blobs\n",
"1 commits\n",
"2 refs\n",
"3 remotes\n",
"4 repositories\n",
"5 tree_entries"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_sql(\"\"\"\n",
"SHOW TABLES;\n",
"\"\"\", con=conn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Describe table"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>type</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>hash</td>\n",
" <td>TEXT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>author_name</td>\n",
" <td>TEXT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>author_email</td>\n",
" <td>TEXT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>author_when</td>\n",
" <td>TIMESTAMP</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>committer_name</td>\n",
" <td>TEXT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>committer_email</td>\n",
" <td>TEXT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>committer_when</td>\n",
" <td>TIMESTAMP</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>message</td>\n",
" <td>TEXT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>tree_hash</td>\n",
" <td>TEXT</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name type\n",
"0 hash TEXT\n",
"1 author_name TEXT\n",
"2 author_email TEXT\n",
"3 author_when TIMESTAMP\n",
"4 committer_name TEXT\n",
"5 committer_email TEXT\n",
"6 committer_when TIMESTAMP\n",
"7 message TEXT\n",
"8 tree_hash TEXT"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_sql(\"\"\"DESCRIBE TABLE commits;\"\"\", con=conn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q1: Count repositories"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>repository_count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>13</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" repository_count\n",
"0 13"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_sql(\"\"\"\n",
"SELECT\n",
" COUNT(DISTINCT id) AS repository_count\n",
"FROM repositories;\n",
"\"\"\", con=conn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q2: Last commit messages in HEAD for every repository"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>message</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>update engine-jupyter chart\\n\\n* renamed from ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>\\nMerge pull request #60 from dpordomingo/repl...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>\\nMerge pull request #67 from src-d/change-jwt...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>fix EXPOSE port\\n\\nSigned-off-by: Santiago M. ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>engineering: add coding guides\\n\\nAdded guides...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>\\nMerge pull request #204 from src-d/2018-04-0...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>\\nMerge pull request #204 from src-d/2018-04-0...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>exclude generated\\n</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>update queries\\n</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>\\nMerge pull request #241 from jfontan/fix/can...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>\\nMerge pull request #16 from warenlg/master\\n...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>\\nMerge pull request #201 from r0mainK/feature...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>\\nMerge pull request #201 from r0mainK/feature...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>Merge remote-tracking branch 'origin/pr/221' i...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>scalastyle applied\\n</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" message\n",
"0 update engine-jupyter chart\\n\\n* renamed from ...\n",
"1 \\nMerge pull request #60 from dpordomingo/repl...\n",
"2 \\nMerge pull request #67 from src-d/change-jwt...\n",
"3 fix EXPOSE port\\n\\nSigned-off-by: Santiago M. ...\n",
"4 engineering: add coding guides\\n\\nAdded guides...\n",
"5 \\nMerge pull request #204 from src-d/2018-04-0...\n",
"6 \\nMerge pull request #204 from src-d/2018-04-0...\n",
"7 exclude generated\\n\n",
"8 update queries\\n\n",
"9 \\nMerge pull request #241 from jfontan/fix/can...\n",
"10 \\nMerge pull request #16 from warenlg/master\\n...\n",
"11 \\nMerge pull request #201 from r0mainK/feature...\n",
"12 \\nMerge pull request #201 from r0mainK/feature...\n",
"13 Merge remote-tracking branch 'origin/pr/221' i...\n",
"14 scalastyle applied\\n"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_sql(\"\"\"\n",
"SELECT c.message\n",
"FROM\n",
" refs r\n",
" JOIN commits c ON r.hash = c.hash\n",
"WHERE\n",
" r.name = 'HEAD';\n",
"\"\"\", con=conn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q3: All commit messages in HEAD history for every repository"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>message</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>update engine-jupyter chart\\n\\n* renamed from ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>engineering: add coding guides\\n\\nAdded guides...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>\\nMerge pull request #67 from src-d/change-jwt...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>exclude generated\\n</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>\\nMerge pull request #60 from dpordomingo/repl...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>update queries\\n</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>fix EXPOSE port\\n\\nSigned-off-by: Santiago M. ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>\\nMerge pull request #241 from jfontan/fix/can...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>\\nMerge pull request #201 from r0mainK/feature...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>\\nMerge pull request #204 from src-d/2018-04-0...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>\\nMerge pull request #201 from r0mainK/feature...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>\\nMerge pull request #204 from src-d/2018-04-0...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>\\nMerge pull request #16 from warenlg/master\\n...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>Merge remote-tracking branch 'origin/pr/221' i...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>scalastyle applied\\n</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" message\n",
"0 update engine-jupyter chart\\n\\n* renamed from ...\n",
"1 engineering: add coding guides\\n\\nAdded guides...\n",
"2 \\nMerge pull request #67 from src-d/change-jwt...\n",
"3 exclude generated\\n\n",
"4 \\nMerge pull request #60 from dpordomingo/repl...\n",
"5 update queries\\n\n",
"6 fix EXPOSE port\\n\\nSigned-off-by: Santiago M. ...\n",
"7 \\nMerge pull request #241 from jfontan/fix/can...\n",
"8 \\nMerge pull request #201 from r0mainK/feature...\n",
"9 \\nMerge pull request #204 from src-d/2018-04-0...\n",
"10 \\nMerge pull request #201 from r0mainK/feature...\n",
"11 \\nMerge pull request #204 from src-d/2018-04-0...\n",
"12 \\nMerge pull request #16 from warenlg/master\\n...\n",
"13 Merge remote-tracking branch 'origin/pr/221' i...\n",
"14 scalastyle applied\\n"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_sql(\"\"\"\n",
"SELECT c.message\n",
"FROM\n",
" commits c\n",
" JOIN refs r ON r.hash = c.hash\n",
"WHERE\n",
" r.name = 'HEAD' AND\n",
" history_idx(r.hash, c.hash) >= 0;\n",
"\"\"\", con=conn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q4: Top 10 repositories by commit count in HEAD"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>repository_id</th>\n",
" <th>commit_count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>/opt/repos/ml</td>\n",
" <td>1228</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>/opt/repos/minutes</td>\n",
" <td>894</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>/opt/repos/guide</td>\n",
" <td>613</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>/opt/repos/engine</td>\n",
" <td>556</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>/opt/repos/borges</td>\n",
" <td>353</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>/opt/repos/gitbase</td>\n",
" <td>305</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>/opt/repos/gemini</td>\n",
" <td>167</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>/opt/repos/infrastructure</td>\n",
" <td>141</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>/opt/repos/ci</td>\n",
" <td>95</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>/opt/repos/datasets</td>\n",
" <td>88</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" repository_id commit_count\n",
"0 /opt/repos/ml 1228\n",
"1 /opt/repos/minutes 894\n",
"2 /opt/repos/guide 613\n",
"3 /opt/repos/engine 556\n",
"4 /opt/repos/borges 353\n",
"5 /opt/repos/gitbase 305\n",
"6 /opt/repos/gemini 167\n",
"7 /opt/repos/infrastructure 141\n",
"8 /opt/repos/ci 95\n",
"9 /opt/repos/datasets 88"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_sql(\"\"\"\n",
"SELECT\n",
" repository_id,\n",
" commit_count\n",
"FROM (\n",
" SELECT\n",
" r.repository_id,\n",
" count(*) AS commit_count\n",
" FROM\n",
" refs r\n",
" JOIN commits c ON history_idx(r.hash, c.hash) >= 0\n",
" WHERE\n",
" r.name = 'HEAD'\n",
" GROUP BY r.repository_id\n",
") AS q\n",
"ORDER BY commit_count DESC\n",
"LIMIT 10;\n",
"\"\"\", con=conn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q5: Fork detection (approximate)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# TODO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q6: Count repository HEAD"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>head_count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>15</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" head_count\n",
"0 15"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_sql(\"\"\"\n",
"SELECT\n",
" COUNT(DISTINCT r.repository_id) AS head_count\n",
"FROM\n",
" refs r\n",
"WHERE name = 'HEAD';\n",
"\"\"\", con=conn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q7: Count repository HEAD excluding forks"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO\n",
"pd.read_sql(\"\"\"\n",
"SELECT\n",
" COUNT(DISTINCT r.repository_id) AS head_count\n",
"FROM\n",
" refs r\n",
"WHERE name = 'HEAD';\n",
"\"\"\", con=conn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q8: Repository count by language presence (HEAD, no forks)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>language</th>\n",
" <th>repository_count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Markdown</td>\n",
" <td>12</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td></td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>YAML</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Makefile</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Text</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Dockerfile</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Shell</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>INI</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>XML</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Python</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>Go</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>JSON</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>Protocol Buffer</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>Jupyter Notebook</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Perl</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>reStructuredText</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>Unix Assembly</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>HTML</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>Scala</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>SQL</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>Smarty</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>TOML</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>C</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>CSV</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>Java</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>JavaScript</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>CSS</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>SVG</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>Batchfile</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>CMake</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35</th>\n",
" <td>HCL</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36</th>\n",
" <td>DNS Zone</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>37</th>\n",
" <td>Emacs Lisp</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38</th>\n",
" <td>Vim script</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39</th>\n",
" <td>Yacc</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40</th>\n",
" <td>ActionScript</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>41</th>\n",
" <td>RPM Spec</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>42</th>\n",
" <td>TypeScript</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43</th>\n",
" <td>Erlang</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44</th>\n",
" <td>Gnuplot</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>45</th>\n",
" <td>Smalltalk</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46</th>\n",
" <td>Gradle</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47</th>\n",
" <td>Thrift</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>48</th>\n",
" <td>Ant Build System</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>49</th>\n",
" <td>SQLPL</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50</th>\n",
" <td>D</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>51</th>\n",
" <td>OCaml</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>52</th>\n",
" <td>Pascal</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>53</th>\n",
" <td>TeX</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>54</th>\n",
" <td>Objective-C</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>55</th>\n",
" <td>C++</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>56</th>\n",
" <td>PLSQL</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>57</th>\n",
" <td>Ruby</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>58</th>\n",
" <td>PowerShell</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59</th>\n",
" <td>Maven POM</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>60</th>\n",
" <td>Haxe</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>61</th>\n",
" <td>Haskell</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>62</th>\n",
" <td>Lex</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>63</th>\n",
" <td>Swift</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>64</th>\n",
" <td>Lua</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>65 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" language repository_count\n",
"0 Markdown 12\n",
"1 11\n",
"2 YAML 10\n",
"3 Makefile 10\n",
"4 Text 9\n",
"5 Dockerfile 9\n",
"6 Shell 7\n",
"7 INI 6\n",
"8 XML 5\n",
"9 Python 5\n",
"10 Go 5\n",
"11 JSON 4\n",
"12 Protocol Buffer 4\n",
"13 Jupyter Notebook 3\n",
"14 Perl 3\n",
"15 reStructuredText 3\n",
"16 Unix Assembly 3\n",
"17 HTML 3\n",
"18 Scala 3\n",
"19 SQL 3\n",
"20 Smarty 3\n",
"21 TOML 3\n",
"22 C 3\n",
"23 CSV 2\n",
"24 Java 2\n",
"25 JavaScript 2\n",
"26 CSS 2\n",
"27 SVG 2\n",
"28 Batchfile 2\n",
"29 CMake 1\n",
".. ... ...\n",
"35 HCL 1\n",
"36 DNS Zone 1\n",
"37 Emacs Lisp 1\n",
"38 Vim script 1\n",
"39 Yacc 1\n",
"40 ActionScript 1\n",
"41 RPM Spec 1\n",
"42 TypeScript 1\n",
"43 Erlang 1\n",
"44 Gnuplot 1\n",
"45 Smalltalk 1\n",
"46 Gradle 1\n",
"47 Thrift 1\n",
"48 Ant Build System 1\n",
"49 SQLPL 1\n",
"50 D 1\n",
"51 OCaml 1\n",
"52 Pascal 1\n",
"53 TeX 1\n",
"54 Objective-C 1\n",
"55 C++ 1\n",
"56 PLSQL 1\n",
"57 Ruby 1\n",
"58 PowerShell 1\n",
"59 Maven POM 1\n",
"60 Haxe 1\n",
"61 Haskell 1\n",
"62 Lex 1\n",
"63 Swift 1\n",
"64 Lua 1\n",
"\n",
"[65 rows x 2 columns]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# TODO: add fork filtering\n",
"pd.read_sql(\"\"\"\n",
"SELECT *\n",
"FROM (\n",
" SELECT\n",
" language,\n",
" COUNT(repository_id) AS repository_count\n",
" FROM (\n",
" SELECT DISTINCT\n",
" r.repository_id AS repository_id,\n",
" language(t.name, b.content) AS language\n",
" FROM\n",
" refs r\n",
" JOIN commits c ON r.hash = c.hash\n",
" JOIN tree_entries t ON commit_has_tree(c.hash, t.tree_hash)\n",
" JOIN blobs b ON t.entry_hash = b.hash\n",
" WHERE\n",
" r.name = 'HEAD'\n",
" ) AS q1\n",
" GROUP BY language\n",
") AS q2\n",
"ORDER BY repository_count DESC;\n",
"\"\"\", con=conn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q9: Top 50 tokens in files by repository count (HEAD, no forks)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: need EXPLODE and SPLIT\n",
"# TODO: add fork filtering\n",
"pd.read_sql(\"\"\"\n",
"SELECT\n",
" token,\n",
" COUNT(DISTINCT repository_id) AS repository_count\n",
"FROM (\n",
" SELECT\n",
" b.repository_id,\n",
" EXPLODE(SPLIT(b.content, '[^_A-Za-z0-9]+')) AS token\n",
" FROM\n",
" blobs b\n",
" JOIN repository_ids_no_forks r ON\n",
" b.repository_id = r.repository_id\n",
" WHERE\n",
" reference_name = 'refs/heads/HEAD' AND\n",
" is_binary = false AND\n",
" LENGTH(content) < 524288\n",
") AS q\n",
"GROUP BY token\n",
"ORDER BY repository_count DESC\n",
"LIMIT 50;\n",
"\"\"\", con=conn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q10: Top 50 identifiers in files by repository count (HEAD, no forks)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: need EXPLODE and SPLIT\n",
"# TODO: add fork filtering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q11: Top 10 repositories by contributor count (all branches)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>repository_id</th>\n",
" <th>contributor_count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>/opt/repos/engine</td>\n",
" <td>90420</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>/opt/repos/minutes</td>\n",
" <td>33243</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>/opt/repos/gitbase</td>\n",
" <td>23645</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>/opt/repos/guide</td>\n",
" <td>20520</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>/opt/repos/ml</td>\n",
" <td>11330</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>/opt/repos/borges</td>\n",
" <td>9359</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>/opt/repos/gemini</td>\n",
" <td>820</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>/opt/repos/charts</td>\n",
" <td>617</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>/opt/repos/infrastructure</td>\n",
" <td>572</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>/opt/repos/datasets</td>\n",
" <td>492</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" repository_id contributor_count\n",
"0 /opt/repos/engine 90420\n",
"1 /opt/repos/minutes 33243\n",
"2 /opt/repos/gitbase 23645\n",
"3 /opt/repos/guide 20520\n",
"4 /opt/repos/ml 11330\n",
"5 /opt/repos/borges 9359\n",
"6 /opt/repos/gemini 820\n",
"7 /opt/repos/charts 617\n",
"8 /opt/repos/infrastructure 572\n",
"9 /opt/repos/datasets 492"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# TODO: incorrect results\n",
"pd.read_sql(\"\"\"\n",
"SELECT\n",
" repository_id,\n",
" contributor_count\n",
"FROM (\n",
" SELECT\n",
" repository_id,\n",
" COUNT(DISTINCT c.author_email) AS contributor_count\n",
" FROM\n",
" refs r\n",
" JOIN commits c ON history_idx(r.hash, c.hash) >= 0\n",
" GROUP BY repository_id\n",
") AS q\n",
"ORDER BY contributor_count DESC\n",
"LIMIT 10;\n",
"\"\"\", con=conn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q12: Created projects per year"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#TODO: get first commit correctly\n",
"pd.read_sql(\"\"\"\n",
"SELECT\n",
" year,\n",
" COUNT(DISTINCT hash) AS project_count\n",
"FROM (\n",
" SELECT\n",
" hash,\n",
" YEAR(author_date) AS year\n",
" FROM\n",
" refs r\n",
" JOIN commits c ON history_idx\n",
" WHERE\n",
" index = 0 AND\n",
" reference_name = 'refs/heads/HEAD'\n",
") AS q\n",
"GROUP BY year\n",
"ORDER BY year DESC;\n",
"\"\"\", con=conn)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment