Last active
January 18, 2020 04:20
-
-
Save tovask/f8ccd573a950fc47e3aaa311a8f012b9 to your computer and use it in GitHub Desktop.
Analyze block lists changes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Analyze lists changes from [PyFunceble](https://github.com/funilrys/PyFunceble)'s repo's git history" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"> ### Jupyter help:\n", | |
"> \n", | |
"> Outline of some basics:\n", | |
"> \n", | |
"> * [Notebook Basics](https://nbviewer.jupyter.org/github/ipython/ipython-in-depth/blob/master/examples/Notebook/Notebook%20Basics.ipynb)\n", | |
"> * [IPython - beyond plain python](https://nbviewer.jupyter.org/github/ipython/ipython-in-depth/blob/master/examples/IPython%20Kernel/Beyond%20Plain%20Python.ipynb)\n", | |
"> * [Markdown Cells](https://nbviewer.jupyter.org/github/ipython/ipython-in-depth/blob/master/examples/Notebook/Working%20With%20Markdown%20Cells.ipynb)\n", | |
"> * [Rich Display System](https://nbviewer.jupyter.org/github/ipython/ipython-in-depth/blob/master/examples/IPython%20Kernel/Rich%20Output.ipynb)\n", | |
"> * [Custom Display logic](https://nbviewer.jupyter.org/github/ipython/ipython-in-depth/blob/master/examples/IPython%20Kernel/Custom%20Display%20Logic.ipynb)\n", | |
"> * [Running a Secure Public Notebook Server](https://nbviewer.jupyter.org/github/ipython/ipython-in-depth/blob/master/examples/Notebook/Running%20the%20Notebook%20Server.ipynb#Securing-the-notebook-server)\n", | |
"> * [How Jupyter works](https://nbviewer.jupyter.org/github/ipython/ipython-in-depth/blob/master/examples/Notebook/Multiple%20Languages%2C%20Frontends.ipynb) to run code in different languages." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Sources:\n", | |
"### [Ultimate Hosts Blacklist](https://github.com/Ultimate-Hosts-Blacklist)\n", | |
"Repositories for testing lists\n", | |
"\n", | |
"### [Dead Host](https://github.com/dead-hosts)\n", | |
"Repositories for testing [PyFunceble](https://github.com/funilrys/PyFunceble)\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"---" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Define whitch list (repo) will be analyzed (e.g. [Ads_Disconnect.me](https://github.com/Ultimate-Hosts-Blacklist/Ads_Disconnect.me))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"repo_url = \"https://github.com/Ultimate-Hosts-Blacklist/yoyo.org_domains\"\n", | |
"repo_dir = repo_url.split('/')[-1]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Download the latest version" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Fetching origin\n", | |
"HEAD is now at 4616264b [Results] Testing for Ultimate Hosts Blacklist [ci skip]\n" | |
] | |
} | |
], | |
"source": [ | |
"import os\n", | |
"if not os.path.exists(repo_dir):\n", | |
" !git clone $repo_url $repo_dir\n", | |
" os.chdir(repo_dir)\n", | |
"else:\n", | |
" os.chdir(repo_dir)\n", | |
" #!git pull origin master\n", | |
" !git fetch --all\n", | |
" !git reset --hard origin/master" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Get the interesting commits\n", | |
"(Not only final results commited, the state is autosaved periodically (every 15 minutes), see: [PyFunceble/auto_save.py](https://github.com/funilrys/PyFunceble/blob/master/PyFunceble/auto_save.py))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"git_log_format_param = \"--format=format:\\\"%H %T %at\\\"\"\n", | |
"def get_git_commits(git_log_command):\n", | |
" !git reset --hard origin/master\n", | |
" git_log = !{git_log_command}\n", | |
" return [{'commit_hash':commit[0],'tree_hash':commit[1],'timestamp':commit[2]} for commit in [ line.split(' ') for line in git_log] ]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"HEAD is now at 4616264b [Results] Testing for Ultimate Hosts Blacklist [ci skip]\r\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"190" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"result_commits = get_git_commits(\"git log --grep=\\\" \\\\[ci skip\\\\]\\\" \"+git_log_format_param)\n", | |
"len(result_commits)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Go back to the interesting commits, and get the status there" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Check if the process finished \n", | |
"[continue.json](https://github.com/Ultimate-Hosts-Blacklist/repository-structure/blob/master/output/continue.json)\n", | |
"[info.json](https://github.com/Ultimate-Hosts-Blacklist/repository-structure/blob/master/info.json)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import json\n", | |
"\n", | |
"def check_not_in_continue():\n", | |
" with open('output/continue.json') as fd:\n", | |
" for list_file,status in json.load(fd).items():\n", | |
" if sum([count for s,count in status.items()]) != 0:\n", | |
" print('Warning: continue status not 0!',list_file,status)\n", | |
" !git log -n 1\n", | |
"\n", | |
"def check_test_finished():\n", | |
" with open('info.json') as fd:\n", | |
" for key,value in json.load(fd).items():\n", | |
" if key == \"currently_under_test\":\n", | |
" if value=='0':\n", | |
" return True\n", | |
" else:\n", | |
" print(\"Warning! Test not finished! \"+value)\n", | |
" !git log -n 1\n", | |
" return False\n", | |
" print(\"Warning! 'currently_under_test' not found in 'info.json'\")\n", | |
" return False" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"helper for parse [percentage.txt](https://github.com/Ultimate-Hosts-Blacklist/repository-structure/blob/master/output/logs/percentage/percentage.txt)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import re\n", | |
"def parse_percentage():\n", | |
" percentage = {}\n", | |
" with open('output/logs/percentage/percentage.txt') as fd:\n", | |
" for line in fd:\n", | |
" m = re.search(r'(?P<status>ACTIVE|INACTIVE|INVALID)\\s*(?P<percentage>\\d*)%\\s*(?P<numbers>\\d*)',line)\n", | |
" if not m:\n", | |
" continue\n", | |
" #percentage[ m.group('status') ] = m.groupdict()\n", | |
" percentage[ m.group('status') ] = {}\n", | |
" percentage[ m.group('status') ]['percentage'] = int(m.group('percentage'))\n", | |
" percentage[ m.group('status') ]['numbers'] = int(m.group('numbers'))\n", | |
" percentage['SUM_NUMBERS'] = sum([values['numbers'] for status,values in percentage.items()])\n", | |
" return percentage\n", | |
"# parse_percentage()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"loop throught the commits" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
".............................................................................................................................................................................................." | |
] | |
} | |
], | |
"source": [ | |
"percentages = {}\n", | |
"for commit in result_commits:\n", | |
" print('.', end='', flush=True)\n", | |
" !git checkout -q {commit['commit_hash']} # bring back the repo to that commit\n", | |
" #check_not_in_continue() # safety, but how cares\n", | |
" #check_test_finished() # safety, but how cares\n", | |
" percentages[ commit['timestamp'] ] = parse_percentage()\n", | |
"# print(percentages)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'ACTIVE': {'percentage': 86, 'numbers': 4267},\n", | |
" 'INACTIVE': {'percentage': 13, 'numbers': 642},\n", | |
" 'INVALID': {'percentage': 0, 'numbers': 19},\n", | |
" 'SUM_NUMBERS': 4928}" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"percentages[list(percentages.keys())[0]]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"---\n", | |
"#### Get the original lists changes" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"HEAD is now at 4616264b [Results] Testing for Ultimate Hosts Blacklist [ci skip]\r\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"80" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"domains_change_commits = get_git_commits(\"git log \"+git_log_format_param+\" domains.list\")\n", | |
"len(domains_change_commits)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"................................................................................" | |
] | |
} | |
], | |
"source": [ | |
"domains_counts = {}\n", | |
"for commit in domains_change_commits:\n", | |
" print('.', end='', flush=True)\n", | |
" !git checkout -q {commit['commit_hash']} # bring back the repo to that commit\n", | |
" domains_counts[ commit['timestamp'] ] = sum(1 for line in open('domains.list'))\n", | |
"# print(domains_counts)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"---\n", | |
"#### Prepare the datas for plotting" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import datetime\n", | |
"percentage_dates = [datetime.datetime.fromtimestamp(int(timestamp)) for timestamp,values in percentages.items() ]\n", | |
"active_percentage = [ value['ACTIVE']['percentage'] for timestamp,value in percentages.items() ]\n", | |
"active_numbers = [ value['ACTIVE']['numbers'] for timestamp,value in percentages.items() ]\n", | |
"inactive_percentage = [ value['INACTIVE']['percentage'] for timestamp,value in percentages.items() ]\n", | |
"inactive_numbers = [ value['INACTIVE']['numbers'] for timestamp,value in percentages.items() ]\n", | |
"invalid_percentage = [ value['INVALID']['percentage'] for timestamp,value in percentages.items() ]\n", | |
"invalid_numbers = [ value['INVALID']['numbers'] for timestamp,value in percentages.items() ]\n", | |
"sum_numbers = [ value['SUM_NUMBERS'] for timestamp,value in percentages.items() ]\n", | |
"\n", | |
"domain_count_dates = [datetime.datetime.fromtimestamp(int(timestamp)) for timestamp,value in domains_counts.items() ]\n", | |
"domain_count_values = [value for timestamp,value in domains_counts.items() ]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"try:\n", | |
" import matplotlib.pyplot as plt\n", | |
"except ImportError:\n", | |
" !pip install matplotlib\n", | |
" import matplotlib.pyplot as plt\n", | |
"import numpy as np\n", | |
"import matplotlib.dates as mdates\n", | |
"import matplotlib.ticker as ticker" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 1332x756 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"%matplotlib inline\n", | |
"# or interactive:\n", | |
"# %matplotlib notebook\n", | |
"\n", | |
"# https://matplotlib.org/api/markers_api.html\n", | |
"# https://matplotlib.org/gallery/lines_bars_and_markers/line_styles_reference.html\n", | |
"\n", | |
"plt.plot( percentage_dates, active_numbers, label=\"active\" )\n", | |
"plt.plot( percentage_dates, inactive_numbers, label=\"inactive\" )\n", | |
"plt.plot( percentage_dates, invalid_numbers, label=\"invalid\" )\n", | |
"plt.plot( percentage_dates, sum_numbers, label=\"sum\" )\n", | |
"\n", | |
"plt.plot( domain_count_dates, domain_count_values, label=\"domains number (changes)\", linestyle=':', marker='s' )\n", | |
"\n", | |
"plt.title(repo_dir)\n", | |
"\n", | |
"plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y.%m.%d'))\n", | |
"plt.gca().xaxis.set_major_locator(ticker.MaxNLocator(20))\n", | |
"plt.xticks( rotation=45 )\n", | |
"\n", | |
"# plt.xlabel(\"Date\")\n", | |
"plt.ylabel(\"Count\")\n", | |
"\n", | |
"plt.legend()\n", | |
"\n", | |
"plt.gcf().set_size_inches(18.5, 10.5, forward=True)\n", | |
"plt.gcf().savefig('../'+repo_dir+'.png', dpi=100)" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.2" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment