Last active
October 16, 2022 16:57
-
-
Save kmike/1fd10869a1af9a54cddbeca38694454a to your computer and use it in GitHub Desktop.
url_has_any_extension benchmark
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "fd8192cf", | |
"metadata": {}, | |
"source": [ | |
"Two implementations of url_has_any_extension:\n", | |
"\n", | |
"* The one merged in https://github.com/scrapy/scrapy/pull/5450 (url_has_any_extension_27)\n", | |
"* Version which is used in Scrapy 2.6 (url_has_any_extension_26)\n", | |
"\n", | |
"The new implementation is more correct, because it works for extensions like .tar.gz." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"id": "8f646b9b", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import posixpath\n", | |
"\n", | |
"from scrapy.utils.url import parse_url\n", | |
"\n", | |
"def url_has_any_extension_27(url, extensions):\n", | |
" \"\"\"Return True if the url ends with one of the extensions provided\"\"\"\n", | |
" lowercase_path = parse_url(url).path.lower()\n", | |
" return any(lowercase_path.endswith(ext) for ext in extensions)\n", | |
"\n", | |
"\n", | |
"def url_has_any_extension_26(url, extensions):\n", | |
" \"\"\"Return True if the url ends with one of the extensions provided\"\"\"\n", | |
" return posixpath.splitext(parse_url(url).path)[1].lower() in extensions" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "ccf338bd", | |
"metadata": {}, | |
"source": [ | |
"Let's use extension list from Scrapy's linkextractors; it's going the be by far the most common list of extensions used." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"id": "3fbcbd78", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from scrapy.linkextractors import IGNORED_EXTENSIONS\n", | |
"\n", | |
"# Extensions must start with \".\"; FilteringLinkExtractor does the same.\n", | |
"# We're using set because LinkExtractor uses set for deny_extensions.\n", | |
"extensions = {'.' + e for e in IGNORED_EXTENSIONS}" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "b1b93a99", | |
"metadata": {}, | |
"source": [ | |
"Case 1: an URL where an extension is present.\n", | |
"\n", | |
"Note than Scrapy's Link Extractor uses parse_url before passing URL to url_has_any_extension; we'll do the same." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"id": "305279c3", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(True, True)" | |
] | |
}, | |
"execution_count": 25, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"url = parse_url(\"http://example.com/files/1ch124h1/video.mp4?hello\")\n", | |
"url_has_any_extension_26(url, extensions), url_has_any_extension_27(url, extensions)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"id": "ebd8a37b", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"1.14 µs ± 4.84 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit url_has_any_extension_26(url, extensions)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"id": "60af1899", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"7.44 µs ± 6.11 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit url_has_any_extension_27(url, extensions)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "5ce4aa3d", | |
"metadata": {}, | |
"source": [ | |
"New version is slower, but both are super-fast. There is probably nothing to worry about.\n", | |
"\n", | |
"Case 2: extension is not present in URL." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"id": "f56ff91c", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(False, False)" | |
] | |
}, | |
"execution_count": 28, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"url = parse_url(\"http://example.com/files/1ch124h1/page.html?hello\")\n", | |
"url_has_any_extension_26(url, extensions), url_has_any_extension_27(url, extensions)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"id": "9d16127d", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"1.12 µs ± 7.14 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit url_has_any_extension_26(url, extensions)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"id": "ad968d6d", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"10.4 µs ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit url_has_any_extension_27(url, extensions)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "43a62e75", | |
"metadata": {}, | |
"source": [ | |
"Again, the new version is slower, but both are very fast. There is probably nothing to worry about." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.8.11" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment