-
-
Save MercuryRising/4061368 to your computer and use it in GitHub Desktop.
from bs4 import BeautifulSoup as bs | |
from pyquery import PyQuery as pq | |
from lxml.html import fromstring | |
import re | |
import requests | |
import time | |
def Timer(): | |
a = time.time() | |
while True: | |
c = time.time() | |
yield time.time()-a | |
a = c | |
timer = Timer() | |
url = "http://www.python.org/" | |
html = requests.get(url).text | |
num = 100000 | |
print '\n==== Total trials: %s =====' %num | |
next(timer) | |
soup = bs(html, 'lxml') | |
for x in range(num): | |
paragraphs = soup.findAll('p') | |
t = next(timer) | |
print 'bs4 total time: %.1f' %t | |
d = pq(html) | |
for x in range(num): | |
paragraphs = d('p') | |
t = next(timer) | |
print 'pq total time: %.1f' %t | |
tree = fromstring(html) | |
for x in range(num): | |
paragraphs = tree.cssselect('p') | |
t = next(timer) | |
print 'lxml (cssselect) total time: %.1f' %t | |
tree = fromstring(html) | |
for x in range(num): | |
paragraphs = tree.xpath('.//p') | |
t = next(timer) | |
print 'lxml (xpath) total time: %.1f' %t | |
for x in range(num): | |
paragraphs = re.findall('<[p ]>.*?</p>', html) | |
t = next(timer) | |
print 'regex total time: %.1f (doesn\'t find all p)\n' %t |
Results using python 3.6.5
==== Total trials: 100000 =====
bs4 total time: 325.9 (Not sure why this happened, but it's a record in slowness)
pq total time: 8.9
lxml (cssselect) total time: 7.9
lxml (xpath) total time: 3.5
regex total time: 8.5 (doesn't find all p)
==== Total trials: 100000 =====
bs4 total time: 93.2
pq total time: 7.4
lxml (cssselect) total time: 7.7
lxml (xpath) total time: 5.5
regex total time: 16.4 (doesn't find all p)
In Python 3.7.1:
==== Total trials: 100000 =====
bs4 total time: 69.6
pq total time: 10.1
lxml (cssselect) total time: 9.6
lxml (xpath) total time: 6.3
regex total time: 13.6 (doesn't find all p)
Results using python 3.7.3
==== Total trials: 100000 =====
bs4 total time: 94.1
pq total time: 9.5
lxml (cssselect) total time: 8.6
lxml (xpath) total time: 5.9
regex total time: 12.9 (doesn't find all p)
I tried selectolax and in this case selectolax is 2 times faster than lxml
https://rushter.com/blog/python-fast-html-parser/
from selectolax.parser import HTMLParser
tree = HTMLParser(html)
for x in range(num):
paragraphs = tree.css('p')
t = next(timer)
print('selectolax total time: %.1f' % t)
==== Total trials: 100000 =====
bs4 total time: 95.4
pq total time: 10.9
lxml (cssselect) total time: 10.0
lxml (xpath) total time: 6.4
regex total time: 14.4 (doesn't find all p)
selectolax total time: 3.4
python 3.9.2
==== Total trials: 100000 =====
bs4 total time: 31.9
pq total time: 4.9
lxml (cssselect) total time: 4.4
lxml (xpath) total time: 3.1
regex total time: 8.5 (doesn't find all p)
Python 3.10.4
==== Total trials: 100000 =====
bs4 total time: 30.1
pq total time: 2.8
lxml (cssselect) total time: 2.6
lxml (xpath) total time: 2.0
regex total time: 6.3 (doesn't find all p)
Python 3.10.1
==== Total trials: 100000 =====
bs4 total time: 45.9
pq total time: 4.6
lxml (cssselect) total time: 4.3
lxml (xpath) total time: 3.3
regex total time: 8.4 (doesn't find all p)
Python 3.11.2
==== Total trials: 100000 =====
bs4 total time: 18.1
pq total time: 2.2
lxml (cssselect) total time: 2.2
lxml (xpath) total time: 1.7
regex total time: 5.2 (doesn't find all p)
Results using Python 3.7