Skip to content

Instantly share code, notes, and snippets.

@MercuryRising
Created November 12, 2012 19:29
Show Gist options
  • Save MercuryRising/4061368 to your computer and use it in GitHub Desktop.
Save MercuryRising/4061368 to your computer and use it in GitHub Desktop.
Pyquery, lxml, BeautifulSoup comparison
from bs4 import BeautifulSoup as bs
from pyquery import PyQuery as pq
from lxml.html import fromstring
import re
import requests
import time
def Timer():
a = time.time()
while True:
c = time.time()
yield time.time()-a
a = c
timer = Timer()
url = "http://www.python.org/"
html = requests.get(url).text
num = 100000
print '\n==== Total trials: %s =====' %num
next(timer)
soup = bs(html, 'lxml')
for x in range(num):
paragraphs = soup.findAll('p')
t = next(timer)
print 'bs4 total time: %.1f' %t
d = pq(html)
for x in range(num):
paragraphs = d('p')
t = next(timer)
print 'pq total time: %.1f' %t
tree = fromstring(html)
for x in range(num):
paragraphs = tree.cssselect('p')
t = next(timer)
print 'lxml (cssselect) total time: %.1f' %t
tree = fromstring(html)
for x in range(num):
paragraphs = tree.xpath('.//p')
t = next(timer)
print 'lxml (xpath) total time: %.1f' %t
for x in range(num):
paragraphs = re.findall('<[p ]>.*?</p>', html)
t = next(timer)
print 'regex total time: %.1f (doesn\'t find all p)\n' %t
@valgur
Copy link

valgur commented Dec 30, 2015

Thanks for the Gist! For anyone else curious about the results (used Python 3.5):

==== Total trials: 100000 =====
bs4 total time: 74.1
pq total time: 13.9
lxml (cssselect) total time: 13.6
lxml (xpath) total time: 8.6
regex total time: 17.2 (doesn't find all p)

@alaakh42
Copy link

alaakh42 commented Feb 20, 2018

Results using Python 2.7


==== Total trials: 100000 =====
bs4 total time: 38.0
pq total time: 5.2
lxml (cssselect) total time: 5.1
lxml (xpath) total time: 3.0
regex total time: 8.4 (doesn't find all p)

@guptarohit
Copy link

Results using Python 3.6

==== Total trials: 100000 =====
bs4 total time: 52.6
pq total time: 7.5
lxml (cssselect) total time: 6.8
lxml (xpath) total time: 4.5
regex total time: 11.2 (doesn't find all p)

@p3nj
Copy link

p3nj commented Aug 20, 2018

Results using Python 3.7

==== Total trials: 100000 =====
bs4 total time: 63.2
pq total time: 8.4
lxml (cssselect) total time: 7.9
lxml (xpath) total time: 5.6
regex total time: 9.6 (doesn't find all p)

@kwuite
Copy link

kwuite commented Sep 14, 2018

Results using python 3.6.5

==== Total trials: 100000 =====
bs4 total time: 325.9 (Not sure why this happened, but it's a record in slowness)
pq total time: 8.9
lxml (cssselect) total time: 7.9
lxml (xpath) total time: 3.5
regex total time: 8.5 (doesn't find all p)

@ghid4ds
Copy link

ghid4ds commented Dec 17, 2018

==== Total trials: 100000 =====
bs4 total time: 93.2
pq total time: 7.4
lxml (cssselect) total time: 7.7
lxml (xpath) total time: 5.5
regex total time: 16.4 (doesn't find all p)

@Fischmax
Copy link

Fischmax commented Jan 7, 2019

In Python 3.7.1:
==== Total trials: 100000 =====
bs4 total time: 69.6
pq total time: 10.1
lxml (cssselect) total time: 9.6
lxml (xpath) total time: 6.3
regex total time: 13.6 (doesn't find all p)

@guptarohit
Copy link

Results using python 3.7.3

==== Total trials: 100000 =====
bs4 total time: 94.1
pq total time: 9.5
lxml (cssselect) total time: 8.6
lxml (xpath) total time: 5.9
regex total time: 12.9 (doesn't find all p)

@andriyor
Copy link

I tried selectolax and in this case selectolax is 2 times faster than lxml
https://rushter.com/blog/python-fast-html-parser/

from selectolax.parser import HTMLParser

tree = HTMLParser(html)
for x in range(num):
    paragraphs = tree.css('p')
t = next(timer)
print('selectolax total time: %.1f' % t)
==== Total trials: 100000 =====
bs4 total time: 95.4
pq total time: 10.9
lxml (cssselect) total time: 10.0
lxml (xpath) total time: 6.4
regex total time: 14.4 (doesn't find all p)
selectolax total time: 3.4

@deedy5
Copy link

deedy5 commented Apr 24, 2021

python 3.9.2

==== Total trials: 100000 =====
bs4 total time: 31.9
pq total time: 4.9
lxml (cssselect) total time: 4.4
lxml (xpath) total time: 3.1
regex total time: 8.5 (doesn't find all p)

@hokwanhung
Copy link

Python 3.10.4

==== Total trials: 100000 =====
bs4 total time: 30.1
pq total time: 2.8
lxml (cssselect) total time: 2.6
lxml (xpath) total time: 2.0
regex total time: 6.3 (doesn't find all p)

@xavierskip
Copy link

Python 3.10.1

==== Total trials: 100000 =====
bs4 total time: 45.9
pq total time: 4.6
lxml (cssselect) total time: 4.3
lxml (xpath) total time: 3.3
regex total time: 8.4 (doesn't find all p)

@p3nj
Copy link

p3nj commented Apr 30, 2023

Python 3.11.2

==== Total trials: 100000 =====
bs4 total time: 18.1
pq total time: 2.2
lxml (cssselect) total time: 2.2
lxml (xpath) total time: 1.7
regex total time: 5.2 (doesn't find all p)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment