Skip to content

Instantly share code, notes, and snippets.

@MercuryRising
Created November 12, 2012 19:29
Show Gist options
  • Select an option

  • Save MercuryRising/4061368 to your computer and use it in GitHub Desktop.

Select an option

Save MercuryRising/4061368 to your computer and use it in GitHub Desktop.
Pyquery, lxml, BeautifulSoup comparison
from bs4 import BeautifulSoup as bs
from pyquery import PyQuery as pq
from lxml.html import fromstring
import re
import requests
import time
def Timer():
a = time.time()
while True:
c = time.time()
yield time.time()-a
a = c
timer = Timer()
url = "http://www.python.org/"
html = requests.get(url).text
num = 100000
print '\n==== Total trials: %s =====' %num
next(timer)
soup = bs(html, 'lxml')
for x in range(num):
paragraphs = soup.findAll('p')
t = next(timer)
print 'bs4 total time: %.1f' %t
d = pq(html)
for x in range(num):
paragraphs = d('p')
t = next(timer)
print 'pq total time: %.1f' %t
tree = fromstring(html)
for x in range(num):
paragraphs = tree.cssselect('p')
t = next(timer)
print 'lxml (cssselect) total time: %.1f' %t
tree = fromstring(html)
for x in range(num):
paragraphs = tree.xpath('.//p')
t = next(timer)
print 'lxml (xpath) total time: %.1f' %t
for x in range(num):
paragraphs = re.findall('<[p ]>.*?</p>', html)
t = next(timer)
print 'regex total time: %.1f (doesn\'t find all p)\n' %t
@valgur

valgur commented Dec 30, 2015

Copy link
Copy Markdown

Thanks for the Gist! For anyone else curious about the results (used Python 3.5):

==== Total trials: 100000 =====
bs4 total time: 74.1
pq total time: 13.9
lxml (cssselect) total time: 13.6
lxml (xpath) total time: 8.6
regex total time: 17.2 (doesn't find all p)

@alaakh42

alaakh42 commented Feb 20, 2018

Copy link
Copy Markdown

Results using Python 2.7


==== Total trials: 100000 =====
bs4 total time: 38.0
pq total time: 5.2
lxml (cssselect) total time: 5.1
lxml (xpath) total time: 3.0
regex total time: 8.4 (doesn't find all p)

@guptarohit

Copy link
Copy Markdown

Results using Python 3.6

==== Total trials: 100000 =====
bs4 total time: 52.6
pq total time: 7.5
lxml (cssselect) total time: 6.8
lxml (xpath) total time: 4.5
regex total time: 11.2 (doesn't find all p)

@p3nj

p3nj commented Aug 20, 2018

Copy link
Copy Markdown

Results using Python 3.7

==== Total trials: 100000 =====
bs4 total time: 63.2
pq total time: 8.4
lxml (cssselect) total time: 7.9
lxml (xpath) total time: 5.6
regex total time: 9.6 (doesn't find all p)

@kwuite

kwuite commented Sep 14, 2018

Copy link
Copy Markdown

Results using python 3.6.5

==== Total trials: 100000 =====
bs4 total time: 325.9 (Not sure why this happened, but it's a record in slowness)
pq total time: 8.9
lxml (cssselect) total time: 7.9
lxml (xpath) total time: 3.5
regex total time: 8.5 (doesn't find all p)

@ghid4ds

ghid4ds commented Dec 17, 2018

Copy link
Copy Markdown

==== Total trials: 100000 =====
bs4 total time: 93.2
pq total time: 7.4
lxml (cssselect) total time: 7.7
lxml (xpath) total time: 5.5
regex total time: 16.4 (doesn't find all p)

@Fischmax

Fischmax commented Jan 7, 2019

Copy link
Copy Markdown

In Python 3.7.1:
==== Total trials: 100000 =====
bs4 total time: 69.6
pq total time: 10.1
lxml (cssselect) total time: 9.6
lxml (xpath) total time: 6.3
regex total time: 13.6 (doesn't find all p)

@guptarohit

Copy link
Copy Markdown

Results using python 3.7.3

==== Total trials: 100000 =====
bs4 total time: 94.1
pq total time: 9.5
lxml (cssselect) total time: 8.6
lxml (xpath) total time: 5.9
regex total time: 12.9 (doesn't find all p)

@andriyor

Copy link
Copy Markdown

I tried selectolax and in this case selectolax is 2 times faster than lxml
https://rushter.com/blog/python-fast-html-parser/

from selectolax.parser import HTMLParser

tree = HTMLParser(html)
for x in range(num):
    paragraphs = tree.css('p')
t = next(timer)
print('selectolax total time: %.1f' % t)
==== Total trials: 100000 =====
bs4 total time: 95.4
pq total time: 10.9
lxml (cssselect) total time: 10.0
lxml (xpath) total time: 6.4
regex total time: 14.4 (doesn't find all p)
selectolax total time: 3.4

@deedy5

deedy5 commented Apr 24, 2021

Copy link
Copy Markdown

python 3.9.2

==== Total trials: 100000 =====
bs4 total time: 31.9
pq total time: 4.9
lxml (cssselect) total time: 4.4
lxml (xpath) total time: 3.1
regex total time: 8.5 (doesn't find all p)

@hokwanhung

Copy link
Copy Markdown

Python 3.10.4

==== Total trials: 100000 =====
bs4 total time: 30.1
pq total time: 2.8
lxml (cssselect) total time: 2.6
lxml (xpath) total time: 2.0
regex total time: 6.3 (doesn't find all p)

@xavierskip

Copy link
Copy Markdown

Python 3.10.1

==== Total trials: 100000 =====
bs4 total time: 45.9
pq total time: 4.6
lxml (cssselect) total time: 4.3
lxml (xpath) total time: 3.3
regex total time: 8.4 (doesn't find all p)

@p3nj

p3nj commented Apr 30, 2023

Copy link
Copy Markdown

Python 3.11.2

==== Total trials: 100000 =====
bs4 total time: 18.1
pq total time: 2.2
lxml (cssselect) total time: 2.2
lxml (xpath) total time: 1.7
regex total time: 5.2 (doesn't find all p)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment