Skip to content

Instantly share code, notes, and snippets.

@knoguchi
Last active August 29, 2015 14:14
Show Gist options
  • Save knoguchi/812f07fd13f27257f5d8 to your computer and use it in GitHub Desktop.
Save knoguchi/812f07fd13f27257f5d8 to your computer and use it in GitHub Desktop.
Python+PyQueryでスクレイピング ref: http://qiita.com/knoguchi/items/cd51a00de026ef92080a
<ul>
<li x="123" class="red"> item 1 </li>
<li x="123" class="red"> item 2 </li>
<li x="123" class="red"> item 3 </li>
</ul>
set(['http://i.yimg.jp/images/sicons/box16.gif',
'http://k.yimg.jp/images/clear.gif',
'http://k.yimg.jp/images/common/tv.gif',
'http://k.yimg.jp/images/icon/photo.gif',
'http://k.yimg.jp/images/new2.gif',
'http://k.yimg.jp/images/sicons/ybm161.gif',
'http://k.yimg.jp/images/top/sp/cgrade/iconMail.gif',
'http://k.yimg.jp/images/top/sp/cgrade/icon_point.gif',
'http://k.yimg.jp/images/top/sp/cgrade/info_btn-140325.gif',
'http://k.yimg.jp/images/top/sp/cgrade/logo7.gif',
'http://lpt.c.yimg.jp/im_sigg6mIfJALB8FuA5LAzp6.HPA---x120-y120/amd/20150208-00010001-dtohoku-000-view.jpg'])
set(['http://i.yimg.jp/images/sicons/box16.gif',
'http://k.yimg.jp/images/clear.gif',
'http://k.yimg.jp/images/common/tv.gif',
'http://k.yimg.jp/images/icon/photo.gif',
'http://k.yimg.jp/images/new2.gif',
'http://k.yimg.jp/images/sicons/ybm161.gif',
'http://k.yimg.jp/images/top/sp/cgrade/iconMail.gif',
'http://k.yimg.jp/images/top/sp/cgrade/icon_point.gif',
'http://k.yimg.jp/images/top/sp/cgrade/info_btn-140325.gif',
'http://k.yimg.jp/images/top/sp/cgrade/logo7.gif',
'http://lpt.c.yimg.jp/im_sigg6mIfJALB8FuA5LAzp6.HPA---x120-y120/amd/20150208-00010001-dtohoku-000-view.jpg'])
#!/usr/bin/env python
from urlparse import urljoin
from pyquery import PyQuery as pq
from pprint import pprint
url = 'http://www.yahoo.co.jp'
dom = pq(url)
result = set()
for img in dom('img').items():
img_url = img.attr['src']
if img_url.startswith('http'):
result.add(img_url)
else:
result.add(urljoin(url, img_url))
pprint(result)
from pyquery import PyQuery as pq
html = '''
<ul>
<li> item 1 </li>
<li> item 2 </li>
<li> item 3 </li>
</ul>
'''
dom = pq(html)
dom('li').each(lambda index, node: pq(node).attr(class_='red', x='123'))
print dom
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment