Common LispによるWebスクレイピング

世はまさにビッグデータ解析時代

データをどう集めるかが問題
- 自前のデータを集めるのは大変
- Web上に公開されているデータを半自動的に収集するロボットを作る
日本の著作権法ではデータ解析目的の複製はOK
サイトのルート直下にあるrobots.txtに従おう
アクセス頻度は常識的な範囲で

Webスクレイピングとは

Webをクローリングし、必要な情報を切り出すことでデータを収集すること
APIが用意されているものは素直に使おう
- Twitter、FacebookなどはOAuth認証が必要

Webスクレイピングに必要なもの

HTTPクライアント
- dexador、drakma
HTML/XMLパーサ
- plump
CSSセレクタ
- clss
OAuth認証 (TwitterなどのAPIを使う場合)
- cl-oauth

日経新聞から現在の日経平均株価を取得

(ql:quickload :dexador)
(ql:quickload :plump)
(ql:quickload :clss)
(ql:quickload :cl-ppcre)

(defparameter article-html (dex:get "http://www.nikkei.com/markets/kabu/"))
(defparameter parse-tree (plump:parse article-html))
(defparameter sub-tree (aref (clss:select "span.mkc-stock_prices" parse-tree) 0))

(print (plump:text (aref (plump:children sub-tree) 0)))

同じことをPythonでやろうとすると

import urllib.request
from bs4 import BeautifulSoup
url = "http://www.nikkei.com/markets/kabu/"
response = urllib.request.urlopen(url)
data = response.read()
soup = BeautifulSoup(data, "html.parser")
span = soup.find_all("span")
nikkei_heikin = ""
for tag in span:
    try:
        string_ = tag.get("class").pop(0)
        if string_ in "mkc-stock_prices":
            nikkei_heikin = tag.string
            break
    except:
        pass
print(nikkei_heikin)

ロイターの記事から本文を取得

(defparameter article-html (dex:get "http://jp.reuters.com/article/idJPL3N0U325520141219"))
(defparameter body-class
  (aref (nth-value 1 (ppcre:scan-to-strings "(ArticleBody_body_.*?)\"" article-html)) 0))
(defparameter parse-tree (plump:parse article-html))
(defparameter sub-tree (aref (clss:select (format nil ".~A" body-class) parse-tree) 0))

(defun node-text (node)
  (flet ((cat (strs) (reduce (lambda (s1 s2) (concatenate 'string s1 s2)) strs)))
    (let ((text-list nil))
      (plump:traverse node
                      (lambda (node) (push (plump:text node) text-list))
                      :test #'plump:text-node-p)
      (cat (nreverse text-list)))))

(print (node-text sub-tree))

連番でない画像を上から順番にダウンロード

連番ならこうすればいい
- wget http://example.com/H1000{00..99}.JPG
連番でなく、ページの特定の場所にあるような画像ならパースが必要
- 例: http://logofaves.com/

連番でない画像を上から順番にダウンロード(2)

.boxesクラスの部分木を取ってきて、さらにIMGタグを探し、URLでフィルタをかける

(defparameter article-html (dex:get "http://logofaves.com/"))
(defparameter parse-tree (plump:parse article-html))
(defparameter sub-trees (clss:select "img" (aref (clss:select ".boxes" parse-tree) 0)))
(defparameter urls
  (remove-if-not
   (lambda (url)
     (cl-ppcre:scan "^http://logofaves.com/wp-content/uploads/" url))
   (map 'list (lambda (node)
                (gethash "src" (plump:attributes node)))
        sub-trees)))

(loop for i from 0
      for url in urls
      do (dex:fetch url (format nil "/tmp/logo-~3,'0d.jpg" i)))

APIから収集

cl-oauthを使ってTwitterのAPIから取得する

cl-oauthでOAuth1.0認証

認証の流れ
- コンシューマトークンを作る
- それを認証サーバに送りリクエストトークンを取得する
- 認証用URLにアクセス、ユーザアカウントでログイン
- コールバックURLにGETパラメータ付きでリダイレクトされる
- アクセストークンを作る

masatoi/lispmeetup56-slide.org

Select an option

No results found

Select an option

No results found

Common LispによるWebスクレイピング

世はまさにビッグデータ解析時代

Webスクレイピングとは

Webスクレイピングに必要なもの

日経新聞から現在の日経平均株価を取得

同じことをPythonでやろうとすると

ロイターの記事から本文を取得

連番でない画像を上から順番にダウンロード

連番でない画像を上から順番にダウンロード(2)

APIから収集

cl-oauthでOAuth1.0認証

コールバックURLに指定するためにningleでサーバを立てる

cl-oauth越しにTwitter APIにアクセスする

おわり