cheerioお試しメモ.md

Cheerioでスクレイピング

前回、requestを使って取得したhtmlから言葉を抜き出してみました

今回使用したのはcheerio→→ https://github.com/cheeriojs/cheerio
npm でインストール

$ npm install cheerio --save

--saveをつけるとpackage.jsonに反映してくれる

client.js

var request = require('request');
var r = request.defaults({'proxy':'<proxy url>:<port>'});

var cheerio = require('cheerio');

var url = 'http://news.google.co.jp/';

r(url, function (error, response, body) {
  if (!error && response.statusCode == 200) {
    
    $ = cheerio.load(body);  //Create cheerio instance

    var resurl = response.request.href;
    var title = $("title").text();

    console.log(resurl);
    console.log(title);
  }
});

結果

http://news.google.co.jp/
Google ニュース

結果が返ってくることを確認。
文字コードが考慮されていないので、対象のwebページがshift-thisとかの場合、文字化けする。 (iconvを使えば解決できそうだが、今はとりあえずそのまま)
とりあえずここまで。

次のアクション

requestとcheerioの中身把握をすすめる(コードリーディング)
抜き出した単語を使って次のアクション(google検索とか)
requesutでhtml取得→cheerioで単語抜き出し→次のアクション　を一定間隔おきに実行

tokunami/cheerioお試しメモ.md

Cheerioでスクレイピング

次のアクション

stoshiya commented Jul 16, 2014

Uh oh!

stoshiya commented Jul 16, 2014

Uh oh!

tokunami commented Jul 17, 2014

Uh oh!

tokunami commented Jul 17, 2014

Uh oh!