This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# -*- coding:utf-8 -*- | |
import re | |
import urllib2 | |
from lib.BeautifulSoup import BeautifulSoup | |
agent="""Sosospider+(+http://help.soso.com/webspider.htm)""" | |
blog_url = 'http://blog.sina.com.cn/s/articlelist_1517582220_0_1.html' | |
spider_handle = urllib2.urlopen(blog_url) | |
blog_content = spider_handle.read() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import gevent | |
from gevent import monkey, queue | |
monkey.patch_all() | |
import urllib2 | |
from time import sleep | |
import traceback | |
import logging |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from gcrawler import GCrawler, Downloader | |
import unittest | |
import urllib2 | |
import logging | |
import traceback | |
from datetime import datetime | |
import re | |
logging.basicConfig(level=logging.DEBUG) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import os | |
import shutil | |
import subprocess | |
import sys | |
import tarfile | |
import urllib2 | |
LIBXML2_PREFIX = "libxml2" | |
LIBXSLT_PREFIX = "libxslt" | |
LIBXML2_FTPURL = "ftp://xmlsoft.org/libxml2/" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
# | |
# Copyright 2009 Facebook | |
# | |
# Licensed under the Apache License, Version 2.0 (the "License"); you may | |
# not use this file except in compliance with the License. You may obtain | |
# a copy of the License at | |
# | |
# http://www.apache.org/licenses/LICENSE-2.0 | |
# |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from scrapy import log | |
from scrapy.item import Item | |
from scrapy.http import Request | |
from scrapy.contrib.spiders import XMLFeedSpider | |
def NextURL(): | |
""" | |
Generate a list of URLs to crawl. You can query a database or come up with some other means | |
Note that if you generate URLs to crawl from a scraped URL then you're better of using a |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
This is a simple example of WebSocket + Tornado + Redis Pub/Sub usage. | |
Do not forget to replace YOURSERVER by the correct value. | |
Keep in mind that you need the *very latest* version of your web browser. | |
You also need to add Jacob Kristhammar's websocket implementation to Tornado: | |
Grab it here: | |
http://gist.github.com/526746 | |
Or clone my fork of Tornado with websocket included: | |
http://github.com/pelletier/tornado | |
Oh and the Pub/Sub protocol is only available in Redis 2.0.0: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
select fans_id from messages where content in ('形勢', '形式', '形势') and mp_id=35164 and created_time >= '2014-07-24 06:00:00' and created_time < '2014-07-24 07:00:00'group by fans_id |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
回民吃猪肉 | |
习近平 | |
TMD | |
毛民进党 | |
妹妹淫水 流 | |
机吧 | |
联国 | |
1989六四 | |
性爱电影 | |
李红智 |