This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ; Sample supervisor config file. | |
| [unix_http_server] | |
| file=/tmp/supervisor.sock ; (the path to the socket file) | |
| ;chmod=0700 ; sockef file mode (default 0700) | |
| ;chown=nobody:nogroup ; socket file uid:gid owner | |
| ;username=user ; (default is no username (open server)) | |
| ;password=123 ; (default is no password (open server)) | |
| ;[inet_http_server] ; inet (TCP) server disabled by default |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ; Sample supervisor config file. | |
| [unix_http_server] | |
| file=/tmp/supervisor.sock ; (the path to the socket file) | |
| ;chmod=0700 ; sockef file mode (default 0700) | |
| ;chown=nobody:nogroup ; socket file uid:gid owner | |
| ;username=user ; (default is no username (open server)) | |
| ;password=123 ; (default is no password (open server)) | |
| ;[inet_http_server] ; inet (TCP) server disabled by default |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness). However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser. | |
| # | |
| # Some things to note: | |
| # You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly. Also this is just a template crawler. You could get much crazier and more advanced with things but I just wanted to show the basic idea. As the code stands now you will be doing two requests for any given url. One request is made by Scrapy and the other is made by Selenium. I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too. | |
| # | |
| # This is quite powerful |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from scrapy.xlib.pydispatch import dispatcher | |
| from scrapy import signals | |
| from scrapy.exceptions import DropItem | |
| from scrapy.utils.serialize import ScrapyJSONEncoder | |
| from carrot.connection import BrokerConnection | |
| from carrot.messaging import Publisher | |
| from twisted.internet.threads import deferToThread |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # You can use this middleware to have a random user agent every request the spider makes. | |
| # You can define a user USER_AGEN_LIST in your settings and the spider will chose a random user agent from that list every time. | |
| # | |
| # You will have to disable the default user agent middleware and add this to your settings file. | |
| # | |
| # DOWNLOADER_MIDDLEWARES = { | |
| # 'scraper.random_user_agent.RandomUserAgentMiddleware': 400, | |
| # 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None, | |
| # } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # This middleware can be used to avoid re-visiting already visited items, which can be useful for speeding up the scraping for projects with immutable items, ie. items that, once scraped, don't change. | |
| from scrapy import log | |
| from scrapy.http import Request | |
| from scrapy.item import BaseItem | |
| from scrapy.utils.request import request_fingerprint | |
| from myproject.items import MyItem | |
| class IgnoreVisitedItems(object): |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # This scripts shows how to crawl a site without settings up a complete project. | |
| # | |
| # Note: the `crawler.start()` can't be called more than once due twisted's reactor limitation. | |
| #!/usr/bin/env python | |
| # -*- coding: utf-8 -*- | |
| # author: Rolando Espinoza La fuente | |
| # | |
| # Changelog: | |
| # 24/07/2011 - updated to work with scrapy 13.0dev |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # A downloader middleware automatically to redirect pages containing a rel=canonical in their contents to the canonical url (if the page itself is not the canonical one), | |
| from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor | |
| from scrapy.utils.url import url_is_from_spider | |
| from scrapy.http import HtmlResponse | |
| from scrapy import log | |
| class RelCanonicalMiddleware(object): | |
| _extractor = SgmlLinkExtractor(restrict_xpaths=['//head/link[@rel="canonical"]'], tags=['link'], attrs=['href']) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import json | |
| from functools import partial | |
| from collections import OrderedDict | |
| from scrapy.spider import BaseSpider | |
| from scrapy.http import Request | |
| from scrapy.selector import HtmlXPathSelector | |
| from lxml.cssselect import css_to_xpath |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| using System; | |
| using System.Net; | |
| using System.IO; | |
| using System.Web; | |
| using System.Text.RegularExpressions; | |
| namespace FundaAPI | |
| { | |
| class MainClass | |
| { |