Skip to content

Instantly share code, notes, and snippets.

View bcambel's full-sized avatar
🌴
On vacation

Bahadir Cambel bcambel

🌴
On vacation
View GitHub Profile
@bcambel
bcambel / supervisor.conf
Created December 6, 2012 17:07 — forked from tsabat/supervisor.conf
Sample supervisor config file
; Sample supervisor config file.
[unix_http_server]
file=/tmp/supervisor.sock ; (the path to the socket file)
;chmod=0700 ; sockef file mode (default 0700)
;chown=nobody:nogroup ; socket file uid:gid owner
;username=user ; (default is no username (open server))
;password=123 ; (default is no password (open server))
;[inet_http_server] ; inet (TCP) server disabled by default
@bcambel
bcambel / supervisord-example.conf
Created December 7, 2012 02:35 — forked from didip/supervisord-example.conf
Example for supervisord conf file
; Sample supervisor config file.
[unix_http_server]
file=/tmp/supervisor.sock ; (the path to the socket file)
;chmod=0700 ; sockef file mode (default 0700)
;chown=nobody:nogroup ; socket file uid:gid owner
;username=user ; (default is no username (open server))
;password=123 ; (default is no password (open server))
;[inet_http_server] ; inet (TCP) server disabled by default
# Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness). However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser.
#
# Some things to note:
# You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly. Also this is just a template crawler. You could get much crazier and more advanced with things but I just wanted to show the basic idea. As the code stands now you will be doing two requests for any given url. One request is made by Scrapy and the other is made by Selenium. I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too.
#
# This is quite powerful
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.exceptions import DropItem
from scrapy.utils.serialize import ScrapyJSONEncoder
from carrot.connection import BrokerConnection
from carrot.messaging import Publisher
from twisted.internet.threads import deferToThread
# You can use this middleware to have a random user agent every request the spider makes.
# You can define a user USER_AGEN_LIST in your settings and the spider will chose a random user agent from that list every time.
#
# You will have to disable the default user agent middleware and add this to your settings file.
#
# DOWNLOADER_MIDDLEWARES = {
# 'scraper.random_user_agent.RandomUserAgentMiddleware': 400,
# 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
# }
@bcambel
bcambel / scrapy_ignore_visited_items.py
Created December 23, 2012 18:45
This middleware can be used to avoid re-visiting already visited items, which can be useful for speeding up the scraping for projects with immutable items, ie. items that, once scraped, don't change.
# This middleware can be used to avoid re-visiting already visited items, which can be useful for speeding up the scraping for projects with immutable items, ie. items that, once scraped, don't change.
from scrapy import log
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.request import request_fingerprint
from myproject.items import MyItem
class IgnoreVisitedItems(object):
@bcambel
bcambel / scrapy_in_container.py
Created December 23, 2012 18:48
This scripts shows how to crawl a site without settings up a complete project.
# This scripts shows how to crawl a site without settings up a complete project.
#
# Note: the `crawler.start()` can't be called more than once due twisted's reactor limitation.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# author: Rolando Espinoza La fuente
#
# Changelog:
# 24/07/2011 - updated to work with scrapy 13.0dev
@bcambel
bcambel / scrapy_download_middleware_canonical_links.py
Created December 23, 2012 18:50
A downloader middleware automatically to redirect pages containing a rel=canonical in their contents to the canonical url (if the page itself is not the canonical one),
@bcambel
bcambel / scrapy_example_coctail_extract.py
Created December 23, 2012 18:52
Scrapy example to extract coctails from seriouseats.com
import json
from functools import partial
from collections import OrderedDict
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from lxml.cssselect import css_to_xpath
@bcambel
bcambel / funda.cs
Created January 19, 2013 12:25
Funda API Test
using System;
using System.Net;
using System.IO;
using System.Web;
using System.Text.RegularExpressions;
namespace FundaAPI
{
class MainClass
{