bcambel’s gists

bcambel / supervisor.conf

Created December 6, 2012 17:07 — forked from tsabat/supervisor.conf

Sample supervisor config file

	; Sample supervisor config file.

	[unix_http_server]
	file=/tmp/supervisor.sock ; (the path to the socket file)
	;chmod=0700 ; sockef file mode (default 0700)
	;chown=nobody:nogroup ; socket file uid:gid owner
	;username=user ; (default is no username (open server))
	;password=123 ; (default is no password (open server))

	;[inet_http_server] ; inet (TCP) server disabled by default

bcambel / supervisord-example.conf

Created December 7, 2012 02:35 — forked from didip/supervisord-example.conf

Example for supervisord conf file

	; Sample supervisor config file.

	[unix_http_server]
	file=/tmp/supervisor.sock ; (the path to the socket file)
	;chmod=0700 ; sockef file mode (default 0700)
	;chown=nobody:nogroup ; socket file uid:gid owner
	;username=user ; (default is no username (open server))
	;password=123 ; (default is no password (open server))

	;[inet_http_server] ; inet (TCP) server disabled by default

bcambel / gist:4365127

Created December 23, 2012 18:37 — forked from anonymous/gist:4365124

	# Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness). However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser.
	#
	# Some things to note:
	# You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly. Also this is just a template crawler. You could get much crazier and more advanced with things but I just wanted to show the basic idea. As the code stands now you will be doing two requests for any given url. One request is made by Scrapy and the other is made by Selenium. I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too.
	#
	# This is quite powerful

bcambel / message_queue_pipeline.py

Created December 23, 2012 18:41

	from scrapy.xlib.pydispatch import dispatcher
	from scrapy import signals
	from scrapy.exceptions import DropItem
	from scrapy.utils.serialize import ScrapyJSONEncoder

	from carrot.connection import BrokerConnection
	from carrot.messaging import Publisher

	from twisted.internet.threads import deferToThread

bcambel / scrapy_random_user_agent.py

Created December 23, 2012 18:43

	# You can use this middleware to have a random user agent every request the spider makes.
	# You can define a user USER_AGEN_LIST in your settings and the spider will chose a random user agent from that list every time.
	#
	# You will have to disable the default user agent middleware and add this to your settings file.
	#
	# DOWNLOADER_MIDDLEWARES = {
	# 'scraper.random_user_agent.RandomUserAgentMiddleware': 400,
	# 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
	# }

bcambel / scrapy_ignore_visited_items.py

Created December 23, 2012 18:45

This middleware can be used to avoid re-visiting already visited items, which can be useful for speeding up the scraping for projects with immutable items, ie. items that, once scraped, don't change.

	# This middleware can be used to avoid re-visiting already visited items, which can be useful for speeding up the scraping for projects with immutable items, ie. items that, once scraped, don't change.

	from scrapy import log
	from scrapy.http import Request
	from scrapy.item import BaseItem
	from scrapy.utils.request import request_fingerprint

	from myproject.items import MyItem

	class IgnoreVisitedItems(object):

bcambel / scrapy_in_container.py

Created December 23, 2012 18:48

This scripts shows how to crawl a site without settings up a complete project.

	# This scripts shows how to crawl a site without settings up a complete project.
	#
	# Note: the `crawler.start()` can't be called more than once due twisted's reactor limitation.

	#!/usr/bin/env python
	# -- coding: utf-8 --
	# author: Rolando Espinoza La fuente
	#
	# Changelog:
	# 24/07/2011 - updated to work with scrapy 13.0dev

bcambel / scrapy_download_middleware_canonical_links.py

Created December 23, 2012 18:50

A downloader middleware automatically to redirect pages containing a rel=canonical in their contents to the canonical url (if the page itself is not the canonical one),

	# A downloader middleware automatically to redirect pages containing a rel=canonical in their contents to the canonical url (if the page itself is not the canonical one),

	from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
	from scrapy.utils.url import url_is_from_spider
	from scrapy.http import HtmlResponse
	from scrapy import log

	class RelCanonicalMiddleware(object):
	_extractor = SgmlLinkExtractor(restrict_xpaths=['//head/link[@rel="canonical"]'], tags=['link'], attrs=['href'])

bcambel / scrapy_example_coctail_extract.py

Created December 23, 2012 18:52

Scrapy example to extract coctails from seriouseats.com

	import json
	from functools import partial
	from collections import OrderedDict

	from scrapy.spider import BaseSpider
	from scrapy.http import Request
	from scrapy.selector import HtmlXPathSelector

	from lxml.cssselect import css_to_xpath

bcambel / funda.cs

Created January 19, 2013 12:25

Funda API Test

	using System;
	using System.Net;
	using System.IO;
	using System.Web;
	using System.Text.RegularExpressions;

	namespace FundaAPI
	{
	class MainClass
	{

Bahadir Cambel bcambel