Skip to content

Instantly share code, notes, and snippets.

View widnyana's full-sized avatar
🤘
available for hire

wid widnyana

🤘
available for hire
View GitHub Profile
@widnyana
widnyana / imdb_next_page_spider.py
Created January 6, 2016 07:23 — forked from premit/imdb_next_page_spider.py
Scrapy reference: Crawling next pagination
'''
Spider for IMDb
- Retrieve most popular movies & TV series with rating of 8.0 and above
- Crawl next pages recursively
'''
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
@widnyana
widnyana / pipelines.py
Last active November 23, 2015 06:57 — forked from tzermias/pipelines.py
Scrapy MySQL pipeline.Just a mirror to the asynchronous MySQL pipeline.Copy-paste it directly to pipelines.py. Database credentials are stored in settings.py. Based on http://snipplr.com/view/66986/
import MySQLdb.cursors
from twisted.enterprise import adbapi
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.utils.project import get_project_settings
from scrapy import log
SETTINGS = get_project_settings()
"""
Celery base task aimed at longish-running jobs that return a result.
``AwesomeResultTask`` adds thundering herd avoidance, result caching, progress
reporting, error fallback and JSON encoding of results.
"""
from __future__ import division
import logging
import simplejson
#!/usr/bin/env python
"""
Regex for URIs
These regex are directly derived from the collected ABNF in RFC3986
(except for DIGIT, ALPHA and HEXDIG, defined by RFC2234).
They should be processed with re.VERBOSE.
"""