Skip to content

Instantly share code, notes, and snippets.

@ad-m
Last active November 6, 2015 15:36
Show Gist options
  • Select an option

  • Save ad-m/aae4c33f0092f0f65e79 to your computer and use it in GitHub Desktop.

Select an option

Save ad-m/aae4c33f0092f0f65e79 to your computer and use it in GitHub Desktop.
2015-11-06 15:37:02 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2015-11-06 15:37:02 [scrapy] INFO: Optional features available: ssl, http11
2015-11-06 15:37:02 [scrapy] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2015-11-06 15:37:03 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-11-06 15:37:03 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-06 15:37:03 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-06 15:37:03 [scrapy] INFO: Enabled item pipelines:
2015-11-06 15:37:03 [scrapy] INFO: Spider opened
2015-11-06 15:37:03 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-06 15:37:03 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-11-06 15:37:03 [scrapy] DEBUG: Crawled (200) <GET http://www.sejm.gov.pl/SQL2.nsf/poskomprocall?OpenAgent&7&3019> (referer: None)
2015-11-06 15:37:04 [scrapy] DEBUG: Crawled (200) <GET http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=C33D09A9784C6784C1257DFF002E05D9> (referer: http://www.sejm.gov.pl/SQL2.nsf/poskomprocall?OpenAgent&7&3019)
2015-11-06 15:37:04 [scrapy] DEBUG: Crawled (200) <GET http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=730A65DB8748E322C1257E28002C40A7> (referer: http://www.sejm.gov.pl/SQL2.nsf/poskomprocall?OpenAgent&7&3019)
2015-11-06 15:37:04 [scrapy] DEBUG: Crawled (200) <GET http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=757BC9F7C7AD2614C1257E150034312A> (referer: http://www.sejm.gov.pl/SQL2.nsf/poskomprocall?OpenAgent&7&3019)
2015-11-06 15:37:04 [scrapy] DEBUG: Scraped from <200 http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=C33D09A9784C6784C1257DFF002E05D9>
{'url': 'http://redir.atmcdn.pl/nvr/o2/sejm/ENC04/1.livx?startTime=447761356000&stopTime=447768480000&nolimit=1'}
2015-11-06 15:37:04 [scrapy] DEBUG: Crawled (200) <GET http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=B322CFBE797B4E0BC1257DF200427187> (referer: http://www.sejm.gov.pl/SQL2.nsf/poskomprocall?OpenAgent&7&3019)
2015-11-06 15:37:04 [scrapy] DEBUG: Crawled (200) <GET http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=CB8681A4ACCD7571C1257DE900503C0D> (referer: http://www.sejm.gov.pl/SQL2.nsf/poskomprocall?OpenAgent&7&3019)
2015-11-06 15:37:04 [scrapy] DEBUG: Crawled (200) <GET http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=2472539D1B106B0BC1257E3D003F5412> (referer: http://www.sejm.gov.pl/SQL2.nsf/poskomprocall?OpenAgent&7&3019)
2015-11-06 15:37:04 [scrapy] DEBUG: Scraped from <200 http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=730A65DB8748E322C1257E28002C40A7>
{'url': 'http://redir.atmcdn.pl/nvr/o2/sejm/ENC18/1.livx?startTime=451310897000&stopTime=451317515000&nolimit=1'}
2015-11-06 15:37:04 [scrapy] DEBUG: Scraped from <200 http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=757BC9F7C7AD2614C1257E150034312A>
{'url': 'http://redir.atmcdn.pl/nvr/o2/sejm/ENC18/1.livx?startTime=450187882000&stopTime=450193642000&nolimit=1'}
2015-11-06 15:37:04 [scrapy] DEBUG: Crawled (200) <GET http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=D73A74DC86C64249C1257E06003B56D1> (referer: http://www.sejm.gov.pl/SQL2.nsf/poskomprocall?OpenAgent&7&3019)
2015-11-06 15:37:04 [scrapy] DEBUG: Scraped from <200 http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=B322CFBE797B4E0BC1257DF200427187>
{'url': 'http://redir.atmcdn.pl/nvr/o2/sejm/ENC12/1.livx?startTime=447167264000&stopTime=447173818000&nolimit=1'}
2015-11-06 15:37:04 [scrapy] DEBUG: Scraped from <200 http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=CB8681A4ACCD7571C1257DE900503C0D>
{'url': 'http://redir.atmcdn.pl/nvr/o2/sejm/senat/ENC06/channel.livx?startTime=445943327000&stopTime=445945493000&nolimit=1'}
2015-11-06 15:37:04 [scrapy] DEBUG: Scraped from <200 http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=2472539D1B106B0BC1257E3D003F5412>
{'url': 'http://redir.atmcdn.pl/nvr/o2/sejm/ENC18/1.livx?startTime=453138322000&stopTime=453143544000&nolimit=1'}
2015-11-06 15:37:04 [scrapy] DEBUG: Scraped from <200 http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=D73A74DC86C64249C1257E06003B56D1>
{'url': 'http://redir.atmcdn.pl/nvr/o2/sejm/ENC18/1.livx?startTime=448464926000&stopTime=448469528000&nolimit=1'}
2015-11-06 15:37:04 [scrapy] DEBUG: Crawled (200) <GET http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=8392B36ABFA0F1F5C1257E7C003110FC> (referer: http://www.sejm.gov.pl/SQL2.nsf/poskomprocall?OpenAgent&7&3019)
2015-11-06 15:37:04 [scrapy] DEBUG: Scraped from <200 http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=8392B36ABFA0F1F5C1257E7C003110FC>
{'url': 'http://redir.atmcdn.pl/nvr/o2/sejm/ENC16/1.livx?startTime=458122115000&stopTime=458128995000&nolimit=1'}
2015-11-06 15:37:04 [scrapy] DEBUG: Crawled (200) <GET http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=95C775175B62B28FC1257E58003CAC36> (referer: http://www.sejm.gov.pl/SQL2.nsf/poskomprocall?OpenAgent&7&3019)
2015-11-06 15:37:04 [scrapy] DEBUG: Scraped from <200 http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=95C775175B62B28FC1257E58003CAC36>
{'url': 'http://redir.atmcdn.pl/nvr/o2/sejm/ENC18/1.livx?startTime=455553554000&stopTime=455561183000&nolimit=1'}
2015-11-06 15:37:05 [scrapy] DEBUG: Crawled (200) <GET http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=BBC081BEA85C71A8C1257E4D002C57A1> (referer: http://www.sejm.gov.pl/SQL2.nsf/poskomprocall?OpenAgent&7&3019)
2015-11-06 15:37:05 [scrapy] DEBUG: Scraped from <200 http://www.sejm.gov.pl/Sejm7.nsf/transmisje_arch.xsp?unid=BBC081BEA85C71A8C1257E4D002C57A1>
{'url': 'http://redir.atmcdn.pl/nvr/o2/sejm/ENC18/1.livx?startTime=454255473000&stopTime=454268371000&nolimit=1'}
import scrapy
import json
START_URL = 'http://www.sejm.gov.pl/SQL2.nsf/poskomprocall?OpenAgent&7&3019'
class VideoSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://www.sejm.gov.pl/SQL2.nsf/poskomprocall?OpenAgent&7&3019']
def parse(self, response):
for url in response.xpath('/html/body/center/table//a[@title="Retransmisja z posiedzenia"]/@href'):
yield scrapy.Request(response.urljoin(url.extract()), self.parse_videos)
def parse_videos(self, response):
params_text = [x for x in response.body_as_unicode().split("\n") if x.startswith('var params ')][0].split(' = ', 1)[1][:-1]
data = json.loads(params_text)
params = data['params']
yield {'url': "{url}?startTime={start}&stopTime={stop}&nolimit=1".format(url=params['file'],
start=params['ATMStart'],
stop=params['ATMStop'])}
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(VideoSpider)
process.start()
http://cdn.files.jawne.info.pl.e24files.com/public_html/2015/11/sejm-stowarzyszenia/06a104cff8ae3853da1611fc559ddcfd.flv
http://cdn.files.jawne.info.pl.e24files.com/public_html/2015/11/sejm-stowarzyszenia/108c91e1744e2ef8900ff0dd08f6202c.flv
http://cdn.files.jawne.info.pl.e24files.com/public_html/2015/11/sejm-stowarzyszenia/222fb585a1efba8d69c76b8d9182eb66.flv
http://cdn.files.jawne.info.pl.e24files.com/public_html/2015/11/sejm-stowarzyszenia/3193f1ca5ec92b2ecb61842d6a6689bb.flv
http://cdn.files.jawne.info.pl.e24files.com/public_html/2015/11/sejm-stowarzyszenia/5a3868a2141c5e8d3b360302b4037664.flv
http://cdn.files.jawne.info.pl.e24files.com/public_html/2015/11/sejm-stowarzyszenia/65a8740f346fef80782c4d0ba55a9039.flv
http://cdn.files.jawne.info.pl.e24files.com/public_html/2015/11/sejm-stowarzyszenia/6b4e77bcbafbd392a756a527e44d4eac.flv
http://cdn.files.jawne.info.pl.e24files.com/public_html/2015/11/sejm-stowarzyszenia/cbddf4dc78ff4a6dba1a2baf192f6b9f.flv
http://cdn.files.jawne.info.pl.e24files.com/public_html/2015/11/sejm-stowarzyszenia/f6a57d9b4a9bb8150c1f4748b3d2e136.flv
http://cdn.files.jawne.info.pl.e24files.com/public_html/2015/11/sejm-stowarzyszenia/fcfa436a7133acc1a0d9a017c5c1b4a1.flv
http://cdn.files.jawne.info.pl.e24files.com/public_html/2015/11/sejm-stowarzyszenia/url.txt
http://redir.atmcdn.pl/nvr/o2/sejm/ENC04/1.livx?startTime=447761356000&stopTime=447768480000&nolimit=1
http://redir.atmcdn.pl/nvr/o2/sejm/ENC18/1.livx?startTime=451310897000&stopTime=451317515000&nolimit=1
http://redir.atmcdn.pl/nvr/o2/sejm/ENC18/1.livx?startTime=450187882000&stopTime=450193642000&nolimit=1
http://redir.atmcdn.pl/nvr/o2/sejm/ENC12/1.livx?startTime=447167264000&stopTime=447173818000&nolimit=1
http://redir.atmcdn.pl/nvr/o2/sejm/senat/ENC06/channel.livx?startTime=445943327000&stopTime=445945493000&nolimit=1
http://redir.atmcdn.pl/nvr/o2/sejm/ENC18/1.livx?startTime=453138322000&stopTime=453143544000&nolimit=1
http://redir.atmcdn.pl/nvr/o2/sejm/ENC18/1.livx?startTime=448464926000&stopTime=448469528000&nolimit=1
http://redir.atmcdn.pl/nvr/o2/sejm/ENC16/1.livx?startTime=458122115000&stopTime=458128995000&nolimit=1
http://redir.atmcdn.pl/nvr/o2/sejm/ENC18/1.livx?startTime=455553554000&stopTime=455561183000&nolimit=1
http://redir.atmcdn.pl/nvr/o2/sejm/ENC18/1.livx?startTime=454255473000&stopTime=454268371000&nolimit=1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment