Skip to content

Instantly share code, notes, and snippets.

@SRatna
Last active August 16, 2017 11:39
Show Gist options
  • Save SRatna/09c64fe649edbd1d0348cea48d1d4711 to your computer and use it in GitHub Desktop.
Save SRatna/09c64fe649edbd1d0348cea48d1d4711 to your computer and use it in GitHub Desktop.
Set up for scrapy
I used anaconda as my python vendor.
So first install anaconda.
Then set path in .profile in home directory.
--> PATH="$HOME/bin:$HOME/.local/bin:$HOME/anaconda3/bin:$PATH"
then create a virtual environment using conda as
: conda create --name your_name_for_env
this will create named env directory inside anaconda3/envs
do source activate your_name_for_env
then pip install scrapy
scrapy startproject project_name
cd project_name
scrapy genspider spider_name website_address
scrapy list --to see list of defined spiders
--to test errors before crawling pages
scrapy crawl name_of_spider
--solved error in setting.py -- change robots.txt from true to false
scrapy shell----
fetch('name_of_website')
response.body --- to see response
view(response) --- to view response in browser
we can use both css and xpath to extrat any node
example
response.css('h1')
response.xpath('//h1')
response.xpath('//h1/a/text()').extract()
response.xpath('//h1/a/text()').extract_first()
response.xpath('//*[@class="tag-item"]')
response.css('.tag-item')
response.xpath('//*[@class="tag-item"]/a/text()').extract()
note doubleslashes '//' means search everywhere and return all instances
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment