Last active
August 16, 2017 11:39
-
-
Save SRatna/09c64fe649edbd1d0348cea48d1d4711 to your computer and use it in GitHub Desktop.
Set up for scrapy
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I used anaconda as my python vendor. | |
So first install anaconda. | |
Then set path in .profile in home directory. | |
--> PATH="$HOME/bin:$HOME/.local/bin:$HOME/anaconda3/bin:$PATH" | |
then create a virtual environment using conda as | |
: conda create --name your_name_for_env | |
this will create named env directory inside anaconda3/envs | |
do source activate your_name_for_env | |
then pip install scrapy | |
scrapy startproject project_name | |
cd project_name | |
scrapy genspider spider_name website_address | |
scrapy list --to see list of defined spiders | |
--to test errors before crawling pages | |
scrapy crawl name_of_spider | |
--solved error in setting.py -- change robots.txt from true to false | |
scrapy shell---- | |
fetch('name_of_website') | |
response.body --- to see response | |
view(response) --- to view response in browser | |
we can use both css and xpath to extrat any node | |
example | |
response.css('h1') | |
response.xpath('//h1') | |
response.xpath('//h1/a/text()').extract() | |
response.xpath('//h1/a/text()').extract_first() | |
response.xpath('//*[@class="tag-item"]') | |
response.css('.tag-item') | |
response.xpath('//*[@class="tag-item"]/a/text()').extract() | |
note doubleslashes '//' means search everywhere and return all instances | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment