This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<!DOCTYPE html> | |
<html lang="en"> | |
<head> | |
<meta charset="UTF-8"> | |
<meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
<meta http-equiv="X-UA-Compatible" content="ie=edge"> | |
<title>Xpath Syntax</title> | |
</head> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<div> | |
<a href='www.example.com'>Link</a> | |
</div> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<p class='someClass'>Paragraph 1</p> | |
<p id='someId'>Paragraph 2</p> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import scrapy | |
class JokesSpider(scrapy.Spider): | |
name= 'jokes' | |
allowed_domains = ['www.laughfactory.com'] | |
start_urls = [ | |
'http://www.laughfactory.com/jokes/family-jokes' | |
] | |
def parse(self, response): |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def parse(self, response): | |
for joke in response.xpath("//div[@class='jokes']"): | |
yield { | |
'joke_text': joke.xpath(".//div[@class='joke-text']/p").extract_first() | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
next_page= response.xpath("//li[@class='next']/a/@href").extract_first() | |
if next_page is not None: | |
next_page_link= response.urljoin(next_page) | |
yield scrapy.Request(url=next_page_link, callback=self.parse) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class JokeItem(scrapy.Item): | |
joke_text= scrapy.Field() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import scrapy | |
from demo_project.items import JokeItem | |
from scrapy.loader import ItemLoader | |
class JokesSpider(scrapy.Spider): | |
name= 'jokes' | |
allowed_domais = ['www.laughfactory.com'] | |
start_urls = [ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import scrapy | |
from scrapy.loader.processors import MapCompose, TakeFirst | |
from w3lib.html import remove_tags | |
def remove_whitespace(value): | |
return value.strip() | |
class JokeItem(scrapy.Item): | |
joke_text= scrapy.Field( | |
input_processor= MapCompose(remove_tags, remove_whitespace), |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
__pycache__/ | |
.vscode/ | |
build/ | |
dbs/ | |
eggs/ | |
project.egg-info/ | |
*.json | |
*.csv |
OlderNewer