Last active
September 10, 2017 22:17
-
-
Save github-shakti/11392841 to your computer and use it in GitHub Desktop.
Times Of India WebScrapping
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Created By Shakti | |
| # This Script scrapes all links from the specific url and enerates a list of URL within a text file which can be run automatically to gather points on Indiatimes | |
| #Basically a SCRAPPING Learning Experience | |
| from bs4 import BeautifulSoup | |
| import urllib2 | |
| url=urllib2.urlopen("http://timesofindia.indiatimes.com/entertainment/hindi/") | |
| content=url.read() | |
| soup=BeautifulSoup(content) | |
| import re | |
| links=soup.findAll(attrs={'hid':re.compile("$")}) | |
| import sys | |
| myfile = open('d:/xyz.txt', 'w') | |
| for tag in soup.findAll(attrs={'hid':re.compile("$")},href=True): | |
| myfile.write("%s\n"%tag['href']) | |
| myfile.close() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Created By Shakti | |
| # This Script scrapes all links from the specific url and enerates a list of URL within a text file which can be run automatically to gather points on Indiatimes | |
| #Basically a SCRAPPING Learning Experience | |
| input=open('d:/input.txt','r') | |
| output=open('d:/outfile.txt','w') | |
| for line in input: | |
| if line[0]=='/': | |
| line='http://timesofindia.indiatimes.com'+line | |
| else: line=line | |
| output.write(line) | |
| output.close() | |
| # it checks the generated text file and makes a list of proper url which can then be used in Imacro or other similar software or addons under Firefox |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment