This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
IDsearch = re.compile(r'id=(\d+)') # This searches for anything that starts with ‘id=’ and ends with a string of numbers, capturing the string of numbers | |
threadIDs = IDsearch.findall(str(cleanpagedata)) # We need to convert the BeautifulSoup output to a string in order to search with regex |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
IDsearch = re.compile(r'vote\?id=(\d+)&') # don’t forget the \ before the ? in the regular expression - certain characters, such as the ? are special in regex and thus need to have an escape character otherwise it will count as part of the regex search | |
threadIDs = IDsearch.findall(str(cleanpagedata)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
commentlinks = [] | |
for i in range(len(threadIDs)): | |
commentlinks.append("https://news.ycombinator.com/item?id=" + threadIDs[i]) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
thread = requests.get(commentlinks[0]) | |
cleanthread = bs4.BeautifulSoup(thread.text, 'html.parser') |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<a class="storylink" href="https://babluboy.github.io/bookworm/">Bookworm: A Simple, Focused eBook Reader</a> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
singlethreadlinksearch = re.compile(r'\<a class="storylink" href="(.+?)"\>') # again, don’t forget the escape \ before characters like < and > | |
singlethreadlink = singlethreadlinksearch.findall(str(cleanthread)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
commenterIDsearch = re.compile(r'user\?id=(.+?)"') | |
commenterIDs = commenterIDsearch.findall(str(cleanthread)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
firstcommenter = commenterIDs[1] # Remember that Python lists start with 0 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def scrapethread(cleanthread): # We need to feed the thread data into the function | |
singlethreadlinksearch = re.compile(r'\<a class="storylink" href="(.+?)"\>') | |
singlethreadlink = singlethreadlinksearch.findall(str(cleanthread)) | |
commenterIDsearch = re.compile(r'user\?id=(.+?)"') | |
commenterIDs = commenterIDsearch.findall(str(cleanthread)) | |
try: | |
firstcommenter = commenterIDs[1] # If there are no commenters this will fail, so we wrap it in a try/except just in case | |
except: | |
firstcommenter = "No commenters" | |
return singlethreadlink, firstcommenter # Return the variables |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
results = [] # We want our results to come back as a list | |
for i in range(len(commentlinks)): | |
thread = requests.get(commentlinks[i]) # Go to each link | |
cleanthread = bs4.BeautifulSoup(thread.text, 'html.parser') | |
link, commenter = scrapethread(cleanthread) # Scrape the data and return them to these variables | |
results.append(link + [commenter]) # Append the results - note that the link actually returns as a list, rather than a string | |
time.sleep(30) |