Skip to content

Instantly share code, notes, and snippets.

@suranands
Created October 2, 2016 17:12
Show Gist options
  • Select an option

  • Save suranands/e1e2c4ca06b2b4a14140b8b5d5d22d24 to your computer and use it in GitHub Desktop.

Select an option

Save suranands/e1e2c4ca06b2b4a14140b8b5d5d22d24 to your computer and use it in GitHub Desktop.
"""
Following Links in Python
In this assignment you will write a Python program that expands on
http://www.pythonlearn.com/code/urllinks.py (http://www.pythonlearn.com/code/urllinks.py). The program will
use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a
tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a
number of times and report the last name you find.
We provide two files for this assignment. One is a sample file where we give you the name for your testing and
the other is the actual data you need to process for the assignment
- Sample problem: Start at http://python­data.dr­chuck.net/known_by_Fikret.html (http://python­data.dr­
chuck.net/known_by_Fikret.html)
Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer
is the last name that you retrieve.
Sequence of names: Fikret Montgomery Mhairade Butchi Anayah
Last name in sequence: Anayah
- Actual problem: Start at: http://python­data.dr­chuck.net/known_by_Inaara.html (http://python­data.dr­
chuck.net/known_by_Inaara.html)
Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The
answer is the last name that you retrieve.
Hint: The first character of the name of the last page that you will load is: R
Strategy
The web pages tweak the height between the links and hide the page after a few seconds to make it difficult for
you to do the assignment without writing a Python program. But frankly with a little effort and patience you can
overcome these attempts to make it a little harder to complete the assignment without writing a Python
program. But that is not the point. The point is to write a clever Python program to solve the program.
"""
import re, urllib
from BeautifulSoup import *
all_links = []
all_names = []
url_first_part = 'http://python-data.dr-chuck.net/known_by_'
url_last_part = '.html'
first_entry = 'Inaara'
for i in range(7):
url = url_first_part + first_entry + url_last_part
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
#def get_next_name(url)
tags = soup('a')
links = []
for tag in tags:
links.append(tag.get('href', None))
url = links[17]
print url
name = url[41:]
next_entry = name[:-5]
all_names.append(next_entry)
first_entry = next_entry
url = url_first_part + first_entry + url_last_part
all_links.append(url)
print all_names[-1]
@fushuai1229
Copy link
Copy Markdown

For python3, there is a new code works for the problem

To run this, you can install BeautifulSoup

https://pypi.python.org/pypi/beautifulsoup4

Or download the file

http://www.py4e.com/code3/bs4.zip

and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

Ignore SSL certificate errors

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
coun=input('Enter your count:')
pos=input('Enter your position:')
print(url)

for i in range(int(coun)):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags = soup('a')
url = tags[int(pos)-1].get('href',None)
print (url)

@yts61
Copy link
Copy Markdown

yts61 commented May 17, 2018

would anyone help me with this question?

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

Ignore SSL certificate errors

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = ('http://py4e-data.dr-chuck.net/known_by_Nabeel.html')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

post = int(input("Enter position: ")) -1 #The position of link relative to first link
count = int(input("Enter count: ")) #The number of times to be repeated

Build a tag list

tags = soup('a')

check the list

#print (tags)
#retrive all the links and put into dictionary
for tag in tags:
#retrive the url every 18
url = tag.get('href',None)
for i in range(count):
ans=url[post]
print (ans)

what is wrong with my code?

@Kajol-Kumari
Copy link
Copy Markdown

In your code you are getting the letter at the 18th position of the link as you are iterating over letters of a particular link stored in your tags.
You try the code below and then run your code, you'll get to know your fault:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter URL : ')
count = input('Enter count : ')
position = input('Enter position : ')

for i in range(int(count)):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
url = tags[int(position)-1].get('href', None)
print(url)

@vedant-milind
Copy link
Copy Markdown

I'm still not getting the answer. Whats the name ?

@snehasrinija2000
Copy link
Copy Markdown

I am getting the error that the module of BeautifulSoup is not available eventhough I have downloaded bs4

@vedant-milind
Copy link
Copy Markdown

Unzip the bs4 RAR and then copy the bs4 folder inside it on the desktop . And then use the line

from bs4 import BeautifulSoup

It'll work .

@tejas22198
Copy link
Copy Markdown

what is the name..i cant find it .

@tejas22198
Copy link
Copy Markdown

I'm still not getting the answer. Whats the name ?

did u got it

@MohammedBasheerUddin
Copy link
Copy Markdown

for i in range(int(coun)):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags = soup('a')
url = tags[int(pos)-1].get('href',None)
print (url)

(first of all great code, but can you explain why and how the variable "pos" is used because I cannot understand it's use in question too . It would be helpful if you explain this code thank you in advance)

@snehasrinija2000
Copy link
Copy Markdown

Thanks everyone for the support... The mistake was that I didn't unzip bs4... Thanks and sorry for the inconvenience

@BuvanasriAK
Copy link
Copy Markdown

BuvanasriAK commented Jun 23, 2020

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

ctx = ssl.create_default_context() # Ignore SSL certificate errors

ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter URL: ')
c = input('Enter count: ')
pos = input('Enter position: ')
print(url)

for i in range(int(c)):
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
tags = soup('a')
url = tags[int(pos)-1].get('href',None)
print(url)
#you get an URL, print it. Then you look for the anchor tag at index which is pos - 1 and get the key value i.e, url present in the href attribute
#then you open that url, print it and do the same for the remaining number of times (count no. of times)
#in the end retrieve the last url

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment