Skip to content

Instantly share code, notes, and snippets.

View MichelleDalalJian's full-sized avatar

Michelle Dalal Jian MichelleDalalJian

View GitHub Profile
@MichelleDalalJian
MichelleDalalJian / py4e_ex_13
Created November 24, 2017 16:40
Extracting Data from XML: The program will prompt for a URL, read the XML data from that URL using urllib and then parse and extract the comment counts from the XML data, compute the sum of the numbers in the file.
from urllib import request
import xml.etree.ElementTree as ET
url = 'http://python-data.dr-chuck.net/comments_24966.xml'
print ("Retrieving", url)
html = request.urlopen(url)
data = html.read()
print("Retrieved",len(data),"characters")
tree = ET.fromstring(data)
@MichelleDalalJian
MichelleDalalJian / py4e_ex_12_02
Created November 24, 2017 16:04
Following Links in Python: The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.
from bs4 import BeautifulSoup
import urllib.request, urllib.parse, urllib.error
import ssl
import re
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = "http://py4e-data.dr-chuck.net/known_by_Bryce.html"
@MichelleDalalJian
MichelleDalalJian / py4e_ex_12_01
Last active August 21, 2024 03:16
Scraping Numbers from HTML using BeautifulSoup. The program will use urllib to read the HTML from the data files below, and parse the data, extracting numbers and compute the sum of the numbers in the file.
#Actual data: http://py4e-data.dr-chuck.net/comments_24964.html (Sum ends with 73)
from urllib import request
from bs4 import BeautifulSoup
html=request.urlopen('http://python-data.dr-chuck.net/comments_24964.html').read()
soup = BeautifulSoup(html)
tags=soup('span')
sum=0
for tag in tags:
sum=sum+int(tag.contents[0])
@MichelleDalalJian
MichelleDalalJian / py4e_ex_12
Created October 7, 2017 14:53
Exploring the HyperText Transport Protocol You are to retrieve the following document using the HTTP protocol in a way that you can examine the HTTP Response headers. http://data.pr4e.org/intro-short.txt There are three ways that you might retrieve this web page and look at the response headers: Preferred: Modify the socket1.py program to retrie…
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/intro-short.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)
while True:
data = mysock.recv(512)
if (len(data) < 1):
@MichelleDalalJian
MichelleDalalJian / py4e_ex_11
Created October 7, 2017 14:48
Extracting Data With Regular Expressions Finding Numbers in a Haystack In this assignment you will read through and parse a file with text and numbers. You will extract all the numbers in the file and compute the sum of the numbers. Data Files We provide two files for this assignment. One is a sample file where we give you the sum for your testi…
import re
hand = open("regex_sum_24962.txt")
x=list()
for line in hand:
y = re.findall('[0-9]+',line)
x = x+y
sum=0
for z in x:
@MichelleDalalJian
MichelleDalalJian / py4e_ex_10_02
Created October 7, 2017 14:44
10.2 Write a program to read through the mbox-short.txt and figure out the distribution by hour of the day for each of the messages. You can pull the hour out from the 'From ' line by finding the time and then splitting the string a second time using a colon. From [email protected] Sat Jan 5 09:14:16 2008 Once you have accumulated the c…
name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
hand = open(name)
hours = dict()
for line in hand:
if line.startswith("From "):
hour = line.split()[5].split(':')[0]
hours[hour] = hours.get(hour, 0) + 1
@MichelleDalalJian
MichelleDalalJian / py4e_ex_09_04
Created October 7, 2017 14:43
9.4 Write a program to read through the mbox-short.txt and figure out who has the sent the greatest number of mail messages. The program looks for 'From ' lines and takes the second word of those lines as the person who sent the mail. The program creates a Python dictionary that maps the sender's mail address to a count of the number of times th…
fname = input("Enter file:")
if len(fname) < 1 : name = "mbox-short.txt"
hand = open(fname)
lst = list()
for line in hand:
if not line.startswith("From:"): continue
line = line.split()
lst.append(line[1])
@MichelleDalalJian
MichelleDalalJian / py4e_ex_08_05
Created October 7, 2017 12:54
8.5 Open the file mbox-short.txt and read it line by line. When you find a line that starts with 'From ' like the following line: From [email protected] Sat Jan 5 09:14:16 2008 You will parse the From line using split() and print out the second word in the line (i.e. the entire address of the person who sent the message). Then print out…
fhand = open("mbox-short.txt")
count = 0
for line in fhand:
line = line.rstrip()
if line == "": continue
words = line.split()
if words[0] !="From": continue
print(words[1])
@MichelleDalalJian
MichelleDalalJian / py4e_ex_08_04
Created October 7, 2017 12:53
8.4 Open the file romeo.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in…
fhand = open("romeo.txt")
lst = list()
for line in fhand:
line = line.rstrip()
line = line.split()
for i in line:
if i not in lst:
lst.append(i)
@MichelleDalalJian
MichelleDalalJian / py4e_ex_07_02
Created October 7, 2017 12:52
7.2 Write a program that prompts for a file name, then opens that file and reads through the file, looking for lines of the form: X-DSPAM-Confidence: 0.8475 Count these lines and extract the floating point values from each of the lines and compute the average of those values and produce an output as shown below. Do not use the sum() function or …
# Use the file name mbox-short.txt as the file name
fname = input("Enter file name: ")
fhand = open(fname)
count = 0
for line in fhand:
if line.startswith("X-DSPAM-Confidence:") :
count = count + 1
total = 0