These are the course notes for the HTTP & Web Servers Udacity course, some parts of the notes are taken from the book HTTP: The definitive Guide.
You'll be using the command line a lot in this course. A lot of the instructions in this course will ask you to run commands on the terminal on your computer. You can use any common terminal program —
- On Windows 10, you can use the bash shell in Windows Subsystem for Linux.
- On earlier versions of Windows, you can use the Git Bash terminal program from Git.
- On Mac OS, you can use the built-in Terminal program, or another such as iTerm.
- On Linux, you can use any common terminal program such as gnome-terminal or xterm.
This course will not use a VM (virtual machine). Instead, you will be running code directly on your computer. This means you will need to have Python installed on your computer. The code in this course is built for Python 3, and will not all work in Python 2.
- Windows and Mac: Install it from python.org: https://www.python.org/downloads/
- Mac (with Homebrew): In the terminal, run
brew install python3
- Debian/Ubuntu/Mint: In the terminal, run
sudo apt-get install python3
To check if you already have python 3.x installed, try:
$ python --version
or,
$ python3 --version
You'll also need to install ncat
, which is part of the Nmap network testing toolkit. We'll be using ncat
to investigate how web servers and browsers talk to each other.
- Windows: Download and run https://nmap.org/dist/nmap-7.30-setup.exe
- Mac (with Homebrew): In the terminal, run
brew install nmap
- Mac (without Homebrew): Download and install https://nmap.org/dist/nmap-7.30.dmg
- Debian/Ubuntu/Mint: In the terminal, run
sudo apt-get install nmap
To check whether ncat
is installed and working, open up two terminals. In one of them, run ncat -l 9999
then in the other, ncat localhost 9999
. Then type something into each terminal and press Enter. You should see the message on the opposite terminal.
An HTTP transaction always involves a client and a server. You're using an HTTP client right now — your web browser. Your browser sends HTTP requests to web servers, and servers send responses back to your browser. Displaying a simple web page can involve dozens of requests — for the HTML page itself, for images or other media, and for additional data that the page needs.
HTTP was originally created to serve hypertext documents, but today is used for much more. As a user of the web, you're using HTTP all the time.
A server is just a program that accepts connections from other programs on the network.
When you start a server program, it waits for clients to connect to it — like the demo server waiting for your web browser to ask it for a page. Then when a connection comes in, the server runs a piece of code — like calling a function — to handle each incoming connection. A connection in this sense is like a phone call: it's a channel through which the client and server can talk to each other. Web clients send requests over these connections, and servers send responses back.
To run an HTTP server using python you can use the http.server
module (We'll pointing to this server as the Demo server)
python3 -m http.server [port]
-m
: The-m
flag tells python to run the module as script.[port]
: The port number the server will be listening on. More on ports later.
When running the server in a directory containing files, you can view the files in that folder from your browser by checking this URL http://localhost:[port]
.
The files served via the web server are now called Web resources.
A web resource is the source of web content. The simplest kind of web resource is a static file on the web server’s filesystem. These files can contain anything: they might be text files, HTML files, Microsoft Word files, Adobe Acrobat files, JPEG image files, AVI movie files, or any other format you can think of. — HTTP: The definitive Guide
Once you do view your files the terminal where you ran the server from will start logging some information:
Serving HTTP on 0.0.0.0 port 8000 ...
127.0.0.1 - - [11/Nov/2018 21:46:26] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [11/Nov/2018 21:46:26] "GET /css/normalize.min.css HTTP/1.1" 200 -
127.0.0.1 - - [11/Nov/2018 21:46:26] "GET /css/app.css HTTP/1.1" 200 -
127.0.0.1 - - [11/Nov/2018 21:46:27] "GET /img/something-to-remember.jpg HTTP/1.1" 200 -
127.0.0.1 - - [11/Nov/2018 21:46:27] "GET /img/village-in-the-valley.jpg HTTP/1.1" 200 -
127.0.0.1 - - [11/Nov/2018 21:46:27] "GET /js/app.js HTTP/1.1" 200 -
Each line of this log corresponds to an HTTP request.
Each web server resource has a name, so clients can point out what resources they are interested in. The server resource name is called a uniform resource identifier, or URI. URIs are like the postal addresses of the Internet, uniquely identifying and locating information resources around the world. — HTTP: The definitive Guide
The uniform resource locator (URL) is the most common form of resource identifier. URLs describe the specific location of a resource on a particular server. They tell you exactly how to fetch a resource from a precise, fixed location. — HTTP: The definitive Guide
URIs are made out of several different parts, each of which has its own syntax. Many of these parts are optional, which is why URIs for different services look so different from one another.
Let's take this URI as an example https://www.example.com/path/to/file
.
This URI has three visible parts, separated by a little bit of punctuation:
https
is the schemewww.example.com
is the hostname/path/to/file
is the path
The first part of a URI is the scheme, which tells the client how to go about accessing the resource. Some URI schemes you've seen before include http, https, and file. File URIs tell the client to access a file on the local filesystem. HTTP and HTTPS URIs point to resources served by a web server.
HTTP and HTTPS URIs look almost the same. The difference is that when a client goes to access a resource with an HTTPS URI, it will use an encrypted connection to do it. Encrypted Web connections were originally used to protect passwords and credit-card transactions, but today many sites use them to help protect users' privacy.
There are many other URI schemes out there, though. You can take a look at the official URI Scheme list!
In an HTTP URI, the next thing that appears after the scheme is a hostname — something like www.udacity.com
or localhost
. This tells the client which server to connect to.
You'll often see web addresses written as just a hostname in print. But in the HTML code of a web page, you can't write <a href="www.google.com">this</a>
and get a working link to Google. A hostname can only appear after a URI scheme that supports it, such as http or https. In these URIs, there will always be a ://
between the scheme and hostname.
By the way, not every URI has a hostname. For instance, a mailto
URI just has an email address: mailto:[email protected]
is a well-formed mailto
URI. This also reveals a bit more about the punctuation in URIs: the :
goes after the scheme, but the //
goes before the hostname. Mailto links don't have a hostname part, so they don't have a //
.
In an HTTP URI (and many others), the next thing that appears is the path, which identifies a particular resource on a server. A server can have many resources on it — such as different web pages, videos, or APIs. The path tells the server which resource the client is looking for.
On the demo server, the paths you see will correspond to files on your filesystem. But that's just the demo server. In the real world, URI paths don't necessarily equate to specific filenames. For instance, if you do a Google search, you'll see a URI path such as /search?q=ponies
. This doesn't mean that there's literally a file on a server at Google with a filename of search?q=ponies
. The server interprets the path to figure out what resource to send. In the case of a search query, it sends back a search result page that maybe never existed before.
When you write a URI without a path, such as http://udacity.com
, the browser fills in the default path, which is written with a single slash. That's why http://udacity.com
is the same as http://udacity.com/
(with a slash on the end).
The path written with just a single slash is also called the root. When you look at the root URI of the demo server — http://localhost:8000/
— you're not looking at the root of your computer's whole filesystem. It's just the root of the resources served by the web server. The demo server won't let a web browser access files outside the directory that it's running in.
Take a look at the HTML source for the demo server's root page. Find one of the <a>
tags that links to a file. In mine, I have a file called cliffsofinsanity.png
, so there's an <a>
tag that looks like this:
<a href="cliffsofinsanity.png">cliffsofinsanity.png</a>
URIs like this one don't have a scheme, or a hostname — just a path. This is a relative URI reference. It's "relative" to the context in which it appears — specifically, the page it's on. This URI doesn't include the hostname or port of the server it's on, but the browser can figure that out from context. If you click on one of those links, the browser knows from context that it needs to fetch it from the same server that it got the original page from.
There are many other parts that can occur in a URI. Consider the difference between these two Wikipedia URIs:
If you follow these links in your browser, it will fetch the same page from Wikipedia's web server. But the second one displays the page scrolled to the section about the discovery of oxygen. The part of the URI after the #
sign is called a fragment. The browser doesn't even send it to the web server. It lets a link point to a specific named part of a resource; in HTML pages it links to an element by id
.
In contrast, consider this Google Search URI:
The ?q=fish
is a query part of the URI. This does get sent to the server.
ℹ️ Read more about URIs here: URI - Generic Syntax.
A full HTTP or HTTPS URI includes the hostname of the web server, like www.udacity.com
or www.un.int
or www.cheeseboardcollective.coop
. A hostname in a URI can also be an IP address: for instance, if you put http://216.58.194.174/
in your browser, you'll end up at Google.
Why is it called a hostname? In network terminology, a host is a computer on the network; one that could host services.
The Internet tells computers apart by their IP addresses; every piece of network traffic on the Internet is labeled with the IP addresses of the sending and receiving computers. In order to connect to a web server such as www.udacity.com
, a client needs to translate the hostname into an IP address. Your operating system's network configuration uses the Domain Name Service (DNS) — a set of servers maintained by Internet Service Providers (ISPs) and other network users — to look up hostnames and get back IP addresses.
IP addresses come in two different varieties: the older IPv4 and the newer IPv6. When you see an address like 127.0.0.1
or 216.58.194.164
, those are IPv4 addresses. IPv6 addresses are much longer, such as 2607:f8b0:4005:804::2004
, although they can also be abbreviated.
The IPv4 address 127.0.0.1
and the IPv6 address ::1
are special addresses that mean "this computer itself" — for when a client (like your browser) is accessing a server on your own computer. The hostname localhost
refers to these special addresses.
When you run the demo server, it prints a message saying that it's listening on 0.0.0.0
. This is not a regular IP address. Instead, it's a special code for "every IPv4 address on this computer". That includes the localhost
address, but it also includes your computer's regular IP address.
When you told your browser to connect to the demo server, you gave it the URI http://localhost:8000/
. This URI has a port number of 8000
. But most of the web addresses you see in the wild don't have a port number on them. This is because the client usually figures out the port number from the URI scheme.
For instance, HTTP URIs imply a port number of 80
, whereas HTTPS URIs imply a port number of 443
. Your Python demo web server is running on port 8000
. Since this isn't the default port, you have to write the port number in URIs for it.
What's a port number, anyway? To get into that, we need to talk about how the Internet works. All of the network traffic that computers send and receive — everything from web requests, to login sessions, to file sharing — is split up into messages called packets. Each packet has the IP addresses of the computer that sent it, and the computer that receives it. And (with the exception of some low-level packets, such as ping) it also has the port number for the sender and recipient. IP addresses distinguish computers; port numbers distinguish programs on those computers.
We say that a server "listens on" a port, such as 80
or 8000
. "Listening" means that when the server starts up, it tells its operating system that it wants to receive connections from clients on a particular port number. When a client (such as a web browser) "connects to" that port and sends a request, the operating system knows to forward that request to the server that's listening on that port.
Why do we use port
8000
instead of80
for the demo server? For historical reasons, operating systems only allow the administrator (or root) account to listen on ports below 1024. This is fine for production web servers, but it's not convenient for learning.
Take a look back at the server logs on your terminal, where the demo server is running. When you request a page from the demo server, an entry appears in the logs with a message like this:
127.0.0.1 - - [03/Oct/2016 15:45:50] "GET /readme.png HTTP/1.1" 200 -
Take a look at the part right after the date and time. Here, it says "GET /readme.png HTTP/1.1"
. This is the text of the request line that the browser sent to the server. This log entry is the server telling you that it received a request that said, literally, GET /readme.png HTTP/1.1
.
This request has three parts.
The word GET
is the method or HTTP verb being used; this says what kind of request is being made. GET
is the verb that clients use when they want a server to send a resource, such as a web page or image. Later, we'll see other verbs that are used when a client wants to do other things, such as submit a form or make changes to a resource.
/readme.png
is the path of the resource being requested. Notice that the client does not send the whole URI of the resource here. It doesn't say https://localhost:8000/readme.png
. It just sends the path.
Finally, HTTP/1.1
is the protocol of the request. Over the years, there have been several changes to the way HTTP works. Clients have to tell servers which dialect of HTTP they're speaking. HTTP/1.1 is the most common version today.
There are tools that allow you to send requests to web server just like a client (browser) does. NetCat is one of those tools.
You can use ncat
command to connect to the demo server and send it an HTTP request by hand. (Make sure the demo server is still running!)
Use ncat 127.0.0.1 8000
to connect your terminal to the demo server.
Then type these two lines:
GET / HTTP/1.1
Host: localhost
After the second line, press Enter twice. As soon as you do, the response from the server will be displayed on your terminal. Depending on the size of your terminal, and the number of files the web server sees, you will probably need to scroll up to see the beginning of the response!
Take another look at what you got back from the web server when you sent that GET request using ncat
.
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/3.5.2
Date: Mon, 12 Nov 2018 00:53:01 GMT
Content-type: text/html
Content-Length: 4377
Last-Modified: Thu, 08 Nov 2018 23:12:00 GMT
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
## ...
This is an HTTP Response. One of these exchanges — a request and response — is happening every time your browser asks a server for a page, an image, or anything else.
Make another GET request request /
, but this time to google.com
on port 80
.
$ ncat google.com 80
and then
GET / HTTP/1.1
Host: google.com
Make sure to send
Host: google.com
exactly ... don't slip awww
in there. These are actually different hostnames, and we want to take a look at the difference between them. And press Enter twice!
The response you'll receive will be somewhat like this, with certain details changes, like dat and whatnot
# => this is the status line
HTTP/1.1 301 Moved Permanently
# => These are the headers
Location: http://www.google.com/
Content-Type: text/html; charset=UTF-8
Date: Mon, 12 Nov 2018 01:17:08 GMT
Expires: Wed, 12 Dec 2018 01:17:08 GMT
Cache-Control: public, max-age=2592000
Server: gws
Content-Length: 219
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
# => This is the response body
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
The HTTP response is made up of three parts: the status line, some headers, and a response body.
The status line is the first line of text that the server sends back. The headers are the other lines up until the first blank line. The response body is the rest — in this case, it's a piece of HTML.
Note: I have added comments showing which is which, but in reality the comments won't be there.
In the response we got from the demo server, the status line said HTTP/1.0 200 OK
. In the one from Google, it says HTTP/1.1 301 Moved Permanently
. The status line tells the client whether the server understood the request, whether the server has the resource the client asked for, and how to proceed next. It also tells the client which dialect of HTTP the server is speaking.
The numbers 200
and 301
here are HTTP status codes. There are dozens of different status codes. The first digit of the status code indicates the general success of the request. As a shorthand, web developers describe all of the codes starting with 2 as "2xx" codes, for instance — the x's mean "any digit".
- 1xx — Informational. The request is in progress or there's another step to take.
- 2xx — Success! The request succeeded. The server is sending the data the client asked for.
- 3xx — Redirection. The server is telling the client a different URI it should redirect to. The headers will usually contain a
Location
header with the updated URI. Different codes tell the client whether a redirect is permanent or temporary. - 4xx — Client error. The server didn't understand the client's request, or can't or won't fill it. Different codes tell the client whether it was a bad URI, a permissions problem, or another sort of error.
- 5xx — Server error. Something went wrong on the server side.
You can find out much more about HTTP status codes in these to resources:
An HTTP response can include many headers. Each header is a line that starts with a keyword, such as Location
or Content-type
, followed by a colon and a value. Headers are a sort of metadata for the response. They aren't displayed by browsers or other clients; instead, they tell the client various information about the response.
Many, many features of the Web are implemented using headers. For instance, cookies are a Web feature that lets servers store data on the browser, for instance to keep a user logged in. To set a cookie, the server sends the Set-Cookie
header. The browser will then send the cookie data back in a Cookie header on subsequent requests. You'll see more about cookies later in this course.
A Content-type
header indicates the kind of data that the server is sending. It includes a general category of content as well as the specific format. For instance, a PNG image file will come with the Content-type image/png
. If the content is text (including HTML), the server will also tell what encoding it's written in. UTF-8 is a very common choice here, and it's the default for Python text anyway.
Very often, the headers will contain more metadata about the response body. For instance, both the demo server and Google also send a Content-Length
header, which tells the client how long (in bytes) the response body will be. If the server sends this, then the client can reuse the connection to send another request after it's read the first response. Browsers use this so they can fetch multiple pieces of data (such as images on a web page) without having to reconnect to the server.
The headers end with a blank line. Everything after that blank line is part of the response body. If the request was successful (a 200 OK
status, for instance), this is a copy of whatever resource the client asked for — such as a web page, image, or other piece of data.
But in the case of an error, the response body is where the error message goes! If you request a page that doesn't exist, and you get a 404 Not Found
error, the actual error message shows up in the response body.
In the last lesson, you used the built-in demo web server from the Python http.server
module. But the demo server is just that — a demonstration of the module's abilities. Just serving static files out of a directory is hardly the only thing you can do with HTTP. In this lesson, you'll build a few different web services using http.server
, and learn more about HTTP at the same time. You'll also use another module, requests
, to write code that acts as an HTTP client.
Web servers using http.server
are made of two parts: the HTTPServer
class, and a request handler class. The first part, the HTTPServer
class, is built in to the module and is the same for every web service. It knows how to listen on a port and accept HTTP requests from clients. Whenever it receives a request, it hands that request off to the second part — a request handler — which is different for every web service.
Here's what your Python code will need to do in order to run a web service:
- Import
http.server
, or at least the pieces of it that you need. - Create a subclass of
http.server.BaseHTTPRequestHandler
. This is your handler class. - Define a method on the handler class for each HTTP verb you want to handle. (The only HTTP verb you've seen yet in this course is
GET
, but that will be changing soon.)- The method for GET requests has to be called
do_GET
. - Inside the method, call built-in methods of the handler class to read the HTTP request and write the response.
- The method for GET requests has to be called
- Create an instance of
http.server.HTTPServe
r, giving it your handler class and server information — particularly, the port number. - Call the HTTPServer instance's
serve_forever
method. Once you call the HTTPServer instance'sserve_forever
method, the server does that — it runs forever, until stopped. Just as in the last lesson, if you have a Python server running and you want to stop it, typeCtrl-C
into the terminal where it's running. (You may need to type it two or three times.)
Following the instructions in the past section, we will end up with code like this:
#!/usr/bin/env python3
#
# The *hello server* is an HTTP server that responds to a GET request by
# sending back a friendly greeting. Run this program in your terminal and
# access the server at http://localhost:8000 in your browser.
from http.server import HTTPServer, BaseHTTPRequestHandler
class HelloHandler(BaseHTTPRequestHandler):
def do_GET(self):
# First, send a 200 OK response.
self.send_response(200)
# Then send headers.
self.send_header('Content-type', 'text/plain; charset=utf-8')
self.end_headers()
# Now, write the response body.
self.wfile.write("Hello, HTTP!\n".encode())
if __name__ == '__main__':
server_address = ('', 8000) # Serve on all addresses, port 8000.
httpd = HTTPServer(server_address, HelloHandler)
httpd.serve_forever()
- In the first line of code we are importing the parts of code we need from the
http.server
module, in this caseHTTPServer
&BaseHTTPRequestHandler
.
from http.server import HTTPServer, BaseHTTPRequestHandler
- After that we created a handler class
class HelloHandler(BaseHTTPRequestHandler):
def do_GET(self):
The handler class HelloHandler
inherits from the BaseHTTPRequestHandler
parent class, which is defined in http.server
. We've defined one method, do_GET
, which handles HTTP GET requests. When the web server receives a GET request, it will call this method to respond to it.
As we've seen in the previous section about HTTP responses, there are three things the server needs to send in an HTTP response: a status code, some headers, and the response body. The handler parent class has methods for doing each of these things. Inside do_GET
, we call them in order.
- The first thing the server needs to do is send a 200 OK status code; and the
send_response
method does this.
# First, send a 200 OK response.
self.send_response(200)
- The next thing the server needs to do is send HTTP headers. The parent class supplies the
send_header
andend_headers
methods for doing this.
# Then send headers.
self.send_header('Content-type', 'text/plain; charset=utf-8')
self.end_headers()
For now, we make the server send a single header line — the Content-type
header telling the client that the response body will be in UTF-8 plain text.
- The last part of the
do_GET
method writes the response body.
# Now, write the response body.
self.wfile.write("Hello, HTTP!\n".encode())
The parent class gives us a variable called self.wfile
, which is used to send the response. The name wfile
stands for writeable file. Python, like many other programming languages, makes an analogy between network connections and open files: they're things you can read and write data to. Some file objects are read-only; some are write-only; and some are read/write.
self.wfile
represents the connection from the server to the client; and it is write-only; hence the name. Any binary data written to it with its write
method gets sent to the client as part of the response. Here, I'm writing a friendly hello message.
The .encode()
method is discussed here
- This code will run when we run this module as a Python program, rather than importing it.
if __name__ == '__main__':
server_address = ('', 8000) # Serve on all addresses, port 8000.
httpd = HTTPServer(server_address, HelloHandler)
httpd.serve_forever()
The HTTPServer
constructor needs to know what address and port to listen on; it takes these as a tuple that we are calling server_address
. we also give it the HelloHandler
class, which it will use to handle each incoming client request.
At the very end of the file, I call serve_forever
on the HTTPServer
, telling it to start handling HTTP requests. And that starts the web server running.
An HTTP response could contain any kind of data, not only text. And so the self.wfile.write
method in the handler class expects to be given a bytes
object — a piece of arbitrary binary data — which it writes over the network in the HTTP response body.
If you want to send a string over the HTTP connection, you have to encode
the string into a bytes
object. The encode
method on strings translates the string into a bytes
object, which is suitable for sending over the network. There is, similarly, a decode
method for turning bytes
objects into strings.
For more details The Python Unicode HOWTO is a definitive guide to the history of string encodings in Python.
The BaseHTTPRequestHandler
provides a number of class and instance variables, and methods for use by subclasses. One of those instance variables is the path
, this instance variable contains the request path. So, you'll need to access self.path
to get the request path.
When you take a look at a URI for a major web service, you'll often see several query parameters, which are a sort of variable assignment that occurs after a ?
in the URI. For instance, here's a Google Image Search URI:
This will be sent to the web server as this HTTP request:
GET /search?q=gray+squirrel&tbm=isch HTTP/1.1
Host: www.google.com
The query part of the URI is the part after the ?
mark. Conventionally, query parameters are written as key=value
and separated by &
signs. So the above URI has two query parameters, q
and tbm
, with the values gray+squirrel
and isch
.
There is a Python library called urllib.parse
that knows how to unpack query parameters and other parts of an HTTP URL. (The library doesn't work on all URIs, only on some URLs.)
Take a look at urllib.parse
documentation here.
We'll first be looking at two of the methods the library exposes. The url_parse
method like the name gives up, parses the URL into six components, returning a 6-tuple. While the parse_qs
is used to parse query strings, returns a dictionary of lists with all the values it's associated with.
Putting that into practice
>>> from urllib.parse import urlparse, parse_qs
>>> address = 'https://www.google.com/search?q=gray+squirrel&tbm=isch'
>>> parts = urlparse(address)
>>> print(parts)
ParseResult(scheme='https', netloc='www.google.com', path='/search', params='', query='q=gray+squirrel&tbm=isch', fragment='')
>>> print(parts.query)
q=gray+squirrel&tbm=isch
>>> query = parse_qs(parts.query)
>>> query
{'q': ['gray squirrel'], 'tbm': ['isch']}
Did you notice that 'gray+squirrel'
in the query string became 'gray squirrel'
in the output of parse_qs
? HTTP URLs aren't allowed to contain spaces or certain other characters. So if you want to send these characters in an HTTP request, they have to be translated into a "URL-safe" or "URL-quoted" format.
"Quoting" in this sense doesn't have to do with quotation marks, the kind you find around Python strings. It means translating a string into a form that doesn't have any special characters in it, but in a way that can be reversed (unquoted) later.
(And if that isn't confusing enough, it's sometimes also referred to as URL-encoding or URL-escaping).
One of the features of the URL-quoted format is that spaces are sometimes translated into plus signs. Other special characters are translated into hexadecimal codes that begin with the percent sign.
Take a look at the documentation for urllib.parse.quote
and related functions. Later in the course when you want to construct a URI in your code, you'll need to use appropriate quoting. More generally, whenever you're working on a web application and you find spaces or percent-signs in places you don't expect them to be, it means that something needs to be quoted or unquoted.
This is just a quick refresher. Let's consider the following form:
<!DOCTYPE html>
<title>Login Page</title>
<form action="http://localhost:8000/" method="GET">
<label>Username:
<input type="text" name="username">
</label>
<br>
<label>Password:
<input type="password" name="pw">
</label>
<br>
<button type=submit>Log in!</button>
The action
attribute defines where the data gets sent. Its value must be a valid URL, in this case http://localhost:8000/
. If this attribute isn't provided, the data will be sent to the URL of the page containing the form.
The method
attribute (obviously) specifies the HTTP verb we would like to request with, in this case a GET request.
If you open up this form in your browser, fill the form and submit, whatever server is listening on port 8000
on your local machine will receive a GET request to the root directory with the query string containing username and password. More on this in the next section.
When a browser submits a form via GET
, it puts all of the form fields into the URI that it sends to the server. These are sent as a query, in the request path — just like search engines do. They're all jammed together into a single line. Since they're in the URI, the user can bookmark the resulting page, reload it, and so forth.
This is fine for search engine queries, but it's not quite what we would want for (say) a form that adds an item to your shopping cart on an e-commerce site, or posts a new message on a comments board. GET
methods are good for search forms and other actions that are intended to look something up or ask the server for a copy of some resource. But GET
is not recommended for actions that are intended to alter or create a resource. For this sort of action, HTTP has a different verb, POST
.
Vocabulary word of the day: idempotent. An action is idempotent if doing it twice (or more) produces the same result as doing it once. "Show me the search results for 'polar bear'" is an idempotent action, because doing it a second time just shows you the same results. "Add a polar bear to my shopping cart" is not, because if you do it twice, you end up with two polar bears.
POST
requests are not idempotent. If you've ever seen a warning from your browser asking you if you really mean to resubmit a form, what it's really asking is if you want to do a non-idempotent action a second time.
Important note if you're ever asked about this in a job interview: idempotent is pronounced like "eye-dem-poe-tent", or rhyming with "Hide 'em, Joe Tent" — not like "id impotent".
Let's take this form
<!DOCTYPE html>
<title>Testing POST requests</title>
<form action="http://localhost:9999/" method="GET">
<label>Magic input:
<input type="text" name="magic" value="mystery">
</label>
<br>
<label>Secret input:
<input type="text" name="secret" value="spooky">
</label>
<br>
<button type="submit">Do a thing!</button>
</form>
If you use ncat -l 9999
to on port 9999
and submit the form above, you'll receive a request like this on your terminal:
GET /?magic=mystery&secret=spooky HTTP/1.1
Host: localhost:9999
Connection: keep-alive
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3608.0 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Now if you take the same form and replace the form method from GET
to POST
, like so
<!DOCTYPE html>
<title>Testing POST requests</title>
<form action="http://localhost:9999/" method="POST">
<!-- the rest of the form goes here -->
and re-submit the form, you'll receive a request like this one:
POST / HTTP/1.1
Host: localhost:9999
Connection: keep-alive
Content-Length: 27
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
Origin: null
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3608.0 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
magic=mystery&secret=spooky
Do you see anything different between the two requests?
In the GET
request the form data is in the URI path, while on the POST
request the form data is in the body of the request.
When a browser submits a form as a POST
request, it doesn't encode the form data in the URI path, the way it does with a GET
request. Instead, it sends the form data in the request body, underneath the headers. The request also includes Content-Type
and Content-Length
headers, which we've previously only seen on HTTP responses.
By the way, the names of HTTP headers are case-insensitive. So there's no difference between writing
Content-Length
orcontent-length
or evenonTent-LeNgTh
… except, of course, that humans will read your code and be confused by that last one.
Let's say we want to build a messageboard server. When a user goes to the main page in their browser, it'll display a form for writing messages, as well as a list of the previously written messages. Submitting the form will send a request to the server, which stores the submitted message and then re-displays the main page.
- Q: Which HTTP methods do you think this server will need to use?
- TL;DR: GET for viewing messages, and POST for submitting them.
Why don't we want to use GET for submitting the form? Imagine if a user did this. They write a message and press the submit button … and the message text shows up in their URL bar. If they press reload, it sends the message again. If they bookmark that URL and go back to it, it sends the message again. This doesn't seem like such a great experience. So we'll use POST for message submission, and GET to display the main page.
Here's the form that we'll be submitting the message with
<!DOCTYPE html>
<title>Message Board</title>
<form method="POST" action="http://localhost:8000/">
<textarea name="message"></textarea>
<br>
<button type="submit">Post it!</button>
</form>
Previously we've written handler classes that have just a single method, do_GET
. But a handler class can have do_POST
as well, to support GET and POST requests. This is exactly how the messageboard server will work. When a GET request comes in, the server will send the HTML form and current messages. When a POST request comes in with a new message, the server will store the message in a list, and then return all the messages it's seen so far.
The code for a do_POST
method will need to do some pretty different things from a do_GET
method. When we're handling a GET request, all the user data in the request is in the URI path. But in a POST request, it's in the request body. Inside do_POST
, our code can read the request body by calling the self.rfile.read
method. self.rfile
is a file object, like the self.wfile
we saw earlier — but rfile
is for reading the request, rather than writing the response.
However, self.rfile.read
needs to be told how many bytes to read … in other words, how long the request body is. And for that we'll be using the Content-Length
header the client sends.
The handler class gives us access to the HTTP headers as the instance variable self.headers
, which is an object that acts a lot like a Python dictionary. The keys of this dictionary are the header names, but they're case-insensitive: it doesn't matter if you look up 'content-length'
or 'Content-Length'
. The values in this dictionary are strings: if the request body is 140 bytes long, the value of the Content-length
entry will be the string "140"
. We need to call self.rfile.read(140)
to read 140 bytes; so once we read the header, we'll need to convert it to an integer.
But in an HTTP request, it's also possible that the body will be empty, in which case the browser might not send a Content-length
header at all. This means we have to be a little careful when accessing the headers from the self.headers
object. If we do self.headers['content-length']
and there's no such header, our code will crash with a KeyError
. Instead we'll use the .get
dictionary method to get the header value safely.
So here's a little bit of code that can go in the do_POST
handler to find the length of the request body and read it:
length = int(self.headers.get('Content-length', 0))
data = self.rfile.read(length).decode()
Once we read the message body, we can use urllib.parse.parse_qs
to extract the POST parameters from it.
With that, we can now build a do_POST
method!
#!/usr/bin/env python3
from http.server import HTTPServer, BaseHTTPRequestHandler
from urllib.parse import parse_qs
class MessageHandler(BaseHTTPRequestHandler):
def do_POST(self):
# How long was the message?
length = int(self.headers.get('Content-length', 0))
# Read the correct amount of data from the request.
data = self.rfile.read(length).decode()
# Extract the "message" field from the request data.
message = parse_qs(data)["message"][0]
# Send the "message" field back as the response.
self.send_response(200)
self.send_header('Content-type', 'text/plain; charset=utf-8')
self.end_headers()
self.wfile.write(message.encode())
if __name__ == '__main__':
server_address = ('', 8000)
httpd = HTTPServer(server_address, MessageHandler)
httpd.serve_forever()
So far, this server only handles POST requests. To submit the form to it, we have to load up the form in the browser as a separate HTML file. It would be much more useful if the server could serve the form itself.
This is pretty straightforward to do. We can add the form in a variable as a Python string (in triple quotes), and then write a do_GET
method that sends the form.
#!/usr/bin/env python3
from http.server import HTTPServer, BaseHTTPRequestHandler
from urllib.parse import parse_qs
form = '''<!DOCTYPE html>
<title>Message Board</title>
<form method="POST" action="http://localhost:8000/">
<textarea name="message"></textarea>
<br>
<button type="submit">Post it!</button>
</form>
'''
class MessageHandler(BaseHTTPRequestHandler):
def do_POST(self):
# How long was the message?
length = int(self.headers.get('Content-length', 0))
# Read the correct amount of data from the request.
data = self.rfile.read(length).decode()
# Extract the "message" field from the request data.
message = parse_qs(data)["message"][0]
# Send the "message" field back as the response.
self.send_response(200)
self.send_header('Content-type', 'text/plain; charset=utf-8')
self.end_headers()
self.wfile.write(message.encode())
def do_GET(self):
self.send_response(200)
self.send_header('Content-type', 'text/html; charset=utf-8')
self.end_headers()
self.wfile.write(form.encode())
if __name__ == '__main__':
server_address = ('', 8000)
httpd = HTTPServer(server_address, MessageHandler)
httpd.serve_forever()
There's a very common design pattern for interactive HTTP applications and APIs, called the PRG or Post-Redirect-Get pattern. A client POST
s to a server to create or update a resource; on success, the server replies not with a 200 OK
but with a 303
redirect. The redirect causes the client to GET the created or updated resource.
This is just one of many, many ways to architect a web application, but it's one that makes good use of HTTP methods to accomplish specific goals. For instance, wiki sites such as Wikipedia often use Post-Redirect-Get when you edit a page.
For the messageboard server, Post-Redirect-Get means:
- You go to http://localhost:8000/ in your browser. Your browser sends a GET request to the server, which replies with a
200 OK
and a piece of HTML. You see a form for posting comments, and a list of the existing comments. (But at the beginning, there are no comments posted yet.) - You write a comment in the form and submit it. Your browser sends it via
POST
to the server. - The server updates the list of comments, adding your comment to the list. Then it replies with a
303
redirect, setting theLocation: /
header to tell the browser to request the main page via GET. - The redirect response causes your browser to go back to the same page you started with, sending a GET request, which replies with a
200 OK
and a piece of HTML …
One big advantage of Post-Redirect-Get is that as a user, every page you actually see is the result of a GET request, which means you can bookmark it, reload it, and so forth — without ever accidentally resubmitting a form.
Putting everything together we'll come with something like this
#!/usr/bin/env python3
from http.server import HTTPServer, BaseHTTPRequestHandler
from urllib.parse import parse_qs
memory = []
form = '''<!DOCTYPE html>
<title>Message Board</title>
<form method="POST">
<textarea name="message"></textarea>
<br>
<button type="submit">Post it!</button>
</form>
<pre>
{}
</pre>
'''
class MessageHandler(BaseHTTPRequestHandler):
def do_POST(self):
# How long was the message?
length = int(self.headers.get('Content-length', 0))
# Read and parse the message
data = self.rfile.read(length).decode()
message = parse_qs(data)["message"][0]
# Escape HTML tags in the message so users can't break world+dog.
message = message.replace("<", "<")
# Store it in memory.
memory.append(message)
# Send a 303 back to the root page
self.send_response(303) # redirect via GET
self.send_header('Location', '/')
self.end_headers()
def do_GET(self):
# First, send a 200 OK response.
self.send_response(200)
# Then send headers.
self.send_header('Content-type', 'text/html; charset=utf-8')
self.end_headers()
# Send the form with the messages in it.
mesg = form.format("\n".join(memory))
self.wfile.write(mesg.encode())
if __name__ == '__main__':
server_address = ('', 8000)
httpd = HTTPServer(server_address, MessageHandler)
httpd.serve_forever()
Now let's turn from writing web servers to writing web clients. The requests
library is a Python library for sending requests to web servers and interpreting the responses. It's not included in the Python standard library, though; you'll need to install it. In your terminal, run pip3 install requests
to fetch and install the requests
library.
Then take a look at the quickstart documentation for requests
and try it out.
When you send a request, you get back a Response object. Try it in your Python interpreter:
>>> import requests
>>> a = requests.get('http://www.udacity.com')
>>> a
<Response [200]>
>>> type(a)
<class 'requests.models.Response'>
If you want to access the response body of the response object a
, you can either use a.content
or a.text
, but they're different. r.content
is a bytes object representing the literal binary data that the server sent. r.text
is the same data but interpreted as a str
object, a Unicode string.
Try fetching some different URIs with the requests module in your Python interpreter. More specifically, try some that don't work. Try some sites that don't exist, like http://bad.example.com/, but also try some pages that don't exist on sites that do, like http://google.com/ThisDoesNotExist.
send a request to a site that doesn't exist
uri = "http://bad.example.com/"
r = requests.get(uri)
Now try the same thing but with a non existent page on a site that does exist
uri = "http://google.com/ThisDoesNotExist."
r = requests.get(uri)
Q: What do you notice about the responses that you get back?
If the requests.get
call can reach an HTTP server at all, it will give you a Response
object. If the request failed, the Response
object has a status_code
data member — either 200
, or 404
, or some other code.
But if it wasn't able to get to an HTTP server, for instance because the site doesn't exist, then requests.get
will raise an exception.
However: Some Internet service providers will try to redirect browsers to an advertising site if you try to access a site that doesn't exist. This is called DNS hijacking, and it's nonstandard behavior, but some do it anyway. If your ISP hijacks DNS, you won't get exceptions when you try to access nonexistent sites. Standards-compliant DNS services such as Google Public DNS don't hijack.
JSON is a data format based on the syntax of JavaScript, often used for web-based APIs. There are a lot of services that let you send HTTP queries and get back structured data in JSON format. You can read more about the JSON format at http://www.json.org/.
Python has a built-in json
module; and as it happens, the requests module makes use of it. A Response
object has a .json
method; if the response data is JSON, you can call this method to translate the JSON data into a Python dictionary.
Try it out! Here, I'm using it to access the Star Wars API, a classic JSON demonstration that contains information about characters and settings in the Star Wars movies:
>>> a = requests.get('http://swapi.co/api/people/1/')
>>> a.json()['name']
'Luke Skywalker'
Now if we try the .json
on Response that isn't made of JSON data, it raises a json.decoder.JSONDecodeError
exception. If you want to catch this exception with a try block, you'll need to import it from the json
module.
For this section we'll be hosting this Bookmark server. Here is the code:
#!/usr/bin/env python3
#
# A *bookmark server* or URI shortener.
import http.server
import requests, os
from urllib.parse import unquote, parse_qs
memory = {}
form = '''<!DOCTYPE html>
<title>Bookmark Server</title>
<form method="POST">
<label>Long URI:
<input name="longuri">
</label>
<br>
<label>Short name:
<input name="shortname">
</label>
<br>
<button type="submit">Save it!</button>
</form>
<p>URIs I know about:
<pre>
{}
</pre>
'''
def CheckURI(uri, timeout=5):
'''Check whether this URI is reachable, i.e. does it return a 200 OK?
This function returns True if a GET request to uri returns a 200 OK, and
False if that GET request returns any other response, or doesn't return
(i.e. times out).
'''
try:
r = requests.get(uri, timeout=timeout)
# If the GET request returns, was it a 200 OK?
return r.status_code == 200
except requests.RequestException:
# If the GET request raised an exception, it's not OK.
return False
class Shortener(http.server.BaseHTTPRequestHandler):
def do_GET(self):
# A GET request will either be for / (the root path) or for /some-name.
# Strip off the / and we have either empty string or a name.
name = unquote(self.path[1:])
if name:
if name in memory:
# We know that name! Send a redirect to it.
self.send_response(303)
self.send_header('Location', memory[name])
self.end_headers()
else:
# We don't know that name! Send a 404 error.
self.send_response(404)
self.send_header('Content-type', 'text/plain; charset=utf-8')
self.end_headers()
self.wfile.write("I don't know '{}'.".format(name).encode())
else:
# Root path. Send the form.
self.send_response(200)
self.send_header('Content-type', 'text/html')
self.end_headers()
# List the known associations in the form.
known = "\n".join("{} : {}".format(key, memory[key])
for key in sorted(memory.keys()))
self.wfile.write(form.format(known).encode())
def do_POST(self):
# Decode the form data.
length = int(self.headers.get('Content-length', 0))
body = self.rfile.read(length).decode()
params = parse_qs(body)
# Check that the user submitted the form fields.
if "longuri" not in params or "shortname" not in params:
self.send_response(400)
self.send_header('Content-type', 'text/plain; charset=utf-8')
self.end_headers()
self.wfile.write("Missing form fields!".encode())
return
longuri = params["longuri"][0]
shortname = params["shortname"][0]
if CheckURI(longuri):
# This URI is good! Remember it under the specified name.
memory[shortname] = longuri
# Serve a redirect to the form.
self.send_response(303)
self.send_header('Location', '/')
self.end_headers()
else:
# Didn't successfully fetch the long URI.
self.send_response(404)
self.send_header('Content-type', 'text/plain; charset=utf-8')
self.end_headers()
self.wfile.write(
"Couldn't fetch URI '{}'. Sorry!".format(longuri).encode())
if __name__ == '__main__':
port = int(os.environ.get('PORT', 8000)) # Use PORT if it's there.
server_address = ('', port)
httpd = http.server.HTTPServer(server_address, Shortener)
httpd.serve_forever()
Here's an overview of the steps we'll need to complete. We'll be going over each one in more detail.
- Check your server code into a new local Git repository.
- Sign up for a free Heroku account.
- Download the Heroku command-line interface (CLI).
- Authenticate the Heroku CLI with your account:
heroku login
- Create configuration files
Procfile
,requirements.txt
, andruntime.txt
and check them into your Git repository. - Modify your server to listen on a configurable port.
- Create your Heroku app:
heroku create your-app-name
- Push your code to Heroku with Git:
git push heroku master
Heroku (and many other web hosting services) works closely with Git: you can deploy a particular version of your code to Heroku by pushing it with the git push
command. So in order to deploy your code, it first needs to be checked into a local Git repository.
cd
into your project directory, then set the directory up as a Git repository:
git init
git add .
# Choose a more accurate commit message
git commit -m "Checking in my app!"
First, visit this link and follow the instructions to sign up for a free Heroku account:
Make sure to write down your username and password!
You'll need the Heroku command-line interface (CLI) tool to set up and configure your app. Download and install it now. Once you have it installed, the heroku
command will be available in your shell.
An easy way to install heroku CLI is through npm
, obviously you must have both npm
and node
installed, if you do then run
sudo npm install -g heroku
From the command line, use heroku login
to authenticate to Heroku. It will prompt you for your username and password; use the ones that you just set up when you created your account. This command will save your authentication information in a hidden file (.netrc
) so you will not need to ender your password again on the same computer.
There are a few configuration files that Heroku requires for deployment, to tell its servers how to run your application. For the case of the bookmark server, I'll just give you the required content for these files. These are just plain text files and can be created in your favorite text editor.
runtime.txt
tells Heroku what version of Python you want to run. Check the currently supported runtimes in the Heroku documentation; this will change over time! The currently supported version of Python 3 is python-3.7.0
; so this file just needs to contain the text python-3.7.0
.
requirements.txt
is used by Heroku (through pip
) to install dependencies of your application that aren't in the Python standard library. The bookmark server has one of these: the requests
module. We'd like a recent version of that, so this file can contain the text equests>=2.12
. This will install version 2.12 or a later version, if one has been released.
Procfile
is used by Heroku to specify the command line for running your application. It can support running multiple servers, but in this case we're only going to run a web server. Check the Heroku documentation about process types for more details. If your bookmark server is in BookmarkServer.py
, then the contents of Procfile
should be web: python BookmarkServer.py
.
Create each of these files in the same directory as your code, and commit them all to your Git repository. Like so:
git add runtime.txt requirements.txt Procfile
git commit -m "Add heroku config files"
Heroku runs many users' processes on the same computer, and multiple processes can't (normally) listen on the same port. So Heroku needs to be able to tell your server what port to listen on.
The way it does this is through an environment variable — a configuration variable that is passed to your server from the program that starts it, usually the shell. Python code can access environment variables in the os.environ
dictionary. The names of environment variables are usually capitalized; and the environment variable we need here is called, unsurprisingly, PORT
.
The port your server listens on is configured when it creates the HTTPServer
instance, near the bottom of the server code. We handle it in a way so that we can make it work with or without the PORT
environment variable:
if __name__ == '__main__':
port = int(os.environ.get('PORT', 8000)) # Use PORT if it's there.
server_address = ('', port)
httpd = http.server.HTTPServer(server_address, Shortener)
httpd.serve_forever()
To access os.environ
, you will also need to import os at the top of the file.
Don't forget to test, then commit changes to your Git repository with git add
and git commit
like before.
Before you can put your service on the web, you have to give it a name. You can call it whatever you want, as long as the name is not already taken by another user! Your app's name will appear in the URI of your deployed service. For instance, if you name your app silly-pony
, it will appear on the web at https://silly-pony.herokuapp.com/.
Use heroku create your-app-name
to tell Heroku about your app and give it a name. Again, you can choose any name you like, but it will have to be unique — the service will tell you if you're choosing a name that someone else has already claimed.
Finally, use git push heroku master
to deploy your app!
If all goes well, your app will now be accessible on the web! The URI appears in the output from the git
command.
If your app doesn't work quite right as deployed, one resource that can be very helpful is the server log. Since your service isn't running on your own local machine any more, those logs aren't going to show up in your terminal! Instead, they're available from the Heroku dashboard.
Take a look at https://dashboard.heroku.com/apps/little-bookmarks/logs, but replace "little-bookmarks" with your own app's name.
or use heroku logs --tail
.
If you try creating a link in the bookmark server where the target URI is the bookmark server's own URI, the app gives me an error, saying it can't fetch that web page.
That's because the basic, built-in http.server.HTTPServer
class can only handle a single request at once. The bookmark server tries to fetch every URI that we give it, while it's in the middle of handling the form submission.
It's like an old-school telephone that can only have one call at once. Because it can only handle one request at a time, it can't "pick up" the second request until it's done with the first … but in order to answer the first request, it needs the response from the second.
Being able to handle two ongoing tasks at the same time is called concurrency, and the basic http.server.HTTPServer
doesn't have it. It's pretty straightforward to plug concurrency support into an HTTPServer, though. The Python standard library supports doing this by adding a mixin to the HTTPServer class. A mixin is a sort of helper class, one that adds extra behavior the original class did not have. To do this, you'll need to add this code to your bookmark server:
import threading
from socketserver import ThreadingMixIn
class ThreadHTTPServer(ThreadingMixIn, http.server.HTTPServer):
"This is an HTTPServer that supports thread-based concurrency."
Then look at the bottom of your bookmark server code, where it creates an HTTPServer
. Have it create a ThreadHTTPServer
instead:
if __name__ == '__main__':
port = int(os.environ.get('PORT', 8000))
server_address = ('', port)
httpd = ThreadHTTPServer(server_address, Shortener)
httpd.serve_forever()
Commit this change to your Git repository, and push it to Heroku. Now when you test it out, you should be able to add an entry that points to the service itself.
Related/Useful links:
The Web was originally designed to serve documents, not to deliver applications. Even today, a large amount of the data presented on any web site is static content — images, HTML files, videos, downloadable files, and other media stored on disk.
Specialized web server programs — like Apache, Nginx, or Microsoft IIS — can serve static content from disk storage very quickly and efficiently. They can also provide access control, allowing only authenticated users to download particular static content.
Some web applications have several different server components, each running as a separate process. One thing a specialized web server can do is dispatch requests to the particular backend servers that need to handle each request. There are a lot of names for this, including request routing and reverse proxying.
Some web applications need to do a lot of work on the server side for each request, and need many servers to handle the load. Splitting requests up among several servers is called load balancing.
Load balancing also helps handle conditions where one server becomes unavailable, allowing other servers to pick up the slack. A reverse proxy can health check the backend servers, only sending requests to the ones that are currently up and running. This also makes it possible to do updates to the backend servers without having an outage.
Handling a large number of network connections at once turns out to be complicated — even more so than plugging concurrency support into your Python web service.
As you may have noticed in your own use of the web, it takes time for a server to respond to a request. The server has to receive and parse the request, come up with the data that it needs to respond, and transmit the response back to the client. The network itself is not instantaneous; it takes time for data to travel from the client to the server.
In addition, a browser is totally allowed to open up multiple connections to the same server, for instance to request resources such as images, or to perform API queries.
All of this means that if a server is handling many requests per second, there will be many requests in progress at once — literally, at any instant in time. We sometimes refer to these as in-flight requests, meaning that the request has "taken off" from the client, but the response has not "landed" again back at the client. A web service can't just handle one request at a time and then go on to the next one; it has to be able to handle many at once.
Imagine a web service that does a lot of complicated processing for each request — something like calculating the best route for a trip between two cities on a map. Pretty often, users make the same request repeatedly: imagine if you load up that map, and then you reload the page — or if someone else loads the same map. It's useful if the service can avoid recalculating something it just figured out a second ago. It's also useful if the service can avoid re-sending a large object (such as an image) if it doesn't have to.
One way that web services avoid this is by making use of a cache, a temporary storage for resources that are likely to be reused. Web systems can perform caching in a number of places — but all of them are under control of the server that serves up a particular resource. That server can set HTTP headers indicating that a particular resource is not intended to change quickly, and can safely be cached.
There are a few places that caching usually can happen. Every user's browser maintains a browser cache of cacheable resources — such as images from recently-viewed web pages. The browser can also be configured to pass requests through a web proxy, which can perform caching on behalf of many users. Finally, a web site can use a reverse proxy to cache results so they don't need to be recomputed by a slower application server or database.
All HTTP caching is supposed to be governed by cache control headers set by the server. You can read a lot more about HTTP cache in this article by Google engineer Ilya Grigorik.
Why serve static requests out of cache (or a static web server) rather than out of your application server? Python code is totally capable of sending images or video via HTTP, after all. The reason is that — all else being equal — handling a request faster provides a better user experience, but also makes it possible for your service to support more requests.
If your web service becomes popular, you don't want it to bog down under the strain of more traffic. So it helps to handle different kinds of request with software that can perform that function quickly and efficiently.
Cookies are a way that a server can ask a browser to retain a piece of information, and send it back to the server when the browser makes subsequent requests. Every cookie has a name and a value, much like a variable in your code; it also has rules that specify when the cookie should be sent back.
What are cookies for? A few different things. If the server sends each client a unique cookie value, it can use these to tell clients apart. This can be used to implement higher-level concepts on top of HTTP requests and responses — things like sessions and login. Cookies are used by analytics and advertising systems to track user activity from site to site. Cookies are also sometimes used to store user preferences for a site.
The first time the client makes a request to the server, the server sends back the response with a Set-Cookie
header. This header contains three things: a cookie name, a value, and some attributes. Every subsequent time the browser makes a request to the server, it will send that cookie back to the server. The server can update cookies, or ask the browser to expire them.
Browsers don't make it easy to find cookies that have been set, because removing or altering cookies can affect the expected behavior of web services you use. However, it is possible to inspect cookies from sites you use in every major browser. Do some research on your own to find out how to view the cookies that your browser is storing.
Here's a cookie that I found in my Chrome browser, from a web site I visited:
What are all these pieces of data in my cookie? There are eight different fields there!
The first two, the cookie's name and content, are also called its key and value. They're analogous to a dictionary key and value in Python — or a variable's name and value for that matter. They will both be sent back to the server. There are some syntactic rules for which characters are allowed in a cookie name; for instance, they can't have spaces in them. The value of the cookie is where the "real data" of the cookie goes — for instance, a unique token representing a logged-in user's session.
The next two fields, Domain and Path, describe the scope of the cookie — that is to say, which queries will include it. By default, the domain of a cookie is the hostname from the URI of the response that set the cookie. But a server can also set a cookie on a broader domain, within limits. For instance, a response from www.udacity.com
can set a cookie for udacity.com
, but not for com
.
The fields that Chrome describes as "Send for" and "Accessible to script" are internally called Secure and HttpOnly, and they are boolean flags (true or false values). The internal names are a little bit misleading. If the Secure flag is set, then the cookie will only be sent over HTTPS (encrypted) connections, not plain HTTP. If the HttpOnly flag is set, then the cookie will not be accessible to JavaScript code running on the page.
Finally, the last two fields deal with the lifetime of the cookie — how long it should last. The creation time is just the time of the response that set the cookie. The expiration time is when the server wants the browser to stop saving the cookie. There are two different ways a server can set this: it can set an Expires field with a specific date and time, or a Max-Age field with a number of seconds. If no expiration field is set, then a cookie is expired when the browser closes.
To set a cookie from a Python HTTP server, all you need to do is set the Set-Cookie header on an HTTP response. Similarly, to read a cookie in an incoming request, you read the Cookie
header. However, the format of these headers is a little bit tricky; I don't recommend formatting them by hand. Python's http.cookies
module provides handy utilities for doing so.
To create a cookie on a Python server, use the SimpleCookie
class. This class is based on a dictionary, but has some special behavior once you create a key within it:
from http.cookies import SimpleCookie, CookieError
out_cookie = SimpleCookie()
out_cookie["bearname"] = "Smokey Bear"
out_cookie["bearname"]["max-age"] = 600
out_cookie["bearname"]["httponly"] = True
Then you can send the cookie as a header from your request handler:
self.send_header("Set-Cookie", out_cookie["bearname"].OutputString())
To read incoming cookies, create a SimpleCookie from the Cookie header:
in_cookie = SimpleCookie(self.headers["Cookie"])
in_data = in_cookie["bearname"].value
Be aware that a request might not have a cookie on it, in which case accessing the Cookie
header will raise a KeyError
exception; or the cookie might not be valid, in which case the SimpleCookie
constructor will raise http.cookies.CookieError
.
Important safety tip: Even though browsers make it difficult for users to modify cookies, it's possible for a user to modify a cookie value. Higher-level web toolkits, such as Flask (in Python) or Rails (in Ruby) will cryptographically sign your cookies so that they won't be accepted if they are modified. Quite often, high-security web applications use a cookie just to store a session ID, which is a key to a server-side database containing user information
Another important safety tip: If you're displaying the cookie data as HTML, you need to be careful to escape any HTML special characters that might be in it. An easy way to do this in Python is to use the html.escape function, from the built-in html module!
For a lot more information on cookie handling in Python, see the documentation for the http.cookies
module.
Domain names play a few other roles in HTTP besides just being easier to remember than IP addresses. A DNS domain links a particular hostname to a computer's IP address. But it also indicates that the owner of that domain intends for that computer to be treated as part of that domain.
Imagine what a bad guy could do if they could convince your browser that their server evilbox was part of (say) Facebook, and get you to request a Facebook URL from evilbox instead of from Facebook's real servers. Your browser would send your facebook.com cookies to evilbox along with that request. But these cookies are what prove your identity to Facebook … so then the bad guy could use those cookies to access your Facebook account and send spam messages to all your friends.
In the immortal words of Dr. Egon Spengler: It would be bad.
This is just one reason that DNS is essential to web security. If a bad guy can take control of your site's DNS domain, they can send all your web traffic to their evil server … and if the bad guy can fool users' browsers into sending that traffic their way, they can steal the users' cookies and reuse them to break into those users' accounts on your site.
When a browser and a server speak HTTPS, they're just speaking HTTP, but over an encrypted connection. The encryption follows a standard protocol called Transport Layer Security, or TLS for short. TLS provides some important guarantees for web security:
- It keeps the connection private by encrypting everything sent over it. Only the server and browser should be able to read what's being sent.
- It lets the browser authenticate the server. For instance, when a user accesses https://www.udacity.com/, they can be sure that the response they're seeing is really from Udacity's servers and not from an impostor.
- It helps protect the integrity of the data sent over that connection — checking that it has not been (accidentally or deliberately) modified or replaced.
Note: TLS is also very often referred to by the older name SSL (Secure Sockets Layer). Technically, SSL is an older version of the encryption protocol. This course talks about TLS because that's the current standard.
If you deployed a web service on Heroku earlier in this lesson, then HTTPS should already be set up. The URI that Heroku assigned to your app was something like https://yourappname.herokuapp.com/.
From there, you can use your browser to see more information about the HTTPS setup for this site. However, the specifics of where to find this information will depend on your browser. You can experiment to find it, or you can check the documentation: Chrome, Firefox, Safari.
Most browsers have a lock icon next to the URI when you're viewing an HTTPS web site. Clicking on the lock is how you start exploring the details of the HTTPS connection. Here, I've clicked on the lock on my bookmark server deployed on Heroku.
Well, there are a lot of locks in these pictures. Those are how the browser indicates to the user that their connection is being protected by TLS. However, these dialogs also show a little about the server's TLS setup.
The server-side configuration for TLS includes two important pieces of data: a private key and a public certificate. The private key is secret; it's held on the server and never leaves there. The certificate is sent to every browser that connects to that server via TLS. These two pieces of data are mathematically related to each other in a way that makes the encryption of TLS possible.
The server's certificate is issued by an organization called a certificate authority (CA). The certificate authority's job is to make sure that the server really is who it says it is — for instance, that a certificate issued in the name of Heroku is actually being used by the Heroku organization and not by someone else.
The role of a certificate authority is kind of like getting a document notarized. A notary public checks your ID and witnesses you sign a document, and puts their stamp on it to indicate that they did so.
The data in the TLS certificate and the server's private key are mathematically related to each other through a system called public-key cryptography. The details of how this works are way beyond the scope of this course. The important part is that the two endpoints (the browser and server) can securely agree on a shared secret which allows them to scramble the data sent between them so that only the other endpoint — and not any eavesdropper — can unscramble it.
A server certificate indicates that an encryption key belongs to a particular organization responsible for that service. It's the job of a certificate authority to make sure that they don't issue a cert for (say) udacity.com to someone other than the company who actually runs that domain.
But the cert also contains metadata that says what DNS domain the certificate is good for. The cert in the picture above is only good for sites in the .herokuapp.com domain. When the browser connects to a particular server, if the TLS domain metadata doesn't match the DNS domain, the browser will reject the certificate and put up a big scary warning to tell the user that something fishy is going on.
Every request and response sent over a TLS connection is sent with a message authentication code (MAC) that the other end of the connection can verify to make sure that the message hasn't been altered or damaged in transit.
The different HTTP methods each stand for different actions that a client might need to perform upon a server-hosted resource. Unlike GET
and POST
, their usage isn't built into the normal operation of web browsers; following a link is always going to be a GET request, and the default action for submitting an HTML form will always be a GET
or POST
request.
However, other methods are available for web APIs to use, for instance from client code in JavaScript. If you want to use other methods in your own full-stack applications, you'll have to write both server-side code to accept them, and client-side JavaScript code to make use of them.
The HTTP PUT
method can be used for creating a new resources. The client sends the URI path that it wants to create, and a piece of data in the request body. A server could implement PUT
in a number of different ways — such as storing a file on disk, or adding records to a database. A server should respond to a PUT
request with a 201 Created
status code, if the PUT action completed successfully. After a successful PUT
, a GET
request to the same URI should return the newly created resource.
The destructive counterpart to PUT
is DELETE
, for removing a resource from the server. After a DELETE
has happened successfully, further GET
requests for that resource will yield 404 Not Found
... unless, of course, a new resource is later created with the same name!
The PATCH
method is a relatively new addition to HTTP. It expresses the idea of patching a resource, or changing it in some well-defined way. (If you've used Git, you can think of patching as what applying a Git commit does to the files in a repository.)
However, just as HTTP doesn't specify what format a resource has to be in, it also doesn't specify in what format a patch can be in: how it should represent the changes that are intended to be applied. That's up to the application to decide. An application could send diffs over HTTP PATCH
requests, for instance. One standardized format for PATCH
requests is the JSON Patch format, which expresses changes to a piece of JSON data. A different one is JSON Merge Patch.
There are a number of additional methods that HTTP supports for various sorts of debugging and examining servers.
HEAD
works just likeGET
, except the server doesn't return any content — just headers.OPTIONS
can be used to find out what features the server supports.TRACE
echoes back what the server received from the client — but is often disabled for security reasons.
HTTP can't prevent a service from using methods to mean something different from what they're intended to mean, but this can have some surprising effects. For instance, you could create a service that used a GET request to delete content. However, web clients don't expect GET requests to have side-effects like that. In one famous case from 2006, an organization put up a web site where "edit" and "delete" actions happened through GET requests, and the result was that the next search-engine web crawler to come along deleted the whole site.
For much more about HTTP methods, consult the HTTP standards documents.
The new version of HTTP is called HTTP/2. It's based on earlier protocol work done at Google, under the name SPDY (pronounced "speedy").
Unfortunately, we can't show you very much about HTTP/2 in Python, because the libraries for it are not very mature yet (as of early 2017). We'll still take a look at the motivations for the changes that HTTP/2 brings, though.
Some other languages are a little bit more up to the minute; one of the best demonstrations of HTTP/2's advantages is in the Gophertiles demo from the makers of the Go programming language. In order to see the effects, you'll need to be using a browser that supports HTTP/2. Check CanIUse.com to check that your browser does!
This demo lets you load the same web page over HTTP/1.1 and HTTP/2. It also lets you add extra latency (delay) to each request, simulating what happens when you access a server that's far away or when you're on a slow network. The latency options are zero (no extra latency), 30 milliseconds, 200 milliseconds, and one second. Try it out!
But if you're requesting hundreds of different tiny files from the server — as in this demo or the Gophertiles demo — it's kind of limiting to only be able to fetch six at a time. This is particularly true when the latency (delay) between the server and browser gets high. The browser can't start fetching the seventh image until it's fully loaded the first six. The greater the latency, the worse this affects the user experience.
HTTP/2 changes this around by multiplexing requests and responses over a single connection. The browser can send several requests all at once, and the server can send responses as quickly as it can get to them. There's no limit on how many can be in flight at once.
And that's why the Gophertiles demo loads much more quickly over HTTP/2 than over HTTP/1.
When you load a web page, your browser first fetches the HTML, and then it goes back and fetches other resources such as stylesheets or images. But if the server already knows that you will want these other resources, why should it wait for your browser to ask for them in a separate request? HTTP/2 has a feature called server push which allows the server to say, effectively, "If you're asking for index.html
, I know you're going to ask for style.css
too, so I'm going to send it along as well."
The HTTP/2 protocol was being designed around the same time that web engineers were getting even more interested in encrypting all traffic on the web for privacy reasons. Early drafts of HTTP/2 proposed that encryption should be required for sites to use the new protocol. This ended up being removed from the official standard … but most of the browsers did it anyway! Chrome, Firefox, and other browsers will only attempt HTTP/2 with a site that is using TLS encryption.
Now you have a sense of where HTTP development has been going in the past few years. You can read much more about HTTP/2 in the HTTP/2 FAQ.
Here are some handy resources for learning more about HTTP:
- Mozilla Developer Network's HTTP index page contains a variety of tutorial and reference materials on every aspect of HTTP.
- The standards documents for HTTP/1.1 start at RFC 7230. The language of Internet standards tends to be a little difficult, but these are the official description of how it's supposed to work.
- The standards documents for HTTP/2 are at https://http2.github.io/.
- If you already run your own web site, Let's Encrypt is a great site to learn about HTTPS in a hands-on way, by creating your own HTTPS certificates and installing them on your site.
- HTTP Spy is a neat little Chrome extension that will show you the headers and request information for every request your browser makes.