Changelog:

  • 31 Oct 2024: add requirement for HEAD + multiple request per connection support; remove requirement for particular error documents.
  • 31 Oct 2024: add suggestions for testing with curl
  • 1 Nov 2024: add suggestions for testing with nc
  • 6 Nov 2024: add note about testing with web browsers
  • 11 Nov 2024: note when talking about testing with curl, that the command PowerShell provides called curl is not what we mean
  • 13 Nov 2024: consistently use --request for curl option and not its alias -X; correct typo in getservname() instructions
  • 13 Nov 2024: point to RFC 9112 as a more specific HTTP/1.1 reference
  • 15 Nov 2024: correct hex/decimal confusion in CRLF expansion footnote; (and 10:45pm) actually correct it to have the right order

1 Your Task

  1. Using the standard Python socket library, and any data structure and text encoding-related standard Python libraries of your choice1, create a webserver. Your webserver must:

    • set the SO_REUSEADDR socket option on its server socket
    • be startable by running python3 webserver.py 127.0.0.1 PORT
    • listen on IP address 127.0.0.1, port number PORT, and look for files in webroot directory in the directory where it is run from
    • implement HTTP/1.1, where
      • only the GET or HEAD methods is supported; requests using any other method always return a 405 Method Not Allowed error.
      • GET requests for the path /FOO are handled as follows:
        • if FOO contains any /s or the file FOO does not exist in webroot, then the webserver returns a 404 Not Found response whose body contains text of your choice.
        • if FOO is redirect-example, returns a 301 Moved Permanently response with a Location header specifying /redirect-target.html and text of your choice as the response body
        • if FOO exists in webroot but is not readable, the webserver returns a 403 Not Authorized response whose body contains text of your choice.
        • otherwise, the webserver returns 200 OK response whose body is the contents of the file FOO in webroot
      • HEAD requests for the path /FOO are handled like GET requests, except no response body is returned.
      • When returning a file from a GET request, if its extension html or htm, the Content-Type header is set to text/html; if its extension is txt, the Content-Type header is set to text/plain. Otherwise, the Content-Type header may be omitted or set as you choose.
      • When returning a response with a message body, either include a correct Content-Length or use the chunked transfer encoding for the message body.
      • Your server supports multiple requests in the same connection, as long as all of those requests are GET and HEAD requests.

    Your webserver only need to handle one connection at a time.

  2. Test your server, probably by using a utility like curl.

  3. Submit your webserver.py to the submission site.

2 References

  1. RFC 9112 is the official specification for HTTP/1.1 (and RFC 9110 is an official specification for HTTP more generally). You can find a friendlier introduction in Section 9.1.2 of Computer Networks: A Systems Approach

  2. The reference documentation for the Python socket library is here.

    There is a friendlier introduction in the official Python socket programming HOWTO. If you remember the socket chat lab from CSO1, you may notice that since the Python socket library wraps the C API you used in CSO1, it follows the same structure.

    The socket API shows creating server sockets using socket.socket and bind; you might find it easier to use the create_server utility function as more readable shorthand.

3 Testing your server

  1. If you are testing on a shared server and aren’t sure what ports are free, you can bind to port 0 to ask the OS to select port. Then you can use ssock.getsockname() (where ssock is your server socket name) to find out which port the OS selected.

3.1 Using curl

  1. You can use the Linux curl utility to make requests to your server. ([added 11 Nov 2024]: Confusingly, Windows PowerShell provides a command called curl which is missing many featuers we use in the commands below, and so will not work.)

    For example:

    curl http://127.0.0.1:12345/foo.html

    will make a GET request for foo.html.

    You can also:

    • get more information about what’s sent over the connection with the --verbose option:

       curl --verbose http://127.0.0.1:12345/foo.html
    • change the method of the request to HEAD with the –head option

       curl --head http://127.0.0.1:12345/foo.html
    • make multiple requests by including multiple URLs on the command line:

       curl --verbose http://127.0.0.1:12345/foo.html http://127.0.0.1:12345/bar.html

      and read curl’s output will indicate whether it reused the same connection or it reconnected.

    • change the method to something other than GET or HEAD with the –request option:

       curl --request DELETE --verbose http://127.0.0.1:12345/someplace

      and pass a request body while doing with the –data option:

       curl --request POST --data 'This is the request body' --verbose http://127.0.0.1:12345/someplace

3.2 Using nc

  1. You can use the nc utility to connect to your server and send arbitrary data.

    If you run nc -C 127.0.0.1 PORTNUMBER (where PORTNUMBER is the port your server is running on), you can type a request like:

    GET / HTTP/1.1<enter>
    Host: 127.0.0.1<enter>>
    <enter>

    and nc will send it, including CRLFs, and show you any response. On a Linux machine, you can type control-D to close the connection.

    You could also enter your request(s) into a text file and run a command like

    nc -C 127.0.0.1 PORTNUMBER < some-text-file.txt

    to send some-text-file.txt to the server, followed by closing the connection.

3.3 Using a web browser

  1. Provided that your web browser is running on the same machine as your server, you should be able to go to http://127.0.0.1:PORTNUMBER/name.html (where PORTNUMBER is the port your server is running on) and make a GET request for /name.html

    In most web browsers, you can use developer tools to see which HTTP requests are being made, what HTTP resposnes were received and other details about them. Usually you can access these tools by using the menu, then going to More tools, then to an item labeled Developer tools or Web Developer Tools. After this, there will usually be a Networking tab on the developer tools that will show the relevant information (that fills in as you visit pages; it won’t show requests/resposnes retroactively).

4 Hints

4.1 Bytes in Python

  1. The socket functions in Python return and expect bytes, not str (strings). (bytes are composed of 8-bit bytes, but strs are composed of Unicode characters.)

    To get bytes instead of strs:

    • open files in binary mode (for example open('webroot/404.html', 'rb') instead of open('webroot/404.html', 'r'))
    • write constants like b'foo' instead of 'foo'
    • given a string s, use something like s.encode('UTF-8') to convert it to a bytes object

    If you need to convert from a bytes object b to str, you can do something like b.decode('UTF-8', errors='replace').

4.2 Reading requests

  1. When you call recv, it will read whatever bytes are available, this may or may not be a full request. You may need to call recv multiple times to read enough of a request to figure out what to do.

    Since a request’s headers are always terminated by two CRLFs2, I would recommend calling recv in a loop, accumulating the bytes received into a buffer, until the buffer contains a doubled CRLF. At that point, you would have a full set of request headers.

  2. When dealing with multiple requests, note that it’s possible that you can read parts of multiple requests in a single recv call.

    In my implementatoin, I dealt with this by adding the result of the recv calls to the end of a buffer. Then, I would check if that buffer contained a full request. If it did, I would remove that request from the beginning of the buffer, but keep the buffer around for the next request.


  1. You may not use other Python libraries for handling URIs or HTTP.↩︎

  2. CRLF is an 0x0d (decimal 13) byte followed by an 0x0a (decimal 10) byte.↩︎