PyCon 2016 in Portland, Or
hills next to breadcrumb illustration

Tuesday 12:10 p.m.–12:55 p.m.

HTTP Can Do That?!

Sumana Harihareswara

Audience level:
Novice
Category:
Other

Description

Learn how to get more performance, testability, and flexibility out of your web apps, using features already built into HTTP. I'll walk you through case studies exploring good (and bad) ideas, using Python, your browser, netcat, and other common tools.

Abstract

Web developers who only know about GET and POST and use the most popular headers and response codes are missing out! Underappreciated verbs, headers, and response codes can boost your web application's performance, flexibility, and testability, and help you better appreciate the structure of the web. Introduction ------------ The version of the Hypertext Transfer Protocol you will deal with most is 1.1. As a quick refresher: Clients and servers talk to each other via HTTP messages (requests and responses), which are clear text comprising start-lines, headers, and bodies. Methods ------- GET ("gimme") and POST ("here you go") are overwhelmingly popular, the Dave Matthews Band of methods. To illustrate their importance: you can create an API that allows the user to POST but not GET, but that would be a terrible idea. [https://github.com/brainwane/secureapi][1] demonstrates this with Python 2 code using the BaseHTTPServer standard library. Using POST to mean "Create resource", "Update resource", and "Delete resource" is inelegant! So why do we overload POST, and what are the alternatives? PUT, meaning "create resource," is implemented throughout the HTTP 1.1 ecology and is unambiguously great; be more careful with DELETE, which deletes a resource (as demonstrated with Python 2 code using the `BaseHTTPServer` standard library and the requests library). It's also worth looking into PATCH and OPTIONS for specialized use cases. An exciting alternative to GET is HEAD, which requests only the metadata about a resource; if the client really only needs to know whether it could GET a resource, or wants a resource's size, last-modified timestamp, or other information available in its headers, using HEAD instead of GET can speed performance by more than 50%. I demonstrate this using the requests library and the `%timeit` functionality in IPython. Also, why am I both discussing good and bad ideas throughout this talk, and how can you tell the difference? Sometimes bad ideas are easy ways to understand edge cases (also, they're funny). The "horror world-to-whiteboard scale" gives you my take on whether or not you should try out what I'm describing. Headers ------- Call-and-response header pairs such as Last-Modified and If-Modified-Since/If-Unmodified-Since allow the client to conditionally specify its preferences; you can save client-side processing time, and test your application more thoroughly, by knowing and using the right headers. For instance, check for cache problems by using Cache-Control and ETag. (But not all headers are useful; for instance, the From header is basically obsoleted by more advanced analytics and by the User-Agent header.) We require that clients send a Host header with all requests; Host works with the path specified in the start-line, the two together forming the full address of the resource. Sometimes the host is merely the domain name of the server, but you can't depend on the assumption that the host will be obvious to all the systems between the client and the server. The client might send a request to an IP address, or to one of several virtual hosts that act as subdomains on one host. This level of redundancy can lead to unintended consequences; for instance, by intentionally malforming the Host headers of GET requests, spammers can leave links to their own sites in your access logs. You can define your own header when sending requests or responses, and many organizations do this; the convention is to prepend "X-" to bespoke headers. It's easy to do this when hand-writing requests, and I'll also demonstrate how to do this in a Python web framework. Response codes -------------- Response codes (a.k.a. status codes) have well-specified semantics. For instance, they come in five numbered classes, and the three-digit integer should be sufficient to explain the response -- the "reason-phrase" (the English explanation) should not be a necessary data point for the client to use when debugging. As several responses sent by real, working web servers demonstrate, if you don't respect this principle, the results can be hilariously confusing. HTTP includes useful response codes that mean more specific things than "OK" or "nope"; 410 Gone, 304 Not Modified, and 451 Unavailable for Legal Reasons help you and your users move faster, debug, test, and recover from unavailable content. I demonstrate how to alter the reason-phrases in your web application's response codes, using the http standard library in Python 3: [https://gitlab.com/brainwane/http-can-do-that/][2] Conclusion ---------- From "don't cache this" instructions to look-before-you-leap requests to using the "Content-Disposition" header to tell clients that a resource should be treated as an attachment, HTTP already contains an embarrassment of riches. Reading up on it gives you both a feeling of power, of increased capability, and a sense of wonder, in discovering a new way to look at the world. What might the web have been? What might it still be? To learn more, read IETF RFCs 7230-7235, use the `http` standard library in Python 3 and the requests library, find netcat, wget, netstat, and telnet on your system, and check out [https://gitlab.com/brainwane/http-can-do-that/][3] . [1]: https://github.com/brainwane/secureapi [2]: https://gitlab.com/brainwane/http-can-do-that/ [3]: https://gitlab.com/brainwane/http-can-do-that/