-
Notifications
You must be signed in to change notification settings - Fork 0
Chapter 2 : Understanding HTTP Requests
HTTP (meaning Hypertext Transfer Protocol) is an application protocol used for communication between distributed and multi-layered systems on the web. The foundation of the web as we know today (the world wide web) uses HTTP as the main data communication protocol.
HTTP functions as a request-response protocol, or in an active fashion, meaning that one end issues a "request" and the other end receives the request and responds with a "response". This is a generic and more of a high-level explanation, but just have in mind that the response can be pretty much anything that is parseable (a JSON document, XML, HTML, an integer number, a URL...you name it).
The classic example of how this work is when you browse the web. Usually speaking, there is a server (or multiple) hosting the website you are accessing that will be in charge of receiving the requests and responding with the website pages (serving them to you). In this case your browser will be known as the "client" and the server, well, it's known as the "server" :)
Each and every HTTP requests needs at least an address to where it will be issued, which is the address (IP) of the server on the other side, that will receive the request and respond to it. The address can be either an IP (something like : 82.255.84.283; Less Common) or a DNS Url (something like : google.com; This is what we use on a day-to-day basis), that will be mapped to an IP address anyway.
Other than the address, each HTTP Request will also have it's own headers, which is one way of passing information to the target server. The full list of defined headers can be found here, but have in mind that the headers are not restricted to this set of values, pretty much anything in the format of "key=value" can be used as a header. Also, it is up to the server itself to handle the headers it receive and act accordingly. (E.G: If we pass a header such as "Name=SlackBot" and the server does not parse, or knows how to handle such header, chances are that nothing will happen and this header will be ignored.
The most commonly used HTTP Headers are : Accept-Encoding (or Accept-Charset), Accept-Language, Host and User-Agent.
The ones we will be tinkering the most will be both "Host" and "User-Agent". The Host header can be used to specify a given address (and port, if needed) from where a given resource will be requested. The User-Agent is a way of telling the server What is Accessing the server
, which can be used for statistical purposes, tracing of protocol violations and automated recognition of users (for the sake of tailoring responses to a particular user, if needed). By default, each browser uses it's own User-Agent, so do Crawlers, Email Clients, Custom Libraries etc (Here are some examples)
The HTTP specifications are in constant change, and one of the possible changes that comes with each new specification are the "HTTP Methods". Up to HTTP/1.0 there were only 3 methods (GET, POST and HEAD), after HTTP/1.1 we now have 5 extra methods in addition to the previous ones (OPTIONS, PUT, DELETE, TRACE and CONNECT).
For the sake of readiness we are going to focus on the main methods for a Web Crawler (GET, POST and HEAD), but we will try to scratch the surface of the other ones aswell.
Each HTTP Method (or Verb) designs an action, that coupled with it's headers will tell the server how to respond. Let's get to them:
-
HTTP GET: Requests a representation of the specified resource (Document, HTML Page, Picture, JSON, XML...). Using the GET method should only retrieve data and should have no other effect.
-
HTTP POST: Requests that the server accept the data enclosed in it's
request body
as a new resource to be persisted on it's end. The data POSTed might be for instance a new user that registered in your website, a new message on an instant messaging app, a comment on a thread etc. Usually speaking, it's something that the server will store on it's end for later consumption, processing or usage. -
HTTP HEAD: The HEAD request is identical to the GET request, but instead of receiving the full payload of the response, it receives only meta-information about the server (also known as the
response headers
). This is useful for understanding what is running on the server (or how it reacts to different requests), without having to tranport the entire content of a standard response. -
HTTP PUT: Similar to the POST request, but this one suplies an URI (identifier) that should be used by the server to persist the object transported by the PUT request. The catch here is that if an object with the same URI already exists on the server side, it should be overwritten by the one received (this operation is also known as
UPSERT
orMerge
operation. If the record does not exist, it will be inserted, otherwise it will be updated). -
HTTP DELETE: Requests the deletion of the specified resource
-
HTTP OPTIONS: Requests the HTTP methods and actions supported by the server for one specific URL
-
HTTP TRACE: Bounces the issued request to the server and back again. This is useful for understanding whether any intermediate servers made any changes to the request you issued, before it reached the target.
The Status Codes
are the way the server can tell the client what happened with the request it issued. Have you ever tried to access a site and saw the classic "404 - Not Found" screen ? Well, it turns out that "404" is the Status Code
that represents the Not Found
status.
Each status is represented by it's own integer number and falls into one out of five different categories of status:
-
1XX - Informational (E.g: 100 - Continue)
-
2XX - Success (E.g: 200 - OK ; 201 - Created ; 204 - No Content)
-
3XX - Redirection (E.g: 301 - Moved Permanently)
-
4XX - Client Error (E.g: 400 - Bad Request ; 401 - Unauthorized ; 404 - Not Found)
-
5XX - Server Error (E.g: 500 - Internal Server Error ; 501 - Not Implemented)
For a full list of status codes you can try this link or if you are a cat lover you can try this visual representation of status codes as cats.
- HTTP Headers
- HTTP Request Methods
- HTTP Status Codes
Check the sub-project of this project called "Chapter Two" for a simple demonstration of those concepts. At this point, you can go through the code as you wish, reading the comments and running it.
The example code of this chapter shows simple "GET" requests for the Home of imdb.com
, teaches you how to "simulate" a search on imdb.com
by understanding and playing with the URL's construction and also how to login on "Pocket" website.
Pocket was the first example that came to my mind, but mind you that the same techniques can be applied to other websites, and I encourage you to try doing it. Once you read the code, you will see that to login
there is not as simple as issuing a request, as it needs a parameter that is hidden within the HTML, so in order to perform the login
we had to parse this hidden field out of the html first, than assemble the login post data before we could try to login.
I strongly recommend you modify the code to your needs, try new requests and play around with it.