Up until now, you learned about the basic concepts and some of the architectural aspects of HTTP. This leads us to the next important subject to the HTTP: client identification.
In this article, you’ll learn why client identification is important and how can Web servers identify you (your Web client). You will also get to see how that information is used and stored.
This is the third part of the HTTP Series.
In this article, you will learn more about:
- Client Identification and Why It’s Extremely Important
- Different Ways to Identify a Client
- HTTP Request Headers Used for Identification
- IP Address
- Long (Fat) URLs
First, let’s see why would websites need to identify you.
As you are most definitely aware, every website, or at least those that care enough about you and your actions, include some form of content personalization.
What do I mean by that?
Well, that includes suggested items if you visit an e-commerce website, or “the people you might know/want to connect with” on social networks, recommended videos, ads that almost spookily know what you need, news articles that are relevant to you and so on.
This effect feels like a double-edged sword. On one hand, it’s pretty nifty having personalized, custom content delivered to you. On the other hand, it can lead to the Confirmation bias that can result in all kinds of stereotypes and prejudice. There is an excellent Dilbert comic that touches upon Confirmation bias.
Yet, how can we live without knowing how our favorite team scored last night, or what celebrities did last night?
Either way, content personalization has become part of our daily lives we can’t and we probably don’t even want to do anything about it.
Let’s see how the Web servers can identify you to achieve this effect.
There are several ways that a Web server can identify you:
- HTTP request headers
- IP address
- Long URLs
- Login information (authentication)
Web servers have a few ways to extract information about you directly from the HTTP request headers.
Those headers are:
- From – contains users email address, if it’s provided
- User-Agent – contains the information about Web client
- Referer – contains the source user came from
- Authorization – contains username and password
- Client-IP – contains the users IP address
- X-Forwarded-For – contains user’s IP address (when going through the proxy server)
- Cookie – contains server-generated ID label
In theory, the From header would be ideal to uniquely identify the user, but in practice, this header is rarely used due to the security concerns of email collection.
The user-agent header contains information like the browser version, operating system. While this is important for customizing content, it doesn’t identify the user in a more relevant way.
The Referer header tells the server where the user is coming from. This information is used to improve the understanding of user behavior, but less so to identify it.
While these headers provide some useful information about the client, it is not enough to personalize content in a meaningful way.
The remaining headers offer more precise mechanisms of identification.
The method of client identification by IP address has been used more in the past when IP addresses weren’t so easily faked/swapped. Although it can be used as an additional security check, it just isn’t reliable enough to be used on its own.
Here are some of the reasons why:
- It describes the machine, not the user
- NAT firewalls – many ISPs (Internet service providers) use NAT firewalls to enhance security and deal with IP address shortage
- Dynamic IP addresses – users often get the dynamic IP address from the ISP
- HTTP proxies and gateways – these can hide the original IP address. Some proxies use Client-IP or X-Forwarded-For to preserve the original IP address
It is not that uncommon to see websites utilize URLs to improve the user experience. They add more information as the user browses the website until URLs look complicated and illegible.
You can see what the long URL looks like by browsing the Amazon store.
There are several problems when using this approach.
- It’s ugly
- Not shareable
- Breaks caching
- It’s limited to that session
- Increases the load on the server
The best client identification method up to date excluding the authentication. Developed by Netscape, but now every browser supports them.
There are two types of cookies: session cookies and persistent cookies. A session cookie is deleted upon leaving the browser, and persistent cookies are saved on disk and can last longer. For the session cookie to be treated as the persistent cookie, Max-Age or Expiry property needs to be set.
Modern browsers like Chrome and Firefox can keep background processes working when you shut them down so you can resume where you left off. This can result in the preservation of the session cookies, so be careful.
So how do the cookies work?
Cookies contain a list of name=value pairs that server sets using Set-Cookie or Set-Cookie2 response header. Usually, the information stored in a cookie is some kind of client id, but some websites store other information as well.
The browser stores this information in its cookie database and returns it when the user visits the page/website next time. The browser can handle thousands of different cookies and it knows when to serve each one.
Here is an example flow.
1. User Agent -> Server
POST /acme/login HTTP/1.1 [form data]
The user identifies itself via form input
2. Server -> User Agent
HTTP/1.1 200 OK Set-Cookie2: Customer="WILE_E_COYOTE"; Version="1"; Path="/acme"
The server sends the Set-Cookie2 response header to instruct the User Agent (browser) to set the information about the user in a cookie.
3. User Agent -> Server
POST /acme/pickitem HTTP/1.1 Cookie: $Version="1"; Customer="WILE_E_COYOTE"; $Path="/acme" [form data]
The user selects the item to the shop basket.
4. Server -> User Agent
HTTP/1.1 200 OK Set-Cookie2: Part_Number="Rocket_Launcher_0001"; Version="1"; Path="/acme"
Shopping basket contains an item.
5. User Agent -> Server
POST /acme/shipping HTTP/1.1 Cookie: $Version="1"; Customer="WILE_E_COYOTE"; $Path="/acme"; Part_Number="Rocket_Launcher_0001"; [form data]
The user selects the shipping method.
6. Server -> User Agent
HTTP/1.1 200 OK Set-Cookie2: Shipping="FedEx"; Version="1"; Path="/acme"
The new cookie reflects the shipping method.
7. User Agent -> Server
POST /acme/process HTTP/1.1 Cookie: $Version="1"; Customer="WILE_E_COYOTE"; $Path="/acme"; Part_Number="Rocket_Launcher_0001"; $Path="/acme"; Shipping="FedEx"; $Path="/acme" [form data]
There is one more thing I want you to be aware of. The cookies are not perfect either. Besides security concerns, there is also a problem with cookies colliding with the REST architectural style. (The section about misusing cookies).
You can learn more about cookies in the RFC 2965.
This wraps it up for this part of the HTTP series.
You have learned about the strengths of content personalization as well as it’s potential pitfalls. You are also aware of the different ways that servers can use to identify you. In part 4 of the series, we will talk about the most important type of client identification: authentication.