Disclaimer: while the views expressed in this article are mine and mine alone, the project introduced here was part of engineering efforts carried out with Withings.

Proxies are a very common piece of web infrastructure which I find is not discussed nearly enough given how much we use them. At the core, their job is to take in an HTTP request, figure out which server can serve it and ship their response back to the original client. They tend to be used for a variety of reasons, including:

Caching: a lot of requests out there are static, they always get the same response. CDNs make heavy use of caching proxies to reduce the load on the backend web servers. If the proxy has the response, it can serve it without sending the request any further.
Reverse proxying: most commonly used for load balancing, those proxies take in requests and distribute them fairly across a large fleet of backend servers.
Access control: combined with firewalls, NAT and routing rules, proxies can act as gateways in and out of restricted subnetworks.

When working with plain HTTP, proxies have full access to requests and responses, allowing them to apply complex access and routing policies. They are also free to intercept and modify HTTP exchanges as they please.

In this article, I will be focusing on forward HTTP(s) proxies. Reverse proxying and other protocols such as SOCKS come with additional specificities. To the curious, I warmly recommend Web Proxy Servers by Ari Luotonen.

Plain HTTP

We tend to distinguish between two main setups: proxy-aware and transparent. The first means that the client itself is aware of the existence of the proxy and explicitely communicates with it to reach out to other servers. In the latter case, clients are completely unaware and network configuration forces the traffic in and out of proxies transparently.

Proxy-aware clients

Say the client wants to connect to example.tld and is aware it needs to go through proxy.lan first. It will open a TCP connection to proxy.lan and may, for example, send the following request:

GET http://example.tld/index.html HTTP/1.1
Host: example.tld

When talking to a proxy, clients will typically send a full URL in the first line, instead of just a path. The proxy will then open a connection to example.tld, sending something along the lines of:

GET /index.html HTTP/1.1
Host: example.tld

Then when example.tld replies, the proxy will ship the response back to the client.

Flow for proxy aware clients

Note that throughout the exchange, the proxy is free to read and modify both the request and the response. It may decide, for example, that example.tld is not allowed and should be blocked. It could also strip the response of any pesky JS scripts. Anything goes, and the client will be none the wiser.

Transparent proxies

When working with transparent proxies the idea is similar. The client will send a normal request, thinking it's going straight to example.tld:

GET /index.html HTTP/1.1
Host: example.tld

A router along the way will be configured to reroute HTTP traffic at layer 4 and the proxy will receive that request as-is. If it is configured as a transparent proxy, it won't get confused by the missing full URL and will use the Host header to figure out where the request needs to go. Connection tracking will then make sure the response gets back to the client as if nothing happened.

TCP interception by router

HTTPS and proxy-aware clients

With HTTPS, a lot of what's been described there becomes tricky, although there are a couple of ways to achieve it when working with proxy-aware clients.

Plain proxy connection

A rather dirty way to adress the issue is to leave HTTPS up to the proxy. The client continues to send plain HTTP requests and the proxy communicates with the remote server over HTTPS, serving responses back onto the plain client connection. With this setup, you lose the benefits of end-to-end encryption and that communication becomes vulnerable to network intrusion.

HTTPS to the proxy

To solve the above issue, you can decide to secure the connection between the client and the proxy. The proxy will need to be provided with a TLS key and certificate which the clients trust. This is more or less feasible if you have a valid PKI set up across your network, allowing internal nodes to have secure, authenticated exchanges. In this case, the proxy terminates TLS for the client, initiates it for the remote and may freely inspect/modify the traffic as a middle man.

TLS sessions between client, proxy and remote

Note that this is only possible because the client is aware it is talking to a proxy. The TLS handshake it performs verifies that it is indeed talking to the right proxy, but gives the client no guarantee that the response has indeed been issued by the right server at the other end. This setup therefore still does not provide end-to-end encryption.

HTTP CONNECT

To ensure end-to-end encryption, we need to let the client complete a TLS handshake with the remote without interfering. To allow HTTP proxies to act as layer 4 tunnels, RFC 9110 provides a lesser-known HTTP method: CONNECT. The request is typically much simpler than with the standard verbs (GET, POST, ...) as only a subset of HTTP semantics apply to it. It looks something like this:

CONNECT example.tld:443 HTTP/1.1
Host: example.tld

Here, the proxy-aware client is asking the proxy to open a TCP connection to example.tld:443. If it agrees, the proxy will reply once with an HTTP response:

HTTP/1.1 200 Connection established

Once this response has been processed by the client, any traffic (that is, any byte) sent on the connection will be forwarded as-is to the remote. The HTTP proxy effectively downgrades itself to operate at layer 4.

HTTP CONNECT method

At this point, the client is free to perform a TLS handshake over the TCP connection and establish end-to-end encryption with the remote. The proxy, however, loses the ability to alter the traffic. The only data it has is the initial CONNECT line, which it can use to filter destinations.

Side note: by design the CONNECT method is not restricted to HTTPS traffic! You can use it to forward any TCP-based protocol (SSH/SFTP comes to mind).

Transparent HTTPS proxying

When intercepting HTTPS traffic, you are effectively sending a TLS ClientHello message to an HTTP proxy which is expecting an HTTP request instead. Most software out there simply will not understand what's happening, and it's often very much a design choice.

HTTPS proxying fail through a regular proxy

In fact if you look for transparent HTTPS proxies out there, you may find that very few projects take on the challenge of providing such a feature. In my opinion, Squid offers the best set of features for this, however it also suffers from a number of open vulnerabilities and there just aren't enough developers on the project to address them all. To the best of my knowledge, Apache Traffic Server is the only other project with similar support for the feature, although I may have missed others.

SNI peeking

I believe the main reason behind the lack of options out there is that a lot of people expect to be able to mess with the traffic when proxying. You want to be able to analyze full requests, alter responses, cache them, and so on. For this, the proxy needs to be able to generate certificates on the fly to impersonate the target server using a trusted local CA. This is exactly the sort of things TLS was designed to prevent so it comes with its fair share of trouble for developers.

In many scenarios however, you only want to filter destinations to decide where your clients are and aren't allowed to go, or how they should be routed. As it so happens, there is no need to interfere with the handshake to achieve this. Most clients today support TLS Server Name Indication (SNI) whereby the first TLS record sent by the client will contain an extension indicating the name of the server it is trying to handshake with. This is an essential feature when talking to load balancers for example, since they may need to present different certificates based on which name is being requested. The following proxy flow could therefore be considered:

The client initiates a TCP handshake with example.tld.
A router intercepts and reroutes it such that the client actually connects to a proxy instead.
The proxy detects that the first packet sent is a TLS ClientHello message and peeks at the SNI.
The proxy behaves as though the client had sent a CONNECT request to that name and becomes a layer 4 relay, forwarding the peeked ClientHello to the remote.
Client and remote proceed with the handshake and start exchanging data.

This is more or less what the projects I mentioned earlier do, although they are also able to perform more advanced man-in-the-middle operations.

tls2httpconnect

For this simpler scenario where traffic remains secret and unaltered, I got to work on a little project that performs SNI peeking ahead of an HTTP proxy: tls2httpconnect. The goal here is to capture incoming TLS handshakes, peek at the SNI, then send a corresponding HTTP CONNECT request to an upstream proxy. tls2httpconnect therefore simply acts as a little addon that sits in front of your proxy and handles the peeking. Filtering and other such decisions remain up to the upstream proxy which receives the HTTP CONNECT request.

tls2httpconnect flow

For more information about this tool, feel free to have a look at the repo!

Final notes

Whatever solution you pick to handle transparent HTTPS proxying, there are a couple of things to be noted in such setups.

First, the remote server will be resolved twice. The client will go through the DNS first and open a connection to the remote's IP. The proxy will then peek at the SNI and resolve it once more to connect, possibly getting a different IP if the remote is behind round-robin or GeoDNS. The discrepancy should not matter too much though since the TLS handshake will make the proper assertions to authenticate the remote no matter the IP.

Squid (and I assume others) also take extra precautions to prevent SNI forgery when intercepting ClientHello messages. This makes such proxies more difficult to use when reaching out to remotes behind round-robin DNS (that is, large networks like AWS). Following a security advisory on the matter, Squid wrote up some interesting documentation outlining the issue in greater detail. tls2httpconnect does not take that risk into consideration and assumes that the client will verify that the details in the ServerHello message match those from its initial ClientHello.

Finally, it should be noted that when running tls2httpconnect, the proxy no longer has access to the client's IP, making proxy access control more difficult. Extra care should therefore be taken to ensure you do not run an open proxy through tls2httpconnect (proxy authentication, firewalling, ...).

The hassles of proxying HTTPS transparently

jjpk.me