Monday, October 30, 2023

Apache as a Proxy

I have a number of different devices and services running on my home network with web interfaces, and I would like to be able to access them all from anywhere.  Some of these are smart devices, some are software packages, but it doesn't really matter.  I want to be able to access them all by connecting to a web page on the server running on my firewall (or wherever port 443 is forwarded).

To start with, I did an inventory of all the devices and servers on my network.  The tool 'nmap' is idea for this.  Something like this works:

nmap 192.168.0.1-254

My network is a little more complicated, but I just run the equivalent command on each subnet.  For any device showing something on port 80 or 443, I connect with a web browser and see what's there.  For other ports that might be web pages, I can construct a URL by adding 'http://' or 'https://' to the front and adding ':8443' or whatever the port is.  If the web browser brings up anything interesting, I note that, too.

For the first pass, I create a web page that simply has a link to each service of interest on my network.  Of course, this will only work from inside my network, but at least I now have a page listing everything that I want to connect to.  Some of these will be http, others https, often with bad certificates (like my printer).  That's fine, I'll eventually deal with that using my proxy.

There are two categories of applications that I want to proxy.

The simple case is web applications that for various reasons run on different web servers (either on the same system on a different port or a different system; it doesn't matter), but are already running as a subdirectory, and I just want to forward anything going to that subdirectory on the main web server to the application.

The complicated case is typically a device where it assumes it's its own web server, and everything is relative to the root of the server.  I want to proxy it from a subdirectory on my main server, and this means modifying all the links that are passed through.  Sometimes this is easy, but sometimes it's quite difficult.

For the simple case, I'll use an application called "mythweb" as an example.  This is running on an internal server at http://192.168.0.5/mythweb/.  I've set it up in my /etc/hosts file as myth.mydomain.com, so I don't have to reference the IP number.  My apache installation has my server defined in /etc/apache2/vhosts.d/mydomain.conf, and I'll be adding entries to that file.  So for mythweb, it's really quite simple:

  <Location /mythweb/>
        ProxyPass http://myth.mydomain.com/mythweb/
        ProxyPassReverse http://myth.mydomain.com/mythweb/
        # mythweb sets a global cookie; change path from / to /mythweb/
        ProxyPassReverseCookiePath "/" "/mythweb/"
  </Location>

That's it.  Since I'm going to be defining multiple proxies as virtual subdirectories, I'm defining each as a location.  The "ProxyPass" line says where to send requests for anything in this directory.  The "ProxyPassReverse" is used for rewriting any redirects that the internal server may send back to the location.  The final note is that this application was setting a global cookie, so the "ProxyPassReverseCookiePath" option restricts any cookies it sets to only be for the subdirectory so they don't get seen by any other programs.

Doing a proxy for something that isn't already packaged in a subdirectory can get much more complicated, as you have to tell Apache how to modify the files being served as well as as the URLs being requested.  This potentially means modifying .js, .css, and .html files, and what has to be done is different for each application.

Perhaps my simplest example is my NAS management.  I'm running XigmaNAS.

    <Location /nas/>
        ProxyPass http://nas.mydomain.com/
        ProxyPassReverse http://nas.mydomain.com/
        ProxyPassReverse /
        ProxyHTMLEnable On
        ProxyHTMLExtended On
        RequestHeader    unset  Accept-Encoding
        ProxyHTMLURLMap ^/ /nas/ R
        ProxyHTMLURLMap http://[w.]*mydomain[^/]*/ /nas/ R
    </Location>

This starts out the same, but there's a good bit more.  Now we need another ProxyPassReverse line for '/' to direct to the location (/nas/).  Since we're modifying file internals, we turn off the "Accept-Encoding" option to disable compression.  I use ProxyHTMLURLMap to do regular expression substitutions in links in html files.  The "extended" option tries to also modify Javascript and CSS within html files.  The two lines for changing are switching any top-level links to the subdirectory, as well as any full links.  Fortunately this program doesn't use any CSS or JS files with links in them that need to be modified.

However, there's some setup for the ProxyHTMLURLMap command to define what gets treated as a URL.  I've seen some references to including "proxy_html.conf," but my Apache didn't have that file, so I put these lines in directly to my config file.  Note that these are not specific to a location, so I put them before the location tags:

    ProxyHTMLEvents onclick ondblclick onmousedown onmouseup onmouseover onmousemove onmouseout onkeypress onkeydown onkeyup onfocus onblur onload onunload onsubmit onreset onselect onchange
    ProxyHTMLLinks  a          href
    ProxyHTMLLinks  area       href
    ProxyHTMLLinks  link       href
    ProxyHTMLLinks  img        src longdesc usemap
    ProxyHTMLLinks  object     classid codebase data usemap
    ProxyHTMLLinks  q          cite
    ProxyHTMLLinks  blockquote cite
    ProxyHTMLLinks  ins        cite
    ProxyHTMLLinks  del        cite
    ProxyHTMLLinks  form       action
    ProxyHTMLLinks  input      src usemap
    ProxyHTMLLinks  head       profile
    ProxyHTMLLinks  base       href
    ProxyHTMLLinks  script     src for
    ProxyHTMLLinks  iframe     src

For my managed network switch, I had to also do some additional modifications to the html files to get it to work.  In the process, I double-modified some URLs, so I had to undo them:

        AddOutputFilterByType SUBSTITUTE text/html
        Substitute s|action="/|action="/switch/|n
        Substitute s|="/"|="/switch/"|n
        Substitute s|location.href="/|location.href="/switch/|n
        Substitute s|"/switch/switch/|"/switch/|n

So how did I figure that out?  It's an iterative process of loading it through the proxy and looking at what files are being requested, and figuring out how the browser got the wrong information when it wasn't translating things correctly.  This isn't too hard if you open up the console (Control-Alt-I in Chrome, Control-Shift-I in Firefox). There you can see every request and response as seen by the web browser.  You can compare connecting directly and through the proxy.

From here it keeps getting more complicated, but it boils down to creating substitute rules for other types.  I had to watch carefully in the console, as in one application I had to modify application/javascript, while in another it was application/x-javascript.

Another issue that I've encountered is proxying an internal device that uses https with a bad certificate.  I have my own legitimate certificate, and I just want to ignore the internal one, and it turns out that's easy to do.  I just put the following into my config file (before the <location> tags):

    # Enable SSL proxy and ignore local certificate errors
    SSLProxyEngine On
    SSLProxyVerify none
    SSLProxyCheckPeerCN Off # Ignore certificate error
    SSLProxyCheckPeerName off
    SSLProxyCheckPeerExpire off

With that I can set a proxy to https:// instead of http:// and everything else just works.

What about security?  Many of the things I'm proxying already have password protection, but some don't.  Fortunately it's easy to add a password inside any <location> field:

        # Authentication
        AuthType Basic
        AuthName "Password"
        AuthUserFile /var/www/localhost/accounts/webfrontend
        require valid-user

That's just the same as you would do for a <directory> if you weren't doing a proxy.  Note that my server only runs on https, or I wouldn't use "basic" authentication.

Another problem is doing a proxy for something that uses websockets.  I hit this with my security camera, and the following does the trick:

        # See: https://httpd.apache.org/docs/2.4/mod/mod_proxy_wstunnel.html
        RewriteEngine on
        RewriteCond %{HTTP:Upgrade} websocket [NC]
        RewriteCond %{HTTP:Connection} upgrade [NC]
        RewriteRule ^/?(.*) "ws://camera.mydomain.com/$1" [P,L]

Unfortunately I haven't succeeded in doing a proxy of everything.  One device fails to load all the pages even though it appears Apache is modifying all the links correctly.  What I have been able to do in that case is have Apache listen on a separate port with a new entry in /etc/apache2/vhosts.d/, and that vhost is a straight proxy for the device without any link rewriting to push it into a subdirectory.  If you're having trouble getting something to work with the rewrites, that's a good first step.

The Apache proxy feature is very powerful.  It's great to be able to take all my different devices and put them in a single interface and appear as if they are all on the same server.  Unfortunately this can be very complicated in some cases.  If developers would avoid absolute links, especially in CSS and JavaScript, it would make proxying much easier.  It would also be nice if there were some community database of proxy recipes for devices and web applications.  You would think there would be a wiki for this with pages for each device or application that someone had done a proxy for.

Transparent Mode Proxies

 I have a fairly complicated home network setup, which should come as a surprise to absolutely nobody.  I recently dealt with an issue that had been bugging me for ages, but first some background.  I want to be able to connect to my home systems from pretty much anywhere, and ssh is the obvious tool for that.  Occasionally I've been somewhere where they've blocked outgoing connections to port 22 (the ssh port), but I can instead connect to port 443 (the https port).  So if I instead run ssh on port 443, that gets around the problem.  But I also want to have a web server on port 443.  Fortunately there's this neat little tool called 'sslh' that can sit and listen for connections on a port, and when something connects and sends a message to the server, it determines what protocol the client is using and forwards the connection to the appropriate program.  So now I have multiple services running on the one port that should never be blocked.

But there's a problem.

The server logs for ssh and apache show the source of the connections as being from the local system, which is technically correct since they are coming from sslh on my local system.  I could merge the logs from sslh into the logs for apache and ssh, but that would be a pain.  What I really want is to have the original source IP to show up in the logs for the applications as if there wasn't a proxy in the middle.

Wishful thinking, right?  But apparently some really smart people wished for it, so they made it happen.

There are instructions for how to make this work for sslh, and if you follow them exactly, and if you're lucky, it does work.  I say you have to be lucky because there are some subtle issues that you'll hit if you think you're smarter than the instructions or try to do things a little differently.  Which is exactly the sort of thing I'm obviously going to do.

The way sslh works, is it accepts connections, and then sees the first message that the client sends, which it uses to determine what application to forward the connection to.  Fortunately network protocols tend to expect the client to send the first message, so for things like ssh, ssl, and http, sslh will know what to do.  (However, some protocols like imap have the server send the first message upon establishing a connection, so sslh can only service one such protocol by defaulting to it using a timeout.)

So my real setup is more complicated than what I described (which should come as no surprise).  The issue I hit was with connections using ssl (or really tls as the newer versions have been renamed).  If I am going to have multiple services using ssl encryption, then I need to decrypt the incoming connection and then use sslh to multiplex to different applications.  This is done by using the program stunnel.  And it also has transparent proxy support.

So an incoming connection comes in on port 443.  First sslh gets it and sees what protocol it's using.  If it's SSL/TLS, it sends it to stunnel, which then sends it back to sslh, which finally sends it on to apache, ssh, imap, or whatever else I have hidden behind that port.

Support for transparent proxying is included with both sslh and stunnel, so I'm good, right?

Nope.

I can get it working with one of them.  I can get it working with sslh going to stunnel and on to apache.  But if I have stunnel going to sslh, it breaks badly.

Why is that?

Well here's the problem.  A quick search brings up instructions on how to make the transparent proxy work, but while they give you the formula, they don't explain how it actually works.  And without understanding the reasoning behind the instructions, you're stuck if something goes wrong or if you want to try some creative variation on the same concept.

So I decided to figure out what's actually happening, and here's the technical meat of my post:

There are two parts of making this work.  The first is the transparent proxy has to send outgoing packets that appear to be from the original host, not from itself.  The second is the network layer of the operating system has to know to route the return packets back to the proxy application, even though they'll be addressed to some other system.

To send packets with the original source IP, the transparent proxy has to do two things between creating the socket and connecting to the target.  First, it has to enable transparent mode, which requires either running as root or having the cap_net_raw capability.  This is done with a line of code like:
  int transparent=1;
  res = setsockopt(fd, IPPROTO_IP, IP_TRANSPARENT, &transparent, sizeof(transparent));
And then it has to say where the packet is to appear to originate from:
  getpeername(fd_from, from.ai_addr, &from.ai_addrlen);
  res = bind(fd, from.ai_addr, from.ai_addrlen);
Normally you only bind a socket to an address like that when listening for incoming connections, but that's how you tell the kernel what address you're sending from in transparent mode.

The other part of this is to have the operating system route packets for your application back to you.  I won't go into the details here, but there are two approaches I've seen.  One is to run your proxy as a specific user, and have a firewall rule that all packets from that user are flagged by the firewall, with firewall rules that have the return packets get sent to a different routing table that tells them to go to the local machine.  The other, and the one I prefer, is to connect to a local IP address other than 127.0.0.1, such as 127.255.0.1.  Really, anything under 127.x.x.x works besides all zeros or all 255.  (I have a separate post about localhost being 127.0.0.1/8 instead of 127.0.0.1/32, giving you a ton of addresses to use for things like this.)

Note that the above assumes that the target of the transparent proxy is on the same system.  It's probably possible to create firewall rules that will make this work with proxying for a separate internal server, but I haven't explored that, as I haven't needed it yet.

Of course, what would be really nice is to just tell the networking layer to check all local bindings before forwarding packets, but there's no option for that.  I think that would make a nasty mix of code layers in the routing code, so I can understand why the fancy firewall rules are required.


So now that I understand how it's supposed to work, why didn't it work for me with my complicated setup?

The problem was the bind() call.  That's not just binding an IP address, it's also binding a port.  What does that mean?  Every network port is bound to a combination of an IP address and a port.  This is done explicitly for any service listening for incoming connections.  For outgoing connections, it's done implicitly with the local IP and some high numbered port that is automatically assigned.  But in transparent proxy mode, that outgoing connection binding is controlled directly by the program.  And if you have multiple layers of transparent proxies, you get into trouble.  You simply can't have two connections bound to the same IP address and port.  In transparent mode, you're using the original IP and port, so a second hop doesn't work.

Except it does if stunnel is the second hop.  How does it do this?  It gets a bind error, but then retries with a new port.  This means the end application sees a different origin port number, but the IP address is right.  And in most cases, that port number isn't logged or meaningful, so it doesn't matter.  It does break the 'ident' protocol, but I don't think anyone uses that anymore; certainly not for connections from outside a firewall.

So to make it work for me, once I understood what was really happening and why it was failing, was to put in the same retry on bind failures in sslh using a new port.  And being open source, I sent my patch to the developer of the program, and my patch will be in the next release.  That's the power of open source.