published on in web tech

Using WWWOFFLE to save a modern webpage for later

Every so often when you want to archive a webpage, you notice it’s full of dynamic content and javascript which won’t easily be archived. I was recently looking to archive a matterport 3D image. This is a typical website that won’t easily save using normal web-archivers, as it relies on javascript to dynamically fetch images as you move through the 3D space.

One generic solution to capture something like this is to use a proxy in the web browser and save everything that passes through it. But most proxies only cache things for a limited time and respect headers like no-cache1. But if the proxy would ignore that and store all requests that flow through it indefinitely, you can maybe create a “snapshot” of a website by browsing it trough this archiving proxy.

Turns out I am not the first one to come up with this idea, there are at least two tools out there which do this. The first one I tried was Proxy Offline Browser, which is a Java GUI application which does this. It worked quite well, but the free version does not do TLS/HTTPS. The Pro version is only 30 euro, but I was curious to see if there was any open-source solution that could do this.

Turns out there is, it’s called WWWOFFLEand it has a lovely compatible webpage. After some trying, I got it working, and I’ll describe rough outlines on how to get it working here. Note though, if you value your time or don’t feel like fiddling around in the terminal, I do recommend just paying 30 euro for the Proxy Offline Browser and be done with it.

Steps for getting it working on OS X

First you need to download wwwoffle source code and ensure you have GNUTLS headers and libraries, so you can use it for HTTPS.
Then compile it with

./configure --prefix=/usr/local/Cellar/wwwoffle/2.9j/ --with-gnutls=/usr/local --with-spooldir=/usr/local/var/run/wwwoffle --with-confdir=/usr/local/etc/
make
make install

Then run it

wwwoffled -c /usr/local/etc/wwwoffle.conf -d

Now there is a few more steps before you can start archiving.

First reconfigure your browser2 to use wwwoffle as proxy. Then visit https://localhost:8080 in the browser to get to the wwwoffle page. Using this page, you can control wwwoffle and see what it has cached.

First, you will need to get the CA certificate, so you won’t get SSL warnings all the time. Go to http://localhost:8080/certificates/root, download and install it.

Then you need to put wwoffled into online mode, which you can do here http://localhost:8080/control/

Then configure wwwoffled itself, which you can do using the built-in web-based configuration tool.

The settings to change are

http://localhost:8080/configuration/SSLOptions/enable-caching to yes

and

http://localhost:8080/configuration/SSLOptions/allow-cacheto allow-cache = *:443

That should hopefully be enough. Now try browsing some website. Then go to the control page and put wwwoffled into offline mode. Hopefully, you should still be able to browse the same page, using the cache.

Additionally, I had to add

CensorHeader
{
 Access-Control-Allow-Origin = *
}

To http://localhost:8080/configuration/CensorHeader/no-nameto ensure AJAX3 requests worked in some cases.

If you run in to other issues, you can either start debugging or go back and cough up the money :-)


  1. which seems to be standard practice nowadays even for things that should definitely be cached [return]
  2. I recommend using another browser than your main one for this to keep things separated. On OS X I’d recommend Firefox as it keeps it’s trusted CA’s separate from the OS’s so you won’t need to have your whole computer trust the newly minted CA certificate. [return]
  3. yeah I’m old [return]