Every so often when you want to archive a webpage, you notice it’s full of dynamic content and javascript which won’t easily be archived. I was recently looking to archive a matterport 3D image. This is a typical website that won’t easily save using normal web-archivers, as it relies on javascript to dynamically fetch images as you move through the 3D space.
One generic solution to capture something like this is to use a proxy in the web browser and save everything that passes through it. But most proxies only cache things for a limited time and respect headers like no-cache
1. But if the proxy would ignore that and store all requests that flow through it indefinitely, you can maybe create a “snapshot” of a website by browsing it trough this archiving proxy.
Turns out I am not the first one to come up with this idea, there are at least two tools out there which do this. The first one I tried was Proxy Offline Browser, which is a Java GUI application which does this. It worked quite well, but the free version does not do TLS/HTTPS. The Pro version is only 30 euro, but I was curious to see if there was any open-source solution that could do this.
Turns out there is, it’s called WWWOFFLEand it has a lovely compatible webpage. After some trying, I got it working, and I’ll describe rough outlines on how to get it working here. Note though, if you value your time or don’t feel like fiddling around in the terminal, I do recommend just paying 30 euro for the Proxy Offline Browser and be done with it.
Steps for getting it working on OS X
First you need to download wwwoffle
source code and ensure you have GNUTLS headers and libraries, so you can use it for HTTPS.
Then compile it with
./configure --prefix=/usr/local/Cellar/wwwoffle/2.9j/ --with-gnutls=/usr/local --with-spooldir=/usr/local/var/run/wwwoffle --with-confdir=/usr/local/etc/
make
make install
Then run it
wwwoffled -c /usr/local/etc/wwwoffle.conf -d
Now there is a few more steps before you can start archiving.
First reconfigure your browser2 to use wwwoffle
as proxy. Then visit https://localhost:8080
in the browser to get to the wwwoffle
page. Using this page, you can control wwwoffle
and see what it has cached.
First, you will need to get the CA certificate, so you won’t get SSL warnings all the time. Go to http://localhost:8080/certificates/root, download and install it.
Then you need to put wwoffled
into online
mode, which you can do here http://localhost:8080/control/
Then configure wwwoffled
itself, which you can do using the built-in web-based configuration tool.
The settings to change are
http://localhost:8080/configuration/SSLOptions/enable-caching to yes
and
http://localhost:8080/configuration/SSLOptions/allow-cacheto allow-cache = *:443
That should hopefully be enough. Now try browsing some website. Then go to the control page and put wwwoffled
into offline
mode. Hopefully, you should still be able to browse the same page, using the cache.
Additionally, I had to add
CensorHeader
{
Access-Control-Allow-Origin = *
}
To http://localhost:8080/configuration/CensorHeader/no-nameto ensure AJAX3 requests worked in some cases.
If you run in to other issues, you can either start debugging or go back and cough up the money :-)
- which seems to be standard practice nowadays even for things that should definitely be cached [return]
- I recommend using another browser than your main one for this to keep things separated. On OS X I’d recommend Firefox as it keeps it’s trusted CA’s separate from the OS’s so you won’t need to have your whole computer trust the newly minted CA certificate. [return]
- yeah I’m old [return]