Fixes and a software sketch for caching a legacy web application.
If one visits the Harvard Book Store’s events webpage, it presents this gray, spinning wheel for several seconds before the events come into place. On a good day, it might be two or three seconds; on a bad day, it might take more than 10. Or 13.
Clicking on an event, and then navigating back… the same thing happens, with the spinning wheel occupying the space for several seconds. It should remember some content…
Peeking under the hood, a giant scab of an ajax script emerges from the browser’s timings chart:
What. the cluck. If Ann Patchett is coming to town, we should all find out sooner. As soon as possible.
What to do, what to do
Emailing with the bookstore’s IT person, it sounded like there weren’t great configuration options from within the software system; the system is old. Really old, and doesn’t have a manual around.
Fixing it would likely involve writing some php for an ancient application. The code would have to follow the logic that the events page doesn’t need to be updated until the soonest event occurs, or until the page’s content is changed. For client-side caching, the server should also keep the page on a relatively short leash, in case of updates; maybe 15 minutes. Perhaps aligned to the hour, because all their events are on the half or quarter-hour.
This script is definitely not doing any caching, or anything smart: from a data-rate perspective, the content is only 7kb, and takes a second and a half to finish. That’s crawling along at ~4kb a second, compared to 2mb in 4.5 seconds for the whole page–400kb a second on average. (And much more in practice, given the portion of time taken up by latency and server-side processing.)
Better yet, the timings chart for the script helpfully informs us that the request spent all but 1 millisecond waiting for a response from the server:
So there’s no server caching. And the headers don’t allow for client-side caching, either:
Expires
has a 1997 dateCache-Control
is set with a max-age of 1 (second), with a must-revalidate directiveLast-Modified
is always the server time, current at the time of the request.
On refresh, if the browser goes to look for the script contents in its cache, it will find that it’s already expired or too old. Even if the client sends an If-Modified-Since
using the last Last-Modified
time (and it does), the server doesn’t issue a 304 Not Modified
–presumably because it doesn’t have an internal concept of when the output changed. So it will run the script again and send the contents. This looks like a cheap way of getting a calendar where everyone always has a consistent, up-to-date view.
Looking at the script’s content, as one might expect, it returns the html for the structured data embedded in the events calendar: event times and places, book details, and book/author image links. Right after the request returns, the browser issues a flurry of requests for all of the linked pictures–the author pictures from the same server, and the book pictures from an amazon site. Once these come in, the page is rendered. This script holds up everything.
The server does do the right thing with image headers. On a new request, the Cache-Control
header is set to public, with a large max-age. Last-Modified
has a permanent date. On refresh, the server issues a 304 and the browser loads the image from cache.
(Curiously, the requests for the author pictures from the bookstore site are “blocked” for upwards of hundreds of milliseconds; even when the image is revalidated, it still takes the server 50+ milliseconds to decide this is the case, and the requests back up in the browser. The aws pictures are snappy and don’t back up. Some server…)
Adding some apache header directives
Since changing the source code isn’t viable, mucking with the server configuration might improve some things. Apache’s header directives can drop/set the right headers for getting some caching, somewhere.
For request headers, apache should drop any client cookies; on the server side, allowing client cookies will show up as misrepresenting and undercounting the visitor population, as only requests that miss the cache will make it through. There’s also already a cookie on the base events page, and a tidy correspondence between loads of /events
and /ajax/events/upcoming
.
For response headers, apache should drop or set the expires, last-modified, and cache-control headers noted above to get some server-side caching in. Any headers for setting the cookies should also be dropped, as these might be cached and re-served to multiple clients; these may be re-used in later browsing, polluting the visitor analytics.
A test setup
Without the ability to login to someone’s server, it can be hard to test changes. Luckily, the events page can be easily mocked out with a python script that emulates the server’s behavior; only the headers matter in this scenario. And the configuration changes should only apply to the ajax script anyway, so they are limited in scope.
A python script that returns the current time as the contents (this makes it easy to see if the content has changed), with mimicked headers, is straightforward: 1
# ... setup python
status = '200 OK'
html = str(time.time()) + '\n'
start_response(status, [
('Content-Type', 'text/plain'),
('CacheControl', 'private, must-revalidate, max-age=1'),
('Pragma', 'no-cache'),
('Last-Modified', datetime.datetime.now(tz=tz.gettz('GMT')).strftime('%a, %d %b %Y %H:%M:%S GMT')),
('Expires', datetime.datetime(year=1997, month=7, day=5, hour=12, tzinfo=tz.gettz('GMT')).strftime('%a, %d %b %Y %H:%M:%S GMT')),
('Set-Cookie', 'random_test_cookie=' + str(time.time())),
])
return [html]
Linking the script into apache, a test request shows matching headers coming back.
Apache header directives
To remove and set the headers as described in this emulated environment, directives from mod_headers fix up the /cacheme
path as described: 2
<Location "/cacheme">
# ... boilerplate setup ...
# client cookies remove
RequestHeader unset Cookie
# server cookies remove
Header unset Set-Cookie
# set/unset server -> client caching headers
Header set CacheControl "public, max-age=60"
Header unset Last-Modified
Header unset Pragma
Header unset Expires
</Location>
Any requests under /cacheme
now get this modified response:
With these changes, server-side disk caching now works; adding the CacheHeader
and CacheDetailHeader
flags add hit/miss debug headers that show up on the client-side.
(Trying to configure a memory cache via apache’s mod_socache
did not work with the header modifications. Disabling gzipping entirely worked, and re-ording the cache/deflate filters worked for server-side caching, but did not work with these mocked-out headers; the headers directives were then applied after caching, and the cache refused the content because of the expired header.)3
Client-side caching
Server-side caching is an improvement, but this didn’t enable client-side caching. The mozilla web docs are a great resource for cache-control directives. But in practice, browser caching didn’t work as described. Even with “only-if-cached” set with a positive max-age, the browser wouldn’t cache any requests under /cacheme
.
The mocked-out page did need to be changed, so that a cached page was embedded in another page; when a site is refreshed, a browser refetches the current page with max-age=0
set. Embedding the /cacheme
path within another also gets closer to the events page layout, in how it embeds a cachable page’s content: 4
def application(environ, start_response):
request_url = environ['REQUEST_URI']
html = "<p>" + str(time.time()) + "</p>"
if not request_url.startswith('/cacheme'):
html += '<p><a href="/b">stuff link</a></p>'
html += "<embed src=/cacheme/c>"
start_response(status, [
('Content-Type', 'text/html'), # changed to html
# ...
Including a link to a page that also embeds the same, (hopefully) cacheable content (...<a href="/b">...
) is a convenient way of seeing how the browser will handle this content without the browser issuing a refresh.
This still didn’t work. Client-side caching worked well if Expires
was set to some date in the future. But without this, the Cache-Control: max-age=60
header set by apache was ignored, until the Last-Modified
header was set in the server response; if this timestamp was set to the current time, it wasn’t cached. Trailing by ten seconds, it was cached sporadically. Trailing by half an hour, it was cached; and for much longer than the max-age
specified by the server.5 Curious.
Both of these changes had to be applied from the python script; these headers can’t be changed in this way from the apache configuration, as mod_headers
and mod_headers
can’t set Last-Modified
to a date relative to the current time, and mod_expires
can’t override an existing Expires
header (even if it has been removed by mod_headers).
Legacy caching
This falls into an odd corner of web software. Most server-side caching plugins are designed to work with the software–not bludgeon it into something workable. Content delivery networks can overwrite and modify headers on the fly, according to a separate configuration.
But there could be software that lives on, and works with the server, providing an efficient and easy-to-configure way to update the headers and cache content. A small, simple system that lives between the web server and a legacy system might allow the server and client to keep up with newer, more efficient techniques, without the need to modify a legacy system.
For the events page, the server could set the Last-Modified
date to match when the content has actually changed; this would enable efficient client-side checks to see if it needs the page again. The server could also issue an Expires
date that lets the client hold onto the page for longer. The images could be using the new Cache-Control: immutable
header, and not need requests from the browser to double-check that everything is OK; if these requests did come in, the server should be able to issue a 304 status much faster.
legacy-webcache
The webcache.wsgi file contains the result of a sketch for this software.
In an apache configuration, the webcache is mounted as an overlay of cacheable resources. Inbound requests for these resources are rerouted to the webcache script; internally-issued requests for the same resource are passed through to the original application.
State is held by url, and stored in memcached. Requests for a resource with no cache entry are sourced from a request to origin, and written into the cache on the way out. Requests for a resource with a valid cache entry can be sourced from the cache, by issuing a response with full content (drawn from the last successful origin response), or by issuing a 304 status. If the cache entry is not valid, it can be replaced or merged with a new origin response: if the origin’s content has changed, the entry is wiped; if it has not changed, the Last-Modified
status is preserved.
A backoff/contest algorithm handles races for the same resource from multiple threads. On a per-url basis, multiple requests are linearized with atomic cache updates, so that each thread attempting to update the cache with a request to origin are assigned a token unique to both the existing cache entry and the thread’s request to renew the resource. In a contest to update a cache entry, only one thread “wins,” and can make an origin request immediately. All threads that lose wait additional time, if necessary, before making an origin request, where this additional time is governed by the count of other, known threads participating in the contest.
next steps
A few things remain to be done:
add tests to ensure continued correctness, and to exercise consistency logic- run some performance tests, to gauge the cache’s overhead against the content size and the server response time, when:
- a request can be served from the cache
- there is no entry in the cache
- there is an entry, but it is invalid
- a contending thread loses the update contest
- switch to a more feature-rich way of coordinating between threads; losing threads poll memcached at some period, building in unnecessary delay
- fix some minor exceptions and logging issues
- add ability to configure on a per-url basis
Footnotes
1: mod_wsgi python script for setting headers and time: https://github.com/simzes/legacy-webcache/blob/main/basic_headers_rewrite/mock_headers.wsgi
2: full apache configuration for disk caching: https://github.com/simzes/legacy-webcache/blob/main/basic_headers_rewrite/disk_cache_access.conf
3: https://stackoverflow.com/questions/15323619/django-response-always-chunked-with-text-html-cannot-set-content-length/53642877#53642877
4: mod_wsgi python script for setting headers, time, and embedding a resource under /cacheme
: https://github.com/simzes/legacy-webcache/blob/main/linked_content_wsgi/linked_content_mockout.wsgi
5: mod_wsgi python script for setting headers for browser caching/memcache experiments: https://github.com/simzes/legacy-webcache/blob/main/linked_content_options_wsgi/linked_content_options.wsgi