Fun with Timing Attacks

Let’s say you’re writing a function that takes user input and checks if it matches some secret.

You’ll be exposing this checkSecret function to external users so you want to make sure it’s safe to use without leaking the secret. As long as your secret is long enough, it’s unlikely to be brute-forced. You’re feeling pretty confident that this simple function that does nothing but check equality doesn’t have any glaring security flaws.

Code for checkSecret

const SUPER_SECRET_VALUE = "hunter2";
function checkSecret(guess: string) {
  return guess.startsWith(SUPER_SECRET_VALUE);
}
> checkSecret("hunter2")
< true
> checkSecret("nothunter2")
< false

Anyway, an adversary who can call this function repeatedly can derive a 10-character secret in just a few thousand calls to checkSecret.

How?

Builtins checking equality are implemented in native code that may differ per runtime, but it’s straightforward to imagine how anybody would implement it. === has some details regarding string interning that make analysis a lot more complicated, so we’ll use startsWith for this post. Ignoring details, startsWith might look something like this:

function startsWith(target: string, searchString: string) {
  for (let i = 0; i < searchString.length; i++) {
    if (a[i] !== b[i]) {
      return false;
    }
  }
  return true;
}

Just iterate through the input and check if the character at each position matches. If not, break early and return false. If so, return true. While the spec doesn’t technically require early exit, most implementations use something like the above. (See V8’s implementation here.)

Break early is the important part. On average, guesses whose prefixes partially match the secret’s are going to take a bit longer to return than guesses that are totally wrong, and the longer the prefix match, the longer the function is going to take. We can measure this directly in your browser:

Running startsWith a few thousand times...

If your browser supports accurate-ish timers, you’ll see that the distributions overlap but the blue plot is shifted to the right, because, on average over many calls, "password".startsWith("pxxxxxxx"); takes longer than "password".startsWith("xxxxxxxx");.

We can use this characteristic to guess the secret by timing lots of calls to checkSecret. Using information like time or memory usage is known as a side-channel attack.

Remote timing attacks were generally considered to be impractical until the estimable David Brumley and Dan Boneh (for whom I happened to TA back in grad school) showed a practical attack on OpenSSL in 2003. Since then, side-channel attacks have had a bit of a renaissance, with Spectre and Meltdown the most famous examples.

Let’s see if we can run a timing attack on this page! This is a toy example, but the ideas apply to real-world systems.

Demo

I’m not paying for a server for this post; Cloudflare¹ is kindly hosting this site for free. Instead, we’re going to simulate the server by running the checkSecret function in a web worker. This is a separate thread that can’t access the DOM but is still running in your browser. We’ll measure the time it takes to run checkSecret and use that to guess the secret².

We need to check a few things before we can run the demo. First, let’s make sure that your browser supports web workers:

Checking for web worker support...

We also need to ensure that it supports cross-origin isolation. Browser timers are coarsened to 100μs precision unless isolated to prevent attacks just like this. It’ll still work with the imprecise timers, but the demo will take much (much) longer.

Testing cross-origin isolation...

Note that this example is fastest on desktop Chrome and Chromium-based browsers, which support 5μs precision, matching the spec. Firefox, Safari, and mobile WebKit-based browsers are stuck at 20μs because they were scared off by Spectre in 2018 and haven’t come up with a better solution yet.

Since the browser is rudely not giving us the time resolution we need³, we need to repeat each trial enough times to actually register on the clock. Below, we’re running a benchmark to see how many times we need to call checkSecret in order to spread the result across a couple of buckets. Otherwise, all times will be 0 and it’ll be hard to do any statistics.

Benchmarking page to choose good parameters...

Running this outside a browser context or anywhere with more precise timings, you could stick with a much smaller number of iterations.

Let’s guess your password! Enter a lowercase 8-character password below (I promise the web worker isn’t cheating, but this post comes with source maps so you can check for yourself):

How’d we do? There are ~209 billion 8-character lowercase passwords. We guessed yours in

not guessed yet...

guesses. Since this approach scales linearly(ish) with the length of the password, rather than exponentially, longer passwords aren’t enough to mitigate the issue.

Optimizing the guessing technique

What is the guesser above actually doing? The obvious attack approach here is to try to derive one character at a time in the prefix. Make lots of guesses, measuring the time per-prefix. Once it’s clear that one consistently takes longer than the others, fix it and repeat with the next character.

The problem with this approach is that computers are noisy and – especially over a network boundary or with limited timing capabilities, it might take a lot of samples to come to that clarity. We want to minimize the number of guesses we expect to need to make so this attack is feasible in a reasonably short amount of time.

First, let’s formalize the problem a bit. Assume for simplicity that the secret length \(l\) and the permissible character set \(C\) are fixed⁴. We have a list of zero or more (guess, time(checkSecretNTimes(guess))) measurements that we’ve already taken that we’ll call \(M\). We want to land on some strategy \(makeGuess(l, C, M) \rightarrow guess\) that emits guesses that yield the most expected additional information.

Thompson Sampling

A simple but effective approach is Thompson Sampling. If we build up our guesses one character at a time, we can reframe this problem similar to a stochastic multi-armed bandit problem – at each step, we have a set of possible arms (characters by which we could extend the prefix) and we want to pick the best one to make the most effective guess.

Since we have the list of measurements \(M\), we can always estimate the distribution \(D_{\text{prefix}}\) by filtering to all of the data that begins with that prefix and updating our prior distributions given that data. Thompson Sampling is simply sampling a single value from each distribution \(D_{\text{prefix} + c}\) for \(c \in C\) and choosing the \(c\) with the largest value. In pseudocode, \(makeGuess(l, C, M)\) looks like:

Set prefix to the empty string
While length(prefix) \(< l\):
- Set bestSample to \(-\infty\) and bestGuess to nil
- For each character \(c\) in \(C\):
  - Fit a distribution \(D_{\text{prefix} + c}\) to \(M\)
  - Take a single sample from \(D_{\text{prefix} + c}\)
  - If the sample is greater than bestSample, set bestSample to the sample and bestGuess to \(c\)
- Set the prefix to prefix + bestGuess
Return prefix

Intuitively, this means that the more confident we are in a prefix, the more likely it is that we’ll choose it, and the more data we have the more we can build our confidence levels to make our next guess even better.

But what does “fit a distribution” mean?

We have no idea what the distribution of time(checkSecret(guess)) might be. Luckily, due to the Central Limit Theorem, we can assume⁵ that samples from time(checkSecretNTimes(guess)) are approximately normal with mean \(N \cdot \mu\) as long as N is sufficiently large. Because of the timing constraints outlined above, we need to repeat our guesses many times anyway, so our samples are going to look approximately normal no matter what the real distribution of time(checkSecret) on your computer might be.

Assuming normality is especially convenient because the algorithm described above would be devastatingly slow if we had to iterate through all of \(M\) every time we wanted to estimate some distribution’s parameters. Instead, we can just keep track of the sample mean and variance of each distribution and update them as we go. We’re going to make a lot of guesses, so we need online updates to make guesses in \(O(l * \vert C\vert)\) time rather than \(O(\vert M\vert)\). We can use the Welford algorithm to update the mean and variance of each distribution in constant time as we gather more samples.

Storing lots of distributions

To make a guess, we need to quickly sample from the normal distributions that we’re storing per-prefix. We can use a Trie where each node stores the parameters of the normal distribution that we believe describes the prefix defined by that node and where the edges are the elements of \(C\). This way, we can quickly update the mean and variance of each distribution and quickly sample from them.

Whenever we get a new measurement, we can update the mean and variance of each node on the path as we walk the trie.

Example: updating the trie

As we take measurements, we can quickly update the trie to reflect our new data.

Expand to see example.

First, we guess `abc` and measure time `(0.4)`. We update the trie:

'' (prefix='', mean=0.4, variance=0, count=1)
  'a' (prefix='a', mean=0.4, variance=0, count=1)
    'b' (prefix='ab', mean=0.4, variance=0, count=1)
      'c' (prefix='abc', mean=0.4, variance=0, count=1)

Next, we guess `abd` and time `(0.5)`. We update the trie again:

'' (prefix='', mean=0.45, variance=0.005, count=2)
  'a' (prefix='a', mean=0.45, variance=0.005, count=2)
    'b' (prefix='ab', mean=0.45, variance=0.005, count=2)
      'c' (prefix='abc', mean=0.4, variance=0, count=1)
      'd' (prefix='abd', mean=0.5, variance=0, count=1)

And finally, we guess `abc` and time `(0.4)` again. We update the trie once more:

'' (prefix='', mean=0.433, variance=0.0033, count=3)
  'a' (prefix='a', mean=0.433, variance=0.0033, count=3)
    'b' (prefix='ab', mean=0.433, variance=0.0033, count=3)
      'c' (prefix='abc', mean=0.4, variance=0, count=2)
      'd' (prefix='abd', mean=0.5, variance=0, count=1)

Updating the trie takes only \(O(l)\) time, because we only have to update the mean and variance of the nodes on the path to the leaf, and each online update is constant time.

If \(M\) is large, we might have a lot of nodes in the trie. We could prune the trie by removing nodes that have too few measurements – however, in practice, we can usually just keep the whole trie in memory since the search usually completes within a few thousand guesses.

Sampling from the trie

We walk the trie, taking a sample from each node’s distribution, and choose the character that leads to the largest sample. However, we may not have samples from all nodes, so with some probability we choose a random character to ensure we continue to explore the space so we don’t get stuck.

Our guessing algorithm is similar to that described above but takes two important parameters:

NOISE_PER_STEP: the probability that we choose a uniformly random character out of \(C\) instead of sampling from the trie.
MIN_SAMPLES: the minimum number of samples we need to have from a node before we consider it. With only a few samples, the parameters of the normal distribution are not very reliable and we’re likely to make a bad guess. There are more sophisticated statistical techniques to express certainty here, but just not considering nodes with count < 3 works fine in practice. This way, we don’t need to build a robust prior distribution instead just ignore any prefixes with too few measurements until they have > MIN_SAMPLES measurements in \(M\).

So, extending the algorithm from above, we update \(makeGuess(l, C, M)\) to:

Set the prefix to the empty string
While length(prefix) \(< l\):
- Look up the prefix in the trie
  - If the prefix isn’t in the trie, return the prefix + random characters up to length \(l\)
- Set bestSample to \(-\infty\) and bestGuess to nil
- For each character \(c\) that are children of the current node and have count > MIN_SAMPLES:
  - Look up the distribution \(D_{\text{prefix} + c}\) in the trie
  - Take a single sample from \(D_{\text{prefix} + c}\) using the Box-Muller transform
  - If the sample is greater than bestSample, set bestSample to the sample and bestGuess to \(c\)
- Set the prefix to prefix + bestGuess
Return the prefix

We repeat this many times, adding the (guess, time) data to the trie describing \(M\) to inform our guesses, until hopefully we guess the right secret.

Using this approach over the network

There may be situations in which you’re trying to exfiltrate a secret locally – however, the majority of the time you’ll be trying to guess a secret over a network boundary. In this case, you’ll need to take into account the noise of the network.

This is a bit trickier because the network noise is probably greater than the time it takes to run checkPassword, and other traffic may impact timing information, so each measurement is generally going to yield a lot less information than the relatively direct measurements we’re taking here. However, if you’re allowed enough trials, the same approach works with no code changes, since we didn’t make any assumptions about the underlying distribution and the network jitter just increases the mean and variance of the distributions that we’re measuring.

Running this algorithm against a much noisier distribution is much slower so isn’t as conducive to an inline demo but if there’s interest, I may host an endpoint with a vulnerable checkSecret to see who can break it first!

Finally

When you’re writing software that compares user-provided data against sensitive values:

Don’t. You should almost never be comparing directly against secret values. Use hashes, pdkf2, or whatever is appropriate for your situation. But even these algorithms may perform differently with different inputs, so be careful!
If you must use a secret value as a function input, do so very carefully! Network noise doesn’t necessarily save you. Use a vetted library that does sensitive operations for you instead of trying to implement it yourself, and always ensure solid rate limits are in place.

While I used startsWith in this post, almost any non-constant-time comparison is susceptible to the same attack. Even the === operator is likely to be vulnerable given enough trials if you’re careful about avoiding string interning that leads to constant-time comparison (I couldn’t get this working, but I’d be curious to see if anyone can.)

Footnotes

I had to move this damn site from Github Pages to Cloudflare because Github Pages doesn’t support custom headers, which are needed for cross-origin isolation. Cloudflare has been great so far but I refuse to be trapped into blogging about setting up a blog. ↩
To make the demo run snappily, we aren’t sending messages back and forth between the web worker and the main thread. Instead, we’re just running the checkSecret function in the web worker and measuring the time it takes to run. This is a bit of a cheat, but the same code would work to guess secrets over an actual noisy network boundary – it might just take a bunch more guesses. ↩
I originally wrote this code in Node and was flummoxed when all timings were exactly 0.0000000000000ms when running code in the browser that I knew took at least 30 or 40 μs in Node. ↩
In the real world, you will often know the length of the secret (e.g. an API key) but even if you don’t, this same technique works – just run the same algorithm repeatedly, fixing the length each time. Similarly, in the real world, “printable ascii characters” is usually a pretty good bet for an allowable character set. ↩
Assuming these measurements are independent – which they might not be, since they’re taken back-to-back on a machine, but close enough. ↩