I played ångstromCTF 2021 for @blackb6a to spend my Easter holiday. I solved most of the reverse and cryptography challenges alone. In particular, Cache Money is one of the harder crypto challenges that I spent more than one day dealing with. It is very rewarding, and eventually four teams ended up solving it.

Challenge Summary

This challenge reimplements the Advanced Encryption Standard (AES) on 128, 192 and 256-bit keys. The encryptor is equipped with caches and we are given a service to encrypt (or decrypt) our messages. In short, there are four oracles provided by the service ($k_0$ is the fixed secret key and $b \in \{128, 192, 256\}$).

  • $\mathcal{E}(m, k)$ encrypts a message $m$ with $k$,
  • $\mathcal{D}(c, k)$ decrypts a ciphertext $c$ with $k$,
  • $\mathcal{E}_0(m, b)$ encrypts a message $m$ with the first $b$ bits of $k_0$, are
  • $\mathcal{D}_0(c, b)$ decrypts a ciphertext $c$ with the first $b$ bits of $k_0$.

Apart from the ciphertext (resp. plaintext), each oracle returns also the amount of time, in nanoseconds, used for encryption (resp. decryption).

Additionally, there is an flag $m_0$ encrypted with the $k_0$ under AES-256. The objective is of course decrypt and retrieve the flag. Needless to mention, they had measures to avoid the flag being retrieved by calling $\mathcal{D}_0(c_0, 256)$ directly. At least I am not awared of a way to solve this without recovering the key.

Solution

Part I: What is that cache?

The caching strategy

From the challenge, there is a Rijndael class that encrypts and decrypts messages. All of the functions inside the class are decorated with @instance_method_cache and they will call the cache method defined below:

def instance_method_cache(f):
    # lru_cache uses the same cache for every instance which kinda sucks
    # so i made my own cache implementation
    def wrapper(self, *args):
        key = (f.__name__, *args)
        try:
            return self.cache[key][0]
        except TypeError:
            # deal with unhashable arguments
            return f(self, *args)
        except KeyError:
            # remove the least recently used element
            # dont want the cache blowing up
            while len(self.cache.keys()) >= 300:
                mintimestamp = timestamp()
                keys = list(self.cache.keys())
                lowest = -1
                for i in range(len(keys)):
                    if self.cache[keys[i]][1] < mintimestamp:
                        mintimestamp = self.cache[keys[i]][1]
                        lowest = i
                    else:
                        continue
                del self.cache[keys[lowest]]
            ret = f(self, *args)
            self.cache[key] = (ret, timestamp())
            return ret

    return wrapper

Suppose we want to call $\mathcal{F}(x)$. This is how the caching works:

  1. If $(\mathcal{F}, x)$ is cached, return the cached value of $\mathcal{F}(x)$.
  2. Drop the oldest entry from the cache until there are less than 300 entries.
  3. Compute $y := \mathcal{F}(x)$, cache $y$ as the value to $(\mathcal{F}, x)$ and return it.

Caching in action: Encrypting a dummy flag with a dummy key

I added a hook to log the entries and see how the function calls are triggered:

-   self.cache[key] = (ret, timestamp())
+   t2 = timestamp()
+   val = (ret, t2)
+   self.cache[key] = val
+   print(f'[+] self.cache[{key}] = {val} (time spent: {t2 - t1})')

This is a transcript that encrypts actf{this_is_a_test_flag} with the key being Tes3UeYGqoq6vatPmlM6uQrTA7OTLNP8. There are two parts:

  1. (Lines 1-369) Initializes the Rijndael class with the key.
  2. (Lines 375-5690) Encrypts the message.

For the oracles, only second part will be timed.

Part II: Obviously a timing attack, but how?

It takes around 0.7 seconds to encrypt a random block of message under AES-256. A costy operation would be MixColumns - it is called 13 times and each contributes ~6% to the total encryption time (hence around 85% in total). Unfortunately, it is not feasible to read cached MixColumns for the timing attack. Here's why:

Don’t believe that 0.7s. The time is measured with there is verbose logging, and I/O definitely takes a considerable amount of time. It takes only 0.2s without logging (which is still pretty slow by the way).
  1. Suppose that $m_1, m_2, ..., m_{13}$ are the inputs for MixColumns. It is not likely to find $m_i = m_j$ for some distinct pairs $(i, j)$.
  2. The time fluctuates a lot and it is not reliable measuring a 6% difference.

Well, I think timing attack is difficult solely with the second point. There are many function calls triggered by a single encrypt_block, including:

  • ~1.8K Add calls,
    • 15 of them add two 4×4 matrices (~5ms each),
    • 60 of them add two array of 4 entries (~1ms each),
    • the remainder adds two numbers (~0.2ms each).
  • ~380 Multiply calls (~1ms each),
  • ~350 Double calls (~0.2ms each),
  • 56 SubWord calls (~0.2ms each),
  • 14 ShiftRows calls (~0.2ms each), and
  • 13 MixColumns calls (~40ms each).

That is a lot of uncertainty. It would be much better if we are able to determine whether a single Add operation is cached... Is it even possible?

I had no idea. The challenge author released a hint few hours later:

Hint 1: This is also partially a web challenge...look for something you can exploit to make things more consistent.

There must be something fishy being a web challenge. Let's look at the source code for the encryption endpoint:

@instance_method_cache
def encrypt(self, m):
    try:
        return b"".join(
            self.encrypt_block(m[i:][:16]) for i in range(0, len(m), 16)
        )
    except:
        return b"Error"
        
@app.route("/api/enc", methods=["POST"])
def encrypt():
    m = tuple(request.json["a"])
    # Does not pad if the length of message is a multiple of 16
    if len(m) % 16 != 0:
        m = m + (0,) * (16 - (len(m) % 16))
    if request.json["secret"] not in ["secret128", "secret192", "secret256"]:
        k = tuple(request.json["k"])
    else:
        # Takes the first n bytes of the secret if we are using k0 for encryption.
        k = key[
            : {"secret128": 16, "secret192": 24, "secret256": 32}[
                request.json["secret"]
            ]
        ]
    try:
        cipher = Rijndael(k)
    except ValueError:
        return jsonify({"result": "Invalid key", "time": 0})
    start = timer()
    enc = cipher.encrypt(m)
    end = timer()
    del cipher
    return jsonify({"result": enc.hex(), "time": end - start})

Note that if we are encrypting the message block actf{this_is_a_t with $k_0$ under AES-256, then this is the JSON object we send:

// Request
{
    "a": [97, 99, 116, 102, 123, 116, 104, 105, 115, 95, 105, 115, 95, 97, 95, 116],
    "secret": "secret256"
}

// Response
{
    "result": "83696d00a4e4def3a9ca1c3e61b899cb",
    "time": 144329411
}

What if we send a malformed a? The exception is raised in a lower level function and is handled by encrypt(self, m). It is good to see the time measured.

{"a": [[]], "secret": "secret256"} // Request
{"result": "4572726f72", "time": 45599} // Response ("4572726f72" is Error hexed)

Currently we are measuring nothing, since exception is raised before the first function call is completed. Let's throw after the first function call:

{"a": [1337, []], "secret": "secret256"} // Request
{"result": "4572726f72", "time": 189082} // Response

By throwing errors in an early stage, the time differences are more significant than the fluctuations now. Nice!

Part III: Leaking one eighth of the key

A drop in response time

As mentioned above, the first 369 lines from my transcript is where the Rijndael instance being initialized. Lines 1-55 is the log for computing the round constants rcon. Lines 56-369 represents how the round keys are computed, and this is part of the transcript:

56:[+] self.cache[('RotWord', (76, 78, 80, 56))] = ((78, 80, 56, 76), 505349375647612, 8617) (time spent: 8617)
57:[+] self.cache[('SubWord', (78, 80, 56, 76))] = ((47, 83, 7, 41), 505349375732640, 14356) (time spent: 14356)
58:[+] self.cache[('Add', 47, 1)] = (46, 505349375825251, 9545) (time spent: 9545)
59:[+] self.cache[('Add', 83, 0)] = (83, 505349375902264, 9931) (time spent: 9931)
60:[+] self.cache[('Add', 7, 0)] = (7, 505349375978725, 9218) (time spent: 9218)
61:[+] self.cache[('Add', 41, 0)] = (41, 505349376055553, 9776) (time spent: 9776)
62:[+] self.cache[('Add', (47, 83, 7, 41), (1, 0, 0, 0))] = ((46, 83, 7, 41), 505349376123267, 320106) (time spent: 320106)
63:[+] self.cache[('Add', 84, 46)] = (122, 505349376211788, 7671) (time spent: 7671)
64:[+] self.cache[('Add', 101, 83)] = (54, 505349376306808, 11562) (time spent: 11562)
65:[+] self.cache[('Add', 115, 7)] = (116, 505349376404211, 12700) (time spent: 12700)
66:[+] self.cache[('Add', 51, 41)] = (26, 505349376486636, 9646) (time spent: 9646)
67:[+] self.cache[('Add', (84, 101, 115, 51), (46, 83, 7, 41))] = ((122, 54, 116, 26), 505349376555581, 361347) (time spent: 361347)
68:[+] self.cache[('Add', 85, 122)] = (47, 505349376655242, 16317) (time spent: 16317)
69:[+] self.cache[('Add', 101, 54)] = (83, 505349376735919, 10524) (time spent: 10524)
   ...

AES-128 and AES-192 have less than 300 function calls computing the round keys, thus every single step taken for round keys is cached. This is not the case for AES-256. Therefore, we should attempt to leak something from $k_0$ via $\mathcal{E}_0(m, 128)$ or $\mathcal{E}_0(m, 192)$. To achieve this, let's try to make use of the error-throwing strategy we discussed in the last part.

{"a": [1337, []], "secret": "secret128"} // Request
{"result": "4572726f72", "time": 159767} // Response

Iterating 0 <= a[0] < 256, it is observed the request a[0] = 40 is particularly faster than the others.

{"a": [40, []], "secret": "secret128"} // Request
{"result": "4572726f72", "time": 40721} // Response

What is this? Why is it faster?

🤨 Confusion alert. I know it is getting more confusing. Please bear with me…

In short, Add(???, 40) is cached... But what is that? To understand why, look at the constructor and the encrypt_block method:

def encrypt_block(self, block):
    m = tuple(tuple(block[4 * i :][:4]) for i in range(4))
    m = self.Add(self.roundkeys[:4], m)
    # ... omitted ...

This is how the methods called during encrypt_block, in order:

encrypt_block(block)
->  [...  Computes m <- block  ...]
->  Add(roundkeys[:4], m)
    ->  Add(roundkeys[0], m[0])
        ->  Add(roundkeys[0][0], m[0][0]) [... 💰 Uses the cache ...]
        ->  Add(roundkeys[0][1], m[0][1]) [... 🔥 Throws an error ... ]

            [... The subsequent functions are not called ...]

        ->  Add(roundkeys[0][2], m[0][2])
        ->  Add(roundkeys[0][3], m[0][3])
    ->  Add(roundkeys[1], m[1])
    ->  ...

Therefore, that Add(???, 40) cached is actually Add(roundkeys[0][0], m[0][0]). With the same strategy, we can find a[1], a[2] and a[3] that the API calls has a smaller response time: 67, 82, 56.

{"a": [40, 67, 82, 56, []], "secret": "secret128"} // Request
{"result": "4572726f72", "time": 40665} // Response

What does (40, 67, 82, 56) mean? From the cache's perspective, the function call Add(roundkeys[0], (40, 67, 82, 56)) is cached. But why is it cached? Looking back at the constructor, this is executed when the round keys are computed:

Add(w[0], Add(SubWord(RotWord(w[3])), rcon[1]))

Hereby w[0] and w[3] are respectively roundkeys[0] and roundkeys[3], while rcon[1] == (1, 0, 0, 0). That said, it is very likely that (40, 67, 82, 56) is being the second argument, i.e.:

Add(SubWord(RotWord(roundkeys[3])), rcon[1]) == (40, 67, 82, 56)

# eqivalently...

sbox(roundkeys[3][1]) ^ 1 == 40
sbox(roundkeys[3][2])     == 67
sbox(roundkeys[3][3])     == 82
sbox(roundkeys[3][0])     == 56

# and equivalently...

roundkeys[3][1] ==  76
roundkeys[3][2] == 100
roundkeys[3][3] ==  72
roundkeys[3][0] == 118

On top of that, roundkeys[3] == key[12:16]. Therefore we have recovered four bytes of $k_0$. We can further recover the first eight bytes of $k_0$ by testing a[4], a[5], ..., a[11]. Since we have 12 bytes of the key, we can exhaust to recover the secret key used for AES-128 (i.e., key[0:16])

The idea can be used when dealing with AES-192. We are able to recover key[20:24] by finding an appropriate input for a[0], a[1], a[2] and a[3], and eventually key[0:24] by brute-forcing the remaining bytes.

Part IV: The cache is not in reach for AES-256!

Unfortunately, we cannot generalize the idea to recover keys under AES-256 because there are more than 300 function calls for computing round keys. Hence, the cache for Add(roundkeys[0][0], m[0][0]) and others have already expired prior than they are used. Only the round keys for the later rounds are cached. With that said, we can use $\mathcal{D}_0(c, 256)$ instead of $\mathcal{E}_0(m, 256)$ and recover the round keys generated near-end. Specifically they are roundkeys[49] == a[0:4], roundkeys[50] == a[4:8] and roundkeys[51] == a[8:12].

Why? The proof is left as readers' exercise (I apologize being lazy and I have actually lose patient spending five hours compiling this writeup during CTF).

Since we have roundkeys[0], roundkeys[1], ..., roundkeys[5], we can retrieve key[24:32] using Z3. Eventually we found key == b'fGsOY9Xb4Eq5vLdHzDGbJ4QvQqOCQcYm' and could decrypt the flag: actf{if_you_cache_my_drift}. First 🩸!

🔑 Solution script! The full solution script is available here (and its output). Hope this explains how my attack works.

Part V: Attacks without time-reading

Note. This is not my idea. I found this pretty mind-blowing so I shamelessly document this as well. Kudos to UnblvR!

In short, one can send a float to determine if an item is cached. One could check if Add(roundkeys[0][0], 1337) is cached by sending the below JSON object:

{"a": [1337.0] "secret": "secret128"} // Request

It happens that an error is thrown if and only if Add(roundkeys[0][0], 1337) is not cached. The cached output will be returned if Add(roundkeys[0][0], 1337) is cached, since ("Add", roundkeys[0][0], 1337) == ("Add", roundkeys[0][0], 1337.0). However, on the other hand, it will attempt to compute roundkeys[0][0] ^ 1337.0. This throws an error since xorring an integer with a float is not a defined behaviour.