AES-128 cryptographic engineering lab

C implementation of AES-128 internals, GF(2^8) arithmetic, 3-round distinguisher via lambda sets, and the Square/integral attack recovering 4.5-round keys.

This lab is about building AES-128 from scratch in C and then breaking a reduced version of it. Building the cipher first matters: attacks against black-box ciphers are exercises in observation, but an attack you coded yourself against a cipher you coded yourself leaves no room to misattribute what is happening. Every step from plaintext to ciphertext is accountable.

The lab works in two parts. Exercise 1 is about the arithmetic substrate and a structural property of the cipher. Exercise 2 is a working key recovery attack against 4.5-round AES.

GF(2^8) and the xtime function

AES operates over the finite field GF(2^8). Elements of the field are 8-bit values. Addition in this field is XOR. Multiplication by the generator element 2 is defined as a left shift with a conditional reduction modulo the irreducible polynomial x^8 + x^4 + x^3 + x + 1, which in binary is 0x11B. Because the result is 8 bits and the polynomial is 9 bits, the carry bit is dropped and 0x1B is XORed in when the high bit was set.

The naive implementation uses a branch:

if (p & 0x80) return (p << 1) ^ 0x1B;
else          return (p << 1);

The production implementation in aes-128_enc.c eliminates the branch entirely:

uint8_t xtime(uint8_t p)
{
    uint8_t m = p >> 7;
    m ^= 1;
    m -= 1;
    m &= 0x1B;
    return ((p << 1) ^ m);
}

p >> 7 extracts bit 7 into position 0. If the high bit was 1, m = 1. If it was 0, m = 0.

m ^= 1 inverts that: m becomes 0 when the high bit was 1, and 1 when it was 0.

m -= 1 wraps: 0 becomes 0xFF (all bits set), 1 becomes 0. This is an 8-bit underflow.

m &= 0x1B masks: 0xFF becomes 0x1B (the reduction polynomial), 0 stays 0.

The result is XORed with the left-shifted byte. When the high bit was set, XOR with 0x1B applies the modular reduction. When it was not set, XOR with 0 is a no-op.

This is branchless conditional logic. The data itself drives which constant gets applied, with no conditional jump in the generated code. On hardware, this is also side-channel cleaner: execution time does not depend on the value being processed.

All multi-byte GF(2^8) multiplication — used in MixColumns — builds on xtime. Multiplying by 3 is xtime(p) ^ p. Multiplying by 0x0E is three nested xtime calls XORed together. The efficiency of every field operation in the cipher depends on this one function.

The fused round function

A textbook AES round is four sequential operations: SubBytes, ShiftRows, MixColumns, AddRoundKey. The implementation fuses SubBytes and ShiftRows into a single pass, then does MixColumns and AddRoundKey.

void aes_round(uint8_t block[AES_BLOCK_SIZE], uint8_t round_key[AES_BLOCK_SIZE], int lastround)
{
    int i;
    uint8_t tmp;

    block[ 0] = S[block[ 0]];
    block[ 4] = S[block[ 4]];
    block[ 8] = S[block[ 8]];
    block[12] = S[block[12]];
    tmp = block[1];
    block[ 1] = S[block[ 5]];
    block[ 5] = S[block[ 9]];
    block[ 9] = S[block[13]];
    block[13] = S[tmp];
    tmp = block[2];
    block[ 2] = S[block[10]];
    block[10] = S[tmp];
    tmp = block[6];
    block[ 6] = S[block[14]];
    block[14] = S[tmp];
    tmp = block[15];
    block[15] = S[block[11]];
    block[11] = S[block[ 7]];
    block[ 7] = S[block[ 3]];
    block[ 3] = S[tmp];

    for (i = lastround; i < 16; i += 4)
    {
        uint8_t *column = block + i;
        uint8_t tmp2 = column[0];
        tmp = column[0] ^ column[1] ^ column[2] ^ column[3];

        column[0] ^= tmp ^ xtime(column[0] ^ column[1]);
        column[1] ^= tmp ^ xtime(column[1] ^ column[2]);
        column[2] ^= tmp ^ xtime(column[2] ^ column[3]);
        column[3] ^= tmp ^ xtime(column[3] ^ tmp2);
    }

    for (i = 0; i < 16; i++)
    {
        block[i] ^= round_key[i];
    }
}

AES stores its state in column-major order. The 16 bytes are indexed as a 4x4 matrix where bytes 0, 4, 8, 12 are the first column; 1, 5, 9, 13 are the second; and so on. SubBytes applies the S-Box to each byte in place. ShiftRows rotates row 0 by 0 positions, row 1 by 1, row 2 by 2, row 3 by 3. Because the state is column-major, the row-shift translates to specific index permutations in the flat byte array.

Row 0 (bytes 0, 4, 8, 12) does not shift — S-Box applied in place. Row 1 (bytes 1, 5, 9, 13) shifts left by one: byte 1 goes to where byte 5 was. The code handles this with a saved tmp, applying S-Box to the source position and writing to the destination in one motion: block[1] = S[block[5]]. The rotation for row 2 is a swap of opposite pairs. Row 3 shifts left by 3 (or right by 1): block[15] = S[block[11]], block[11] = S[block[7]], block[7] = S[block[3]], block[3] = S[tmp].

The result is SubBytes and ShiftRows done simultaneously without a second scratch buffer.

MixColumns operates column by column. For each 4-byte column, the formula for each output byte is:

out[0] = 2*in[0] ^ 3*in[1] ^ in[2] ^ in[3]
out[1] = in[0] ^ 2*in[1] ^ 3*in[2] ^ in[3]
out[2] = in[0] ^ in[1] ^ 2*in[2] ^ 3*in[3]
out[3] = 3*in[0] ^ in[1] ^ in[2] ^ 2*in[3]

The implementation factors out the common XOR sum: tmp = a ^ b ^ c ^ d. Then each byte is updated as column[k] ^= tmp ^ xtime(column[k] ^ column[(k+1)%4]). The xtime call computes the 2× term. The tmp provides the constant term. The result is the same as the matrix multiplication, computed in five xtime calls per column instead of eight.

The lastround parameter controls whether MixColumns runs. The loop starts at lastround, which is 0 normally and 16 for the final AES round. Starting at 16 means the loop body never executes: i = 16 < 16 is false on the first check. This skips MixColumns for the last round, which is correct per the AES spec.

Key schedule: forward and backward

next_aes128_round_key derives round key n+1 from round key n. The first word of the new key mixes in an S-Box lookup of the rotated last word of the previous key plus the round constant:

void next_aes128_round_key(const uint8_t prev_key[16], uint8_t next_key[16], int round)
{
    next_key[0] = prev_key[0] ^ S[prev_key[13]] ^ RC[round];
    next_key[1] = prev_key[1] ^ S[prev_key[14]];
    next_key[2] = prev_key[2] ^ S[prev_key[15]];
    next_key[3] = prev_key[3] ^ S[prev_key[12]];

    for (i = 4; i < 16; i++)
    {
        next_key[i] = prev_key[i] ^ next_key[i - 4];
    }
}

The indices 13, 14, 15, 12 apply the one-byte left rotation to the last column (bytes 12–15 become 13, 14, 15, 12). The subsequent bytes are simple XOR chains: each is the previous key byte XOR the new key byte four positions back.

The backward function prev_aes128_round_key inverts this. Because the XOR chain can be inverted (knowing next_key[i] and next_key[i-4] gives prev_key[i]), the function first recovers bytes 4–15, then uses those recovered bytes to undo the first-word computation:

void prev_aes128_round_key(const uint8_t next_key[16], uint8_t prev_key[16], int round)
{
    for (i = 4; i < 16; i++)
    {
        prev_key[i] = next_key[i] ^ next_key[i - 4];
    }
    
    prev_key[0] = next_key[0] ^ S[prev_key[13]] ^ RC[round];
    prev_key[1] = next_key[1] ^ S[prev_key[14]];
    prev_key[2] = next_key[2] ^ S[prev_key[15]];
    prev_key[3] = next_key[3] ^ S[prev_key[12]];
}

The backward function is necessary for the Square attack. After recovering the last round key from ciphertexts, you need to walk back through the key schedule four times to reach the master key. Each step applies prev_aes128_round_key with the decreasing round index.

The aes128_enc function avoids storing all 11 round keys. It uses a 32-byte buffer (ekey[32]) as a ping-pong: the current and next key alternate between the two halves using pk = (pk + 16) & 0x10 and nk = (nk + 16) & 0x10, which flip between 0 and 16. After applying a round, the current key slot is reused for computing the next-next key. Only two round keys exist in memory at any time.

Exercise 1: what happens when the polynomial is wrong

Exercise 1, question 1 constructs a variant of xtime using 0x73 instead of 0x1B:

uint8_t xtime_variant(uint8_t p)
{
    uint8_t m = p >> 7;
    m ^= 1;
    m -= 1;
    m &= 0x73;      // 0x73 instead of 0x1B
    return ((p << 1) ^ m);
}

In binary: 0x1B = 0001 1011 corresponds to the polynomial x^4 + x^3 + x + 1. Including the implicit x^8 bit from the shift, the full irreducible polynomial is x^8 + x^4 + x^3 + x + 1. The byte 0x73 = 0111 0011 corresponds to x^6 + x^5 + x^4 + x + 1. That is not an irreducible polynomial over GF(2). Using a reducible polynomial means multiplication can map non-zero elements to zero, collapsing the field structure.

The practical consequence: MixColumns implemented with xtime_variant no longer computes an invertible linear map over GF(2^8). It is still linear, but elements that should be distinct may collide. The cipher can still produce output, but it no longer has the security properties of the real AES.

The test cases in exercise1_q1.c demonstrate this concretely by computing 0x02 * 0x8D and 0x03 * 0xF6 both ways. The original yields the results required by FIPS 197 test vectors. The variant does not.

Exercise 1 Q3: the 3-round distinguisher

The lambda set (or delta set) is the algebraic structure at the heart of the Square attack. A lambda set in byte position b is a collection of 256 plaintexts where every element takes a distinct value in position b — the set covers all 256 possible byte values exactly once — while every other byte position is fixed to the same constant value across all 256 plaintexts.

The claim is: if you encrypt a lambda set through 3 full rounds of AES, the XOR of all 256 ciphertexts is zero, at every byte position.

void test_3round_aes_distinguisher()
{
    uint8_t key[AES_128_KEY_SIZE];
    // random key...
    
    uint8_t lambda_set[256][AES_BLOCK_SIZE];
    generate_lambda_set(0, constant_bytes, lambda_set);
    
    uint8_t xor_sum[AES_BLOCK_SIZE] = {0};
    for (int i = 0; i < 256; i++) {
        uint8_t ciphertext[AES_BLOCK_SIZE];
        memcpy(ciphertext, lambda_set[i], AES_BLOCK_SIZE);
        aes128_enc(ciphertext, key, 3, 1);
        for (int j = 0; j < AES_BLOCK_SIZE; j++) {
            xor_sum[j] ^= ciphertext[j];
        }
    }
    
    printf("AES 3-round: %s\n", is_zero_block(xor_sum) ? "PASS" : "FAIL");
}

The test always prints PASS regardless of key or constants. This is not a coincidence or lucky choice of parameters. It is provable.

Why it holds: after AddRoundKey, the active byte remains a permutation of all 256 values (XOR with a constant shifts the values but they are still all distinct). The S-Box is a bijection, so SubBytes maps the 256-value active set to another 256-value active set — still all values present. ShiftRows is a byte permutation, which moves the active positions around but preserves the “all 256 values” property in the bytes it moves. MixColumns mixes active and constant bytes together, but over three rounds, the linear propagation causes all 16 byte positions to become active (all values appearing). Once all positions are active, the XOR of all 256 values in any byte position is 0, because XORing the complete set {0, 1, …, 255} gives 0.

The XOR-zero property is invariant under the linear parts of AES (ShiftRows, MixColumns, AddRoundKey). S-Box is nonlinear but preserves the “complete set” property because it is a bijection. The combination of these two facts gives the distinguisher.

The PRF that fails

Exercise 1 Q3 also introduces a construction intended to be a PRF: XOR two independent 3-round AES encryptions of the same input with different keys.

void prf_construction(uint8_t output[AES_BLOCK_SIZE], 
                      const uint8_t k1[AES_128_KEY_SIZE],
                      const uint8_t k2[AES_128_KEY_SIZE],
                      const uint8_t input[AES_BLOCK_SIZE])
{
    uint8_t enc1[AES_BLOCK_SIZE];
    uint8_t enc2[AES_BLOCK_SIZE];
    
    memcpy(enc1, input, AES_BLOCK_SIZE);
    memcpy(enc2, input, AES_BLOCK_SIZE);
    
    aes128_enc(enc1, k1, 3, 1);
    aes128_enc(enc2, k2, 3, 1);
    
    for (int i = 0; i < AES_BLOCK_SIZE; i++) {
        output[i] = enc1[i] ^ enc2[i];
    }
}

The test applies the distinguisher to this PRF:

AES 3-round:   PASS   (XOR sum = 0 as expected)
PRF 3-round:   FAIL   (XOR sum ≠ 0)

The distinguisher detects the PRF as not random. Here is why: both AES instances individually satisfy the lambda set property — their XOR sums over 256 encryptions are each zero. XOR of two zero values is zero. So the PRF should also give XOR sum zero. But it does not.

The issue is the interaction. When you apply a lambda set and encrypt both copies through different round-key sequences, the specific ciphertext values from enc1 and enc2 are key-dependent in ways that do not cancel when XORed together. The XOR of the two outputs is not simply the XOR of two independent lambda set results: the lambda set structure of enc1 does not distribute through the XOR with enc2 in a way that preserves the zero-sum property.

The test also includes the trivial case where k1 = k2. When both keys are identical, enc1 and enc2 are always equal, so the XOR output is always zero for any input — a constant function, which is the most distinguishable thing possible.

The distinguisher working on the PRF is significant: the construction looks like it should combine two secure primitives into a stronger one, but the algebraic structure of each component leaks through the XOR, and an attacker can detect it with only 256 chosen plaintexts.

Exercise 2 Q1: the Square/integral attack

The 3-round distinguisher extends into a key recovery attack against 4.5-round AES. The attack targets the bytes of the last round key and recovers the master key by working backward through the key schedule.

Lambda sets and 4.5-round encryption

The attack uses aes128_enc(block, key, 4, 0) — four full rounds plus a fifth round that stops before MixColumns:

static void encrypt_lambda_set(uint8_t ciphertexts[256][16],
                        uint8_t lambda[256][16],
                        const uint8_t secret_key[16])
{
    for (int i = 0; i < 256; i++) {
        memcpy(ciphertexts[i], lambda[i], 16);
        aes128_enc(ciphertexts[i], secret_key, 4, 0);
    }
}

The lastfull = 0 argument causes aes_round to be called with lastround = 16, skipping MixColumns in the final round. The last operation the cipher performs is SubBytes + ShiftRows + AddRoundKey (with round key 4).

Candidate filtering

For each byte position 0–15, the attack tries all 256 possible values for that byte of the last round key:

static void filter_candidates_with_delta_set(candidate_list_t lists[16],
                                      uint8_t ciphertexts[256][16])
{
    for (int pos = 0; pos < 16; pos++) {
        bool valid_guess[MAX_CANDIDATES] = {false};
        for (int guess = 0; guess < MAX_CANDIDATES; guess++) {
            uint8_t xor_sum = 0;
            for (int i = 0; i < 256; i++) {
                uint8_t val = (uint8_t)(ciphertexts[i][pos] ^ guess);
                val = Sinv[val];
                xor_sum ^= val;
            }
            if (xor_sum == 0) {
                valid_guess[guess] = true;
            }
        }
        // intersect with existing candidate list...
    }
}

For each guess, the attack undoes the last round key byte (XOR with guess) and the last round’s S-Box (inverse S-Box lookup), then XORs all 256 resulting values. If the guess is correct, the undone bytes represent values that passed through exactly 3.5 rounds of AES — specifically, 3 full rounds plus ShiftRows. The 3-round distinguisher guarantees that XOR of these values over the full lambda set is zero. If the guess is wrong, the inverse S-Box output is an arbitrary permutation of the 256 ciphertext values, and the XOR sum will generally be nonzero.

The candidate list for each position starts with all 256 values. After one lambda set, most wrong guesses are eliminated. After a second or third lambda set, the intersection typically collapses to a single value. The convergence check:

static int all_singleton(const candidate_list_t lists[16])
{
    for (int pos = 0; pos < 16; pos++) {
        if (lists[pos].count != 1) return 0;
    }
    return 1;
}

Master key recovery

Once all 16 positions of the last round key are recovered with certainty, the backward key schedule runs four times:

static void recover_master_key_from_round4(const uint8_t round4[16], uint8_t master_key[16])
{
    uint8_t current[16];
    uint8_t previous[16];
    memcpy(current, round4, 16);
    for (int round = 3; round >= 0; round--) {
        prev_aes128_round_key(current, previous, round);
        memcpy(current, previous, 16);
    }
    memcpy(master_key, current, 16);
}

Each call to prev_aes128_round_key is the inverse of next_aes128_round_key. After four steps, current holds the original master key. The attack verifies with memcmp against the known secret. In practice, it converges after a small number of lambda sets — typically 2–4 — and the recovered master key matches.

The attack’s complexity

The number of oracle queries is small: 256 plaintexts per lambda set, and usually 2–4 lambda sets. That is at most 1024 chosen plaintexts to recover a full 128-bit AES key from 4.5 rounds. The computational cost is 256 × 256 = 65,536 S-Box lookups per candidate per position, times 16 positions: around 16 million cheap table lookups per lambda set. This is negligible on any modern machine. The attack is orders of magnitude cheaper than exhaustive search over 2^128 keys.

The attack does not extend directly to full 10-round AES. Adding more rounds destroys the lambda set property before the final round; additional algebraic structure is needed, which is what the multiset and integral attacks in later literature address.

Exercise 2 Q2: what drives the distinguisher?

The natural question after seeing the 3-round distinguisher is whether the S-Box or the field arithmetic is the essential ingredient. The exercise parameterizes both and tests all combinations.

void aes_round_param(uint8_t block[16], uint8_t round_key[16], int lastround,
                     const uint8_t sbox[256], uint8_t (*xtime_func)(uint8_t))

The key schedule is also parameterized to use the same S-Box as the round function. Four configurations are tested:

Configuration	S-Box	xtime	Result
Original AES	AES S-Box	0x1B polynomial	PASS
Alt Field	AES S-Box	0x73 polynomial	PASS
Alt S-box	random permutation	0x1B polynomial	PASS
Both Alt	random permutation	0x73 polynomial	PASS

All four configurations pass the distinguisher test. The XOR sum is zero in every case.

The result says something important about where the security property comes from. The distinguisher does not depend on the specific S-Box — any bijection preserves the “all 256 values present” property because bijections map complete sets to complete sets. The distinguisher does not depend on the field polynomial — the MixColumns linear mixing step is invertible over any field structure, and invertibility is the only property the argument needs. The XOR-zero property after three rounds is a consequence of the round function being composed of bijections with specific propagation behavior, not of the particular tables used.

This has a direct implication for cipher design. A cipher that replaces the AES S-Box with a different bijection, or replaces GF(2^8) arithmetic with a different field, is still susceptible to the same structural attack if it keeps the same round structure. The distinguisher property is built into the architecture.

What the implementation teaches

Writing the cipher before attacking it produces knowledge that a purely analysis-based approach does not. The MixColumns formula is a matrix multiplication, but understanding why the implementation computes xtime(a ^ b) and XORs in the sum requires tracing the factorization manually. The ShiftRows fusion is elegant but requires working through the column-major index arithmetic to see why the specific indices 5, 9, 13 appear for row 1. The ping-pong key buffer requires understanding that only two consecutive round keys are ever needed simultaneously.

The backward key schedule is the piece that most implementations omit, because normal encryption never needs it. It exists here because the attack needs it: the observable data from the oracle gives you the last round key, and the goal is the first. The path from last to first requires prev_aes128_round_key, and that function is only straightforward to write after you understand why the forward schedule’s XOR chain is invertible.

The distinguisher connecting exercises 1 and 2 runs through the same mathematical story at two different abstraction levels. In exercise 1, it is a structural observation: XOR of all elements is zero. In exercise 2, it is a key-recovery mechanism: peel off one round and test whether the structural observation holds. The connection is clean, and the code makes it concrete.