RSA and certificate authority from scratch

A Python project for understanding RSA, PEM keys, certificate signing, verification, and revocation.

Most cryptography courses leave RSA as a mathematical object. You learn that it works, that the security comes from the difficulty of factoring large numbers, and that you should use it via a library and never implement it yourself. All of that is correct. But I wanted to understand what a certificate actually contains before accepting that understanding at face value. So I implemented it: not to use in production, but to make every claim about it traceable to running code.

The repository has two layers. crsa.py handles RSA primitives: key generation, PKCS#1 v1.5 padding, and the modular arithmetic. cert_auth.py builds a PKI workflow on top of it: CSR generation, certificate signing, and CRL-based revocation.

The random number generator

The first decision was about randomness. RSA security depends on the unpredictability of the primes, and the primes come from a random number generator. I implemented the Mersenne Twister, but with a deliberate change to the standard parameters.

The standard MT19937 uses a state array of 624 32-bit words. My implementation uses n=5000:

class MersenneTwister:
    def __init__(self, seed=None, w=32):
        global size
        self.w = w
        self.n = size  # 5000, not 624
        self.m = 397
        ...
        self.state = [0] * self.n

The seed defaults to secrets.randbits(1024) — 1024 bits of OS randomness — so the non-standard state size is not about security in this context. It was an intentional deviation to make the generator visibly different from a copy of a standard implementation. The tempering function is standard MT19937 tempering (four XOR operations using the constants u, d, s, b, t, c, l), and _twist() is called whenever self.index >= self.n, regenerating the full state.

This is not a secure PRNG for production use. MT is not cryptographically secure — its output can be predicted from a few hundred consecutive values. The note here is awareness: this code is education, not production, and the separation matters.

Key generation

RSA.keygen(bitlen=512) generates a full key pair including PEM output. The steps are:

Prime generation. generate_prime_bits(nbits) creates candidates as:

p = (mt.extract_number() % (2 ** nbits)) | (1 << (nbits - 1)) | 1

Setting bit nbits-1 ensures the number is at least 2^(nbits-1), giving it the full bit length. Setting bit 0 ensures it is odd. Miller-Rabin then tests the candidate. If it fails, the loop tries again.

Miller-Rabin. The primality test uses prime_rounds=32 iterations, which bounds the probability of a composite passing as prime to 4^(-32). The test decomposes n-1 = 2^r * d, picks random witnesses from the MT, and uses Montgomery modular exponentiation for the repeated squaring.

Public exponent. generate_e(phi) returns 65537 whenever phi > 65537, which is the common practice — 65537 is prime and has a low Hamming weight (it is 2^16 + 1), making exponentiation fast. Only if phi is less than 65537 does the code generate a random prime exponent.

Private exponent. d = Modmath.modinv(e, phi) using the extended Euclidean algorithm. The algorithm is implemented as an iterative loop that tracks Bezout coefficients, returning x % n to guarantee a positive result.

PEM serialization. The private key is output via RSA.construct_private_pem, which computes the CRT components (dmp1, dmq1, iqmp) and delegates to the cryptography library’s serialization machinery. This produces a standard PKCS8 PEM that any RSA implementation can read.

Montgomery reduction

The mathematical core of the implementation is the Modmath class, which handles modular arithmetic using Montgomery reduction. Without this, every modular exponentiation would require a bignum division — slow. Montgomery reduction converts numbers into a different representation where reduction costs only a multiplication and a shift.

The setup:

def __init__(self, mod: int):
    self.m = mod
    self.R = 2 ** ceil(log2(mod))        # smallest power of 2 above mod
    self.mm = self.R - self.modinv(mod, self.R)  # m' such that R*R^-1 - m*m' = 1
    self.R2m = (self.R ** 2) % mod       # R^2 mod m, for converting to Montgomery form
    self.invR = self.modinv(self.R, mod) # R^-1 mod m, for converting back

mod_red(x) computes x * R^-1 mod m without a full division:

def mod_red(self, x) -> int:
    qm = (x % self.R) * (self.mm % self.R) % self.R
    ym = ((x + (qm * self.m)) // self.R)
    y = ym if ym < self.m else ym - self.m
    return self._convert(y)

_convert(u) transforms the result back out of Montgomery space:

def _convert(self, u):
    z = u * self.R2m
    return (z * self.invR) % self.m

mod_exp(x, k, mont) uses the binary method (square-and-multiply), calling mod_red for each squaring and conditional multiplication. Every exponentiation in key generation, encryption, and decryption goes through this.

PKCS#1 v1.5 padding

RSA encryption is not applied directly to the message. The message is first embedded in a padded format (PKCS#1 v1.5, type 2):

EM = 0x00 || 0x02 || PS || 0x00 || M

PS is a padding string of random non-zero bytes whose length is k - mLen - 3, where k is the key byte length and mLen is the message length. This ensures EM fits in k bytes and that PS is at least 8 bytes (required by the spec). The non-zero constraint is important: the separator at the end of PS is a 0x00 byte, so any 0x00 in PS would create ambiguity about where PS ends and M begins.

The padding bytes come from the Mersenne Twister, filtered for non-zero:

ps = []
while len(ps) != self.k - mLen - 3:
    new_byte = (self.mt.extract_number() % 256).to_bytes(1, 'big')
    if new_byte[0] == 0x00:
        continue
    ps.append(new_byte)

Decryption reverses this: compute m = c^d mod n, convert to bytes, check byte 0 is 0x02, find the 0x00 separator, check PS length >= 8, return the message.

The check decrypted[0] != 0x02 in decryption raises an assertion on failure rather than returning a decryption error. In a production implementation this is dangerous — timing differences between valid and invalid padding allow Bleichenbacher’s attack. For a learning implementation, the abort makes the format visible rather than hiding it behind a generic error.

The PKI layer

cert_auth.py models a small trust chain. A CertClient holds a key pair and can generate a CSR. A CertAuth (which inherits from CertClient) can sign CSRs and issue certificates.

Certificate signing request

The CSR is a JSON dictionary wrapped in PEM headers:

csr = {
    "subject": {
        "common_name": self.common_name,
        "public_key": self._key_pair["public_key"]
    },
    "signature_algorithm": "SHA256withRSA"
}
csr["signature"] = RSAsign.encryptrsa(
    sha256.generate_hash(str(csr).encode()),
    self._key_pair["private_key"]
)

The RSAsign.encryptrsa call uses reverse=True when constructing the RSAcrypt object, which swaps d and e:

return RSAcrypt(d=e, n=n, e=d)  # when reverse=True

This makes the object treat the private exponent as the encryption exponent, which is how RSA “signing” works at the mathematical level: signing is encryption with the private key. The hash function comes from sha256.py, a pure Python SHA-256 implementation included in the repository. That implementation follows the NIST FIPS 180-4 specification: 8 initial hash values h0–h7 derived from fractional parts of the square roots of the first eight primes, 64 round constants K[0..63] from fractional parts of cube roots of the first 64 primes, a 512-bit block message schedule that expands 16 words to 64 using the σ0 and σ1 rotation functions (rotate_right by 7, 18, 3 and by 17, 19, 10 respectively), and a 64-round compression function using Σ0, Σ1, Ch, and Maj functions on the 8-word state. The final hash is the concatenation of the 8 updated 32-bit state words.

The README notes that the SHA-256 and AES implementations in the repository are not original work — they were included as supporting components rather than written from scratch. The core project contributions are the Mersenne Twister, the RSA primitives including Montgomery reduction, the PKCS#1 padding, and the full PKI layer in cert_auth.py.

The PEM encoding wraps the base64-encoded JSON in standard -----BEGIN CERTIFICATE REQUEST----- headers with 64-character line breaks. Real X.509 CSRs use DER-encoded ASN.1, not JSON, but this custom format carries all the same fields: subject identity and public key.

Certificate issuance

CertAuth.generate_certificate(pem_csr) parses the CSR from PEM, extracts the subject information, and builds a certificate dictionary:

certificate = {
    "version": 3,
    "serial_number": generate_serial_number(),  # timestamp + 5-digit random
    "signature_algorithm": "SHA256withRSA",
    "issuer": self.certificate["subject"],
    "validity": {
        "not_before": datetime.datetime.utcnow(),
        "not_after": datetime.datetime.utcnow() + datetime.timedelta(days=365)
    },
    "subject": {
        "common_name": csr["subject"]["common_name"],
        "public_key": csr["subject"]["public_key"]
    },
    "extensions": {
        "key_usage": "digital_signature",
        "basic_constraints": {"is_ca": False}
    }
}
certificate["signature"] = RSAsign.encryptrsa(
    sha256.generate_hash(str(certificate).encode()),
    self._key_pair['private_key']
)

The serial number combines a UTC timestamp (20251001123456) with a 5-digit random number, making it unique even under rapid consecutive issuance. The CA signs the whole certificate dictionary excluding the signature field itself, exactly as real X.509 signing works.

Self-signed CA certificate

The CertAuth constructor creates its own certificate at instantiation:

self.certificate = {
    "version": 3,
    "serial_number": 1,
    ...
    "extensions": {
        "key_usage": "key_cert_sign,crl_sign",
        "basic_constraints": {"is_ca": True}
    }
}
self.certificate["signature"] = RSAsign.encryptrsa(
    sha256.generate_hash(str(self.certificate).encode('utf-8')),
    self._key_pair["private_key"]
)

The CA signs its own certificate with its own private key. This is what makes it a root CA: the trust anchor is not a signature from an upstream authority but the act of accepting the CA’s certificate as trusted out of band.

Revocation

The CRL is stored in an AES-encrypted file on disk. When a certificate is revoked, the file is decrypted, the revocation entry is appended, and the file is re-encrypted:

def revoke_certificate(self, certificate_serial_number):
    AESdecrypt.decrypt_file(self._crl_file_path, self._crl_file_path + '.d', self._crl_pass_path)
    with open(self._crl_file_path + '.d', 'a') as f:
        f.write(pem_revoked_cert)
    AESencrypt.encrypt_file(self._crl_file_path + '.d', self._crl_file_path, self._crl_pass_path)
    os.remove(self._crl_file_path + '.d')

The passphrase is generated at CA initialization from a large random number appended to the CA’s name, stored in a file named <ca_name><bignum>.pass. It is not human-readable and is only ever read by the code. The decrypted .d file is removed after re-encryption.

is_cert_revoked() uses a try/finally to ensure the decrypted file is always re-encrypted and removed, even if the search raises an exception.

Verification

CertClient.verify_certificate() checks four conditions:

The signature verifies against the CA’s public key (uses reverse=False, i.e., decrypts with the public exponent)
not_before <= utcnow
utcnow < not_after
ca.is_cert_revoked(serial_number) returns False

All four must hold for the certificate to be considered valid. The behavior when a certificate is revoked changes the final result from True to False without error — revocation is just another validity condition.

Why this project

The value is not in having written a new RSA implementation. It is in understanding what happens inside the opaque functions I use everywhere. A CSR is not a magical token: it is a JSON object with a hash and a signature. A certificate is not a cryptographic primitive: it is a signed assertion with an expiration timestamp and a revocability mechanism that only works if someone actually checks. Writing these things from scratch makes the vocabulary concrete and the failure modes visible.

The failure modes matter most. This implementation does not resist side-channel attacks on the timing of PKCS#1 padding validation. It uses a non-cryptographic PRNG for prime candidates. Its serial numbers are time-based, which leaks issuance timing. Knowing exactly what it does wrong, and why those things are wrong, is what makes me able to evaluate the real implementations when I encounter them.