expérience / juillet 2024

Cloud KMS internship at CERIST

M2 internship at CERIST: building Vision, a Vault-based secrets management SaaS on Kubernetes over OpenStack, with automated unsealing, PKI, SSH certificates, dynamic DB credentials, and a Flask web GUI that hides the entire stack from users.

KMSHashiCorp VaultKubernetesTerraformCloud securityPKISSH

This was a final-year M2 thesis project at USTHB, conducted as an internship at CERIST in Algiers and defended on July 1, 2024. The full title is “Conception et réalisation d’un système de gestion des secrets dans le cloud.” The project built a complete Secrets Management System (SMS) delivered as a cloud service, running on CERIST’s OpenStack infrastructure with Kubernetes clusters managed by Kubespray, Vault deployed by Helm, and a Flask web application called Vision that presents the entire system through a simple form-based interface. The engineering question at every stage was the same: how do you make serious infrastructure usable for people who should not need to understand it?

Why this exists

Organizations mismanage secrets in predictable ways: credentials hardcoded into source repositories, API keys shared over messaging apps, encryption keys that never rotate because the rotation process is manual and error-prone, SSH keys that accumulate across servers with no audit trail. Each of those problems has a known solution, but the solutions usually require dedicated infrastructure and expertise most teams do not have.

The thesis started with a comparative analysis of existing KMS products: AWS Secrets Manager, HashiCorp Vault, Google Cloud KMS, Azure Key Vault, IBM Security Key Lifecycle Manager, and several others. The analysis evaluated them on seven criteria including REST API, certificate-based authentication, KMIP compliance, auditing, and on-premises key storage. HashiCorp Vault scored well across the board: REST API, certificate auth, auditing, secure communication, and policy-based access control all checked. That is what the thesis chose as the backend engine.

The goal was not to wrap Vault in a thin shell. It was to build a proper multi-tenant cloud service where a customer subscribes, gets a private Kubernetes cluster with a Vault instance, and can use it through a web interface that abstracts away every detail of the underlying stack.

The architecture

The system has two logical sides separated by a VPN tunnel.

On the cloud provider side (CERIST’s OpenStack deployment) there are three components in the backend. A Central SMS Server stores admin credentials and holds the unseal keys for all tenant Vault instances. A Central Internal Certificate Authority signs intermediate CA certificates for tenants who want their own PKI. And Client Clusters contain the per-tenant Vault deployments, one Kubernetes cluster per customer, each transparent to the customer.

On the client side, a frontend web application (Vision) is the single point of entry. Users reach it over a VPN tunnel using OpenVPN. They never talk to Vault directly. They never see Kubernetes. Every interaction goes through Vision, which translates form submissions into Vault API calls via the hvac Python client, manages K8s resources via kr8s, and deploys Helm charts via pyhelm3.

The VPN requirement is not incidental. It puts the client’s traffic inside a private network before it even reaches the frontend, which is one layer of isolation on top of the web application’s own authentication.

Three tiers of users

The design distinguishes three categories of actors with different scopes of control.

Instances’ Manager is the organizational account holder. This role creates and destroys Vault instances, manages snapshots, configures deployment mode (dev server, standalone, or high-availability), seals and unseals instances, and can download all secrets if they want to unsubscribe from the service. The instances’ manager is the one who interacts with the Central SMS Server at account creation time to get the initial unseal keys and root token.

Administrator operates within an existing Vault instance. Admins create user accounts, assign policies, enable features (SSH key generation, dynamic database credentials, password storage), and deploy Certificate Authorities. An admin does not need to know what vault secrets enable pki does. They fill out a form.

Normal User consumes the secrets management capabilities: generating X.509 certificates, getting SSH certificates or one-time passwords, storing and retrieving key-value secrets, and generating dynamic database credentials with a TTL.

The thesis added a further separation within the admin role. A User Management Admin can create accounts but cannot configure security features. A Security Configuration Admin can create CAs, enable SSH certificate generation, enable dynamic DB credentials, and assign policies, but cannot create user accounts. The rationale is that if a User Management Admin account is compromised, the attacker can create accounts but cannot escalate into changing security policies, because policy assignment belongs to a different admin. New users are required to change their password and revoke their initial access token on first login, which cuts the User Management Admin out of their account from that point on. This is a direct application of least privilege: segment the administration surface so that no single compromised credential grants full control.

Design: use cases and the class model

The class diagram captures the key relationships. Administrator and Standard user are the two concrete user types, both with username, password, and token. Administrator has methods for Create_instance(), Deploy_CA(), Create_users(), and Assign_policy(). Standard user has Store_secret(), Generate_SSH_key(), Generate_dynamic_db_credentials(), and Ask_for_certificate().

Each client cluster has one or more Vault instances, and each instance runs in one of three deployment modes: Development (single node, no persistence), Standalone (single node, persistent storage), or High Availability (three replicas with Raft-based consensus). Secrets have owner and type metadata. Secret types branch into four subtypes: Passwords, API keys, SSH keys, and DB credentials. Certificates are tracked separately with issuer, TTL, and owner.

The high availability design replicates data across multiple storage nodes so that if a primary storage system fails, secondary nodes continue serving. Kubernetes is the enabler here: it handles load balancing, auto-scaling based on CPU and RAM metrics, self-repair, and pod replacement automatically.

First implementation: virtual machines

Before moving to Kubernetes, the first working version ran on virtual machines in OpenStack. This was the right initial choice. VMs offer strong isolation between tenants, they are familiar, and they made it possible to test the core Vault workflows without the additional complexity of container orchestration.

The key question for the VM approach was how to automate the repetitive work: provisioning a VM when a client requests a new SMS instance, installing and configuring Vault, initializing it, and extracting the unseal keys. Ansible became the automation layer.

For VM provisioning, an Ansible playbook used the OpenStack Cloud module to launch a compute instance:

name: launch a compute instance
hosts: localhost
gather_facts: False
tasks:
- name: Create a new instance with metadata and attaches it to a network
  openstack.cloud.server:
      state: present
      auth:
        auth_url: "{{auth_url}}"
        user_id: "{{User_id}}"
        password: "{{User_Password}}"
        project_name: default
      name: VM_to_be_created
      auto_ip: False
      image: 9590eecc-bf35-48dd-b8e2-67350fa416eb
      key_name: main_key
      timeout: 200
      flavor: 3
      nics:
        - net-name: demo-net
      boot_from_volume: True
      volume_size: 40
      security_groups:
        - SSH_main
        - default

When a client requested a new SMS instance, this playbook ran and allocated the VM. The image UUID (9590eecc-...) was a pre-configured Ubuntu image with Vault already installed. The volume_size: 40 ensured persistent storage survived VM restarts. SSH_main and default security groups controlled network access.

Solving unsealing automatically

Vault’s security model means the instance starts sealed after every restart. Normally an operator provides unseal keys manually, one by one, until enough shares are combined to reconstruct the root key. For a platform where each client has their own Vault instance, manual unsealing is completely unworkable.

The solution worked in two steps. When a new Vault instance was initialized, the initialization response is JSON containing the root token and the five unseal key shares. An Ansible playbook captured this JSON output, extracted the unseal keys programmatically, and stored them in the Central SMS Server (itself a Vault instance running on dedicated infrastructure). From that point, whenever a tenant’s Vault instance restarted, the application queried the Central SMS Server for that tenant’s unseal keys and submitted them via the Vault API automatically. The sequence from the user’s perspective is: authenticate to Vision, Vision authenticates to the Central Vault, Central Vault returns the unseal keys, Vision unseals the tenant instance, access is granted.

The Central SMS Server itself has a different unsealing procedure. It is started manually by infrastructure administrators after a restart and kept continuously running. There is a hierarchy here: the central server is the operational root of trust, and its availability determines the availability of all tenant instances.

Certificate authority automation with Consul templates

For CA renewal, manually running certificate renewal commands on a schedule is error-prone and typically leads teams to issue certificates with multi-year validity to avoid the process, which is exactly the wrong outcome. The thesis implemented automatic renewal using HashiCorp Consul templates.

Consul templates watch Vault for certificate expiry and regenerate certificate files automatically when they approach expiry. A configuration file specifies the Vault address, authentication token, retry policy, and output templates:

{
  "vault": {
    "address": "{{SMS_server_address}}",
    "token": "{{authentication_token}}",
    "retry": {
      "enabled": true,
      "attempts": 3,
      "backoff": "250ms"
    }
  },
  "template1": {
    "source": "/etc/consul-template.d/cert.tpl",
    "destination": "/home/ubuntu/cert_python.pem",
    "perms": "0600"
  },
  "template2": {
    "source": "/etc/consul-template.d/key.tpl",
    "destination": "/home/ubuntu/key_python.pem",
    "perms": "0600"
  }
}

The template files (cert.tpl, key.tpl) contain Consul template syntax that queries Vault’s PKI engine and writes the certificate and key to the destination paths at 0600 permissions when they are about to expire. Certificate renewal, which typically takes a week when done manually, becomes entirely automatic. The thesis cited this as meaningful for compliance reasons: organizations often issue long-lived certificates to avoid renewal friction, which increases exposure if a key is compromised.

Moving to Kubernetes

Virtual machines validated the core design. But VMs have a real cost: each one carries a complete OS and its associated overhead. For a multi-tenant platform where each client gets their own instance, the resource consumption multiplies with every client. Kubernetes on top of OpenStack was the answer.

The production stack is what the thesis calls the LOKI stack: Linux, OpenStack, Kubernetes infrastructure. OpenStack manages the bare metal through virtualization, Kubernetes runs on top, and application containers run inside Kubernetes. The combination gives OpenStack’s multi-tenant isolation, Kubernetes’s horizontal scaling and self-healing, and container density without the per-VM overhead.

The specific OpenStack components used were: Nova for compute resources, Neutron for networking, Cinder for persistent volumes (the same Cinder volumes back the Kubernetes PersistentVolumes that Vault uses for data and audit storage), Keystone for identity and access management, Horizon for the dashboard, and Barbican for secret storage at the infrastructure level.

Kubespray cluster provisioning

Kubespray is an Ansible-based Kubernetes provisioner. Given an inventory file listing node IP addresses and roles, it installs and configures a full production-grade Kubernetes cluster: etcd, kube-apiserver, kubelet, container runtime, network plugins, and ingress controllers.

The Terraform configuration produced the OpenStack resources (nodes, networks, security groups), and Kubespray consumed the resulting infrastructure. The Kubespray configuration for one deployment looked like this:

cluster_name = "clust"
az_list = ["nova"]
public_key_path = "/home/ubuntu/kubespray-2.24.1/inventory/clust/node_pub.key"
image = "Ubuntu-Jammy-22.04"
ssh_user = "ubuntu"
number_of_k8s_masters_no_floating_ip = 1
number_of_k8s_nodes_no_floating_ip = 2
flavor_k8s_master = "3"
flavor_k8s_node = "3"
network_name = "kube-net"
use_existing_network = "true"
external_net = "c17fd38e-3e28-4ed8-8948-19bcffd72ecf"
subnet_cidr = "20.0.0.0/24"
floatingip_pool = "PrivateNetwork"

This defines a cluster with one master node and two worker nodes, using Ubuntu 22.04 as the base image, in the nova availability zone, attached to the kube-net network with subnet 20.0.0.0/24. The flavor: 3 corresponds to an OpenStack flavor (a named resource specification for CPU, RAM, and disk). No floating IPs are assigned to the masters or nodes since they are accessed through VPN.

Persistent storage for Vault used a Kubernetes PersistentVolume backed by OpenStack Cinder:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: <VOLUME_NAME>
spec:
  capacity:
    storage: <VOLUME_SIZE>
  accessModes:
    - ReadWriteOnce
  cinder:
    volumeID: <OPENSTACK_VOLUME_ID>
    fsType: ext4
  persistentVolumeReclaimPolicy: Retain

The Retain reclaim policy is the right choice here: if the PersistentVolumeClaim is deleted (for example, because someone deleted the Vault deployment accidentally), the underlying Cinder volume and its data are preserved rather than wiped.

Deploying Vault with Helm

Helm is the Kubernetes package manager. It installs applications using charts, which are collections of Kubernetes manifest templates with configurable values. The Vault Helm chart produced by HashiCorp handles the StatefulSet, Service, ConfigMap, ServiceAccount, and RBAC resources needed for a production Vault deployment.

The Helm values file for a high-availability Vault deployment:

global:
  enabled: true
  namespace: "vault-deployment"
  tlsDisable: false
server:
  enabled: true
  image:
    repository: "hashicorp/vault"
    tag: "1.16.1"
  service:
    enabled: true
    active:
      enabled: true
    standby:
      enabled: true
    port: 8200
    targetPort: 8200
  dataStorage:
    enabled: true
    size: 10Gi
    mountPath: "/vault/data"
    accessMode: ReadWriteOnce
  auditStorage:
    enabled: true
    size: 10Gi
    mountPath: "/vault/audit"
    accessMode: ReadWriteOnce
  ha:
    enabled: true
    replicas: 3
ui:
  enabled: true
  externalPort: 8200
  targetPort: 8200

Several decisions are embedded here. tlsDisable: false means TLS is on, which is required for a production deployment. Separate PVCs for data and audit storage mean that high-volume audit logging cannot fill the data volume. Three replicas with ha.enabled: true runs a Raft-based HA cluster where one replica is active and two are standby, with automatic failover. The vault-deployment namespace isolates this from other workloads in the cluster.

Each tenant gets their own namespace. Network policies and RBAC roles apply per-namespace, so a misconfiguration in one tenant’s namespace cannot reach another tenant’s Vault. If the threat model requires stronger isolation (a compromised pod actively attempting lateral movement), deploying each tenant’s cluster on a separate node pool provides physical isolation in addition to the Kubernetes-level controls.

Vault internals that the application relies on

The thesis covers the Vault concepts that the application actually uses. Understanding them is necessary to understand what the Flask application is doing when it makes API calls.

Tokens are Vault’s primary access mechanism. There are four types: Service tokens (standard, persisted in storage), Batch tokens (lightweight, not persisted, for high-volume short-lived use), Periodic tokens (require renewal at intervals to stay valid), and Orphan tokens (not tied to a parent token’s lifecycle). The application creates Userpass accounts for users, which generate Service tokens on authentication. Those tokens have configurable TTL, max TTL, max number of uses, and bound CIDRs. The user creation form in Vision exposes all of these: TTL=720 (hours), MaxTTL=8760, Max uses=700, Period=168, bound CIDRs like 192.168.1.0/24,198.51.100.0/24.

Policies are JSON documents that specify what paths a token holder can interact with and what capabilities they have (create, read, update, delete, list, sudo). A “privileged-access” policy that grants full control over secrets looks like:

{
  "name": "privileged-access",
  "rules": [
    {
      "path": "secret/data/*",
      "capabilities": ["create", "read", "update", "delete", "list"]
    },
    {
      "path": "auth/token/*",
      "capabilities": ["create", "read", "update", "delete", "list"]
    }
  ]
}

The application creates policies when an admin configures features, then assigns them to users via the Policy Assignment page. This is the mechanism that implements the principle of least privilege: a user with only the ssh-key policy can request SSH certificates but cannot read other users’ secrets.

Secret engines are modular backends. Each is mounted at a path. The KV engine stores arbitrary key-value pairs (passwords, API keys, tokens). The PKI engine generates X.509 certificates and manages a Certificate Authority. The SSH engine generates SSH credentials. The Database engine creates dynamic, TTL-bound credentials for PostgreSQL, MySQL, MongoDB, and others. Disabling a secret engine permanently destroys all secrets it holds, which is why the Helm persistentVolumeReclaimPolicy: Retain matters so much — the data must survive even if the Kubernetes objects are accidentally deleted.

Leases are time-bound access contracts. Every dynamic secret (database credentials, SSH keys) is associated with a lease that has a TTL. When the TTL expires, Vault automatically revokes the credential and the database user disappears. This is what makes the dynamic DB credentials feature useful: a developer requests temporary access to a database, uses it for the duration of the task, and the credential is automatically cleaned up without anyone needing to remember to do it.

Secret engines in the application

The PKI engine is used for both X.509 certificate issuance and the SSH signed certificate feature. For X.509, the admin deploys a CA through Vision’s “Create Self-Signed Certificate Authority” page. The form fields map directly to what Vault’s PKI engine needs: Common Name (company1.org), certificate authority type (internal, external, or exported), TTL (87600h, which is ten years for a root CA), subject alternative names, key type (rsa, ed25519, ec), key bits (4096 for the CA key), organizational unit, organization, country (DZ), locality (Alger), province (Bab Ezzouar). Vision takes those form values and calls the Vault PKI API, then displays the resulting CA certificate. Users see a working CA; they do not see a single Vault command.

When a user then requests a certificate, the form takes: certificate role name (web-server-role), issuing CA (company1.org), common name (web-server1.com), SANs (web-server1.org,web-server1.dz), IP SANs (198.51.100.2), revocation date (2024-12-31T23:59:59Z), and TTL (72h). Vision calls the PKI issue endpoint and returns the signed certificate.

The SSH engine operates differently from the PKI engine. Two approaches are supported. Signed SSH certificates: Vault acts as a CA that signs the user’s public key for a specified TTL. The signed certificate is presented to the target server, which trusts the CA. The certificate expires automatically; there is no need to manage authorized_keys files. One-time passwords: Vault generates an OTP valid for exactly one SSH session. The client uses it, and it is immediately invalidated.

The Database engine connects to a configured database and generates a new username and password when a user requests credentials. The user creation form in Vision takes a Vault role name (which has a configured database role and TTL) and a mount point (pgsql/ for PostgreSQL). Vault creates the database user with the appropriate permissions, returns the credentials, and schedules automatic revocation when the TTL expires.

The Vision interface

The frontend is a Flask application with Bootstrap styling. The product name visible in the GUI is “Vision.” Authentication supports two methods: userpass (username and password) and token (leave the username field blank, enter the token in the password field).

The admin interface manages instances. The dashboard shows each running Vault instance (customer10 accessible at port 60687, customer11 at port 51053) with buttons for Delete, View Secrets, Seal, Access, and Snapshots. Instance creation offers four options from a dropdown: Dev server, High availability server, Standalone server, or Custom. Custom mode exposes a YAML text area pre-filled with the Vault Helm chart values, letting an advanced user modify the deployment configuration before submitting it. A single click on “Create Instance” runs the entire provisioning sequence: Helm chart deployment, Vault initialization, key extraction, storage in the Central SMS Server, and unsealing. The “Initial Unseal Keys Page” then displays the root token and all five unseal key shares so the admin can record them offline.

The user interface shows KMS Operations and Admin Operations sections depending on the user’s policy. A user with full permissions sees: Get New SSH Certificate, Get New Certificate, Store New Secret, Get Secrets, Generate New Database Key (KMS), and Create User, Assign Policy, Create Self-Signed Certificate Authority, Create Intermediate Certificate Authority (Admin). The dashboard header shows the Vault instance UUID the user is connected to: dc3f0ac7-61bc-211e-dad1-9042f1a6f522.

How instance creation works in code

The “Create Instance” button in Vision runs a sequence across three Python modules. Understanding the mechanics requires reading them together.

helm_integration.py is 16 lines. It wraps pyhelm3, the Python Helm client, with two functions. At module load time, the HashiCorp chart is fetched from https://helm.releases.hashicorp.com:

hClient = pyhelm3.Client()
vchart = asyncio.run(hClient.get_chart("vault", repo="https://helm.releases.hashicorp.com"))

def install_vault(appname, namespace, values):
    release_revision = asyncio.run(
        hClient.install_or_upgrade_release(appname, vchart, values,
                                           namespace=namespace, create_namespace=True)
    )
    return release_revision

install_or_upgrade_release is idempotent: it installs the release if it does not exist, or upgrades it if it does. create_namespace=True creates the Kubernetes namespace if absent. The values dict passed in comes directly from the form the user submitted (or from the pre-filled YAML in custom mode). Each tenant instance gets its own namespace named after the instance.

instance_handler.py orchestrates the creation sequence inside create_vault_instance:

vname = vclient.get("vhandler").username + str(len(vclient.get("vhandler").get_instances()))
result = helm_integration.install_vault(vname, vname, instance_data)

The instance name is the username concatenated with the count of existing instances — customer10, customer11, etc. The Helm namespace and release name are both set to this name, which makes everything per-tenant and easy to locate.

After Helm completes, the code polls Kubernetes until all pods in the namespace report Running phase, up to 30 seconds:

while not all_Running and counter < 30:
    all_Running = all(pod.status.phase == "Running" for pod in pods)
    ...
    sleep(1)
    counter += 1

Once the pods are up, a free local port is allocated with socket.socket().getsockname()[1] and a kr8s.portforward.PortForward is started, tunneling the pod’s port 8200 to 0.0.0.0:<free_port> on the Vision server. This is how Vision can reach a Vault pod that has no external IP — it port-forwards through the Kubernetes API.

Then VaultHandler.init_and_unseal() calls sys.initialize(secret_shares=5, secret_threshold=3), extracts the root token and unseal keys from the response, and submits three unseal keys via sys.submit_unseal_key() in a loop until the instance reports unsealed.

The final step stores everything in the Central Vault:

vcl.append_secret(mount='customers/', path=vcl.username + "/" + vname,
                  key="root-token", secret=root_token)
for i, key in enumerate(unseal_keys, start=0):
    vcl.append_secret(mount='customers/', path=vcl.username + "/" + vname,
                      key="unseal-key-" + str(i), secret=key)

The storage path is customers/<username>/<instance_name>. The Central Vault holds one KV secret per instance, containing the root token and all five unseal key shares. When an instance needs to be unsealed later (e.g., after a pod restart), the same path is read back and the unseal sequence is replayed. When the instance is deleted, the KV secret at that path is also deleted via remove_secret, so the Central Vault stays clean.

vault_intgre.py wraps the hvac Python client. Most methods are thin wrappers around hvac calls: login_vault calls client.auth.userpass.login, seal calls client.sys.seal, get_pki_mount_points calls client.sys.list_mounted_secrets_engines and filters for type == 'pki'. The append_secret method implements a read-modify-write cycle: it reads the existing KV secret at the path, merges the new key-value pair into the dict, then writes the whole dict back. This avoids overwriting existing fields when adding a new key to an existing secret.

End-to-end test: SSH certificate authentication

The thesis closes with a working demonstration of the SSH certificate workflow. The test demonstrates what the whole stack was built for.

The user visits the SSH certificate page, enters the SSH engine mount path (ssh), remote username (ubuntu), and key size (4096). Vision generates an RSA key pair, sends the public key to Vault’s SSH engine for signing, and returns the signed certificate for download.

To use it: download the CA public key from Vision and add it to the server’s TrustedUserCAKeys in sshd_config. On the client side, present the signed certificate alongside the private key when connecting. The server validates the certificate against the trusted CA and grants access without checking authorized_keys at all. The thesis shows a successful connection screenshot, verifying that the certificate produced by the web interface works against a real SSH server.

The security properties of this approach are worth stating explicitly. The certificate has a TTL baked in, so when it expires the SSH access automatically stops working. There is no need to update authorized_keys on every server when an employee leaves; revoking their Vault credentials stops them from getting new certificates. The audit log in Vault records every certificate issuance with the user’s identity. This is a qualitatively different security model from static SSH key management.

What the project taught

The hard problem in key management is not the cryptography. Vault’s AES-256 key wrapping, Shamir secret sharing for unseal keys, and TLS transit are all solved. The hard problems are operational and human.

Who holds the unseal keys? If the answer is “whoever initializes Vault and copies down the output,” those keys will end up in a spreadsheet or a messaging thread. The design solution was to store them in the Central SMS Server automatically, so the answer is “the infrastructure.”

How does a CA get deployed without a week of manual work? The answer is a form that submits to Vault’s PKI API. How do database credentials get cleaned up when a developer finishes testing? The answer is dynamic secrets with a TTL enforced by Vault’s lease system.

The recurring pattern is that every time a security control requires human attention to apply consistently, it will sometimes not be applied. The interesting engineering work is finding ways to make the secure path the automatic path. The auto-unsealing, the certificate renewal via Consul templates, and the dynamic credential expiry all reflect that principle: remove the human from the loop on anything that has a defined correct answer.

The VM-to-Kubernetes migration reflects a second pattern: prototype on the simplest infrastructure that validates the design, then switch to the infrastructure that scales. VMs confirmed that Vault could be provisioned and unsealed automatically. Kubernetes confirmed that it could be done at tenant scale without proportional resource growth. The same Ansible and Helm tooling that worked for one VM worked for a hundred Kubernetes pods because the automation was written against the API, not the physical machine.