projet / mars 2026

LLM smart home assistant

Python backend with Groq API, five-tier semantic memory system, intent routing, voice I/O, and Home Assistant integration designed for elderly users.

LLMPythonGroqSemantic memoryHome Assistant

This project started from a practical constraint: build a voice-first conversational assistant for elderly users in French, running on a Home Assistant instance, where the user has no keyboard and the assistant needs to remember context across conversations that happen over weeks. The immediate engineering challenge was not “how do we call an LLM” — the Groq API handles that — but “how do we build a memory system that behaves like a person who remembers you, rather than a stateless chatbot that forgets everything between sessions.”

The dispatch layer

The backend (app.py) handles most requests through the LLM, which returns a structured command field in its response. That field tells the backend which handler to invoke next. The LLM is the router: it reads the transcribed input, decides what kind of request it is, and encodes the decision in its output.

One exception bypasses this entirely. Alarm requests are detected with a direct string check before anything else: if text.startswith("Alarm : "). Time-based requests have a precise enough syntax that string matching beats LLM classification for them. Everything else goes through the model first.

The post-LLM handlers are:

  • News handler: fetches and formats current news
  • Service handler: issues Home Assistant API calls to control devices and services
  • Weather handler: queries current conditions and formats the response
  • Conversational fallback: returns the LLM response text as-is when no specific handler matches

The design reflects a practical constraint. News, weather, and Home Assistant control all require calling external APIs with specific parameters. The LLM identifies the intent and can extract relevant parameters (which device, which action), but the actual API call is deterministic code, not the model. This splits the work correctly: the model handles ambiguous natural language, the handler code handles reliable I/O.

Voice input uses Groq’s speech transcription. Voice output goes through speak_text_async, an asynchronous TTS wrapper. The async design matters for responsiveness — generating speech output while processing the next query keeps the interaction feeling real.

The memory system

The most substantial engineering work is in semantic_memory.py. The system has five distinct memory store types, each handling a different kind of persistent information.

ShortTermMemory: the immediate context buffer

class ShortTermMemory(MemoryStore):
    def __init__(self, max_turns: int = 20):
        self.buffer = deque(maxlen=max_turns)
        self.lock = threading.Lock()

A deque with maxlen=20 acts as a circular buffer. When the 21st conversation turn arrives, the oldest is evicted automatically. Every read and write goes through a threading.Lock() because the background decay tasks and the main request-processing thread both access these stores.

Each turn is stored as a ConversationTurn object: user_input, llm_response, timestamp, and a turn_id UUID. When composing prompts for the LLM, the last N turns from this buffer provide the immediate conversational context.

EpisodicMemory: conversation sessions over time

Episodic memory stores complete conversation episodes rather than individual turns. An episode is a temporally coherent interaction segment — the conversation about setting up a doctor appointment, or the exchange about a family visit. Episodes have: episode_id, start_time, end_time, summary, key_interactions (the notable turns), topics, and emotional_tone.

Episode boundaries are detected in two ways:

self.episode_gap_minutes = 30

def _explicit_episode_boundary(self, user_input: str) -> bool:
    cues = ["new topic", "let's change the subject", "start over", "switch topic"]
    return any(cue in user_input.lower() for cue in cues)

A gap of more than 30 minutes between turns triggers a new episode. So does an explicit phrase like “let’s change the subject.” Either condition closes the current episode (recording its summary and highlights) and opens a new one.

This matters for context composition. When the user starts talking, the system can retrieve past episodes that match the current topic by calling retrieve_episodes_by_topic. A conversation about medications today can be connected to the episode from last week where the user mentioned a prescription, without surfacing the entire chat history.

ProceduralMemory, FactualMemory, AffectiveMemory

ProceduralMemory stores structured step-by-step procedures: multi-stage tasks the assistant has learned or been told, each with a name, steps list, prerequisites, difficulty, and success_criteria. Retrieving by name allows the assistant to guide the user through recurring tasks.

FactualMemory stores explicit knowledge items — the user’s preferences, names of family members, the doctor’s phone number. Each fact has metadata including importance score and access count.

AffectiveMemory stores sentiment-tagged interactions: a sentiment score (numeric), a list of emotions, the context of the interaction, and what triggers were detected. The goal is for the assistant to recognize when a topic has historically produced distress and adjust its approach.

Memory scoring and decay

The most interesting part of the memory system is the scoring function that governs what survives:

def calculate_memory_score(memory: Memory) -> float:
    age_in_days = (now - meta["timestamp"]).total_seconds() / 86400
    recency_score = math.exp(-age_in_days * 0.1)
    frequency_score = math.log(meta.get("access_count", 0) + 1)
    feedback_multiplier = 1.0
    if meta.get("user_feedback") == "good":
        feedback_multiplier = 2.0
    elif meta.get("user_feedback") == "bad":
        feedback_multiplier = 0.5
    importance_score = meta.get("importance_score", 0.5)
    return 0.6 * recency_score + 0.2 * frequency_score + 0.1 * feedback_multiplier + 0.1 * importance_score

The weights: 60% of the score comes from recency. Recency decays exponentially with a half-life of roughly 7 days (exp(-0.1 * days) reaches 0.5 at 6.9 days). 20% comes from access frequency, using a log scale so that highly-accessed memories score well but with diminishing returns. 10% comes from user feedback — good feedback doubles the score, bad feedback halves it. 10% comes from the initial importance score assigned when the memory was created.

A factual memory with a score below 0.3 is deleted. Episodic memories older than 30 days get summarized (the full turn list is replaced with a summary). Affective memories are deleted after 60 days if not accessed.

Background timers run these maintenance tasks automatically:

def _start_background_tasks(self):
    def daily():
        self.score_and_decay()
        threading.Timer(86400, daily).start()
    threading.Timer(86400, daily).start()

The daily timer reschedules itself after each run, so the maintenance cycle is self-sustaining as long as the process is running.

The engineering trade-offs

The memory system is in-process, not persistent. Everything lives in Python dictionaries and deques for the lifetime of the backend process. A restart loses all memory. For a prototype this is acceptable; a production version would serialize the memory stores to a database on shutdown and restore them on startup.

The threading design is conservative — a single threading.Lock() per store serializes all access. This avoids races but creates contention if the background decay tasks and the main request thread both try to access a large store at the same time. A read-write lock (multiple concurrent readers, exclusive writer) would be a meaningful improvement for the access pattern here.

The scoring formula is a heuristic. The 60/20/10/10 weight split was a design decision without empirical validation against real user behavior. Whether these weights produce a memory system that feels natural to an elderly French user requires actual usage data to tune.

Why the safety angle matters

Once an assistant can trigger Home Assistant services — turn off lights, lock doors, set alarms, send messages — the input validation boundary becomes a security boundary. An LLM that can be prompted to take actions in the physical world is a different class of risk from one that only generates text.

The routing layer that intercepts service requests before they reach the LLM is partly an engineering efficiency decision and partly a safety decision. Deterministic service handlers are predictable in a way that LLM tool calls are not. A service handler for “turn off the lights” can be audited, tested, and rate-limited. An LLM interpreting an ambiguous request and deciding to call multiple Home Assistant APIs cannot be constrained the same way.

The project demonstrates that these problems are real and tractable, not hypothetical. The memory system, the routing layer, and the async voice I/O are all working code that could be extended into a deployable product. The security design decisions are already embedded in the architecture rather than being left for later.