How I built a story telling LLM engine that actually responds like a game master—and doesn’t stall like a blog post generator.
In my previous adventures I found that most local LLM storytelling engines stall—not because the model is slow, but because everything around it is:
I learned this the hard way building Noir Adventures, a AIDungeon style system where every player action feeds a cinematic AI narrator and rolls for events and interactions. Think Dungeons and Dragons meets AI story telling in real-time.
🔧 My Fix: I switched from string truncation to token-aware trimming using
llama_cpp
andtransformers
tokenizers. Memory now clips oldest turns based on actual token count, not guesswork.
🧵 My Fix: Replaced
ThreadPoolExecutor
withasyncio.to_thread()
for async model calls. I also added a graceful timeout fallback that retries with reducedmax_tokens
. I have no idea if this is best, but it works very well currently.
Instead of dumping the whole conversation each turn, I:
🧠 Result: The AI sounds consistent and aware—without burning 2,000 tokens to remember “you walked into a bar.”
Here's an example.
The biggest takeaway?
Narrative AI isn’t about bigger models—it’s about treating story flow like an API call: fast, contextual, and scoped.
By optimizing the architecture around the LLM—not just the prompt—we made a local RPG engine feel reactive, immersive, and alive.
Noir Adventures is an experimental project—part of an ongoing investigation into local language models and responsive narrative engines.
Due to current legal constraints in the United States, along with other personal legal agreements, I am not releasing, monetizing, or distributing this system publicly.
This work is shared strictly for research, prototyping, and collaboration among developers and creative engineers exploring the edge of narrative AI.
If you're working on similar systems—or just curious about how to build them—I’d love to trade notes.
Author: Will Blew, SARVIS AI
Add a comment!