Hurry Up and Wait: UX Challenges for LLM-Powered Games
The UX Challenges of LLM Applications
LLMs generate text one token at a time, and they don't know how long their response will be until they're done writing it. This means:
- Full-response latency is high. Even a "fast" LLM takes a noticeable amount of time to finish a response.
- More capable models are slower. More complex prompts are also slower. You can't have it both ways.
- The output is non-deterministic. You can't predict the exact response, only steer it.
The first massively successful LLM application — ChatGPT — dealt with this by accident. A chat interface is uniquely forgiving of latency. The moment the first words appear on screen, the user is reading. The full response can take many seconds, sometimes even minutes; but it's okay. Because it's "like chatting". Streaming didn't solve the latency problem, it just made it invisible.
Other successful types of LLM applications hide these challenges in other ways. Automation tools run in the background asynchronously, so the user isn't waiting at all. Productivity tools (e.g., AI coding tools) offer enough perceived value that users are willing to sit through a pause.
Games fit none of these patterns.
What This Means for Our Game
Here's the core flow for a single player turn in our game:
- Player writes their action
- LLM call with a fairly complex prompt — it needs to return both the GM's narration and structured data to move the game state forward (damage dealt, enemy status, etc.)
- Wait for the full response — streaming won't help because the TTS API needs complete text, and the game state update needs the complete structured data
- Send narration to TTS API
- Play the audio
The LLM call can take over a second. Then there is the latency for the TTS audio playback to kick in too. Added together, this becomes too slow. You've just written your action and hit enter, and now you're staring at a loading state. The immersion evaporates immediately.
A spinner doesn't fix this. A progress bar doesn't fix this. The player knows they're waiting for the AI to think, and it feels slow because it is slow.
Here's what that looked like:
[A basic attack that lands seconds later]
The Fix: Distraction Is a Design Pattern
My wife is a seasoned product designer, and she suggested a counterintuitive fix: distract the player.
Just like a skeleton screen distracts you from a slow page load — if something is happening, the wait feels shorter.
So here's what we did:
Two calls instead of one. The moment the player submits their action, we fire two LLM requests simultaneously:
- First call: A lightweight prompt to a smaller, faster model. Its only job is to generate a brief "reaction" to the player's action — something the GM says in the immediate moment, like "Oh that's an interesting choice" This call completes in under 500ms.
- Second call: The full, complex prompt that generates the complete narration, resolves combat, and returns the structured game state update. This takes over a second.
The first call's response goes straight to TTS and starts playing audio. There is still some latency, but it's much quicker. While the player listens to the GM quip, the second call finishes in the background. The full narration plays next, seamlessly.
This only works because the first call is genuinely fast — and that's where Cerebras comes in. Their inference hardware runs significantly faster than standard GPU infrastructure, which is what made sub-500ms responses on the lightweight call achievable. Without that speed margin, the distraction call would itself feel slow, and the whole trick falls apart.
The insight is simple but worth stating clearly: we're not somehow making the AI faster — we're designing around the wait. The second call still takes over a second. The player just doesn't notice.
[The warrior attempts to do a whirlwind attack]
The Hidden Benefit of Waiting for the Full Response
One of the most satisfying things in the game: the battle animations sync to the narration audio.
When the TTS reads "you slash at the orc", the warrior's attack animation plays at exactly that moment. When "the orc staggers back", the enemy flinch animation fires. It makes an enormous difference to how the game feels.
This was only possible because we wait for the full response with structured data before starting audio playback. The complete narration text is available before the first word is spoken, so we can parse it, identify the key action words, calculate their timestamps in the audio, and queue the animations accordingly.
The Problem of Freedom
Latency isn't the only thing that makes LLMs unusual to design around.
In the first post, I described how the absence of a fixed action menu is the game's biggest strength — players can do anything.
But this is also a problem. Players will inevitably type things that have nothing to do with the game. Maybe they're testing the system, maybe they're just curious: "What's the weather in Seoul?" "Write me a poem."
We need the AI to recognize when an input isn't what a tabletop player would actually say, and redirect gracefully. The solution is straightforward in concept: instruct the LLM to validate the input and decline off-topic requests.
But the implementation is less clean. If we include the validation instruction in the first call, the lightweight model now has to do more work — and that first call's speed is the whole trick. If we put it in the second call, the first call runs without any guardrails, potentially generating and voicing a reaction to something we shouldn't be responding to at all.
We currently handle it in the second call, which means the first call is unguarded. It works most of the time. It's not elegant. This is an open problem we're still thinking through.
What We Learned
Working with LLMs in this setting introduced a genuinely new set of UX challenges.
The good news is that these challenges are designable. ChatGPT didn't solve them with a breakthrough; it avoided them with a chat UI. We didn't need to build a faster model ourselves; we designed around them with a distraction pattern and a little creative thinking. Not every problem has a clean solution — our input validation issue still doesn't — but knowing the shape of the problem is most of the work.
If you're fighting LLM latency, the first question worth asking is whether your users actually need to wait for the full response — or just for something to start happening. What already-running things in your flow could buy you the time you need?