How We Learned to Think of LLMs as Compute Engines
Building Yohanun taught us that the biggest architectural mistake teams make with AI isn't technical—it's conceptual.
Anton Mannering
Founder & Chief Architect
Building Yohanun taught us that the biggest architectural mistake teams make with AI isn't technical—it's conceptual.
When we started building AI applications, we made the same mistake everyone makes: we treated LLMs like they were complete AI systems. Load up a prompt with context, business rules, conversation history, and instructions, then hope the model figures it out.
It worked... sort of. For demos. For simple use cases. But every time we tried to build something production-ready, we hit the same walls.
The Problems We Kept Running Into
Context Limits: We'd carefully craft the perfect prompt with all the relevant history, then hit token limits. Conversations would lose their thread. Important context would get truncated.
Inconsistent Behavior: The same input would produce different outputs. Business rules would be followed sometimes, ignored other times. We'd spend hours prompt engineering for consistency that never quite arrived.
Expensive Everything: Every interaction required a full LLM call with maximum context. Costs scaled terribly. Simple rule checks were burning through tokens.
Testing Nightmares: How do you write unit tests for "send everything to GPT and see what happens"? How do you debug probabilistic systems? How do you audit AI decisions for compliance?
Memory Problems: Each conversation started from scratch. Our AI assistants had goldfish memory. Users would repeat themselves constantly because the system couldn't maintain true continuity.
We kept thinking: "If we just get better at prompt engineering..." But prompt engineering was treating the symptom, not the disease.
The Breakthrough: LLMs Are Not AI Systems
The mental shift came when we stopped thinking of LLMs as complete AI systems and started thinking of them as specialized compute engines.
Just like we don't expect a database to handle business logic, or a GPU to manage memory allocation, we realized we shouldn't expect LLMs to handle memory, rules, context management, and language generation all at once.
LLMs are really good at one thing: transforming text input into contextually appropriate text output. They're incredible at language understanding, generation, and reasoning in the moment. But they're terrible at persistence, consistency, and state management.
What Changed When We Made This Shift
1. Separation of Concerns
Instead of cramming everything into prompts, we separated responsibilities:
- Memory Layer: Stores and retrieves relevant context
- Rules Engine: Handles business logic and compliance
- Context Manager: Decides what information matters for this interaction
- LLM: Generates natural language responses based on processed input
2. Deterministic + Probabilistic
We realized we needed both:
- Deterministic systems for rules, memory, and context (must be reliable)
- Probabilistic systems for language generation (can be creative)
Mixing them was the mistake. Now business rules run deterministically, then we invoke the LLM for language generation.
3. Cost Optimization
With the compute model, we only call LLMs when we actually need language generation:
- Rule evaluation: Local computation (cheap)
- Memory retrieval: Vector search (cheap)
- Context assembly: Data processing (cheap)
- Language generation: LLM call (expensive)
Our token usage dropped by 60-80% while performance improved.
4. Model Agnostic Architecture
When LLMs are just compute engines, they become swappable:
- Memory and rules persist across model changes
- We can A/B test different LLMs for the same application
- Local models, cloud models, or specialized models - doesn't matter to the application layer
The GPU Parallel
This reminded us of the GPU revolution in graphics and machine learning.
Before GPU compute: CPUs handled everything, badly. Graphics were slow. ML training was impossible.
After GPU compute: CPUs handle coordination and logic. GPUs handle parallel computation. Massive performance gains.
AI equivalent:
- Before: LLMs handle everything, inconsistently
- After: Infrastructure handles memory/rules/context. LLMs handle language. Better results, lower costs.
What This Looks Like in Practice
Here's a customer service AI interaction in both models:
The "LLM as System" Approach
User: "I'm still having the billing issue we discussed yesterday"
Prompt: [Conversation history] + [Company policies] + [User profile] +
[Billing rules] + [Escalation procedures] +
"User says: I'm still having the billing issue we discussed yesterday"
LLM: [Generates response, may or may not follow all rules consistently]
The "LLM as Compute" Approach
User: "I'm still having the billing issue we discussed yesterday"
1. Memory Layer: Retrieves yesterday's conversation context
2. Rules Engine: Checks user tier, escalation rules, billing policies
3. Context Manager: Assembles relevant information for this specific query
4. LLM: Generates response based on processed, structured input
Result: Consistent rule following + natural language + full context
Why This Matters Beyond Yohanun
We're not the only team that's figured this out, but we think more teams should adopt this mental model because:
1. The Economics Are Shifting
LLM costs are dropping, but infrastructure costs for memory and rules remain constant. The compute model optimizes for this economic reality.
2. Compliance and Auditability
Regulated industries need deterministic rule following and audit trails. You can't audit a prompt, but you can audit a rules engine.
3. Enterprise Requirements
Real applications need consistency, persistence, and integration with existing systems. The compute model enables this.
4. Developer Experience
It's much easier to reason about, test, and debug systems with clear separation of concerns.
The Infrastructure Implications
Thinking of LLMs as compute leads to different infrastructure choices:
Traditional AI Stack:
Application → LLM → Response
Compute-Oriented AI Stack:
Application → Memory Layer → Rules Engine → Context Manager → LLM → Response
↑ ↑ ↑
Vector DB Business Logic Semantic Router
This is why we built Yohanun. Not because we wanted to build infrastructure, but because we kept rebuilding the same memory, rules, and context management pieces for every AI application.
What We Got Wrong Initially
We thought: LLMs are so smart, they should handle everything
Reality: LLMs are powerful but specialized tools
We thought: More context in prompts = better results
Reality: Relevant context + clear responsibilities = better results
We thought: Prompt engineering would solve consistency problems
Reality: Architectural patterns solve consistency problems
Looking Forward
The AI infrastructure landscape is still evolving, but we believe the compute model is the right direction because:
- It matches how LLMs actually work (stateless functions)
- It optimizes for cost and performance
- It enables enterprise requirements (auditability, consistency)
- It's composable and maintainable
The teams building the most sophisticated AI applications are already thinking this way. They're treating LLMs as powerful but specialized compute resources, and building proper infrastructure around them.
We think this shift from "LLM as AI system" to "LLM as compute engine" is as important as the shift from "server as pet" to "server as cattle" was for cloud computing.
The question isn't whether this will happen—it's whether teams will build this infrastructure themselves or use purpose-built solutions. We built Yohanun because we got tired of rebuilding the same cognitive infrastructure over and over.
But the important thing isn't which solution you choose. The important thing is making the mental shift: LLMs are compute engines, not complete AI systems. Everything else follows from there.
Want to see how this works in practice?
We're happy to show you how Yohanun implements the compute model, or discuss how to apply these patterns in your own AI infrastructure.