How We Learned to Think of LLMs as Compute Engines

Building Yohanun taught us that the biggest architectural mistake teams make with AI isn't technical—it's conceptual.

When we started building AI applications, we made the same mistake everyone makes: we treated LLMs like they were complete AI systems. Load up a prompt with context, business rules, conversation history, and instructions, then hope the model figures it out.

It worked... sort of. For demos. For simple use cases. But every time we tried to build something production-ready, we hit the same walls.

The Problems We Kept Running Into

Context Limits: We'd carefully craft the perfect prompt with all the relevant history, then hit token limits. Conversations would lose their thread. Important context would get truncated.

Inconsistent Behavior: The same input would produce different outputs. Business rules would be followed sometimes, ignored other times. We'd spend hours prompt engineering for consistency that never quite arrived.

Expensive Everything: Every interaction required a full LLM call with maximum context. Costs scaled terribly. Simple rule checks were burning through tokens.

Testing Nightmares: How do you write unit tests for "send everything to GPT and see what happens"? How do you debug probabilistic systems? How do you audit AI decisions for compliance?

Memory Problems: Each conversation started from scratch. Our AI assistants had goldfish memory. Users would repeat themselves constantly because the system couldn't maintain true continuity.

We kept thinking: "If we just get better at prompt engineering..." But prompt engineering was treating the symptom, not the disease.

The Breakthrough: LLMs Are Not AI Systems

The mental shift came when we stopped thinking of LLMs as complete AI systems and started thinking of them as specialized compute engines.

Just like we don't expect a database to handle business logic, or a GPU to manage memory allocation, we realized we shouldn't expect LLMs to handle memory, rules, context management, and language generation all at once.

LLMs are really good at one thing: transforming text input into contextually appropriate text output. They're incredible at language understanding, generation, and reasoning in the moment. But they're terrible at persistence, consistency, and state management.

What Changed When We Made This Shift

1. Separation of Concerns

Instead of cramming everything into prompts, we separated responsibilities:

Memory Layer: Stores and retrieves relevant context
Rules Engine: Handles business logic and compliance
Context Manager: Decides what information matters for this interaction
LLM: Generates natural language responses based on processed input

2. Deterministic + Probabilistic

We realized we needed both:

Deterministic systems for rules, memory, and context (must be reliable)
Probabilistic systems for language generation (can be creative)

Mixing them was the mistake. Now business rules run deterministically, then we invoke the LLM for language generation.

3. Cost Optimization

With the compute model, we only call LLMs when we actually need language generation:

Rule evaluation: Local computation (cheap)
Memory retrieval: Vector search (cheap)
Context assembly: Data processing (cheap)
Language generation: LLM call (expensive)

Our token usage dropped by 60-80% while performance improved.

4. Model Agnostic Architecture

When LLMs are just compute engines, they become swappable:

Memory and rules persist across model changes
We can A/B test different LLMs for the same application
Local models, cloud models, or specialized models - doesn't matter to the application layer

The GPU Parallel

This reminded us of the GPU revolution in graphics and machine learning.

Before GPU compute: CPUs handled everything, badly. Graphics were slow. ML training was impossible.

After GPU compute: CPUs handle coordination and logic. GPUs handle parallel computation. Massive performance gains.

AI equivalent:

Before: LLMs handle everything, inconsistently
After: Infrastructure handles memory/rules/context. LLMs handle language. Better results, lower costs.

What This Looks Like in Practice

Here's a customer service AI interaction in both models:

The "LLM as System" Approach

User: "I'm still having the billing issue we discussed yesterday"

Prompt: [Conversation history] + [Company policies] + [User profile] + 
        [Billing rules] + [Escalation procedures] + 
        "User says: I'm still having the billing issue we discussed yesterday"

LLM: [Generates response, may or may not follow all rules consistently]

The "LLM as Compute" Approach

User: "I'm still having the billing issue we discussed yesterday"

1. Memory Layer: Retrieves yesterday's conversation context
2. Rules Engine: Checks user tier, escalation rules, billing policies
3. Context Manager: Assembles relevant information for this specific query
4. LLM: Generates response based on processed, structured input

Result: Consistent rule following + natural language + full context

Why This Matters Beyond Yohanun

We're not the only team that's figured this out, but we think more teams should adopt this mental model because:

1. The Economics Are Shifting

LLM costs are dropping, but infrastructure costs for memory and rules remain constant. The compute model optimizes for this economic reality.

2. Compliance and Auditability

Regulated industries need deterministic rule following and audit trails. You can't audit a prompt, but you can audit a rules engine.

3. Enterprise Requirements

Real applications need consistency, persistence, and integration with existing systems. The compute model enables this.

4. Developer Experience

It's much easier to reason about, test, and debug systems with clear separation of concerns.

The Infrastructure Implications

Thinking of LLMs as compute leads to different infrastructure choices:

Traditional AI Stack:

Application → LLM → Response

Compute-Oriented AI Stack:

Application → Memory Layer → Rules Engine → Context Manager → LLM → Response
                 ↑              ↑              ↑
            Vector DB    Business Logic   Semantic Router

This is why we built Yohanun. Not because we wanted to build infrastructure, but because we kept rebuilding the same memory, rules, and context management pieces for every AI application.

What We Got Wrong Initially

We thought: LLMs are so smart, they should handle everything
Reality: LLMs are powerful but specialized tools

We thought: More context in prompts = better results
Reality: Relevant context + clear responsibilities = better results

We thought: Prompt engineering would solve consistency problems
Reality: Architectural patterns solve consistency problems

Looking Forward

The AI infrastructure landscape is still evolving, but we believe the compute model is the right direction because:

It matches how LLMs actually work (stateless functions)
It optimizes for cost and performance
It enables enterprise requirements (auditability, consistency)
It's composable and maintainable

The teams building the most sophisticated AI applications are already thinking this way. They're treating LLMs as powerful but specialized compute resources, and building proper infrastructure around them.

We think this shift from "LLM as AI system" to "LLM as compute engine" is as important as the shift from "server as pet" to "server as cattle" was for cloud computing.

The question isn't whether this will happen—it's whether teams will build this infrastructure themselves or use purpose-built solutions. We built Yohanun because we got tired of rebuilding the same cognitive infrastructure over and over.

But the important thing isn't which solution you choose. The important thing is making the mental shift: LLMs are compute engines, not complete AI systems. Everything else follows from there.

Want to see how this works in practice?

We're happy to show you how Yohanun implements the compute model, or discuss how to apply these patterns in your own AI infrastructure.

View Architecture Try the Platform

Share this article

Twitter LinkedIn