// LEARN · STATE SPACE MODELS

Mamba inference, explained.
Why state-space matters.

Mamba is the linear-time alternative to transformer attention that broke architectural dominance in late 2023. Better at long context. Lower memory. Faster on edge. EdgeFlow runs Mamba, Mamba-2, and Jamba natively across NVIDIA, AMD, Intel, ARM, and Qualcomm. SandLogic's Samba-ASR — a Mamba speech model — delivers −51% WER vs Whisper-large-v3.

Mamba complexity
O(n)
Transformer complexity
O(n²)
Samba-ASR vs Whisper
−51% WER
Krsna SSM support
Production
// THE SHORT ANSWER

Mamba is a State Space Model (SSM) architecture that scales linearly with sequence length — O(n) compute and constant memory per step — versus transformer attention's O(n²). Practically: Mamba handles 100k+ token contexts without memory blow-up, runs efficiently on edge silicon where attention is memory-bound, and powers SandLogic's Samba-ASR (−51% WER vs Whisper-large-v3 per arXiv 2501.02832). EdgeFlow inside EdgeMatrix runs Mamba, Mamba-2, and Jamba natively across NVIDIA, AMD, Intel, ARM, and Qualcomm — first-class architecture support, not workaround-via-attention emulation.

Linear-time sequence modeling. No attention required.

Mamba was introduced by Gu & Dao in late 2023 as a selective State Space Model — a different paradigm for sequence modeling than transformer attention. The core insight: replace the pairwise-attention bottleneck (every token attending to every other) with a selective state-space mechanism that processes sequences via efficient scan operations. The model maintains an internal hidden state that gets updated per token; the state itself does the work that attention was doing.

The architectural family has evolved rapidly: Mamba-2 (improved training-time parallelism), Jamba (hybrid Mamba + Transformer for selective short-context attention plus long-context efficiency), and a growing landscape of SSM variants. They all share the linear-time signature.

// COMPLEXITY COMPARISON
SEQUENCE LENGTH
n

Token count in the context.

TRANSFORMER ATTENTION
O(n²)

Compute + memory scale quadratically. Doubling context = 4× cost.

MAMBA / SSM
O(n)

Compute scales linearly; constant memory per inference step.

// WHY IT MATTERS FOR INFERENCE

Mamba is edge-native.

The architecture matters most where transformer attention hurts most: long context, memory-constrained silicon, and sustained streaming workloads.

// USE CASE 01

Long-context inference (100k+ tokens)

Document understanding, long meeting transcripts, RAG over book-length context. Transformer attention degrades sharply past 32k tokens. Mamba processes sequence length linearly — 100k tokens is the same cost-per-token as 10k tokens.

// USE CASE 02

Edge silicon (memory-constrained)

Smartwatches, smart speakers, AI note-takers, smart glasses — total RAM measured in gigabytes. Constant-memory Mamba inference fits where O(n²) attention can't. Krsna SoC supports SSM as one of four production model families specifically for this.

// USE CASE 03

Streaming speech (continuous audio)

Long-form speech recognition — 1-hour meetings, podcast transcription, continuous voice capture. Samba-ASR (Mamba-based) handles streaming audio with constant memory per audio second. Sruthi (the SandLogic ASR product) is built on this advantage — −51% WER vs Whisper-large-v3.

// USE CASE 04

Multi-user agentic workloads

Each user maintains a long conversational state. Per-user transformer KV-cache balloons; per-user Mamba SSM hidden state stays small. Higher concurrency on the same silicon. The economic argument compounds.

First-class state space. Not an afterthought.

Most inference frameworks were built for transformers, with state-space support bolted on as a special case. EdgeFlow treats SSMs as a peer architecture family — same dispatch infrastructure, same kernel pipeline, same hardware coverage. Four engineered layers drive the acceleration.

// LAYER 01

Architecture recognition

CORE recognizes Mamba / Mamba-2 / Jamba at compile time via the model graph structure (selective state-space layers, scan operations, the absence of attention blocks). It selects the SSM-appropriate kernel sequence — not a transformer kernel sequence.

// LAYER 02

Scan-kernel selection

The hardware-aware compiler picks the right scan implementation for each silicon target. NVIDIA: parallel scan on CUDA. AMD: ROCm-tuned scan. ARM: NEON-vectorized scan. Krsna: native scan primitives in the spatial array. Same model, different kernel per silicon, all driven by the compiler.

// LAYER 03

Memory management

Mamba's constant-memory-per-step property requires careful state caching. EdgeFlow's hybrid cache architecture extends from KV-cache (transformer-style) to SSM hidden state (Mamba-style) — the same memory management framework handles both architectures.

// LAYER 04

Throughput optimization

For long-context Mamba inference, the bottleneck is often the state-update step. EdgeFlow batches state updates across requests where the cache layout allows, driving throughput up on multi-user agentic workloads where each user has a long conversational context.

Samba-ASR: −51% WER vs Whisper.

The most direct evidence that Mamba inference works at production quality on the SandLogic stack is Samba-ASR — the underlying model of the Sruthi ASR product. Published as arXiv 2501.02832, Samba-ASR achieves a 51% average word error rate reduction versus Whisper-large-v3, OpenAI's open-source ASR baseline.

// ARCHITECTURE

Mamba (SSM)

Selective state-space scan. Linear-time processing of audio streams. Constant memory per audio second.

// PERFORMANCE

−51% WER

Average word error rate reduction vs Whisper-large-v3. Stronger advantage on long-form audio, accented English, code-switched Indic speech.

// DEPLOYMENT

22 Indic languages

Production deployment as Sruthi inside the Lingo platform. 21+ enterprises running today. On-prem or cloud.

// GO DEEPER

Continue exploring state-space and the runtime.

  • /sruthi — the production ASR product built on Samba-ASR. −51% WER, 22 Indic languages, on-prem deployment.
  • /edgeflow — the inference engine that runs Mamba natively across all silicon families.
  • /any-ai — the chip-level Any AI claim: Krsna supports SSM as one of four production model families.
  • /research — the arXiv 2501.02832 paper and the rest of the SandLogic research output.
  • /compare/vllm-vs-edgeflow — why EdgeFlow handles Mamba natively where vLLM has limited support.
// LET'S BUILD

Mamba in production. Talk to engineering.