How is Mamba different from Transformer attention?

Transformer attention computes pairwise interactions between every token in the sequence — O(n²) compute and memory. Mamba replaces attention with a selective state-space scan — O(n) compute, with constant memory at inference time per step. The practical implications: (1) Mamba handles very long contexts (100k+ tokens) without the memory blow-up of attention; (2) Mamba inference latency stays roughly constant per token at long sequence lengths where transformers degrade; (3) Mamba runs efficiently on edge silicon where attention is memory-bound. The tradeoff: Mamba is harder to parallelize during training, and pure-Mamba models can underperform transformers on some short-context reasoning tasks — which is why hybrid architectures like Jamba (Mamba + Transformer) have emerged.

Does EdgeFlow support Mamba inference?

Yes — Mamba (Mamba-1, Mamba-2) and Jamba are first-class architectures in EdgeFlow, the inference engine inside EdgeMatrix. EdgeFlow runs Mamba natively across NVIDIA (A100, H100, L40s), AMD (CPUs and GPUs with ROCm), Intel, and ARM silicon. State space dispatch is handled by CORE — the compiler + runtime engine that recognizes the architecture family at compile time and emits the appropriate scan-kernel sequence. Most generic inference frameworks (vLLM, TensorRT-LLM) have limited or workaround-only Mamba support; EdgeFlow treats SSMs as a first-class architecture family.

Samba-ASR is SandLogic's Mamba-based automatic speech recognition (ASR) model, the underlying engine of the Sruthi speech-to-text product. Published as arXiv 2501.02832, Samba-ASR delivers a −51% average word error rate reduction versus Whisper-large-v3 (the open-source ASR benchmark). The Mamba architecture provides two advantages for speech: linear-time processing of long-form audio (1-hour meetings, podcast transcription) where transformer-attention degrades, and significantly lower memory footprint per audio second — critical for on-device or edge speech inference.

Why is Mamba inference important for edge AI?

Two reasons. First, memory: transformer attention requires O(n²) memory at inference, which is prohibitive on edge devices where total RAM is measured in single-digit gigabytes. Mamba uses constant memory per step regardless of sequence length, making long-context inference feasible on phones, smart glasses, AI note-takers, and embedded systems. Second, throughput: scan operations parallelize well on the silicon classes used in edge devices (NPU, ARM-class processors, custom AI accelerators like Krsna). The Krsna SoC supports State Space Models as one of its four production model families specifically because of edge inference requirements.

Can I run Mamba on Qualcomm or ARM silicon?

Yes — EdgeFlow runs Mamba on ARM (server and edge, including Raspberry Pi class), Qualcomm QDC stack, and other ARM-class processors. The SandLogic stack was selected into Qualcomm's Startup Mentorship Program (QSMP) 2024 cohort (1 of 2 companies). Shakti models (including state-space variants) are demonstrated running on Qualcomm silicon — IMC 2025 featured on-device inference of Shakti through the Qualcomm QDC stack. ARM and Qualcomm are not afterthoughts in EdgeFlow; they are first-class deployment targets alongside NVIDIA.

What about Liquid Foundation Models (LFM)? Are they like Mamba?

Liquid Foundation Models (LFM-1B, LFM-3B, LFM-40B from Liquid AI) are a separate family of architectures — based on continuous-time neural networks rather than discrete-time state-space models like Mamba. LFMs are particularly strong for time-series, control, and reasoning workloads where Liquid AI is pushing state-of-the-art. EdgeFlow has beta support for the LFM family — runtime supports the architecture, with engineering validation complete; customer deployments are still pending. The dispatch pattern is similar to Mamba (recognize architecture family, emit appropriate kernel sequence), making EdgeFlow naturally extensible to new sequence-modeling architectures as they emerge.

// LEARN · STATE SPACE MODELS

Mamba inference, explained.
Why state-space matters.

Q: What is Mamba?

Mamba is a State Space Model (SSM) architecture introduced in late 2023 as a linear-time alternative to transformer attention. While transformer attention scales quadratically with sequence length — making long-context inference expensive — Mamba scales linearly through a selective state-space mechanism that processes sequences via efficient scan operations rather than full attention matrices. Mamba (and successors Mamba-2 and Jamba, a Mamba+Transformer hybrid) is particularly strong for long-context workloads, time-series modeling, and edge deployment where attention memory cost is prohibitive.

Mamba is the linear-time alternative to transformer attention that broke architectural dominance in late 2023. Better at long context. Lower memory. Faster on edge. EdgeFlow runs Mamba, Mamba-2, and Jamba natively across NVIDIA, AMD, Intel, ARM, and Qualcomm. SandLogic's Samba-ASR — a Mamba speech model — delivers −51% WER vs Whisper-large-v3.

Mamba complexity

O(n)

Transformer complexity

O(n²)

Samba-ASR vs Whisper

−51% WER

Krsna SSM support

Production

// THE SHORT ANSWER

Mamba is a State Space Model (SSM) architecture that scales linearly with sequence length — O(n) compute and constant memory per step — versus transformer attention's O(n²). Practically: Mamba handles 100k+ token contexts without memory blow-up, runs efficiently on edge silicon where attention is memory-bound, and powers SandLogic's Samba-ASR (−51% WER vs Whisper-large-v3 per arXiv 2501.02832). EdgeFlow inside EdgeMatrix runs Mamba, Mamba-2, and Jamba natively across NVIDIA, AMD, Intel, ARM, and Qualcomm — first-class architecture support, not workaround-via-attention emulation.

// WHAT MAMBA IS

Linear-time sequence modeling. No attention required.

Mamba was introduced by Gu & Dao in late 2023 as a selective State Space Model — a different paradigm for sequence modeling than transformer attention. The core insight: replace the pairwise-attention bottleneck (every token attending to every other) with a selective state-space mechanism that processes sequences via efficient scan operations. The model maintains an internal hidden state that gets updated per token; the state itself does the work that attention was doing.

The architectural family has evolved rapidly: Mamba-2 (improved training-time parallelism), Jamba (hybrid Mamba + Transformer for selective short-context attention plus long-context efficiency), and a growing landscape of SSM variants. They all share the linear-time signature.

// COMPLEXITY COMPARISON

SEQUENCE LENGTH

Token count in the context.

TRANSFORMER ATTENTION

O(n²)

Compute + memory scale quadratically. Doubling context = 4× cost.

MAMBA / SSM

O(n)

Compute scales linearly; constant memory per inference step.

// WHY IT MATTERS FOR INFERENCE

Mamba is edge-native.

The architecture matters most where transformer attention hurts most: long context, memory-constrained silicon, and sustained streaming workloads.

// USE CASE 01

Long-context inference (100k+ tokens)

Document understanding, long meeting transcripts, RAG over book-length context. Transformer attention degrades sharply past 32k tokens. Mamba processes sequence length linearly — 100k tokens is the same cost-per-token as 10k tokens.

// USE CASE 02

Edge silicon (memory-constrained)

Smartwatches, smart speakers, AI note-takers, smart glasses — total RAM measured in gigabytes. Constant-memory Mamba inference fits where O(n²) attention can't. Krsna SoC supports SSM as one of four production model families specifically for this.

// USE CASE 03

Streaming speech (continuous audio)

Long-form speech recognition — 1-hour meetings, podcast transcription, continuous voice capture. Samba-ASR (Mamba-based) handles streaming audio with constant memory per audio second. Sruthi (the SandLogic ASR product) is built on this advantage — −51% WER vs Whisper-large-v3.

// USE CASE 04

Multi-user agentic workloads

Each user maintains a long conversational state. Per-user transformer KV-cache balloons; per-user Mamba SSM hidden state stays small. Higher concurrency on the same silicon. The economic argument compounds.

// HOW EDGEFLOW ACCELERATES MAMBA

First-class state space. Not an afterthought.

Most inference frameworks were built for transformers, with state-space support bolted on as a special case. EdgeFlow treats SSMs as a peer architecture family — same dispatch infrastructure, same kernel pipeline, same hardware coverage. Four engineered layers drive the acceleration.

// LAYER 01

Architecture recognition

CORE recognizes Mamba / Mamba-2 / Jamba at compile time via the model graph structure (selective state-space layers, scan operations, the absence of attention blocks). It selects the SSM-appropriate kernel sequence — not a transformer kernel sequence.

// LAYER 02

Scan-kernel selection

The hardware-aware compiler picks the right scan implementation for each silicon target. NVIDIA: parallel scan on CUDA. AMD: ROCm-tuned scan. ARM: NEON-vectorized scan. Krsna: native scan primitives in the spatial array. Same model, different kernel per silicon, all driven by the compiler.

// LAYER 03

Memory management

Mamba's constant-memory-per-step property requires careful state caching. EdgeFlow's hybrid cache architecture extends from KV-cache (transformer-style) to SSM hidden state (Mamba-style) — the same memory management framework handles both architectures.

// LAYER 04

Throughput optimization

For long-context Mamba inference, the bottleneck is often the state-update step. EdgeFlow batches state updates across requests where the cache layout allows, driving throughput up on multi-user agentic workloads where each user has a long conversational context.

// PROOF POINT

Samba-ASR: −51% WER vs Whisper.

The most direct evidence that Mamba inference works at production quality on the SandLogic stack is Samba-ASR — the underlying model of the Sruthi ASR product. Published as arXiv 2501.02832, Samba-ASR achieves a 51% average word error rate reduction versus Whisper-large-v3, OpenAI's open-source ASR baseline.

// ARCHITECTURE

Mamba (SSM)

Selective state-space scan. Linear-time processing of audio streams. Constant memory per audio second.

// PERFORMANCE

−51% WER

Average word error rate reduction vs Whisper-large-v3. Stronger advantage on long-form audio, accented English, code-switched Indic speech.

// DEPLOYMENT

22 Indic languages

Production deployment as Sruthi inside the Lingo platform. 21+ enterprises running today. On-prem or cloud.

// GO DEEPER

Continue exploring state-space and the runtime.

/sruthi — the production ASR product built on Samba-ASR. −51% WER, 22 Indic languages, on-prem deployment.
/edgeflow — the inference engine that runs Mamba natively across all silicon families.
/any-ai — the chip-level Any AI claim: Krsna supports SSM as one of four production model families.
/research — the arXiv 2501.02832 paper and the rest of the SandLogic research output.
/compare/vllm-vs-edgeflow — why EdgeFlow handles Mamba natively where vLLM has limited support.

// LET'S BUILD

Mamba in production. Talk to engineering.

Talk to engineering See EdgeFlow

Mamba inference, explained.Why state-space matters.

Linear-time sequence modeling. No attention required.

Mamba is edge-native.

Long-context inference (100k+ tokens)

Edge silicon (memory-constrained)

Streaming speech (continuous audio)

Multi-user agentic workloads

First-class state space. Not an afterthought.

Architecture recognition

Scan-kernel selection

Memory management

Throughput optimization

Samba-ASR: −51% WER vs Whisper.

Mamba (SSM)

−51% WER

22 Indic languages

Continue exploring state-space and the runtime.

Mamba in production. Talk to engineering.

Mamba inference, explained.
Why state-space matters.