// L02 — RUNTIME · THE CUDA OF EDGE

EdgeMatrix.
The CUDA of edge AI.

EdgeMatrix has two layers: CORE (compiler + runtime engine) and EdgeFlow (the inference engine where models actually execute). Together they make any model run on any silicon. It's what CUDA is for NVIDIA, but hardware-agnostic by design. +73% throughput vs vLLM on L40s. 193 model architectures pre-tuned. Runs across five third-party silicon platforms — plus Krsna, our co-designed in-house SoC.

Throughput vs vLLM
+73%
Cost saving
40%
Models supported
193
Hardware targets
6

CUDA, but unlocked.

CUDA earned its name by doing five things — and locking developers into one silicon vendor to get them. EdgeMatrix does the same five things, without the lock. The comparison is earned, point by point. Here's what each claim means.

// CLAIM 01 · UNIFIED PROGRAMMING MODEL

Like CUDA: one programming abstraction across the silicon.

CUDA gave developers one mental model that compiled down to every NVIDIA GPU generation. EdgeMatrix gives developers one programming abstraction that compiles down to NVIDIA, AMD, Intel, ARM, and Qualcomm — plus Krsna, the co-designed in-house SoC that targets a deliberate model-family subset.

Proof: 193 models · 5 third-party silicon · 1 in-house · one binary

// CLAIM 02 · COMPILER + RUNTIME + LIBS

Like CUDA: the full toolchain, not just a library.

CUDA is compiler (nvcc) + runtime (driver) + libraries (cuBLAS, cuDNN). EdgeMatrix is CORE (compiler + runtime engine) + EdgeFlow (inference engine) + op libraries. Same architectural shape, deliberately.

Proof: L03 platform layer · L01 silicon co-design

// CLAIM 03 · KERNEL OPTIMIZATION

Like CUDA: hand-tuned kernels for the silicon.

CUDA's value isn't the language — it's the decade of hand-tuned kernels in cuBLAS, cuDNN, cuSPARSE. EdgeMatrix has the equivalent for each silicon target: hybrid KV-cache, cache-aware scheduling, dynamic dispatch. The +73% vs vLLM is the receipt.

Proof: +73% on L40s · +29% on A100 · vs vLLM 0.10.2

// CLAIM 04 · MODEL ZOO BREADTH

Like CUDA: ships with the model architectures already supported.

CUDA shipped with the operators you needed. EdgeMatrix ships with 193 model architectures pre-tuned — across Transformers, SSMs (Mamba), RWKV, LFMs, CNNs, MoE, VLMs, and diffusion. Bring your own model or pick from the zoo.

Proof: CORE dispatches 8 architecture families × 5 third-party silicon platforms

// CLAIM 05 · THE DIFFERENCE THAT MATTERS

Unlike CUDA: hardware-agnostic by design.

CUDA's biggest feature is its biggest constraint — it only runs on NVIDIA. That's NVIDIA's moat, and it's developer's lock. EdgeMatrix optimizes natively for NVIDIA *and* AMD *and* Intel *and* ARM *and* Qualcomm — five third-party silicon platforms — plus Krsna, our co-designed in-house SoC. When the workload moves, the runtime moves with it. That's the only respect in which we don't want to be CUDA.

Proof: the compatibility matrix on /any-ai

// THROUGHPUT BENCHMARKS

Faster than every open framework.

Benchmarks run on NVIDIA A100 (80 GB) and L40s. EdgeMatrix v0.0.4 vs vLLM v0.10.2 vs TensorRT-LLM v1.0.0. Numbers vary by model family and workload — see research for full disclosure.

+29%
avg. throughput on A100
+73%
avg. throughput on L40s
23-40%
cost savings range
NVIDIA L40s · tokens/sec

EdgeMatrix vs leading runtimes

EdgeMatrix v0.0.4Hybrid KV-cache reuse+73%TensorRT-LLM v1.0.0NVIDIA reference+8%SGLangOpen-source+4%vLLM v0.10.2Baseline0%
Relative throughput vs vLLM 0.10.2 baseline. Top-25 enterprise SLMs, NVIDIA L40s, 3-run avg, FP16.
NVIDIA A100 80GB · tokens/sec

Same engine on enterprise hardware

EdgeMatrix v0.0.4Cache-aware scheduling+29%TensorRT-LLM v1.0.0+10%SGLang+5%vLLM v0.10.2Baseline0%
Relative throughput vs vLLM 0.10.2 baseline on NVIDIA A100 80GB.

Write once. Run anywhere.

EdgeMatrix is the connective tissue. Any of the 193 supported model architectures — Llama, Qwen, Mistral, Shakti, Phi, DeepSeek, Gemma, and more — runs through one binary on any target: NVIDIA, AMD, Intel, ARM, NPUs, or Krsna SoC. No re-quantization. No vendor lock-in. No per-target kernel team.

Shakti family6 in-house · 2 in flightLlama12 architecturesQwen18 architectures · incl. VLMistral · Mixtral9 architecturesDeepSeek11 architectures · V2/V3/R1Phi · Gemma · more+ 137 across the long tailEdgeMatrixHardware-agnostic runtimeONE BINARYNVIDIAA100 · H100 · L40s · L4AMDMI300 · ROCm stackIntelGaudi · Xeon · ArcARM CPUsServer + edge devicesNPUsQualcomm · Apple · customKrsna SoCIn-house siliconANY MODELANY HARDWARE
193 model architectures × 5 third-party silicon + 1 in-house Krsna SoC × one binary. CORE dispatches the architectures; EdgeFlow accelerates the inference. Krsna implements a deliberate scoped subset — see /krsna for chip-specific coverage.

Built for the production reality.

Hybrid KV cache reuse

Combines prefix-level and entity-level cache reuse to cut recomputation. Lifts tokens/sec by 29-73% over vLLM 0.10.2 and TensorRT-LLM 1.0.0 across the Top-25 enterprise SLMs.

Dynamic compiler optimization

Adapts to model and hardware at runtime. No static configuration. Just-in-time kernel selection based on batch shape, sequence length, and target device.

Hardware-agnostic acceleration

Optimized for NVIDIA, AMD, Intel, ARM. Ready for NPUs, GPUs, FPGAs without re-architecting. Modular runtime — extend to new hardware in days, not quarters.

Quantization without quality loss

INT8 and INT4 quantization across model architectures with no noticeable accuracy degradation. Fits 70B-class models into 24GB device footprints.

Cache-aware scheduling

Maximizes GPU/NPU utilization by routing requests based on cache locality. Higher concurrency, lower latency, no engineering effort from the model team.

Native VLM, MoE, multi-modal

Where vLLM and TensorRT-LLM still don't support VLM architectures, EdgeMatrix runs Shakti-VLM, Qwen-VL, and frontier multi-modal models out of the box.

// HYBRID KV CACHE

How the +73% lift actually works.

INCOMINGRequestprompt + contextCACHE TIER 01 — PREFIXPrefix-level KV cacheShared prompt prefixes · system messagesHit = skip prefill of that prefix entirelyCACHE TIER 02 — ENTITYEntity-level KV cacheNamed entities · retrieved chunks · tool outputsHit = stitch entity KVs without re-encodingMODELShaktior any LLMdecode only on miss↑ populates cachesOUTPUTResponsetokensHIT — skip modelMISS — decode
Two cache tiers sit between every request and the model. Prefix-level for shared prompts, entity-level for retrieved chunks. Cache hits skip the model entirely; cache misses populate both tiers on the way through.

193 models and counting.

EdgeMatrix's modular runtime ships pre-tuned for the 193 most-used model architectures in enterprise agentic AI — text, VLM, MoE, and multi-modal. New models added in less than a week.

Shakti

6

Shakti-2.5B, Shakti-VLM-1B, Shakti-VLM-4B

Llama

12

Llama 3, Llama 3.1, Llama 4

Qwen

18

Qwen 2, Qwen 2.5, Qwen2-VL

Mistral

9

Mistral 7B, Mixtral 8x7B, Codestral

Phi

6

Phi-3, Phi-3-Vision, Phi-3.5

DeepSeek

11

DeepSeek V2, V3, R1, OCR

Gemma

7

Gemma 2, Gemma 3

And many more

124

Granite, Cohere, Yi, GLM, Falcon...

// LET'S BUILD

Replace your inference stack — in a week.