// LEARN · SHAKTI ARCHITECTURE

How Shakti is different.
Architecture, not scale.

Most language and vision models evolve through scale-driven iteration — abundant compute, large memory, cloud-centric inference. Shakti was architected under the opposite assumptions: compute is constrained, latency matters, models must adapt to devices. The objective — maximize intelligence per parameter while preserving predictable behavior under real-world constraints. Here is how each architectural choice shapes model behavior.

Design objective
Intelligence / param
Text + Vision
Shared philosophy
Punches above weight
2–3×
arXiv papers
3
// THE SHORT ANSWER

Shakti is architecturally different because it optimizes for intelligence per parameter, not scale. The Text SLMs use Variable Grouped Query Attention (VGQA), sliding-window attention with high-theta RoPE (θ = 500,000), and Pre-Normalization with SwiGLU. The Vision-Language Models use a deep 48-layer ViT encoder, QK-Normalization with RMSNorm, dynamic patch sizing (14×14 to 32×32), and 2D RoPE with absolute positional bias. Every choice follows from one assumption set — compute is constrained, latency matters, models must adapt to devices. The result: Shakti models punch 2-3× above their parameter weight, with lower hallucination, more faithful instruction adherence, and consistent latency.

From scale-first to intelligence-first.

The current generation of language and vision models has largely evolved through scale-driven iteration. Impressive benchmark results — but structural limitations in efficiency, stability, controllability, and deployment feasibility, especially outside hyperscale cloud environments. Shakti starts from a different assumption set.

// SCALE-FIRST (MOST MODELS)

Assumes abundance

  • · Abundant compute
  • · Large memory budgets
  • · High-bandwidth interconnects
  • · Cloud-centric inference
  • · Intelligence emerges from brute force
// INTELLIGENCE-FIRST (SHAKTI)

Assumes constraint

  • · Compute is constrained
  • · Latency matters
  • · Models adapt to devices, not the reverse
  • · Edge-aware memory and compute budgets
  • · Intelligence emerges from architectural efficiency

"Constraints are not a limitation — they are an architectural clarifier. When compute and memory are treated as scarce from day one, inefficient design choices surface immediately."

// SHAKTI TEXT SLM

Architecture shaped by long-horizon reasoning.

Three architectural choices define the Shakti Text SLMs. Each was selected for behavioral impact, not benchmark optics.

// TEXT SLM ARCHITECTURE — AT A GLANCE
MechanismWhat it doesBehavioral impact
Variable Grouped Query Attention (VGQA)Query heads dynamically share key/value representations; fewer K/V heads than query heads; K/V projections reused across multiple query headsLong-horizon reasoning coherence; predictable latency as context grows
Sliding Window Attention + KV Caching + high-theta RoPE (θ = 500,000)Context treated as temporal memory — recent tokens get higher fidelity, relevance decays gracefullyReduced hallucination drift in long conversations; better instruction + constraint retention
Pre-Normalization + SwiGLUPre-Norm consistently across layers; SwiGLU activations for representational capacity without widening layersGradient stability under aggressive fine-tuning; faster convergence; lower prompt-phrasing sensitivity
01

Variable Grouped Query Attention (VGQA)

Rethinking attention economics

Instead of treating attention heads uniformly, VGQA has multiple query heads dynamically share key/value representations — with deliberately fewer key/value heads than query heads, and key/value projections reused across multiple query heads to cut redundant computation. Architecturally: reduced memory-bandwidth pressure, improved cache locality, efficient sliding-window attention without sacrificing global coherence. Behaviorally: more stable reasoning over long contexts, lower coherence degradation as sequence length increases, predictable latency even as context grows. Critical for enterprise workloads — logs, transcripts, long documents, conversational histories.

Attention design directly shapes reasoning behavior. VGQA does not merely optimize compute — it fundamentally improves long-horizon coherence.

02

Sliding Window Attention + KV Caching

Sustained context without explosion

Rather than pushing context length purely via static positional scaling, Shakti combines sliding-window attention, aggressive key-value caching, and high-theta RoPE (θ = 500,000). The architectural implication: context is treated as temporal memory, not a flat sequence — recent tokens get higher fidelity, long-context inference is supported with reduced memory and compute overhead versus full attention. Model behavior: reduced hallucination drift in long conversations, better retention of instructions and constraints, improved summarization fidelity over evolving documents.

Long context fails when everything is treated as equally important. Treating context as temporal memory — where relevance decays gracefully — proved more effective.

03

Pre-Normalization + SwiGLU

Stability as a first-class citizen

Shakti adopts Pre-Normalization consistently across layers and uses SwiGLU activations. Pre-Norm ensures gradient stability even under aggressive fine-tuning; SwiGLU improves representational capacity without widening layers. Resulting behavior: faster convergence during domain adaptation, more predictable fine-tuning outcomes, less sensitivity to prompt phrasing.

Fine-tuning is the most honest stress test of architecture. Architectures that appear strong during pretraining often fail during repeated domain adaptation.

Vision as a first-class reasoning modality.

Many VLMs treat vision as a front-end encoder feeding a language backbone. Shakti VLM treats vision tokens as reasoning tokens. Four architectural choices follow from that decision.

// VLM ARCHITECTURE — AT A GLANCE
ComponentSpec / approachWhy it matters
Vision encoderDeep ViT-style — 48 layers, 1920 hidden dimPreserves fine-grained spatial structure deep into the network; avoids aggressive early pooling
Attention normalizationQK-Normalization with RMSNorm, inside vision attentionPrevents attention-score explosion on high-resolution images; stabilizes deep-stack gradients
Patch sizingDynamic — 14×14 (low-res) up to 32×32 (high-res)Adapts to image resolution; efficient token use; no accuracy collapse on mixed-resolution data
Positional encoding2D RoPE extended into spatial domains + absolute positional biasPreserves relative spatial relationships; drift-free absolute grounding in large layouts

Deep ViT encoder — 48 layers, 1920 hidden dim

Not to chase accuracy benchmarks, but to preserve fine-grained spatial semantics, avoid aggressive early pooling, and maintain object-level separability deep into the network. Visual reasoning — documents, charts, layouts — requires spatial continuity; shallow encoders collapse structure too early.

Vision models fail not because they lack accuracy, but because they discard structure too early.

QK-Normalization with RMSNorm

Applied directly inside the vision attention mechanism. Prevents attention-score explosion in high-resolution images; stabilizes gradients across deep vision stacks. Behavioral impact: consistent attention maps across resolutions, reduced sensitivity to lighting/noise/scaling, improved object localization without explicit supervision.

Attention stability is foundational in vision, not optional.

Dynamic patch sizing — 14×14 to 32×32

Instead of fixed patch sizes, Shakti VLM adapts: 14×14 for low-resolution inputs, up to 32×32 for high-resolution. Real-world images are not uniformly sized; fixed patches either over-fragment or under-represent. Result: efficient token utilization, better performance on mixed-resolution datasets, lower inference cost without accuracy collapse.

Architectures must adapt to data variability instead of forcing uniformity.

2D RoPE with absolute bias

Shakti extends RoPE into 2D spatial domains, augmented with absolute positional bias. Relative spatial relationships are preserved; absolute positioning prevents drift in large layouts. Practical outcome: stronger grounding in document intelligence, improved multi-object reasoning, reliable spatial referencing in multimodal conversations.

Relative position alone is insufficient for spatial reasoning at scale.

// SHAKTI IN CONTEXT

Proven foundations. Deliberate divergence.

Shakti is not built on exotic, unproven components. It uses the same battle-tested techniques the frontier open models use — Pre-Normalization, RMSNorm, SwiGLU, RoPE, grouped-query attention, KV caching. We are not reinventing the basics.

Shakti diverges precisely where the constraint demands it: a different design objective, a different long-context strategy, a different vision-integration philosophy, and a coherence across modalities that most stacks do not have. The table below is honest about both — what is shared with the field, and what genuinely diverges.

Architecture dimensionPrevailing approach
(frontier open models)
Shakti's choice
Primary objectiveScale-first — capability through parameter count and data volumeIntelligence-per-parameter — capability under edge, latency, and memory constraints
Attention schemeMulti-Head or fixed Grouped Query AttentionVariable Grouped Query Attention (VGQA) — query heads dynamically share key/value representations
Long-context strategyLarge static context windows; positional scalingSliding window + aggressive KV caching + high-theta RoPE (θ=500,000) — context as temporal memory
Normalization & activationsPre-Normalization, RMSNorm, SwiGLUPre-Normalization, RMSNorm, SwiGLU — the same proven foundations
Vision integration (VLM)Vision encoder as a front-end feature extractor feeding the language backboneVision tokens treated as reasoning tokens; deep 48-layer encoder; dynamic patch sizing; 2D RoPE
Cross-modal designText and vision architectures evolved as separate effortsOne unified set of design principles applied identically across text and vision
Deployment assumptionCloud-centric inference — abundant compute and memoryEdge + on-prem — models adapt to devices, not the reverse
// HOW TO READ THIS TABLE

Rust text marks a genuine divergence from the prevailing approach. Muted text marks a shared, proven best practice. We mark the shared rows honestly — because that honesty is what makes the divergences credible.

// THE POINT

Shakti's differentiation is not a secret component. It is the objective function the architecture is optimized for, the combination of choices, and the coherence across text and vision. Different where it counts — proven everywhere else.

Note: "prevailing approach" describes the common pattern across frontier open models (Llama, Qwen, Mistral, Phi class). Specific implementations vary by model and version; the column reflects the dominant industry pattern, not any single model's exact configuration.

Text and vision share the same principles.

Multimodal systems benefit more from architectural coherence than architectural diversity. Shared principles across text and vision significantly reduced alignment fragility. Five principles run identically through both families.

PrincipleManifestationEffect
Efficiency-firstVGQA, dynamic patch sizing, KV cachingReduced compute load
Stability-firstPre-Norm, QK-Norm, RMSNormPredictable behavior
Long-horizon reasoningHigh-theta RoPE, sliding windowsExtended coherence
Deployment realismEdge-aware memory and compute budgetsReal-world viability
Behavioral predictabilityReduced variance under fine-tuningReliable outcomes
// WHAT CHANGES IN PRACTICE

Architectural choices, architectural consequences.

Because of these architectural choices, Shakti models exhibit a specific behavioral signature. These are not accidental outcomes — they are architectural consequences.

// 01

Lower hallucination propensity in long contexts

// 02

More faithful instruction adherence

// 03

Better spatial grounding in vision tasks

// 04

Consistent latency profiles

// 05

Higher intelligence-per-parameter ratio

// THE CONCLUSION

"Architecture encodes values. What a model optimizes for — efficiency, stability, adaptability — shows up inevitably in how it behaves in the real world."

— Kamalakar Devaki, Founder & CEO, SandLogic Technologies

From architecture to deployment.

  • /shakti — the full Shakti family: six released models (100M to 4B), benchmarks vs Llama / Phi / Qwen, the Lexicons + Nexons continuum.
  • /research — the three arXiv papers behind Shakti and Shakti-VLM, plus the patent portfolio.
  • /learn/mamba-inference-explained — the state-space architecture that powers Samba-ASR, the Shakti-family speech model.
  • /any-ai — how the Krsna SoC + CORE run Shakti and other architectures natively.
  • /founders-desk — the full Shakti article series and more writing by Kamalakar Devaki.
// LET'S BUILD

Architecture you can deploy.