What makes Shakti models architecturally different?

Shakti is SandLogic's family of small language and vision-language models architected for one objective: maximize intelligence per parameter while preserving predictable behavior under real-world constraints. Where most contemporary models evolve through scale-driven iteration (abundant compute, large memory, cloud-centric inference), Shakti was designed under the opposite assumptions — compute is constrained, latency matters, models must adapt to devices. Key Text SLM choices: Variable Grouped Query Attention (VGQA), sliding-window attention with high-theta RoPE (θ = 500,000), and Pre-Normalization with SwiGLU. Key Vision-Language choices: a deep 48-layer ViT encoder, QK-Normalization with RMSNorm, dynamic patch sizing, and 2D RoPE with absolute positional bias.

What is Variable Grouped Query Attention (VGQA)?

Variable Grouped Query Attention (VGQA) is the attention mechanism in Shakti Text SLMs. Instead of treating attention heads uniformly, VGQA has multiple query heads dynamically share key/value representations, with deliberately fewer key/value heads than query heads; key/value projections are reused across multiple query heads to reduce redundant computation. Architecturally, this reduces memory-bandwidth pressure, improves cache locality during inference, and enables efficient sliding-window attention without sacrificing global coherence. Behaviorally, VGQA does not merely optimize compute — it measurably improves long-horizon reasoning coherence and produces predictable latency even as context grows.

How does Shakti handle long context?

Shakti Text SLMs combine sliding-window attention, aggressive key-value caching, and high-theta RoPE (θ = 500,000). Rather than treating context as a flat sequence, Shakti treats context as temporal memory — recent tokens get higher fidelity, and relevance decays gracefully. The architectural result: long-context inference with reduced memory and compute overhead compared to full attention. The behavioral result: reduced hallucination drift in long conversations, better retention of instructions and constraints, and improved summarization fidelity over evolving documents. As the article puts it — 'long context fails when everything is treated as equally important.'

How is the Shakti Vision-Language Model architecture different?

Where many VLMs treat vision as a front-end encoder feeding a language backbone, Shakti VLM treats vision tokens as reasoning tokens. Four architectural choices follow: (1) a deep ViT-style encoder — 48 layers, 1920 hidden dim — that preserves fine-grained spatial structure deep into the network instead of collapsing it with early pooling; (2) QK-Normalization with RMSNorm inside the vision attention mechanism, preventing attention-score explosion on high-resolution images; (3) dynamic patch sizing (14×14 for low-resolution, up to 32×32 for high-resolution) so vision adapts to data variability; (4) 2D RoPE extended into spatial domains with absolute positional bias, preserving relative spatial relationships while preventing drift in large layouts.

Why does architecture matter more than scale for Shakti?

Shakti is not an attempt to out-scale existing models — it is an attempt to out-think the architecture. The thesis: trustworthy behavior begins at the architecture layer, not the alignment layer. Alignment techniques refine behavior, but architectural stability determines whether that behavior is reliable under pressure. Because of its architectural choices, Shakti exhibits lower hallucination propensity in long contexts, more faithful instruction adherence, better spatial grounding in vision tasks, consistent latency profiles, and a higher intelligence-per-parameter ratio. These are not accidental outcomes — they are architectural consequences. As the article concludes — 'architecture encodes values.'

How does Shakti compare to Llama, Qwen, or Phi?

Shakti is designed for a different optimization target than frontier-scale general-purpose models from Meta (Llama), Alibaba (Qwen), or Microsoft (Phi). Those models are scale-first, cloud-centric. Shakti is intelligence-first, deployment-realistic — architected for constrained compute, edge deployment, and predictable behavior. The practical outcome: Shakti models are engineered to outperform peers 2-3× their parameter count on production enterprise workloads, and Shakti-VLM outperforms Qwen2-VL-7B on document and chart understanding. When right-sizing a model for an enterprise workload, Shakti often replaces a much larger model at 50-100× lower per-token cost.

// LEARN · SHAKTI ARCHITECTURE

How Shakti is different.
Architecture, not scale.

Most language and vision models evolve through scale-driven iteration — abundant compute, large memory, cloud-centric inference. Shakti was architected under the opposite assumptions: compute is constrained, latency matters, models must adapt to devices. The objective — maximize intelligence per parameter while preserving predictable behavior under real-world constraints. Here is how each architectural choice shapes model behavior.

Design objective

Intelligence / param

Text + Vision

Shared philosophy

Punches above weight

2–3×

arXiv papers

// THE SHORT ANSWER

Shakti is architecturally different because it optimizes for intelligence per parameter, not scale. The Text SLMs use Variable Grouped Query Attention (VGQA), sliding-window attention with high-theta RoPE (θ = 500,000), and Pre-Normalization with SwiGLU. The Vision-Language Models use a deep 48-layer ViT encoder, QK-Normalization with RMSNorm, dynamic patch sizing (14×14 to 32×32), and 2D RoPE with absolute positional bias. Every choice follows from one assumption set — compute is constrained, latency matters, models must adapt to devices. The result: Shakti models punch 2-3× above their parameter weight, with lower hallucination, more faithful instruction adherence, and consistent latency.

// THE OBJECTIVE SHIFT

From scale-first to intelligence-first.

The current generation of language and vision models has largely evolved through scale-driven iteration. Impressive benchmark results — but structural limitations in efficiency, stability, controllability, and deployment feasibility, especially outside hyperscale cloud environments. Shakti starts from a different assumption set.

// SCALE-FIRST (MOST MODELS)

Assumes abundance

· Abundant compute
· Large memory budgets
· High-bandwidth interconnects
· Cloud-centric inference
· Intelligence emerges from brute force

// INTELLIGENCE-FIRST (SHAKTI)

Assumes constraint

· Compute is constrained
· Latency matters
· Models adapt to devices, not the reverse
· Edge-aware memory and compute budgets
· Intelligence emerges from architectural efficiency

"Constraints are not a limitation — they are an architectural clarifier. When compute and memory are treated as scarce from day one, inefficient design choices surface immediately."

// SHAKTI TEXT SLM

Architecture shaped by long-horizon reasoning.

Three architectural choices define the Shakti Text SLMs. Each was selected for behavioral impact, not benchmark optics.

// TEXT SLM ARCHITECTURE — AT A GLANCE

Mechanism	What it does	Behavioral impact
Variable Grouped Query Attention (VGQA)	Query heads dynamically share key/value representations; fewer K/V heads than query heads; K/V projections reused across multiple query heads	Long-horizon reasoning coherence; predictable latency as context grows
Sliding Window Attention + KV Caching + high-theta RoPE (θ = 500,000)	Context treated as temporal memory — recent tokens get higher fidelity, relevance decays gracefully	Reduced hallucination drift in long conversations; better instruction + constraint retention
Pre-Normalization + SwiGLU	Pre-Norm consistently across layers; SwiGLU activations for representational capacity without widening layers	Gradient stability under aggressive fine-tuning; faster convergence; lower prompt-phrasing sensitivity

Variable Grouped Query Attention (VGQA)

Rethinking attention economics

Instead of treating attention heads uniformly, VGQA has multiple query heads dynamically share key/value representations — with deliberately fewer key/value heads than query heads, and key/value projections reused across multiple query heads to cut redundant computation. Architecturally: reduced memory-bandwidth pressure, improved cache locality, efficient sliding-window attention without sacrificing global coherence. Behaviorally: more stable reasoning over long contexts, lower coherence degradation as sequence length increases, predictable latency even as context grows. Critical for enterprise workloads — logs, transcripts, long documents, conversational histories.

Attention design directly shapes reasoning behavior. VGQA does not merely optimize compute — it fundamentally improves long-horizon coherence.

Sliding Window Attention + KV Caching

Sustained context without explosion

Rather than pushing context length purely via static positional scaling, Shakti combines sliding-window attention, aggressive key-value caching, and high-theta RoPE (θ = 500,000). The architectural implication: context is treated as temporal memory, not a flat sequence — recent tokens get higher fidelity, long-context inference is supported with reduced memory and compute overhead versus full attention. Model behavior: reduced hallucination drift in long conversations, better retention of instructions and constraints, improved summarization fidelity over evolving documents.

Long context fails when everything is treated as equally important. Treating context as temporal memory — where relevance decays gracefully — proved more effective.

Pre-Normalization + SwiGLU

Stability as a first-class citizen

Shakti adopts Pre-Normalization consistently across layers and uses SwiGLU activations. Pre-Norm ensures gradient stability even under aggressive fine-tuning; SwiGLU improves representational capacity without widening layers. Resulting behavior: faster convergence during domain adaptation, more predictable fine-tuning outcomes, less sensitivity to prompt phrasing.

Fine-tuning is the most honest stress test of architecture. Architectures that appear strong during pretraining often fail during repeated domain adaptation.

// SHAKTI VLM

Vision as a first-class reasoning modality.

Many VLMs treat vision as a front-end encoder feeding a language backbone. Shakti VLM treats vision tokens as reasoning tokens. Four architectural choices follow from that decision.

// VLM ARCHITECTURE — AT A GLANCE

Component	Spec / approach	Why it matters
Vision encoder	Deep ViT-style — 48 layers, 1920 hidden dim	Preserves fine-grained spatial structure deep into the network; avoids aggressive early pooling
Attention normalization	QK-Normalization with RMSNorm, inside vision attention	Prevents attention-score explosion on high-resolution images; stabilizes deep-stack gradients
Patch sizing	Dynamic — 14×14 (low-res) up to 32×32 (high-res)	Adapts to image resolution; efficient token use; no accuracy collapse on mixed-resolution data
Positional encoding	2D RoPE extended into spatial domains + absolute positional bias	Preserves relative spatial relationships; drift-free absolute grounding in large layouts

Deep ViT encoder — 48 layers, 1920 hidden dim

Not to chase accuracy benchmarks, but to preserve fine-grained spatial semantics, avoid aggressive early pooling, and maintain object-level separability deep into the network. Visual reasoning — documents, charts, layouts — requires spatial continuity; shallow encoders collapse structure too early.

Vision models fail not because they lack accuracy, but because they discard structure too early.

QK-Normalization with RMSNorm

Applied directly inside the vision attention mechanism. Prevents attention-score explosion in high-resolution images; stabilizes gradients across deep vision stacks. Behavioral impact: consistent attention maps across resolutions, reduced sensitivity to lighting/noise/scaling, improved object localization without explicit supervision.

Attention stability is foundational in vision, not optional.

Dynamic patch sizing — 14×14 to 32×32

Instead of fixed patch sizes, Shakti VLM adapts: 14×14 for low-resolution inputs, up to 32×32 for high-resolution. Real-world images are not uniformly sized; fixed patches either over-fragment or under-represent. Result: efficient token utilization, better performance on mixed-resolution datasets, lower inference cost without accuracy collapse.

Architectures must adapt to data variability instead of forcing uniformity.

2D RoPE with absolute bias

Shakti extends RoPE into 2D spatial domains, augmented with absolute positional bias. Relative spatial relationships are preserved; absolute positioning prevents drift in large layouts. Practical outcome: stronger grounding in document intelligence, improved multi-object reasoning, reliable spatial referencing in multimodal conversations.

Relative position alone is insufficient for spatial reasoning at scale.

// SHAKTI IN CONTEXT

Proven foundations. Deliberate divergence.

Shakti is not built on exotic, unproven components. It uses the same battle-tested techniques the frontier open models use — Pre-Normalization, RMSNorm, SwiGLU, RoPE, grouped-query attention, KV caching. We are not reinventing the basics.

Shakti diverges precisely where the constraint demands it: a different design objective, a different long-context strategy, a different vision-integration philosophy, and a coherence across modalities that most stacks do not have. The table below is honest about both — what is shared with the field, and what genuinely diverges.

Architecture dimension	Prevailing approach (frontier open models)	Shakti's choice
Primary objective	Scale-first — capability through parameter count and data volume	Intelligence-per-parameter — capability under edge, latency, and memory constraints
Attention scheme	Multi-Head or fixed Grouped Query Attention	Variable Grouped Query Attention (VGQA) — query heads dynamically share key/value representations
Long-context strategy	Large static context windows; positional scaling	Sliding window + aggressive KV caching + high-theta RoPE (θ=500,000) — context as temporal memory
Normalization & activations	Pre-Normalization, RMSNorm, SwiGLU	Pre-Normalization, RMSNorm, SwiGLU — the same proven foundations
Vision integration (VLM)	Vision encoder as a front-end feature extractor feeding the language backbone	Vision tokens treated as reasoning tokens; deep 48-layer encoder; dynamic patch sizing; 2D RoPE
Cross-modal design	Text and vision architectures evolved as separate efforts	One unified set of design principles applied identically across text and vision
Deployment assumption	Cloud-centric inference — abundant compute and memory	Edge + on-prem — models adapt to devices, not the reverse

// HOW TO READ THIS TABLE

Rust text marks a genuine divergence from the prevailing approach. Muted text marks a shared, proven best practice. We mark the shared rows honestly — because that honesty is what makes the divergences credible.

// THE POINT

Shakti's differentiation is not a secret component. It is the objective function the architecture is optimized for, the combination of choices, and the coherence across text and vision. Different where it counts — proven everywhere else.

Note: "prevailing approach" describes the common pattern across frontier open models (Llama, Qwen, Mistral, Phi class). Specific implementations vary by model and version; the column reflects the dominant industry pattern, not any single model's exact configuration.

// UNIFIED DESIGN PHILOSOPHY

Text and vision share the same principles.

Multimodal systems benefit more from architectural coherence than architectural diversity. Shared principles across text and vision significantly reduced alignment fragility. Five principles run identically through both families.

Principle	Manifestation	Effect
Efficiency-first	VGQA, dynamic patch sizing, KV caching	Reduced compute load
Stability-first	Pre-Norm, QK-Norm, RMSNorm	Predictable behavior
Long-horizon reasoning	High-theta RoPE, sliding windows	Extended coherence
Deployment realism	Edge-aware memory and compute budgets	Real-world viability
Behavioral predictability	Reduced variance under fine-tuning	Reliable outcomes

// WHAT CHANGES IN PRACTICE

Architectural choices, architectural consequences.

Because of these architectural choices, Shakti models exhibit a specific behavioral signature. These are not accidental outcomes — they are architectural consequences.

// 01

Lower hallucination propensity in long contexts

// 02

More faithful instruction adherence

// 03

Better spatial grounding in vision tasks

// 04

Consistent latency profiles

// 05

Higher intelligence-per-parameter ratio

// THE CONCLUSION

"Architecture encodes values. What a model optimizes for — efficiency, stability, adaptability — shows up inevitably in how it behaves in the real world."

— Kamalakar Devaki, Founder & CEO, SandLogic Technologies

// GO DEEPER

From architecture to deployment.

/shakti — the full Shakti family: six released models (100M to 4B), benchmarks vs Llama / Phi / Qwen, the Lexicons + Nexons continuum.
/research — the three arXiv papers behind Shakti and Shakti-VLM, plus the patent portfolio.
/learn/mamba-inference-explained — the state-space architecture that powers Samba-ASR, the Shakti-family speech model.
/any-ai — how the Krsna SoC + CORE run Shakti and other architectures natively.
/founders-desk — the full Shakti article series and more writing by Kamalakar Devaki.

// LET'S BUILD

Architecture you can deploy.

Talk to the ML team See the Shakti family

How Shakti is different.Architecture, not scale.