Kimi K2.5
by Moonshot · Released January 27, 2026
Moonshot AI's flagship model. A 1 trillion parameter MoE model with 32B active parameters per token. Native multimodal, trained on 15 trillion mixed visual and text tokens. Supports Thinking, Instant, Agent, and Agent Swarm (100 parallel agents) modes.
Kimi K2.5
Powered by Moonshot · Sparse Mixture-of-Experts (1T total / 32B active)
Context Window
128K
Parameters
1T total / 32B active (MoE)
Max Output
16K
Category
LLM Chat
Overview
Kimi K2.5 is Moonshot AI's flagship model, featuring a massive 1 trillion total parameter Mixture-of-Experts architecture with 61 layers — 1 dense layer followed by 60 MoE layers, each containing 384 expert networks. The router activates the top 8 experts plus 1 shared expert per token, meaning only approximately 3.2% of the total parameters are active at any given time (roughly 32B active parameters). The model was pretrained on 15 trillion combined vision and text tokens, making it natively multimodal from the ground up rather than having vision bolted on as an afterthought.
The architecture incorporates MoonViT, a 400-million-parameter vision encoder embedded directly into the model architecture, and Multi-Head Latent Attention (MLA) that compresses KV pairs into a shared latent representation, reducing the KV cache by approximately 10x compared to standard multi-head attention. This cache reduction is what makes the 256K token context window practical despite the model's massive scale.
Kimi K2.5 supports four distinct operating modes, each optimized for different use cases. Instant mode suppresses reasoning traces and uses approximately 75% fewer tokens, ideal for fast direct responses. Thinking mode enables full reasoning traces and achieves 96.1% on AIME 2025. Agent mode handles 200-300 tool calls without losing track of the task, though it has an approximately 12% failure rate on tool calls. Agent Swarm mode orchestrates up to 100 sub-agents working in parallel, delivering a 4.5x speedup on complex tasks and boosting BrowseComp from 60.6% to 78.4%.
On benchmarks, Kimi K2.5 delivers strong results across the board: AIME 2025 at 96.1%, HMMT 2025 at 95.4%, MMLU-Pro at 87.1%, SWE-Bench Verified at 76.8%, LiveCodeBench v6 at 85%, and BrowseComp at 60.6% (rising to 78.4% with Agent Swarm). For multimodal tasks, it scores 78.5% on MMMU-Pro, 84.2% on MathVision, and 86.6% on VideoMMMU.
The model is released under a modified MIT license — open-weight but not fully open-source, as training data and training code remain proprietary. For self-hosting, FP16 requires approximately 2TB of memory, INT4 quantization brings this down to approximately 630GB (fitting on 8x A100/H100/H200 GPUs), 2-bit quantization needs approximately 375GB, and 1.58-bit extreme quantization fits in approximately 240GB but runs at only 1-2 tokens per second. Attention layers remain in BF16 even with INT4 quantization, bringing the actual VRAM requirement to approximately 549GB. Hobbyist reports indicate approximately 15 tokens per second on dual M3 Ultra Macs with extreme quantization.
API pricing is approximately $0.60 per million input tokens and $3.00 per million output tokens, making it one of the most affordable trillion-parameter models available. The model is available on HuggingFace for download and self-hosting.
Known caveats include verbose outputs that can inflate token costs, routing randomness causing some inconsistency between runs, undisclosed training data composition, and no SOC 2 or ISO compliance certifications. Despite these limitations, the combination of massive scale, native multimodality, flexible operating modes, open-weight availability, and strong benchmark performance makes Kimi K2.5 one of the most capable open models for complex coding tasks, multi-agent workflows, mathematical reasoning, and visual understanding.
Pricing
| Metric | Price |
|---|---|
| Input /1M tokens | ₹52.0000 |
| Output /1M tokens | ₹230.0000 |
1 credit = ₹1 = $0.01 USD. Prices shown from provider; CallMissed passes through with ~35% markup.
Key Highlights
- 1 trillion total parameters, 32B active per token
- Agent Swarm mode: 100 parallel AI agents, 4.5x faster
- Native multimodal: trained on 15T mixed visual+text tokens
- Open-weight model available on HuggingFace
Benchmarks
| Benchmark | Score |
|---|---|
| AIME 2025 | 96.1% |
| HMMT 2025 | 95.4% |
| MMLU-Pro | 87.1% |
| SWE-bench Verified | 76.8% |
| LiveCodeBench v6 | 85% |
| BrowseComp | 78.4% |
| MMMU-Pro | 78.5% |
| MathVision | 84.2% |
| VideoMMMU | 86.6% |
Technical Details
- Architecture: 1T total params, 61 layers (1 dense + 60 MoE), 384 experts per MoE layer
- Router activates top 8 experts + 1 shared expert per token (~3.2% of params active)
- Pretrained on 15 trillion combined vision and text tokens
- MoonViT: 400M-parameter vision encoder embedded in architecture
- Multi-Head Latent Attention (MLA): reduces KV cache by ~10x, enables 256K context
- Four modes: Instant, Thinking, Agent, Agent Swarm (100 sub-agents, 4.5x speedup)
- ~12% tool-call failure rate
- Modified MIT license (open-weight, not fully open-source)
- INT4 quantization: ~630 GB, needs 8x A100/H100/H200 GPUs
- API pricing: ~$0.60/M input, ~$3.00/M output
Strengths
- 1T parameter MoE — one of the largest open-weight models available
- Agent Swarm mode enables 100 parallel agents for 4.5x speedup
- Native multimodal trained on 15T mixed tokens — not bolted-on vision
- Open-weight on HuggingFace — can be self-hosted and fine-tuned
- Strong coding performance: 76.8% SWE-bench, 85.0% LiveCodeBench
Limitations
- 128K context is smaller than 1M-context competitors
- 1T total parameters requires significant infrastructure for self-hosting
- Newer model with less production track record than OpenAI/Anthropic
- Agent Swarm mode may not be available through all API providers
Use Cases
API Example
curl https://api.callmissed.com/v1/chat/completions \
-H "Authorization: Bearer cm_YOUR_KEY" \
-d '{"model": "kimi-k2.5", "messages": [{"role": "user", "content": "Build a React dashboard component"}]}'Endpoint: POST /v1/chat/completions · Model ID: kimi-k2.5