How much does Kimi K2.5 cost?

Kimi K2.5 costs $0.81/1M tokens for input and $4.05/1M tokens for output on CallMissed. 1 credit = ₹1 = $0.01 USD.

How do I use Kimi K2.5 via API?

Send a POST request to POST /v1/chat/completions with model "kimi-k2.5" and your API key. CallMissed uses the OpenAI-compatible format — just change the base URL and model field.

What is the context window of Kimi K2.5?

Kimi K2.5 supports a 128K token context window with up to 16K output tokens.

Back to all models

LLM Chat

Kimi K2.5

by Moonshot · Released January 27, 2026

Moonshot AI's flagship model. A 1 trillion parameter MoE model with 32B active parameters per token. Native multimodal, trained on 15 trillion mixed visual and text tokens. Supports Thinking, Instant, Agent, and Agent Swarm (100 parallel agents) modes.

LLM Chat

Kimi K2.5

Context Window

128K

Parameters

1T total / 32B active (MoE)

Max Output

16K

Overview

Kimi K2.5 is Moonshot AI's flagship model, featuring a massive 1 trillion total parameter Mixture-of-Experts architecture with 61 layers — 1 dense layer followed by 60 MoE layers, each containing 384 expert networks. The router activates the top 8 experts plus 1 shared expert per token, meaning only approximately 3.2% of the total parameters are active at any given time (roughly 32B active parameters). The model was pretrained on 15 trillion combined vision and text tokens, making it natively multimodal from the ground up rather than having vision bolted on as an afterthought.

The architecture incorporates MoonViT, a 400-million-parameter vision encoder embedded directly into the model architecture, and Multi-Head Latent Attention (MLA) that compresses KV pairs into a shared latent representation, reducing the KV cache by approximately 10x compared to standard multi-head attention. This cache reduction is what makes the 256K token context window practical despite the model's massive scale.

Kimi K2.5 supports four distinct operating modes, each optimized for different use cases. Instant mode suppresses reasoning traces and uses approximately 75% fewer tokens, ideal for fast direct responses. Thinking mode enables full reasoning traces and achieves 96.1% on AIME 2025. Agent mode handles 200-300 tool calls without losing track of the task, though it has an approximately 12% failure rate on tool calls. Agent Swarm mode orchestrates up to 100 sub-agents working in parallel, delivering a 4.5x speedup on complex tasks and boosting BrowseComp from 60.6% to 78.4%.

On benchmarks, Kimi K2.5 delivers strong results across the board: AIME 2025 at 96.1%, HMMT 2025 at 95.4%, MMLU-Pro at 87.1%, SWE-Bench Verified at 76.8%, LiveCodeBench v6 at 85%, and BrowseComp at 60.6% (rising to 78.4% with Agent Swarm). For multimodal tasks, it scores 78.5% on MMMU-Pro, 84.2% on MathVision, and 86.6% on VideoMMMU.

The model is released under a modified MIT license — open-weight but not fully open-source, as training data and training code remain proprietary. For self-hosting, FP16 requires approximately 2TB of memory, INT4 quantization brings this down to approximately 630GB (fitting on 8x A100/H100/H200 GPUs), 2-bit quantization needs approximately 375GB, and 1.58-bit extreme quantization fits in approximately 240GB but runs at only 1-2 tokens per second. Attention layers remain in BF16 even with INT4 quantization, bringing the actual VRAM requirement to approximately 549GB. Hobbyist reports indicate approximately 15 tokens per second on dual M3 Ultra Macs with extreme quantization.

API pricing is approximately $0.60 per million input tokens and $3.00 per million output tokens, making it one of the most affordable trillion-parameter models available. The model is available on HuggingFace for download and self-hosting.

Known caveats include verbose outputs that can inflate token costs, routing randomness causing some inconsistency between runs, undisclosed training data composition, and no SOC 2 or ISO compliance certifications. Despite these limitations, the combination of massive scale, native multimodality, flexible operating modes, open-weight availability, and strong benchmark performance makes Kimi K2.5 one of the most capable open models for complex coding tasks, multi-agent workflows, mathematical reasoning, and visual understanding.

Pricing

Metric	Price
Input /1M tokens	₹81.0000
Output /1M tokens	₹405.0000

1 credit = ₹1 = $0.01 USD. Prices shown from provider; CallMissed passes through with ~35% markup.

Key Highlights

1 trillion total parameters, 32B active per token
Agent Swarm mode: 100 parallel AI agents, 4.5x faster
Native multimodal: trained on 15T mixed visual+text tokens
Open-weight model available on HuggingFace

Benchmarks

Benchmark	Score	Notes
AIME 2025	96.1%	Competition mathematics
HMMT 2025	95.4%	Competition mathematics
MMLU-Pro	87.1%	Professional knowledge
SWE-bench Verified	76.8%	Real-world software engineering
LiveCodeBench v6	85%	Live competitive programming
BrowseComp	78.4%	60.6% standard, 78.4% with Agent Swarm
MMMU-Pro	78.5%	Multimodal understanding
MathVision	84.2%	Visual math reasoning
VideoMMMU	86.6%	Video understanding

Technical Details

Architecture: 1T total params, 61 layers (1 dense + 60 MoE), 384 experts per MoE layer
Router activates top 8 experts + 1 shared expert per token (~3.2% of params active)
Pretrained on 15 trillion combined vision and text tokens
MoonViT: 400M-parameter vision encoder embedded in architecture
Multi-Head Latent Attention (MLA): reduces KV cache by ~10x, enables 256K context
Four modes: Instant, Thinking, Agent, Agent Swarm (100 sub-agents, 4.5x speedup)
~12% tool-call failure rate
Modified MIT license (open-weight, not fully open-source)
INT4 quantization: ~630 GB, needs 8x A100/H100/H200 GPUs
API pricing: ~$0.60/M input, ~$3.00/M output

Strengths

1T parameter MoE — one of the largest open-weight models available
Agent Swarm mode enables 100 parallel agents for 4.5x speedup
Native multimodal trained on 15T mixed tokens — not bolted-on vision
Open-weight on HuggingFace — can be self-hosted and fine-tuned
Strong coding performance: 76.8% SWE-bench, 85.0% LiveCodeBench

Limitations

128K context is smaller than 1M-context competitors
1T total parameters requires significant infrastructure for self-hosting
Newer model with less production track record than OpenAI/Anthropic
Agent Swarm mode may not be available through all API providers

Use Cases

Complex coding tasksMulti-agent workflowsVisual understandingLong-form content generation

API Example

curl https://api.callmissed.com/v1/chat/completions \
  -H "Authorization: Bearer cm_YOUR_KEY" \
  -d '{"model": "kimi-k2.5", "messages": [{"role": "user", "content": "Build a React dashboard component"}]}'

Endpoint: POST /v1/chat/completions · Model ID: kimi-k2.5

Try Kimi K2.5 now

Get 1000 free API credits on signup. No credit card required.

Start free Read docs