Update Note: This is a follow-up to my previous article “Finding the Perfect Partner for Your AI Agents - oh-my-opencode Model Selection Guide”. Back then, I configured the system based on 5 models: GLM-4.7, GLM-5, MiniMax-M2.5, DeepSeek-V3.2, and Kimi-K2.5. Recently, I added 4 Qwen series models (qwen3.5-plus, qwen3-max, qwen3-coder-next, qwen3-coder-plus), prompting a fresh re-evaluation of the optimal model for each Agent.
Why Re-evaluate?

After publishing the previous article, my model subscription changed: Alibaba Cloud Bailian added Qwen series models. The benchmark data for these 4 new models is impressive:
- qwen3-max: GPQA 86.1% (scientific reasoning ceiling), LiveCodeBench 91.4% (coding reasoning leader)
- qwen3-coder-next: SWE-bench 70.6% + $0.12/M (value champion)
- qwen3.5-plus: 1M context + multimodal (long document powerhouse)
This data compelled me to re-examine my previous configuration. After deep research and benchmark analysis, I arrived at this new configuration scheme.
Why “Role-Based Allocation”?
oh-my-opencode has an interesting architecture that splits workflows into specialized Agents:
- Sisyphus: The conductor, orchestrates tasks and delegates work
- Hephaestus: Deep worker, executes tasks end-to-end
- Oracle: Complex debugging, architecture consultant
- Librarian: Documentation retrieval, external library queries
- Explore: Codebase search specialist
- Metis: Pre-planning consultant, identifies implicit intents
- Momus: Plan reviewer
- Prometheus: Strategic planner
- Multimodal-looker: Image/video analysis
- Atlas: Main model for UI interactions
Each Agent has different responsibilities and requires different model capabilities. Just like in a teamโsome excel at design, others at coding, others at documentationโmodels should be assigned accordingly.
What Changed This Time?
Compared to the previous configuration, here are the core changes:
| Scenario | Previous Config | New Config | Reason |
|---|---|---|---|
| Deep Reasoning (Oracle, Prometheus, Ultrabrain) | GLM-5 / MiniMax-M2.5 | qwen3-max | GPQA 86.1% + LiveCodeBench 91.4% is the current ceiling |
| High-frequency Coding (Explore, Quick, Hephaestus) | DeepSeek-V3.2 / GLM-4.7 / MiniMax-M2.5 | qwen3-coder-next | $0.12/M + 151.5 tok/s blazing fast |
| Multimodal/Long Docs (Librarian, Metis, Atlas) | DeepSeek-V3.2 / Kimi-K2.5 | qwen3.5-plus | 1M context + native multimodal |
| Conductor (Sisyphus) | GLM-5 | Keep GLM-5 | Low hallucination is still most important |
| Plan Review (Momus) | MiniMax-M2.5 | Keep MiniMax-M2.5 | SWE-bench 80.2% still highest |
| Video Analysis (Multimodal-looker) | Kimi-K2.5 | Keep Kimi-K2.5 | Video understanding is irreplaceable |
In short: Reasoning scenarios upgrade to qwen3-max, coding scenarios switch to qwen3-coder-next, long document scenarios use qwen3.5-plus, while three specialists remain unchanged.
My Complete Model Lineup

Before diving into allocation, let me show you all the “players” I have:
| Model | Context | Multimodal | Pricing ($/1M in/out) | Key Strengths |
|---|---|---|---|---|
| GLM-4.7 | 202K | No | $0.60/$2.20 | Math 92%, Coding 84.9%, Balanced |
| GLM-5 | 202K | No | $1.00/$3.20 | Low hallucination, Agent SOTA, Complex reasoning |
| DeepSeek-V3.2 | 262K | No | $0.28/$0.42 | Ultra cheap, Math 94.17%, Deep reasoning |
| MiniMax-M2.5 | 196K | No | $0.30/$1.20 | SWE-bench 80.2%, Fast |
| Kimi-K2.5 | 262K | Yes | โ | Strongest multimodal, Video understanding |
| qwen3.5-plus | 1M | Yes | $0.12-0.26/$0.29-1.56 | 1M context, Multimodal, Great value |
| qwen3-max | 262K | No | $0.96-2.40/$4.80-12.00 | GPQA 86.1%, LiveCodeBench 91.4% |
| qwen3-coder-next | 256K | No | $0.12-0.14/$0.30-0.42 | SWE-bench 70.6%, Blazing fast |
| qwen3-coder-plus | 1M | No | $0.65-1.00/$3.25-5.00 | SWE-bench 69.6%, 1M context |
Key Discoveries: The New Contenders
qwen3-max: The New “Reasoning King”
This was my biggest discovery from this research:
- GPQA 86.1% โ Strongest scientific reasoning in publicly available data
- LiveCodeBench v6 91.4% โ Coding reasoning ceiling
- Test-time Scaling + Early Stop Detection โ Automatically determines when to stop thinking, no wasted compute
What does this mean? If you need deep reasoning, architecture analysis, or complex debugging, qwen3-max is currently the best choice.
qwen3-coder-next: The Value Champion
- SWE-bench Verified 70.6% โ Close to MiniMax-M2.5’s 80.2%
- Output speed 151.5 tokens/sec โ 2nd in its class
- Response time 11.68 seconds โ vs MiniMax-M2.5’s 43.03 seconds
- Price $0.12/M โ Cheaper than any competitor
What does this mean? If you need high-frequency calls and fast responses (code exploration, quick fixes), qwen3-coder-next is the obvious choice.
qwen3.5-plus: Multimodal + Long Context New Option
- 1M context โ Largest context window currently available
- Native multimodal โ Supports text, images, video
- Apache 2.0 open source โ Self-deployable
- Competitive pricing โ $0.12-0.26/M input
What does this mean? For scenarios requiring long documents or multimodal content (document retrieval, UI interaction, writing), qwen3.5-plus is an ideal choice.
Core Configuration Strategy
After careful consideration, I established these configuration principles:
1. The Conductor Needs Low Hallucination
Sisyphus โ GLM-5
Why not qwen3-max? Because as the conductor, reliability trumps reasoning depth. GLM-5’s hallucination rate is 56% lower than GLM-4.7, making it more reliable for task orchestration.
2. Deep Reasoning Scenarios Use the Strongest Reasoning
Prometheus, Oracle, Ultrabrain โ qwen3-max
These scenarios require deep reasoning: strategic planning, architecture consulting, complex logic analysis. qwen3-max’s GPQA 86.1% and LiveCodeBench 91.4% are currently the ceiling.
3. High-frequency Coding Scenarios Use Speed Models
Explore, Quick, Deep, Hephaestus โ qwen3-coder-next
These scenarios have high call frequency and need fast responses. qwen3-coder-next’s 151.5 tokens/sec and $0.12/M price make it the best choice.
4. Multimodal and Long Context Use the All-rounder
Librarian, Metis, Atlas, Visual-engineering, Artistry, Writing โ qwen3.5-plus
These scenarios require processing long documents or multimodal content. qwen3.5-plus’s 1M context and native multimodal support are key.
5. Special Scenarios Retain Specialists
Momus โ MiniMax-M2.5 (SWE-bench 80.2% highest score, plan review needs coding accuracy)
Multimodal-looker โ Kimi-K2.5 (Video understanding capability, qwen3.5-plus doesn’t support video yet)
Final Configuration
Agents Configuration
| Agent | Model | Core Capability | Role |
|---|---|---|---|
| sisyphus | GLM-5 | Low hallucination, Agent SOTA | Conductor (high-reliability orchestration) |
| prometheus | qwen3-max | GPQA 86.1%, Deep thinking | Strategic planning |
| oracle | qwen3-max | LiveCodeBench 91.4% | Architecture consulting, complex debugging |
| metis | qwen3.5-plus | 1M context, Multimodal | Intent analysis |
| momus | MiniMax-M2.5 | SWE-bench 80.2% | Plan review |
| hephaestus | qwen3-coder-next | 70.6% SWE-bench, Fast | Deep work |
| librarian | qwen3.5-plus | 1M context | Document retrieval |
| explore | qwen3-coder-next | 151.5 tok/s | Code exploration |
| atlas | qwen3.5-plus | Multimodal | UI interaction |
| multimodal-looker | Kimi-K2.5 | Video understanding | Image/video analysis |
Categories Configuration
| Category | Model | Scenario |
|---|---|---|
| ultrabrain | qwen3-max | Complex logic tasks |
| unspecified-high | qwen3-max | Complex tasks |
| deep | qwen3-coder-next | Deep autonomous work |
| quick | qwen3-coder-next | Quick modifications |
| unspecified-low | qwen3-coder-next | Simple tasks |
| visual-engineering | qwen3.5-plus | Frontend UI |
| artistry | qwen3.5-plus | Creative tasks |
| writing | qwen3.5-plus | Long document writing |
Configuration Code
|
|
Model Distribution Visualization
|
|
Lessons Learned
1. Don’t Blindly Chase “Latest and Greatest”
qwen3-max’s GPQA 86.1% is impressive, but GLM-5’s low hallucination characteristic is more important for orchestration scenarios. Choose based on actual needs, not just benchmark scores.
2. Optimize High-frequency Scenarios Separately
Explore and Quick are high-frequency Agents. Using expensive models would quickly deplete quotas. qwen3-coder-next’s $0.12/M eliminates this concern entirely.
3. Special Capabilities Need Specialists
Video understanding is currently only supported by Kimi-K2.5, and plan review needs MiniMax-M2.5’s SWE-bench highest score. These special scenarios can’t be replaced by “general-purpose strong models.”
4. Testing Beats Theory
After configuration, I recommend testing these typical scenarios:
- Code search (triggers Explore)
- Document retrieval (triggers Librarian)
- Visual analysis (triggers Multimodal-looker)
- Complex architecture design (triggers Oracle or Ultrabrain)
Conclusion
This adjustment is a comprehensive upgrade to my previous configuration. There’s no best model, only the most suitable modelโthis still holds true, but as new models are added, the “most suitable” answer changes.
| Scenario | Best Model | Key Advantage |
|---|---|---|
| Conductor orchestration | GLM-5 | Low hallucination, high reliability |
| Deep reasoning | qwen3-max | GPQA 86.1%, LiveCodeBench 91.4% |
| High-frequency coding | qwen3-coder-next | $0.12/M, 151.5 tok/s |
| Multimodal/Long docs | qwen3.5-plus | 1M context, native multimodal |
| Plan review | MiniMax-M2.5 | SWE-bench 80.2% highest |
| Video analysis | Kimi-K2.5 | Video understanding capability |
After completing this configuration, the system’s efficiency and effectiveness improved noticeably. Each Agent is doing what it excels at, and the collaboration is smoother.
If you’re also using oh-my-opencode, I recommend adjusting your configuration based on your use cases and available models. After all, finding the right partner is what doubles your productivity.