Re-optimizing oh-my-opencode Model Configuration - Re-evaluation After Adding Qwen Series

Update Note: This is a follow-up to my previous article “Finding the Perfect Partner for Your AI Agents - oh-my-opencode Model Selection Guide”. Back then, I configured the system based on 5 models: GLM-4.7, GLM-5, MiniMax-M2.5, DeepSeek-V3.2, and Kimi-K2.5. Recently, I added 4 Qwen series models (qwen3.5-plus, qwen3-max, qwen3-coder-next, qwen3-coder-plus), prompting a fresh re-evaluation of the optimal model for each Agent.

Why Re-evaluate?

AI Model Selection

After publishing the previous article, my model subscription changed: Alibaba Cloud Bailian added Qwen series models. The benchmark data for these 4 new models is impressive:

qwen3-max: GPQA 86.1% (scientific reasoning ceiling), LiveCodeBench 91.4% (coding reasoning leader)
qwen3-coder-next: SWE-bench 70.6% + $0.12/M (value champion)
qwen3.5-plus: 1M context + multimodal (long document powerhouse)

This data compelled me to re-examine my previous configuration. After deep research and benchmark analysis, I arrived at this new configuration scheme.

Why “Role-Based Allocation”?

oh-my-opencode has an interesting architecture that splits workflows into specialized Agents:

Sisyphus: The conductor, orchestrates tasks and delegates work
Hephaestus: Deep worker, executes tasks end-to-end
Oracle: Complex debugging, architecture consultant
Librarian: Documentation retrieval, external library queries
Explore: Codebase search specialist
Metis: Pre-planning consultant, identifies implicit intents
Momus: Plan reviewer
Prometheus: Strategic planner
Multimodal-looker: Image/video analysis
Atlas: Main model for UI interactions

Each Agent has different responsibilities and requires different model capabilities. Just like in a team—some excel at design, others at coding, others at documentation—models should be assigned accordingly.

What Changed This Time?

Compared to the previous configuration, here are the core changes:

Scenario	Previous Config	New Config	Reason
Deep Reasoning (Oracle, Prometheus, Ultrabrain)	GLM-5 / MiniMax-M2.5	qwen3-max	GPQA 86.1% + LiveCodeBench 91.4% is the current ceiling
High-frequency Coding (Explore, Quick, Hephaestus)	DeepSeek-V3.2 / GLM-4.7 / MiniMax-M2.5	qwen3-coder-next	$0.12/M + 151.5 tok/s blazing fast
Multimodal/Long Docs (Librarian, Metis, Atlas)	DeepSeek-V3.2 / Kimi-K2.5	qwen3.5-plus	1M context + native multimodal
Conductor (Sisyphus)	GLM-5	Keep GLM-5	Low hallucination is still most important
Plan Review (Momus)	MiniMax-M2.5	Keep MiniMax-M2.5	SWE-bench 80.2% still highest
Video Analysis (Multimodal-looker)	Kimi-K2.5	Keep Kimi-K2.5	Video understanding is irreplaceable

In short: Reasoning scenarios upgrade to qwen3-max, coding scenarios switch to qwen3-coder-next, long document scenarios use qwen3.5-plus, while three specialists remain unchanged.

My Complete Model Lineup

Model Comparison

Before diving into allocation, let me show you all the “players” I have:

Model	Context	Multimodal	Pricing ($/1M in/out)	Key Strengths
GLM-4.7	202K	No	$0.60/$2.20	Math 92%, Coding 84.9%, Balanced
GLM-5	202K	No	$1.00/$3.20	Low hallucination, Agent SOTA, Complex reasoning
DeepSeek-V3.2	262K	No	$0.28/$0.42	Ultra cheap, Math 94.17%, Deep reasoning
MiniMax-M2.5	196K	No	$0.30/$1.20	SWE-bench 80.2%, Fast
Kimi-K2.5	262K	Yes	—	Strongest multimodal, Video understanding
qwen3.5-plus	1M	Yes	$0.12-0.26/$0.29-1.56	1M context, Multimodal, Great value
qwen3-max	262K	No	$0.96-2.40/$4.80-12.00	GPQA 86.1%, LiveCodeBench 91.4%
qwen3-coder-next	256K	No	$0.12-0.14/$0.30-0.42	SWE-bench 70.6%, Blazing fast
qwen3-coder-plus	1M	No	$0.65-1.00/$3.25-5.00	SWE-bench 69.6%, 1M context

Key Discoveries: The New Contenders

qwen3-max: The New “Reasoning King”

This was my biggest discovery from this research:

GPQA 86.1% — Strongest scientific reasoning in publicly available data
LiveCodeBench v6 91.4% — Coding reasoning ceiling
Test-time Scaling + Early Stop Detection — Automatically determines when to stop thinking, no wasted compute

What does this mean? If you need deep reasoning, architecture analysis, or complex debugging, qwen3-max is currently the best choice.

qwen3-coder-next: The Value Champion

SWE-bench Verified 70.6% — Close to MiniMax-M2.5’s 80.2%
Output speed 151.5 tokens/sec — 2nd in its class
Response time 11.68 seconds — vs MiniMax-M2.5’s 43.03 seconds
Price $0.12/M — Cheaper than any competitor

What does this mean? If you need high-frequency calls and fast responses (code exploration, quick fixes), qwen3-coder-next is the obvious choice.

qwen3.5-plus: Multimodal + Long Context New Option

1M context — Largest context window currently available
Native multimodal — Supports text, images, video
Apache 2.0 open source — Self-deployable
Competitive pricing — $0.12-0.26/M input

What does this mean? For scenarios requiring long documents or multimodal content (document retrieval, UI interaction, writing), qwen3.5-plus is an ideal choice.

Core Configuration Strategy

After careful consideration, I established these configuration principles:

1. The Conductor Needs Low Hallucination

Sisyphus → GLM-5

Why not qwen3-max? Because as the conductor, reliability trumps reasoning depth. GLM-5’s hallucination rate is 56% lower than GLM-4.7, making it more reliable for task orchestration.

2. Deep Reasoning Scenarios Use the Strongest Reasoning

Prometheus, Oracle, Ultrabrain → qwen3-max

These scenarios require deep reasoning: strategic planning, architecture consulting, complex logic analysis. qwen3-max’s GPQA 86.1% and LiveCodeBench 91.4% are currently the ceiling.

3. High-frequency Coding Scenarios Use Speed Models

Explore, Quick, Deep, Hephaestus → qwen3-coder-next

These scenarios have high call frequency and need fast responses. qwen3-coder-next’s 151.5 tokens/sec and $0.12/M price make it the best choice.

4. Multimodal and Long Context Use the All-rounder

Librarian, Metis, Atlas, Visual-engineering, Artistry, Writing → qwen3.5-plus

These scenarios require processing long documents or multimodal content. qwen3.5-plus’s 1M context and native multimodal support are key.

5. Special Scenarios Retain Specialists

Momus → MiniMax-M2.5 (SWE-bench 80.2% highest score, plan review needs coding accuracy)

Multimodal-looker → Kimi-K2.5 (Video understanding capability, qwen3.5-plus doesn’t support video yet)

Final Configuration

Agents Configuration

Agent	Model	Core Capability	Role
sisyphus	GLM-5	Low hallucination, Agent SOTA	Conductor (high-reliability orchestration)
prometheus	qwen3-max	GPQA 86.1%, Deep thinking	Strategic planning
oracle	qwen3-max	LiveCodeBench 91.4%	Architecture consulting, complex debugging
metis	qwen3.5-plus	1M context, Multimodal	Intent analysis
momus	MiniMax-M2.5	SWE-bench 80.2%	Plan review
hephaestus	qwen3-coder-next	70.6% SWE-bench, Fast	Deep work
librarian	qwen3.5-plus	1M context	Document retrieval
explore	qwen3-coder-next	151.5 tok/s	Code exploration
atlas	qwen3.5-plus	Multimodal	UI interaction
multimodal-looker	Kimi-K2.5	Video understanding	Image/video analysis

Categories Configuration

Category	Model	Scenario
ultrabrain	qwen3-max	Complex logic tasks
unspecified-high	qwen3-max	Complex tasks
deep	qwen3-coder-next	Deep autonomous work
quick	qwen3-coder-next	Quick modifications
unspecified-low	qwen3-coder-next	Simple tasks
visual-engineering	qwen3.5-plus	Frontend UI
artistry	qwen3.5-plus	Creative tasks
writing	qwen3.5-plus	Long document writing

Configuration Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61


{
  "$schema": "https://raw.githubusercontent.com/code-yeongyu/oh-my-opencode/dev/assets/oh-my-opencode.schema.json",
  "agents": {
    "hephaestus": {
      "model": "bailian-coding-plan/qwen3-coder-next"
    },
    "oracle": {
      "model": "bailian-coding-plan/qwen3-max-2026-01-23"
    },
    "librarian": {
      "model": "bailian-coding-plan/qwen3.5-plus"
    },
    "explore": {
      "model": "bailian-coding-plan/qwen3-coder-next"
    },
    "multimodal-looker": {
      "model": "volcengine-coding/kimi-k2.5"
    },
    "prometheus": {
      "model": "bailian-coding-plan/qwen3-max-2026-01-23"
    },
    "metis": {
      "model": "bailian-coding-plan/qwen3.5-plus"
    },
    "momus": {
      "model": "volcengine-coding/minimax-m2.5"
    },
    "atlas": {
      "model": "bailian-coding-plan/qwen3.5-plus"
    },
    "sisyphus": {
      "model": "bailian-coding-plan/glm-5"
    }
  },
  "categories": {
    "visual-engineering": {
      "model": "bailian-coding-plan/qwen3.5-plus"
    },
    "ultrabrain": {
      "model": "bailian-coding-plan/qwen3-max-2026-01-23"
    },
    "deep": {
      "model": "bailian-coding-plan/qwen3-coder-next"
    },
    "artistry": {
      "model": "bailian-coding-plan/qwen3.5-plus"
    },
    "quick": {
      "model": "bailian-coding-plan/qwen3-coder-next"
    },
    "unspecified-low": {
      "model": "bailian-coding-plan/qwen3-coder-next"
    },
    "unspecified-high": {
      "model": "bailian-coding-plan/qwen3-max-2026-01-23"
    },
    "writing": {
      "model": "bailian-coding-plan/qwen3.5-plus"
    }
  }
}

Model Distribution Visualization

1
2
3
4
5
6


qwen3-coder-next  ████████████ 5 positions (hephaestus, explore, deep, quick, unspecified-low)
qwen3.5-plus      ██████████████ 6 positions (librarian, metis, atlas, visual-engineering, artistry, writing)
qwen3-max         ██████████ 4 positions (prometheus, oracle, ultrabrain, unspecified-high)
GLM-5             ██ 1 position (sisyphus)
MiniMax-M2.5      ██ 1 position (momus)
Kimi-K2.5         ██ 1 position (multimodal-looker)

Lessons Learned

1. Don’t Blindly Chase “Latest and Greatest”

qwen3-max’s GPQA 86.1% is impressive, but GLM-5’s low hallucination characteristic is more important for orchestration scenarios. Choose based on actual needs, not just benchmark scores.

2. Optimize High-frequency Scenarios Separately

Explore and Quick are high-frequency Agents. Using expensive models would quickly deplete quotas. qwen3-coder-next’s $0.12/M eliminates this concern entirely.

3. Special Capabilities Need Specialists

Video understanding is currently only supported by Kimi-K2.5, and plan review needs MiniMax-M2.5’s SWE-bench highest score. These special scenarios can’t be replaced by “general-purpose strong models.”

4. Testing Beats Theory

After configuration, I recommend testing these typical scenarios:

Code search (triggers Explore)
Document retrieval (triggers Librarian)
Visual analysis (triggers Multimodal-looker)
Complex architecture design (triggers Oracle or Ultrabrain)

Conclusion

This adjustment is a comprehensive upgrade to my previous configuration. There’s no best model, only the most suitable model—this still holds true, but as new models are added, the “most suitable” answer changes.

Scenario	Best Model	Key Advantage
Conductor orchestration	GLM-5	Low hallucination, high reliability
Deep reasoning	qwen3-max	GPQA 86.1%, LiveCodeBench 91.4%
High-frequency coding	qwen3-coder-next	$0.12/M, 151.5 tok/s
Multimodal/Long docs	qwen3.5-plus	1M context, native multimodal
Plan review	MiniMax-M2.5	SWE-bench 80.2% highest
Video analysis	Kimi-K2.5	Video understanding capability

After completing this configuration, the system’s efficiency and effectiveness improved noticeably. Each Agent is doing what it excels at, and the collaboration is smoother.

If you’re also using oh-my-opencode, I recommend adjusting your configuration based on your use cases and available models. After all, finding the right partner is what doubles your productivity.