Featured image of post Re-optimizing oh-my-opencode Model Configuration - Re-evaluation After Adding Qwen Series

Re-optimizing oh-my-opencode Model Configuration - Re-evaluation After Adding Qwen Series

After adding Qwen series models to my subscription, I re-evaluated the optimal model selection for each Agent. qwen3-max with GPQA 86.1% and LiveCodeBench 91.4% becomes the top choice for deep reasoning, while qwen3-coder-next at $0.12/M offers unbeatable value for high-frequency coding scenarios.

Update Note: This is a follow-up to my previous article “Finding the Perfect Partner for Your AI Agents - oh-my-opencode Model Selection Guide”. Back then, I configured the system based on 5 models: GLM-4.7, GLM-5, MiniMax-M2.5, DeepSeek-V3.2, and Kimi-K2.5. Recently, I added 4 Qwen series models (qwen3.5-plus, qwen3-max, qwen3-coder-next, qwen3-coder-plus), prompting a fresh re-evaluation of the optimal model for each Agent.

Why Re-evaluate?

AI Model Selection

After publishing the previous article, my model subscription changed: Alibaba Cloud Bailian added Qwen series models. The benchmark data for these 4 new models is impressive:

  • qwen3-max: GPQA 86.1% (scientific reasoning ceiling), LiveCodeBench 91.4% (coding reasoning leader)
  • qwen3-coder-next: SWE-bench 70.6% + $0.12/M (value champion)
  • qwen3.5-plus: 1M context + multimodal (long document powerhouse)

This data compelled me to re-examine my previous configuration. After deep research and benchmark analysis, I arrived at this new configuration scheme.

Why “Role-Based Allocation”?

oh-my-opencode has an interesting architecture that splits workflows into specialized Agents:

  • Sisyphus: The conductor, orchestrates tasks and delegates work
  • Hephaestus: Deep worker, executes tasks end-to-end
  • Oracle: Complex debugging, architecture consultant
  • Librarian: Documentation retrieval, external library queries
  • Explore: Codebase search specialist
  • Metis: Pre-planning consultant, identifies implicit intents
  • Momus: Plan reviewer
  • Prometheus: Strategic planner
  • Multimodal-looker: Image/video analysis
  • Atlas: Main model for UI interactions

Each Agent has different responsibilities and requires different model capabilities. Just like in a teamโ€”some excel at design, others at coding, others at documentationโ€”models should be assigned accordingly.

What Changed This Time?

Compared to the previous configuration, here are the core changes:

Scenario Previous Config New Config Reason
Deep Reasoning (Oracle, Prometheus, Ultrabrain) GLM-5 / MiniMax-M2.5 qwen3-max GPQA 86.1% + LiveCodeBench 91.4% is the current ceiling
High-frequency Coding (Explore, Quick, Hephaestus) DeepSeek-V3.2 / GLM-4.7 / MiniMax-M2.5 qwen3-coder-next $0.12/M + 151.5 tok/s blazing fast
Multimodal/Long Docs (Librarian, Metis, Atlas) DeepSeek-V3.2 / Kimi-K2.5 qwen3.5-plus 1M context + native multimodal
Conductor (Sisyphus) GLM-5 Keep GLM-5 Low hallucination is still most important
Plan Review (Momus) MiniMax-M2.5 Keep MiniMax-M2.5 SWE-bench 80.2% still highest
Video Analysis (Multimodal-looker) Kimi-K2.5 Keep Kimi-K2.5 Video understanding is irreplaceable

In short: Reasoning scenarios upgrade to qwen3-max, coding scenarios switch to qwen3-coder-next, long document scenarios use qwen3.5-plus, while three specialists remain unchanged.

My Complete Model Lineup

Model Comparison

Before diving into allocation, let me show you all the “players” I have:

Model Context Multimodal Pricing ($/1M in/out) Key Strengths
GLM-4.7 202K No $0.60/$2.20 Math 92%, Coding 84.9%, Balanced
GLM-5 202K No $1.00/$3.20 Low hallucination, Agent SOTA, Complex reasoning
DeepSeek-V3.2 262K No $0.28/$0.42 Ultra cheap, Math 94.17%, Deep reasoning
MiniMax-M2.5 196K No $0.30/$1.20 SWE-bench 80.2%, Fast
Kimi-K2.5 262K Yes โ€” Strongest multimodal, Video understanding
qwen3.5-plus 1M Yes $0.12-0.26/$0.29-1.56 1M context, Multimodal, Great value
qwen3-max 262K No $0.96-2.40/$4.80-12.00 GPQA 86.1%, LiveCodeBench 91.4%
qwen3-coder-next 256K No $0.12-0.14/$0.30-0.42 SWE-bench 70.6%, Blazing fast
qwen3-coder-plus 1M No $0.65-1.00/$3.25-5.00 SWE-bench 69.6%, 1M context

Key Discoveries: The New Contenders

qwen3-max: The New “Reasoning King”

This was my biggest discovery from this research:

  • GPQA 86.1% โ€” Strongest scientific reasoning in publicly available data
  • LiveCodeBench v6 91.4% โ€” Coding reasoning ceiling
  • Test-time Scaling + Early Stop Detection โ€” Automatically determines when to stop thinking, no wasted compute

What does this mean? If you need deep reasoning, architecture analysis, or complex debugging, qwen3-max is currently the best choice.

qwen3-coder-next: The Value Champion

  • SWE-bench Verified 70.6% โ€” Close to MiniMax-M2.5’s 80.2%
  • Output speed 151.5 tokens/sec โ€” 2nd in its class
  • Response time 11.68 seconds โ€” vs MiniMax-M2.5’s 43.03 seconds
  • Price $0.12/M โ€” Cheaper than any competitor

What does this mean? If you need high-frequency calls and fast responses (code exploration, quick fixes), qwen3-coder-next is the obvious choice.

qwen3.5-plus: Multimodal + Long Context New Option

  • 1M context โ€” Largest context window currently available
  • Native multimodal โ€” Supports text, images, video
  • Apache 2.0 open source โ€” Self-deployable
  • Competitive pricing โ€” $0.12-0.26/M input

What does this mean? For scenarios requiring long documents or multimodal content (document retrieval, UI interaction, writing), qwen3.5-plus is an ideal choice.

Core Configuration Strategy

After careful consideration, I established these configuration principles:

1. The Conductor Needs Low Hallucination

Sisyphus โ†’ GLM-5

Why not qwen3-max? Because as the conductor, reliability trumps reasoning depth. GLM-5’s hallucination rate is 56% lower than GLM-4.7, making it more reliable for task orchestration.

2. Deep Reasoning Scenarios Use the Strongest Reasoning

Prometheus, Oracle, Ultrabrain โ†’ qwen3-max

These scenarios require deep reasoning: strategic planning, architecture consulting, complex logic analysis. qwen3-max’s GPQA 86.1% and LiveCodeBench 91.4% are currently the ceiling.

3. High-frequency Coding Scenarios Use Speed Models

Explore, Quick, Deep, Hephaestus โ†’ qwen3-coder-next

These scenarios have high call frequency and need fast responses. qwen3-coder-next’s 151.5 tokens/sec and $0.12/M price make it the best choice.

4. Multimodal and Long Context Use the All-rounder

Librarian, Metis, Atlas, Visual-engineering, Artistry, Writing โ†’ qwen3.5-plus

These scenarios require processing long documents or multimodal content. qwen3.5-plus’s 1M context and native multimodal support are key.

5. Special Scenarios Retain Specialists

Momus โ†’ MiniMax-M2.5 (SWE-bench 80.2% highest score, plan review needs coding accuracy)

Multimodal-looker โ†’ Kimi-K2.5 (Video understanding capability, qwen3.5-plus doesn’t support video yet)

Final Configuration

Agents Configuration

Agent Model Core Capability Role
sisyphus GLM-5 Low hallucination, Agent SOTA Conductor (high-reliability orchestration)
prometheus qwen3-max GPQA 86.1%, Deep thinking Strategic planning
oracle qwen3-max LiveCodeBench 91.4% Architecture consulting, complex debugging
metis qwen3.5-plus 1M context, Multimodal Intent analysis
momus MiniMax-M2.5 SWE-bench 80.2% Plan review
hephaestus qwen3-coder-next 70.6% SWE-bench, Fast Deep work
librarian qwen3.5-plus 1M context Document retrieval
explore qwen3-coder-next 151.5 tok/s Code exploration
atlas qwen3.5-plus Multimodal UI interaction
multimodal-looker Kimi-K2.5 Video understanding Image/video analysis

Categories Configuration

Category Model Scenario
ultrabrain qwen3-max Complex logic tasks
unspecified-high qwen3-max Complex tasks
deep qwen3-coder-next Deep autonomous work
quick qwen3-coder-next Quick modifications
unspecified-low qwen3-coder-next Simple tasks
visual-engineering qwen3.5-plus Frontend UI
artistry qwen3.5-plus Creative tasks
writing qwen3.5-plus Long document writing

Configuration Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
{
  "$schema": "https://raw.githubusercontent.com/code-yeongyu/oh-my-opencode/dev/assets/oh-my-opencode.schema.json",
  "agents": {
    "hephaestus": {
      "model": "bailian-coding-plan/qwen3-coder-next"
    },
    "oracle": {
      "model": "bailian-coding-plan/qwen3-max-2026-01-23"
    },
    "librarian": {
      "model": "bailian-coding-plan/qwen3.5-plus"
    },
    "explore": {
      "model": "bailian-coding-plan/qwen3-coder-next"
    },
    "multimodal-looker": {
      "model": "volcengine-coding/kimi-k2.5"
    },
    "prometheus": {
      "model": "bailian-coding-plan/qwen3-max-2026-01-23"
    },
    "metis": {
      "model": "bailian-coding-plan/qwen3.5-plus"
    },
    "momus": {
      "model": "volcengine-coding/minimax-m2.5"
    },
    "atlas": {
      "model": "bailian-coding-plan/qwen3.5-plus"
    },
    "sisyphus": {
      "model": "bailian-coding-plan/glm-5"
    }
  },
  "categories": {
    "visual-engineering": {
      "model": "bailian-coding-plan/qwen3.5-plus"
    },
    "ultrabrain": {
      "model": "bailian-coding-plan/qwen3-max-2026-01-23"
    },
    "deep": {
      "model": "bailian-coding-plan/qwen3-coder-next"
    },
    "artistry": {
      "model": "bailian-coding-plan/qwen3.5-plus"
    },
    "quick": {
      "model": "bailian-coding-plan/qwen3-coder-next"
    },
    "unspecified-low": {
      "model": "bailian-coding-plan/qwen3-coder-next"
    },
    "unspecified-high": {
      "model": "bailian-coding-plan/qwen3-max-2026-01-23"
    },
    "writing": {
      "model": "bailian-coding-plan/qwen3.5-plus"
    }
  }
}

Model Distribution Visualization

1
2
3
4
5
6
qwen3-coder-next  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 5 positions (hephaestus, explore, deep, quick, unspecified-low)
qwen3.5-plus      โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 6 positions (librarian, metis, atlas, visual-engineering, artistry, writing)
qwen3-max         โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 4 positions (prometheus, oracle, ultrabrain, unspecified-high)
GLM-5             โ–ˆโ–ˆ 1 position (sisyphus)
MiniMax-M2.5      โ–ˆโ–ˆ 1 position (momus)
Kimi-K2.5         โ–ˆโ–ˆ 1 position (multimodal-looker)

Lessons Learned

1. Don’t Blindly Chase “Latest and Greatest”

qwen3-max’s GPQA 86.1% is impressive, but GLM-5’s low hallucination characteristic is more important for orchestration scenarios. Choose based on actual needs, not just benchmark scores.

2. Optimize High-frequency Scenarios Separately

Explore and Quick are high-frequency Agents. Using expensive models would quickly deplete quotas. qwen3-coder-next’s $0.12/M eliminates this concern entirely.

3. Special Capabilities Need Specialists

Video understanding is currently only supported by Kimi-K2.5, and plan review needs MiniMax-M2.5’s SWE-bench highest score. These special scenarios can’t be replaced by “general-purpose strong models.”

4. Testing Beats Theory

After configuration, I recommend testing these typical scenarios:

  • Code search (triggers Explore)
  • Document retrieval (triggers Librarian)
  • Visual analysis (triggers Multimodal-looker)
  • Complex architecture design (triggers Oracle or Ultrabrain)

Conclusion

This adjustment is a comprehensive upgrade to my previous configuration. There’s no best model, only the most suitable modelโ€”this still holds true, but as new models are added, the “most suitable” answer changes.

Scenario Best Model Key Advantage
Conductor orchestration GLM-5 Low hallucination, high reliability
Deep reasoning qwen3-max GPQA 86.1%, LiveCodeBench 91.4%
High-frequency coding qwen3-coder-next $0.12/M, 151.5 tok/s
Multimodal/Long docs qwen3.5-plus 1M context, native multimodal
Plan review MiniMax-M2.5 SWE-bench 80.2% highest
Video analysis Kimi-K2.5 Video understanding capability

After completing this configuration, the system’s efficiency and effectiveness improved noticeably. Each Agent is doing what it excels at, and the collaboration is smoother.

If you’re also using oh-my-opencode, I recommend adjusting your configuration based on your use cases and available models. After all, finding the right partner is what doubles your productivity.