r/LocalLLaMA · June 23, 2026 · 1 min read

Training a Qwen 3.5 4B/9B agent for multi-tool use: SFT first or go directly to RL?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

To train Qwen 3.5 4B or 9B for a custom multi-tool agent workflow and would appreciate guidance from people who have done this successfully.

A few questions:

SFT → RL or RL-only?

- Is it still recommended to first do supervised fine-tuning (tool-calling traces, reasoning trajectories, etc.) and then apply RL?

- Or are people seeing good results with RL-based training directly for tool-use tasks?
Reward design

- How do you design reward functions for tool-use agents?
Parallel tool execution

- One complication in my workflow:

- Tool A returns N items

- The agent must call Tool B N times, potentially in parallel

- Then aggregate the results

How would you represent and train this behavior?

For those who have trained production-quality tool-use models, what training recipe worked best?

Discussion (0)

No comments yet. Sign in and be the first to say something.