WideSeek: Advancing Wide Research via Multi-Agent Scaling

WideSeek: Advancing Wide Research via Multi-Agent Scaling

Ziyang Huang^*,1,2, Haolin Ren^*,1,2, Xiaowei Yuan^1,2, Jiawei Wang³, Zhongtao Jiang, Kun Xu, Shizhu He^1,2, Jun Zhao^1,2, Kang Liu^1,2,✉

¹Institute of Automation, Chinese Academy of Sciences
²University of Chinese Academy of Sciences
³University of Science and Technology of China

arXiv
^*Indicates Equal Contribution
^✉Corresponding Author

Paper Code Dataset

Abstract

Search intelligence is evolving from Deep Research to Wide Research, a paradigm essential for retrieving and synthesizing comprehensive information sets under complex constraints in parallel. However, advancements in this field are impeded by the lack of dedicated benchmarks and optimization methodologies for search breadth. To address these challenges, we give a deep dive into Wide Research from two perspectives: Data Pipeline and Agent Optimization. First, we produce WideSeekBench, a General Broad Information Seeking (GBIS) benchmark constructed via a rigorous multi-phase data pipeline to ensure diversity across the volume of target information, logical constraints, and domains. Second, we introduce WideSeek, a dynamic hierarchical multi-agent architecture that could autonomously fork parallel sub-agents based on task requirements. Furthermore, we design a unified training framework that linearizes multi-agent trajectories and optimizes the system using end-to-end RL. Experimental results demonstrate the effectiveness of WideSeek and multi-agent RL, highlighting that scaling the number of agents is a promising direction for advancing Wide Research paradigm.

Key Contributions

Deep Research paradigm v.s. Wide Research paradigm.

📊 WideSeekBench

A General Broad Information Seeking (GBIS) benchmark constructed via a rigorous multi-phase data pipeline, ensuring diversity across information volume, logical constraints, and domains.

🤖 WideSeek Architecture

A dynamic hierarchical multi-agent system that autonomously forks parallel sub-agents based on task requirements, enabling scalable information retrieval.

🎯 Unified RL Framework

An end-to-end reinforcement learning framework that linearizes multi-agent trajectories and optimizes the system for wide research tasks.

Method Overview

An illustration of Multi-Agent Reinforcement Learning. As shown on the left, the main agent can fork any number of sub-agents at any step. The trajectories of the main agent and sub-agents are unified for RL training.

Dynamic Multi-Agent Architecture

WideSeek operates as a hierarchical multi-agent system with a centralized Main Agent (Planner) that dynamically forks variable instances of Sub-Agents (Executors) at any step. Unlike static multi-agent architectures with fixed roles, WideSeek empowers the main agent with complete autonomy to instantiate any number of sub-agents based on task requirements.

Unified Reinforcement Learning

We linearize the hierarchical execution trace into a single sequence and optimize the system using Group Relative Policy Optimization (GRPO). The unified trajectory interleaves Main Agent planning steps with Sub-Agent execution steps, enabling end-to-end optimization of the entire multi-agent system.

WideSeekBench

The data pipeline of WideSeekBench constrauction, whcih mines a set of target information under complex constraints.

WideSeekBench is a General Broad Information Seeking (GBIS) benchmark constructed via a rigorous multi-phase data pipeline. The benchmark ensures diversity across:

Volume of target information: Varying scales of information retrieval tasks
Logical constraints: Complex set operations (intersection, union, difference)
Domains: Multi-dimensional domain coverage

Experimental Results

Main Results on WideSeekBench

Model	Success Rate (%)	Row F1 Score (%)		Item F1 Score (%)		# Sub-Agents	# Tool Calls
Model	Success Rate (%)	Mean@4	Max@4	Mean@4	Max@4	# Sub-Agents	# Tool Calls
Proprietary Models
GPT-5.2	0.00	4.45	6.75	21.03	26.88	11.21	408.64
GPT-5.1	0.00	4.11	6.75	20.44	27.88	6.02	121.36
DeepSeek-v3.2	0.00	4.34	6.85	20.51	27.09	31.25	326.41
Kimi-K2-Thinking	0.00	3.17	5.86	17.48	25.19	8.74	85.36
Seed-1.8	0.14	3.44	5.92	17.88	25.23	7.93	88.36
Open-Sourced Models
Qwen3-8B-Thinking	0.00	0.53	1.51	7.37	12.71	4.18	9.50
Qwen3-30B-A3B-Thinking	0.00	1.26	3.00	10.11	16.51	7.53	17.15
WideSeek-8B-RL	0.00	1.09 (+0.56)	2.59 (+1.08)	10.86 (+3.49)	16.61 (+3.90)	9.57 (×2.29)	41.09 (×4.33)
WideSeek-8B-SFT	0.14	1.74 (+1.21)	3.66 (+2.15)	11.35 (+3.98)	18.92 (+6.21)	13.16 (×3.15)	121.98 (×12.84)
WideSeek-8B-SFT-RL	0.00	1.95 (+1.42)	3.88 (+2.37)	12.87 (+5.50)	19.73 (+7.02)	26.60 (×6.36)	273.75 (×28.82)

Experiment results on WideSeekBench. We run each task for 4 times.

The training dynamics of WideSeek-8B-RL. We present the evolution of training rewards and the frequency of tool calls throughout the entire training process.

Generalization to Deep Research

To assess whether the capabilities transfer to deep research tasks, we evaluate our models on the BrowseComp-Plus dataset.

Model	Scaffold	Acc (%)
Baseline Models (ReAct)
Gemini-2.5-Pro	ReAct	29.52
GPT-OSS-120B-Low	ReAct	25.54
DeepSeek-R1-0528	ReAct	16.39
Search-R1-32B	ReAct	11.08
Qwen3-32B	ReAct	10.72
WideSeek Models
Qwen3-30B-A3B	WideSeek	14.82
Qwen3-8B	WideSeek	14.22
WideSeek-8B-SFT	WideSeek	23.61
WideSeek-8B-SFT-RL	WideSeek	23.61
WideSeek-8B-RL	WideSeek	26.42 (+12.20)

BrowseComp-Plus performance. We test the generalization of WideSeek to Deep Research dataset.

Key Findings

Scalability: WideSeek-8B-SFT-RL achieves 28.82× increase in tool calls and 6.36× increase in sub-agent instantiation compared to base model.
Performance: Item F1 score of 12.87% (+5.50% over base) and Max Row F1 of 3.88% on WideSeekBench.
Generalization: Achieves 26.42% accuracy on BrowseComp-Plus (+12.20% improvement), demonstrating transfer to Deep Research tasks.
Efficiency: The system learns to scale search effort aggressively, with strong correlation between reward and tool usage.

Analysis

Item-F1 score, the number of sub-agents, and the number of tool calls on different task sets with different volume of target information.

Item-F1 score on different domains.

Item-F1 score on different constraint types.

BibTeX

@misc{huang2026wideseekadvancingwideresearch,
      title={WideSeek: Advancing Wide Research via Multi-Agent Scaling}, 
      author={Ziyang Huang and Haolin Ren and Xiaowei Yuan and Jiawei Wang and Zhongtao Jiang and Kun Xu and Shizhu He and Jun Zhao and Kang Liu},
      year={2026},
      eprint={2602.02636},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.02636}, 
}