Abstract
We study clinical trial table reasoning, where answers are not directly stored in visible cells and must be inferred from semantic understanding through normalization, classification, extraction, and lightweight domain reasoning. We introduce SCoPE, a multi-LLM planner-based framework that decomposes the problem into row selection, structured planning, and execution. Across 1,500 hybrid reasoning questions over oncology clinical-trial tables, explicit planning improves grounded row-level reasoning accuracy over direct prompting and stronger tabular baselines, while maintaining a favorable accuracy-efficiency tradeoff.
Introduction
Many clinical trial questions are neither pure retrieval nor standard text-to-SQL. In this setting, the answer must be derived from row evidence rather than projected from an explicit schema field. SCoPE addresses this by separating three failure-prone decisions: row grounding, source-field identification, and transformation logic.
Research question: can explicit planner-based decomposition improve grounded row-level reasoning over partially observed clinical-trial tables?
Method: SCoPE System and Flow
- Executor (row selection): identify candidate relevant rows from the visible table.
- Planner (structured reasoning): predict source field, relevant columns, reasoning rules, and output constraints.
- Executor (final generation): apply the plan and return row-aligned predictions.
This explicit planning interface improves interpretability and reduces ambiguity compared to single-step generation.
Dataset
The benchmark contains 1,500 programmatically-augmented hybrid reasoning questions constructed from an expert-authored seed set of 500 questions over oncology clinical-trial data.
Clinical-Trial Table Statistics
| Statistic | Value |
|---|---|
| Rows | 159 |
| Columns | 32 |
| Unique trials (NCT) | 105 |
| Cancer types | 19 |
| ICI names | 13 |
Question/Answer Statistics
| Statistic | Value |
|---|---|
| Total questions | 1,500 |
| Mean question length | 21.2 tokens |
| Target fields | 31 |
| String | 957 |
| List | 241 |
| Boolean | 224 |
| Null-only | 78 |
Experimental Setup
Baselines: Zero-shot, Few-shot, CoT, BlendSQL, EHRAgent, TableGPT2.
Backbones: Qwen3-30B-A3B-Instruct-2507, gpt-oss-20b, Llama-3.3-70B-Instruct.
Metrics: Table F1 (primary), Grounded Row Jaccard (RJ), Grounded Fowlkes-Mallows (FM).
Main Results
| Method | Qwen3 | Llama-3.3 | GPT-OSS | ||||||
|---|---|---|---|---|---|---|---|---|---|
| F1 | RJ | FM | F1 | RJ | FM | F1 | RJ | FM | |
| BlendSQL | 11.56 | 6.52 | 20.63 | 5.60 | 5.15 | 5.81 | 7.30 | 6.48 | 7.71 |
| EHRAgent | 32.99 | 29.79 | 33.74 | 30.99 | 28.07 | 31.69 | 34.85 | 31.23 | 35.65 |
| Zero Shot | 56.32 | 44.95 | 62.73 | 66.96 | 54.55 | 72.04 | 73.50 | 61.05 | 77.47 |
| CoT | 55.37 | 44.65 | 61.93 | 70.87 | 57.83 | 75.15 | 74.17 | 61.77 | 78.05 |
| Few-Shot | 54.74 | 44.09 | 61.48 | 69.38 | 56.56 | 74.05 | 73.99 | 61.55 | 77.85 |
| TableGPT2 | F1: 44.03, RJ: 33.78, FM: 50.81 | ||||||||
| SCoPE | 63.19 | 52.07 | 69.45 | 70.87 | 60.66 | 76.12 | 74.31 | 62.48 | 78.27 |
SCoPE is strongest overall on GPT-OSS and Qwen3, and tied on Table F1 with stronger grounding metrics on Llama-3.3.
Cross-Model Ablation
| Executor | Planner | F1 | RJ | FM |
|---|---|---|---|---|
| GPT-OSS | Qwen3 | 75.07 | 63.74 | 79.26 |
| Qwen3 | GPT-OSS | 59.59 | 48.26 | 66.32 |
| Qwen3 | Llama-3.3 | 62.47 | 51.40 | 68.88 |
| GPT-OSS | Llama-3.3 | 75.12 | 63.88 | 79.28 |
| Llama-3.3 | GPT-OSS | 68.01 | 57.37 | 73.64 |
| Llama-3.3 | Qwen3 | 71.43 | 61.33 | 76.64 |
Planner-Coder Baseline
| Coder Model | F1 | RJ | FM |
|---|---|---|---|
| GPT-OSS | 14.56 | 11.67 | 23.39 |
| Qwen3 | 48.92 | 37.53 | 56.01 |
| Llama-3.3 | 63.40 | 51.55 | 69.10 |
Code-synthesis execution is substantially more brittle than constrained grounded execution.
Model Cost-Effectiveness
In the Qwen-based comparison, SCoPE lies on the strongest accuracy-cost frontier: highest Table F1 with moderate planning overhead, substantially better than heavier but less accurate structured baselines.
Related Work
Related datasets include MIMIC-III/IV, eICU, i2b2/n2c2, TabFact, and FETAQA. Related methods include Seq2SQL, TaBERT, TAPAS, PAL, ReAct, TableGPT2, BlendSQL, and EHRAgent.
Conclusion
SCoPE frames clinical trial reasoning as a distinct table understanding setting and demonstrates that lightweight planner-executor decomposition improves grounded row-level inference while preserving efficiency.
Limitations
Current evaluation excludes latest frontier proprietary models, does not include planner/executor fine-tuning specialization, lacks a dedicated memorization analysis, and focuses on a single oncology table setting.
Ethics Statement
This system supports trial evidence review, not patient-specific decision-making. It operates on public trial-level data without PHI/PII, and outputs should be expert-reviewed due to possible model errors.
Acknowledgment
Supported by the Mayo Clinic and Arizona State University Alliance for Health Care Collaborative Research Seed Grant Program (Award ID: AWD00041508; Sponsor Award ID: ARI-358187).