Accepted to BioNLP 2026

SCoPE: Planning for Hybrid Querying over Clinical Trial Data

Structured Clinical hybrid Planning for Evidence retrieval in clinical trials

Suparno Roy Chowdhury, Manan Roy Choudhury, Tejas Anvekar, Muhammad Ali Khan, Kaneez Zahra Rubab Khakwani, Mohamad Bassam Sonbol, Irbaz Bin Riaz, Vivek Gupta

Arizona State University, Mayo Clinic

1500 questions31 target fields159 x 32 table

Abstract

We study clinical trial table reasoning, where answers are not directly stored in visible cells and must be inferred from semantic understanding through normalization, classification, extraction, and lightweight domain reasoning. We introduce SCoPE, a multi-LLM planner-based framework that decomposes the problem into row selection, structured planning, and execution. Across 1,500 hybrid reasoning questions over oncology clinical-trial tables, explicit planning improves grounded row-level reasoning accuracy over direct prompting and stronger tabular baselines, while maintaining a favorable accuracy-efficiency tradeoff.

Introduction

Many clinical trial questions are neither pure retrieval nor standard text-to-SQL. In this setting, the answer must be derived from row evidence rather than projected from an explicit schema field. SCoPE addresses this by separating three failure-prone decisions: row grounding, source-field identification, and transformation logic.

Research question: can explicit planner-based decomposition improve grounded row-level reasoning over partially observed clinical-trial tables?

Hybrid reasoning exemplar for SCoPE
Sample hybrid reasoning question requiring semantic understanding of regimen structure and extraction of held-out target information.

Method: SCoPE System and Flow

  1. Executor (row selection): identify candidate relevant rows from the visible table.
  2. Planner (structured reasoning): predict source field, relevant columns, reasoning rules, and output constraints.
  3. Executor (final generation): apply the plan and return row-aligned predictions.

This explicit planning interface improves interpretability and reduces ambiguity compared to single-step generation.

SCoPE planner-executor pipeline
SCoPE pipeline: row selection, structured planning, and final row-aligned execution over visible evidence columns.

Dataset

The benchmark contains 1,500 programmatically-augmented hybrid reasoning questions constructed from an expert-authored seed set of 500 questions over oncology clinical-trial data.

Clinical-Trial Table Statistics

StatisticValue
Rows159
Columns32
Unique trials (NCT)105
Cancer types19
ICI names13

Question/Answer Statistics

StatisticValue
Total questions1,500
Mean question length21.2 tokens
Target fields31
String957
List241
Boolean224
Null-only78

Experimental Setup

Baselines: Zero-shot, Few-shot, CoT, BlendSQL, EHRAgent, TableGPT2.

Backbones: Qwen3-30B-A3B-Instruct-2507, gpt-oss-20b, Llama-3.3-70B-Instruct.

Metrics: Table F1 (primary), Grounded Row Jaccard (RJ), Grounded Fowlkes-Mallows (FM).

Main Results

Method Qwen3 Llama-3.3 GPT-OSS
F1RJFM F1RJFM F1RJFM
BlendSQL11.566.5220.635.605.155.817.306.487.71
EHRAgent32.9929.7933.7430.9928.0731.6934.8531.2335.65
Zero Shot56.3244.9562.7366.9654.5572.0473.5061.0577.47
CoT55.3744.6561.9370.8757.8375.1574.1761.7778.05
Few-Shot54.7444.0961.4869.3856.5674.0573.9961.5577.85
TableGPT2F1: 44.03, RJ: 33.78, FM: 50.81
SCoPE63.1952.0769.4570.8760.6676.1274.3162.4878.27

SCoPE is strongest overall on GPT-OSS and Qwen3, and tied on Table F1 with stronger grounding metrics on Llama-3.3.

Cross-Model Ablation

ExecutorPlannerF1RJFM
GPT-OSSQwen375.0763.7479.26
Qwen3GPT-OSS59.5948.2666.32
Qwen3Llama-3.362.4751.4068.88
GPT-OSSLlama-3.375.1263.8879.28
Llama-3.3GPT-OSS68.0157.3773.64
Llama-3.3Qwen371.4361.3376.64

Planner-Coder Baseline

Coder ModelF1RJFM
GPT-OSS14.5611.6723.39
Qwen348.9237.5356.01
Llama-3.363.4051.5569.10

Code-synthesis execution is substantially more brittle than constrained grounded execution.

Model Cost-Effectiveness

In the Qwen-based comparison, SCoPE lies on the strongest accuracy-cost frontier: highest Table F1 with moderate planning overhead, substantially better than heavier but less accurate structured baselines.

Related Work

Related datasets include MIMIC-III/IV, eICU, i2b2/n2c2, TabFact, and FETAQA. Related methods include Seq2SQL, TaBERT, TAPAS, PAL, ReAct, TableGPT2, BlendSQL, and EHRAgent.

Conclusion

SCoPE frames clinical trial reasoning as a distinct table understanding setting and demonstrates that lightweight planner-executor decomposition improves grounded row-level inference while preserving efficiency.

Limitations

Current evaluation excludes latest frontier proprietary models, does not include planner/executor fine-tuning specialization, lacks a dedicated memorization analysis, and focuses on a single oncology table setting.

Ethics Statement

This system supports trial evidence review, not patient-specific decision-making. It operates on public trial-level data without PHI/PII, and outputs should be expert-reviewed due to possible model errors.

Acknowledgment

Supported by the Mayo Clinic and Arizona State University Alliance for Health Care Collaborative Research Seed Grant Program (Award ID: AWD00041508; Sponsor Award ID: ARI-358187).