ICML 2026 · Preprint

IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research

A pioneering benchmark for evaluating LLMs on interdisciplinary ideation, identification, and recommendation.

Yuanhao Shen*1 Daniel Xavier De Sousa*2 Ricardo Marçal de Andrade Nascimento2 Hongyu Guo3 Xiaodan Zhu1
* Equal Contribution
1Queen's University, Kingston, Canada  ·  2Instituto Federal de Goiás, Anápolis, Brazil  ·  3National Research Council Canada, Ottawa

Abstract

Innovation is a key driving force of human civilization. As the body of knowledge has grown considerably, bridging knowledge across different disciplines — where significant innovation often emerges — has become increasingly challenging.

The recent advancements in Large Language Models (LLMs) have provided effective access to extensive knowledge sources and shown impressive abilities in reasoning, rendering significant opportunities for interdisciplinary discovery. Our research aims to understand the capabilities of state-of-the-art LLMs in integrating knowledge from different fields for interdisciplinary research (IDR).

To address this fundamental problem, we introduce IDRBench, a pioneering framework that includes both datasets and evaluation tasks: (1) IDR Paper Identification, (2) IDR Idea Integration, and (3) IDR Idea Recommendation. Our study on ten mainstream LLMs provides a comprehensive analysis and establishes benchmarks and baselines for future research.

"If you would understand anything, observe its beginning and its development." — Aristotle

The IDRBench Framework

IDRBench is built on a knowledge triplet structure. Each positive instance consists of an IDR paper PA together with two cited papers PB and PC from distinct disciplines, whose key concepts are meaningfully integrated — not merely referenced — in PA.

Citing Paper PA
"Tumor Location-weighted Contrastive Learning: Improving the Explainability of Pediatric Brain Tumor Diagnosis"
IDR Paper
Cited Paper PB
"Pediatric low-grade glioma: State-of-the-art and ongoing challenges"
Quantitative Biology
+
Cited Paper PC
"Improving Pediatric Neuroepithelial Tumor Identification With Novel Loss Function for CNNs"
Computer Science

Dataset Statistics

335
Expert-annotated positive paper triplets
31K+
Synthetic data samples
271K
ArXiv papers in source corpus (Nov 2024 – Oct 2025)
10
LLMs evaluated across all tasks
6
Distinct scientific disciplines covered
9
Expert annotators from diverse backgrounds

Disciplines Covered

Computer Science, EE & Systems Science Quantitative Biology Physics Mathematics & Statistics Economics & Quantitative Finance Other (Medical, Chemistry, Law…)

Disciplines are selected to maximize conceptual distance, with CS/EE + Quantitative Biology combinations representing nearly 55% of annotated samples.

Three Progressive IDR Tasks

IDRBench's three tasks reflect progressive stages of interdisciplinary research, from basic classification through deep ideation and recommendation.

Task 1

IDR Paper Identification

IPI · Classification

Given the title and abstract of a paper PA, can the LLM determine whether it constitutes genuine interdisciplinary research and identify the disciplines involved?

Task 2

IDR Idea Integration

I3 · Integration Analysis

Given two papers PB and PC from distinct disciplines, can the LLM determine whether they can be meaningfully integrated into a feasible, novel IDR idea?

Task 3

IDR Idea Recommendation

I2R · Ranking

Given a seed paper PB and a ranked candidate list, can the LLM identify which paper best complements PB for interdisciplinary research?

Task I3: Two Complementary Subsets

Subset 1 (Feasibility): Negative instances are randomly sampled from different disciplines. Tests whether LLMs can evaluate the feasibility of integrating two papers into a valid IDR.

Subset 2 (Awareness): Negative instances come from the same discipline and sub-discipline. Tests whether LLMs are aware that two papers from a single discipline do not constitute interdisciplinary research.

Main Results

We evaluated 10 mainstream LLMs — including 5 with explicit reasoning capabilities — under zero-shot and few-shot prompting. Results are reported in Macro-F1 for classification tasks and Mean Reciprocal Rank (MRR) for the recommendation task.

Task / Metric Non-Reasoning Models Reasoning Models
GPT-4o-mini Gemini 2.0 Flash Llama 3.1 70B Llama 3.3 70B DeepSeek-V3 Qwen 2.5 Qwen 3-32B GPT o3-mini GPT o4-mini GPT-5-nano Claude Sonnet 4 DeepSeek-R1
IPI 0-shot 0.587 0.538 0.551 0.468 0.607 0.571 0.527 0.486 0.610 0.630 0.582 0.507
IPI 5-shot 0.510 0.534 0.336 0.444 0.614 0.501 0.565 0.523 0.617 0.640 0.621 0.551
I3 S1 0-shot 0.750 0.666 0.696 0.814 0.769 0.793 0.782 0.254 0.372 0.495 0.563 0.526
I3 S1 3-shot 0.765 0.434 0.519 0.822 0.635 0.860 0.746 0.302 0.411 0.492 0.694 0.450
I3 S2 0-shot 0.588 0.473 0.372 0.407 0.509 0.500 0.486 0.151 0.190 0.339 0.555 0.333
I3 S2 3-shot 0.672 0.212 0.240 0.421 0.345 0.543 0.518 0.187 0.365 0.240 0.592 0.259
I2R MRR 0.646* 0.623* 0.650* 0.623* 0.585 0.661 0.642 0.571 0.640* 0.486 0.446 0.588

* Asterisks in I2R denote absence of statistically significant difference from the best result (Wilcoxon signed-rank test). Best results in each row are highlighted.

Expert Assessment of LLM-Generated Ideas

Six human experts with diverse academic backgrounds assessed 60 LLM-generated IDR idea samples on two dimensions: Correctness (does the abstract integrate papers PB and PC in an interdisciplinary way?) and Clarity (how clearly is the integration described?). Scores are on a 1–5 scale.

Ideas were generated by gemini-2.0-flash — the most optimistic model in Task I3 — in two formats: a running abstract and an integration sentence.

Running Abstract Quality

Correctness3.96 / 5
Clarity4.13 / 5
Expert Confidence4.16 / 5

Integration Sentence Quality

Correctness3.99 / 5
Clarity4.20 / 5
Expert Confidence4.34 / 5

Additionally, a user study with 56 researchers across 9 disciplines at an academic institution validated IDRBench in real-world scenarios: over 50% of Clarity ratings reached 4–5, confirming that LLM-generated IDR proposals are generally well-formulated.

Analysis & Findings

1

LLMs can generate valid IDR ideas — but struggle to distinguish true integration

Human experts rated LLM-generated ideas positively for both correctness and clarity (scores >3.9/5). However, on Task I3, models fail to reliably distinguish genuine interdisciplinary integration from papers that are merely feasible to combine within a single discipline.

2

Reasoning models underperform on IDR integration tasks

Non-reasoning models score 0.6–0.8 on I3 Subset 1, while reasoning models score only 0.2–0.6. This echoes a known reasoning–creativity trade-off: chain-of-thought constraints appear to suppress the creative, analogical thinking that IDR requires.

3

Alignment targets predict optimism vs. pessimism in IDR

Models post-trained toward helpfulness (GPT-o series, Gemini, Llama) are more optimistic — high TPR, lower TNR. Models trained toward safety and integrity (Claude Sonnet 4, GPT-4o-mini) are more conservative — lower TPR, higher TNR. DeepSeek models, with balanced alignment, show intermediate behavior.

4

Fine-tuning on IDRBench data improves integration performance

Fine-tuning a Llama 3.1-8B model on augmented IDRBench data with LoRA yields an average 12.7% gain in Macro-F1 on Task I3, demonstrating the benchmark's value as a training resource for future IDR-capable models.

5

LLM re-rankings in I2R are not driven by surface similarity

Kendall's τ correlations between LLM re-rankings and SciBERT semantic similarity rankings range from 0.18 to 0.25 across all models — confirming that LLMs perform deeper conceptual reasoning rather than simple topical matching.

BibTeX

If you use IDRBench in your research, please cite:

@article{shen2026idrbench,
  title     = {IDRBench: Understanding the Capability of Large Language
             Models on Interdisciplinary Research},
  author    = {Shen, Yuanhao and De Sousa, Daniel Xavier and
             Nascimento, Ricardo Mar\c{c}al de Andrade and
             Guo, Hongyu and Zhu, Xiaodan},
  journal   = {arXiv preprint arXiv:2507.15736},
  year      = {2026}
}