IDRBench: LLMs on Interdisciplinary Research

Overview

Abstract

Innovation is a key driving force of human civilization. As the body of knowledge has grown considerably, bridging knowledge across different disciplines — where significant innovation often emerges — has become increasingly challenging.

The recent advancements in Large Language Models (LLMs) have provided effective access to extensive knowledge sources and shown impressive abilities in reasoning, rendering significant opportunities for interdisciplinary discovery. Our research aims to understand the capabilities of state-of-the-art LLMs in integrating knowledge from different fields for interdisciplinary research (IDR).

To address this fundamental problem, we introduce IDRBench, a pioneering framework that includes both datasets and evaluation tasks: (1) IDR Paper Identification, (2) IDR Idea Integration, and (3) IDR Idea Recommendation. Our study on ten mainstream LLMs provides a comprehensive analysis and establishes benchmarks and baselines for future research.

"If you would understand anything, observe its beginning and its development." — Aristotle

Dataset Design

The IDRBench Framework

IDRBench is built on a knowledge triplet structure. Each positive instance consists of an IDR paper P_A together with two cited papers P_B and P_C from distinct disciplines, whose key concepts are meaningfully integrated — not merely referenced — in P_A.

Citing Paper P_A

"Tumor Location-weighted Contrastive Learning: Improving the Explainability of Pediatric Brain Tumor Diagnosis"
IDR Paper

→

Cited Paper P_B

"Pediatric low-grade glioma: State-of-the-art and ongoing challenges"
Quantitative Biology

Cited Paper P_C

"Improving Pediatric Neuroepithelial Tumor Identification With Novel Loss Function for CNNs"
Computer Science

Dataset Statistics

335

Expert-annotated positive paper triplets

31K+

Synthetic data samples

271K

ArXiv papers in source corpus (Nov 2024 – Oct 2025)

LLMs evaluated across all tasks

Distinct scientific disciplines covered

Expert annotators from diverse backgrounds

Disciplines Covered

Computer Science, EE & Systems Science Quantitative Biology Physics Mathematics & Statistics Economics & Quantitative Finance Other (Medical, Chemistry, Law…)

Disciplines are selected to maximize conceptual distance, with CS/EE + Quantitative Biology combinations representing nearly 55% of annotated samples.

Evaluation

Three Progressive IDR Tasks

IDRBench's three tasks reflect progressive stages of interdisciplinary research, from basic classification through deep ideation and recommendation.

Task 1

IDR Paper Identification

IPI · Classification

Given the title and abstract of a paper P_A, can the LLM determine whether it constitutes genuine interdisciplinary research and identify the disciplines involved?

Task 2

IDR Idea Integration

I3 · Integration Analysis

Given two papers P_B and P_C from distinct disciplines, can the LLM determine whether they can be meaningfully integrated into a feasible, novel IDR idea?

Task 3

IDR Idea Recommendation

I2R · Ranking

Given a seed paper P_B and a ranked candidate list, can the LLM identify which paper best complements P_B for interdisciplinary research?

Task I3: Two Complementary Subsets

Subset 1 (Feasibility): Negative instances are randomly sampled from different disciplines. Tests whether LLMs can evaluate the feasibility of integrating two papers into a valid IDR.

Subset 2 (Awareness): Negative instances come from the same discipline and sub-discipline. Tests whether LLMs are aware that two papers from a single discipline do not constitute interdisciplinary research.

Experiments

Main Results

We evaluated 10 mainstream LLMs — including 5 with explicit reasoning capabilities — under zero-shot and few-shot prompting. Results are reported in Macro-F1 for classification tasks and Mean Reciprocal Rank (MRR) for the recommendation task.

Task / Metric	Non-Reasoning Models						Reasoning Models
Task / Metric	GPT-4o-mini	Gemini 2.0 Flash	Llama 3.1 70B	Llama 3.3 70B	DeepSeek-V3	Qwen 2.5	Qwen 3-32B	GPT o3-mini	GPT o4-mini	GPT-5-nano	Claude Sonnet 4	DeepSeek-R1
IPI 0-shot	0.587	0.538	0.551	0.468	0.607	0.571	0.527	0.486	0.610	0.630	0.582	0.507
IPI 5-shot	0.510	0.534	0.336	0.444	0.614	0.501	0.565	0.523	0.617	0.640	0.621	0.551
I3 S1 0-shot	0.750	0.666	0.696	0.814	0.769	0.793	0.782	0.254	0.372	0.495	0.563	0.526
I3 S1 3-shot	0.765	0.434	0.519	0.822	0.635	0.860	0.746	0.302	0.411	0.492	0.694	0.450
I3 S2 0-shot	0.588	0.473	0.372	0.407	0.509	0.500	0.486	0.151	0.190	0.339	0.555	0.333
I3 S2 3-shot	0.672	0.212	0.240	0.421	0.345	0.543	0.518	0.187	0.365	0.240	0.592	0.259
I2R MRR	0.646*	0.623*	0.650*	0.623*	0.585	0.661	0.642	0.571	0.640*	0.486	0.446	0.588

* Asterisks in I2R denote absence of statistically significant difference from the best result (Wilcoxon signed-rank test). Best results in each row are highlighted.

Human Evaluation

Expert Assessment of LLM-Generated Ideas

Six human experts with diverse academic backgrounds assessed 60 LLM-generated IDR idea samples on two dimensions: Correctness (does the abstract integrate papers P_B and P_C in an interdisciplinary way?) and Clarity (how clearly is the integration described?). Scores are on a 1–5 scale.

Ideas were generated by gemini-2.0-flash — the most optimistic model in Task I3 — in two formats: a running abstract and an integration sentence.

Running Abstract Quality

Correctness3.96 / 5

Clarity4.13 / 5

Expert Confidence4.16 / 5

Integration Sentence Quality

Correctness3.99 / 5

Clarity4.20 / 5

Expert Confidence4.34 / 5

Additionally, a user study with 56 researchers across 9 disciplines at an academic institution validated IDRBench in real-world scenarios: over 50% of Clarity ratings reached 4–5, confirming that LLM-generated IDR proposals are generally well-formulated.

Key Takeaways

Analysis & Findings

LLMs can generate valid IDR ideas — but struggle to distinguish true integration

Human experts rated LLM-generated ideas positively for both correctness and clarity (scores >3.9/5). However, on Task I3, models fail to reliably distinguish genuine interdisciplinary integration from papers that are merely feasible to combine within a single discipline.

Reasoning models underperform on IDR integration tasks

Non-reasoning models score 0.6–0.8 on I3 Subset 1, while reasoning models score only 0.2–0.6. This echoes a known reasoning–creativity trade-off: chain-of-thought constraints appear to suppress the creative, analogical thinking that IDR requires.

Alignment targets predict optimism vs. pessimism in IDR

Models post-trained toward helpfulness (GPT-o series, Gemini, Llama) are more optimistic — high TPR, lower TNR. Models trained toward safety and integrity (Claude Sonnet 4, GPT-4o-mini) are more conservative — lower TPR, higher TNR. DeepSeek models, with balanced alignment, show intermediate behavior.

Fine-tuning on IDRBench data improves integration performance

Fine-tuning a Llama 3.1-8B model on augmented IDRBench data with LoRA yields an average 12.7% gain in Macro-F1 on Task I3, demonstrating the benchmark's value as a training resource for future IDR-capable models.

LLM re-rankings in I2R are not driven by surface similarity

Kendall's τ correlations between LLM re-rankings and SciBERT semantic similarity rankings range from 0.18 to 0.25 across all models — confirming that LLMs perform deeper conceptual reasoning rather than simple topical matching.

Citation

BibTeX

If you use IDRBench in your research, please cite:

@article{shen2026idrbench,
  title     = {IDRBench: Understanding the Capability of Large Language
             Models on Interdisciplinary Research},
  author    = {Shen, Yuanhao and De Sousa, Daniel Xavier and
             Nascimento, Ricardo Mar\c{c}al de Andrade and
             Guo, Hongyu and Zhu, Xiaodan},
  journal   = {arXiv preprint arXiv:2507.15736},
  year      = {2026}
}