Abstract
Innovation is a key driving force of human civilization. As the body of knowledge has grown considerably, bridging knowledge across different disciplines — where significant innovation often emerges — has become increasingly challenging.
The recent advancements in Large Language Models (LLMs) have provided effective access to extensive knowledge sources and shown impressive abilities in reasoning, rendering significant opportunities for interdisciplinary discovery. Our research aims to understand the capabilities of state-of-the-art LLMs in integrating knowledge from different fields for interdisciplinary research (IDR).
To address this fundamental problem, we introduce IDRBench, a pioneering
framework that includes both datasets and evaluation tasks: (1) IDR Paper Identification, (2) IDR Idea
Integration, and (3) IDR Idea Recommendation. Our study on ten mainstream LLMs provides a
comprehensive analysis and establishes benchmarks and baselines for future research.
The IDRBench Framework
IDRBench is built on a knowledge triplet structure. Each positive instance consists of an IDR paper PA together with two cited papers PB and PC from distinct disciplines, whose key concepts are meaningfully integrated — not merely referenced — in PA.
IDR Paper
Quantitative Biology
Computer Science
Dataset Statistics
Disciplines Covered
Disciplines are selected to maximize conceptual distance, with CS/EE + Quantitative Biology combinations representing nearly 55% of annotated samples.
Three Progressive IDR Tasks
IDRBench's three tasks reflect progressive stages of interdisciplinary research, from basic classification through deep ideation and recommendation.
IDR Paper Identification
Given the title and abstract of a paper PA, can the LLM determine whether it constitutes genuine interdisciplinary research and identify the disciplines involved?
IDR Idea Integration
Given two papers PB and PC from distinct disciplines, can the LLM determine whether they can be meaningfully integrated into a feasible, novel IDR idea?
IDR Idea Recommendation
Given a seed paper PB and a ranked candidate list, can the LLM identify which paper best complements PB for interdisciplinary research?
Task I3: Two Complementary Subsets
Subset 1 (Feasibility): Negative instances are randomly sampled from different disciplines. Tests whether LLMs can evaluate the feasibility of integrating two papers into a valid IDR.
Subset 2 (Awareness): Negative instances come from the same discipline and sub-discipline. Tests whether LLMs are aware that two papers from a single discipline do not constitute interdisciplinary research.
Main Results
We evaluated 10 mainstream LLMs — including 5 with explicit reasoning capabilities — under zero-shot and few-shot prompting. Results are reported in Macro-F1 for classification tasks and Mean Reciprocal Rank (MRR) for the recommendation task.
| Task / Metric | Non-Reasoning Models | Reasoning Models | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4o-mini | Gemini 2.0 Flash | Llama 3.1 70B | Llama 3.3 70B | DeepSeek-V3 | Qwen 2.5 | Qwen 3-32B | GPT o3-mini | GPT o4-mini | GPT-5-nano | Claude Sonnet 4 | DeepSeek-R1 | |
| IPI 0-shot | 0.587 | 0.538 | 0.551 | 0.468 | 0.607 | 0.571 | 0.527 | 0.486 | 0.610 | 0.630 | 0.582 | 0.507 |
| IPI 5-shot | 0.510 | 0.534 | 0.336 | 0.444 | 0.614 | 0.501 | 0.565 | 0.523 | 0.617 | 0.640 | 0.621 | 0.551 |
| I3 S1 0-shot | 0.750 | 0.666 | 0.696 | 0.814 | 0.769 | 0.793 | 0.782 | 0.254 | 0.372 | 0.495 | 0.563 | 0.526 |
| I3 S1 3-shot | 0.765 | 0.434 | 0.519 | 0.822 | 0.635 | 0.860 | 0.746 | 0.302 | 0.411 | 0.492 | 0.694 | 0.450 |
| I3 S2 0-shot | 0.588 | 0.473 | 0.372 | 0.407 | 0.509 | 0.500 | 0.486 | 0.151 | 0.190 | 0.339 | 0.555 | 0.333 |
| I3 S2 3-shot | 0.672 | 0.212 | 0.240 | 0.421 | 0.345 | 0.543 | 0.518 | 0.187 | 0.365 | 0.240 | 0.592 | 0.259 |
| I2R MRR | 0.646* | 0.623* | 0.650* | 0.623* | 0.585 | 0.661 | 0.642 | 0.571 | 0.640* | 0.486 | 0.446 | 0.588 |
* Asterisks in I2R denote absence of statistically significant difference from the best result (Wilcoxon signed-rank test). Best results in each row are highlighted.
Expert Assessment of LLM-Generated Ideas
Six human experts with diverse academic backgrounds assessed 60 LLM-generated IDR idea samples on two dimensions: Correctness (does the abstract integrate papers PB and PC in an interdisciplinary way?) and Clarity (how clearly is the integration described?). Scores are on a 1–5 scale.
Ideas were generated by gemini-2.0-flash — the most optimistic model in Task I3 — in two formats:
a running abstract and an integration sentence.
Running Abstract Quality
Integration Sentence Quality
Additionally, a user study with 56 researchers across 9 disciplines at an academic institution validated IDRBench in real-world scenarios: over 50% of Clarity ratings reached 4–5, confirming that LLM-generated IDR proposals are generally well-formulated.
Analysis & Findings
LLMs can generate valid IDR ideas — but struggle to distinguish true integration
Human experts rated LLM-generated ideas positively for both correctness and clarity (scores >3.9/5). However, on Task I3, models fail to reliably distinguish genuine interdisciplinary integration from papers that are merely feasible to combine within a single discipline.
Reasoning models underperform on IDR integration tasks
Non-reasoning models score 0.6–0.8 on I3 Subset 1, while reasoning models score only 0.2–0.6. This echoes a known reasoning–creativity trade-off: chain-of-thought constraints appear to suppress the creative, analogical thinking that IDR requires.
Alignment targets predict optimism vs. pessimism in IDR
Models post-trained toward helpfulness (GPT-o series, Gemini, Llama) are more optimistic — high TPR, lower TNR. Models trained toward safety and integrity (Claude Sonnet 4, GPT-4o-mini) are more conservative — lower TPR, higher TNR. DeepSeek models, with balanced alignment, show intermediate behavior.
Fine-tuning on IDRBench data improves integration performance
Fine-tuning a Llama 3.1-8B model on augmented IDRBench data with LoRA yields an average 12.7% gain in Macro-F1 on Task I3, demonstrating the benchmark's value as a training resource for future IDR-capable models.
LLM re-rankings in I2R are not driven by surface similarity
Kendall's τ correlations between LLM re-rankings and SciBERT semantic similarity rankings range from 0.18 to 0.25 across all models — confirming that LLMs perform deeper conceptual reasoning rather than simple topical matching.
BibTeX
If you use IDRBench in your research, please cite:
@article{shen2026idrbench, title = {IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research}, author = {Shen, Yuanhao and De Sousa, Daniel Xavier and Nascimento, Ricardo Mar\c{c}al de Andrade and Guo, Hongyu and Zhu, Xiaodan}, journal = {arXiv preprint arXiv:2507.15736}, year = {2026} }