Expert Lech Madeyski Conducts Systematic Review of AI-Powered Evidence Synthesis in Software Engineering

meta

2026-06-21 | Source: Mastodon | Original article

A systematic review found a large language model discarded 63% of relevant papers. This raises concerns about AI accuracy.

A recent finding highlights the limitations of Large Language Models (LLMs) in systematic reviews. According to Lech Madeyski, a prominent voice in the field, the "most accurate" LLM discarded 63% of relevant papers during the screening process for a systematic review. This raises significant concerns about the reliability of LLMs in evidence synthesis and software engineering. This matters because systematic reviews rely on comprehensive and accurate analysis of existing research to inform decisions and guide future studies. If LLMs are silently discarding a substantial portion of relevant papers, the results of these reviews may be flawed, leading to potential misinformed decisions. As the use of LLMs in research and software engineering continues to grow, it is essential to monitor their performance and address these limitations. Researchers and developers should be cautious when relying on LLMs for systematic reviews and evidence synthesis, and prioritize transparency and accountability in their methods. Further investigation into the capabilities and limitations of LLMs in these contexts is necessary to ensure the integrity of research and decision-making processes.

Sources

Mastodon

Back to AIPULSEN