Callaghan, M., Mueller-Hansen, F., Bond, M., Hamel, C., Devane, D., Kusa, W., O'Mara-Eves, A., Spijker, R., Stevenson, M., Stansfield, C., Thomas, J., Minx, J.
Computer-assisted screening in systematic evidence synthesis requires robust and well-evaluated stopping criteria
in Systematic Reviews, 22.11.2024
Peer Review , Applied Sustainability Sciences
Systematic reviews, systematic maps, rapid reviews, and other evidence synthesis products are important resources for evidence-based decision-making [1,2,3]. To synthesise evidence systematically, potentially relevant documents are usually identified using systematic searches across multiple bibliographic databases [4]. Typically, and as recommended in widely applied review guidelines, two reviewers read each of these records at the title and abstract level independently and either include for further assessment based on the full text or exclude the record [5]. This process, known as “screening”, is labour-intensive and time-consuming. As the amount of records needing to be assessed increases [6], and as progress in artificial intelligence (AI) and machine learning (ML) — particularly in the domain of text — advances [7,8,9], calls to use ML to increase efficiency in screening grow louder [10,11,12].
There is a long history of research demonstrating the potential of ML in screening (since [13]), as well as a large related literature on the use of ML for similar tasks within legal eDiscovery [14, 15]. A substantial part of this literature [16] uses ML to prioritise records by predicted relevance (ML-prioritised screening, sometimes referred to as active learning), such that a high proportion of potentially relevant records are identified after human screening of a lower proportion of all available records. Human screening with ML prioritisation thus constitutes an “active learning” or “researcher in the loop” procedure, in which the machine uses information from already screened documents to select which ones to show the human coders in the next batch.
Once all relevant records have been identified, the remaining unscreened records represent work that could, in principle, be saved by stopping human screening early — that is before all records have been screened. When this happens is, however, unknowable, because we do not know a priori the total number of relevant documents. To make a decision when to stop and achieve these work savings safely, a live review therefore needs methods that effectively manage the risk of missing more studies than would be acceptable in a given review. This commentary uses the term “early stopping” to refer to stopping screening before all records have been screened, without implying that this is too early. We also refer to “safe” methods for early stopping while recognising that no method eliminates risk entirely, and that the consequences of missing studies vary depending on the review context.
In this commentary, we briefly assess the current state of early stopping across evidence synthesis practice, evidence synthesis tools, and evidence synthesis guidance and highlight where this falls short of the demand for transparent and robust methods for identifying studies. In order to address this gap, we provide recommendations for promoting, developing, and applying safe stopping criteria and highlight leverage points on how to develop commonly agreed-upon principles for their implementation in ML-supported evidence synthesis.