LitQuest Design and Workflow
The Living Review System (LitQuest version 1.0; Grbin et al.,
2022) is a cloud-based web platform designed to allow researchers to leverage AI capability to facilitate timely and living reviews. Researchers can engage the system when starting a new review to achieve time savings in the reference screening phase of a review. They can also upload the results of an historical review (i.e. one in which all references have already been manually screened) to generate an algorithm that may be applied to new literature to maintain a living review. A web-based, menu-driven user interface was chosen in preference to a command-line, syntax-driven program to provide a more visually appealing interface and to reduce potential barriers in uptake due to limited coding experience (O’Connor et al.,
2019).
Several features that are common to users of other citation management systems (e.g. Covidence) are included in the LitQuest to obviate the need for use of multiple programs to complete one’s review. Capacity is included for multiple raters to review the literature independently, and to then receive a report detailing articles for which disagreements arise. Further, data on duplicate removal, number of articles screened, and their allocation into ‘relevant’ versus ‘not relevant’ categories are stored for use for PRISMA flowcharts for the end-user.
New features currently under development (but not used in the present paper) include those designed to: (1) enhance confidence in the system (showing end users the keywords that are most important for the algorithm; with provision of opportunity for human input to help adjust/correct the algorithm), (2) enhance usefulness (e.g. text summarisation [within constraints of copyright rules] to concisely summarise key findings; text flagging capability to help with extraction of key data), and (3) allow the algorithm for one project to be accessible to other users to enhance generalizability of the algorithm through transfer learning as part of a broader commitment to open science and scientific progress.
The key steps in our LitQuest workflow are described below.
In Step 1 (reference search/results upload), researchers conduct a search across relevant databases, export results as an EndNote XML or.RIS file and upload this to the LitQuest for de-duplication. As is standard for this stage of screening in a manual review, imported files contain title, author, abstract, and DOI (assisting with back and forward search) as inputs for screening.
In Step 2 (
reference screening), the LitQuest presents articles individually on the screen for the end-user to rate. In traditional screening approaches, the rater manually evaluates all articles for possible inclusion in their intended review. In contrast, the LitQuest system utilises AI techniques (specifically, machine learning) to support researchers in determining whether a reference is relevant. This is enabled through the use of active learning, a technique whereby the machine learner (i.e. the algorithm) adapts its understanding of a set of data (the list of search results) with each example it is given (end-user determination of reference relevance). In this way, the LitQuest can be considered a supervised machine learning model, (i.e. a model that learns by example) that can be trained to screen search results in partnership with human reviewers. This partnership is reflected in the description of the system as semi-automated. Recent work shows that an active learner can identify relevant, unsorted articles more quickly than other contemporary AI approaches (Yu et al.,
2018).
By default, the LitQuest requires end-users to manually screen 10 relevant and 10 not relevant articles before generating the first iteration of an algorithm to sort the remaining articles into ‘relevant’ and ‘not relevant’ categories. This initial algorithm is refined through the active learner, and given a value between 0 and 1, which allows the LitQuest to simply present references in descending score order of relevance. Relevancy of manually sorted articles is displayed visually, with relevant articles coded as green with two ticks and not relevant coded as red with a dash. In cases where multiple raters independently screen references, discrepancies in assessment of an article are flagged by the LitQuest for users to reconcile. Discrepant ratings do not enter into the algorithm, and hence the algorithm does not diverge for multiple raters.
In Step 3 (
stopping rules), LitQuest recommends continued manual screening of references sorted by the active learning model until the stopping criteria are met, which is triggered after a streak of 40 papers that are not relevant, and with relevancy confidence scores of < 0.5 for all remaining references. At this point the LitQuest AI algorithm is programmed to stop, as it is unlikely that many more relevant papers will be found and, thus, the user is prompted to stop screening. The LitQuest then asks if the user would like to mark all remaining unscreened references as “not relevant” to complete the screening phase of the review. It is important to note that the LitQuest default stopping rule is only a guide and can be overridden by simply ignoring the recommendation to stop. The LitQuest stopping rule may also be subject to change in future versions of the LitQuest as more data from end-user testing allows for evaluation and comparison of performance of different stopping rules. We also note though that, at present, a consensus view is lacking in the literature with respect to the best approach for a stopping rule (Callaghan & Müller-Hansen,
2020).
In Step 4 (full-text screening and review), on completion of screening, the LitQuest produces a list sorting all references into ‘relevant’ versus ‘not relevant’ for full-text screening. PDFs are collated for full-text screening and data extraction. The LitQuest has functionality to attach PDFs individually to relevant citations. It also supports bulk uploading of PDFs. To do this, relevant citations are exported as an RIS file from the LitQuest. This RIS file is then imported into EndNote so that PDFs can be located using EndNote’s automated ‘Find Full Texts’ function. Once located, the citations and PDFs are exported from EndNote and imported into the LitQuest for full-text screening.
It is important to emphasise that the LitQuest does not currently perform automatic text extraction because (1) a model would need to be fine-tuned and calibrated to ensure reliable summarisation of the article contents, lest the summarisation be misleading in casting a vote for inclusion, and (2) uploading non creative commons articles to OpenAI API for summarisation may represent a breach of the access and usage conditions for the material. In the future, LitQuest may be able to overcome this by using an open source model that is self-hosted as part of LitQuest, accompanied with specific legal terms governing the use of it in conjunction with the LitQuest platform.
LitQuest is currently under IP and commercial development. To discuss access to LitQuest, please contact the corresponding author.
LitQuest Evaluation
Evaluation of the efficiency of the LitQuest was benchmarked using three primary outcome measures assessing LitQuest efficiency, performance, and accuracy. Data on each of these measures has been drawn from a set of seven systematic reviews, comprising a review series on predictors, natural history and consequences of early patterns of relational health within family systems, commissioned by the Paul Ramsay Foundation, Australia. Reviews were based on comprehensive, library-based, systematic searches of the global literature on early relational health, using gold-standard database searching techniques and accessing literature across all major platforms/databases: PubMed, MedLine, PsycInfo, and EBSCOBase. Title and abstract screening was conducted within the LitQuest, overseen by a review series program manager and completed by small teams of trained research assistants.
LitQuest Efficiency was estimated in terms of cost savings in human resources required for screening. Routinely collected meta-data on the number of articles screened in order to reach stop criteria were collected across each of the seven systematic searches, with the proportion needing human screening being calculated as the number of papers screened to stop criteria divided by the total number of articles to be screened (i.e. the total number of articles imported from the literature search after de-duplication). The mean proportion screened by the research team, including range of proportions screened across all reviews, was used to provide a marker of LitQuest efficiency.
LitQuest Performance was evaluated using Work Saved over Sampling (WSS) for a certain recall (in this case, 95% recall). This was calculated as the:
$$\left( {{\text{Number of Candidate Papers }}{-}{\text{Number of Manual Screenings}}} \right)/\left( {\text{Number of Candidate Papers}} \right) - \left( {{1}{-}{\text{Recall achieved}}} \right)$$
While other indicators (e.g. Balanced F-Score) are often employed for evaluation of model performance, these metrics fail to provide a holistic representation of rater preferences, and as such, provide a crude and unreliable approximation of performance of the model to benchmark in comparison to human raters. WSS is used for evaluation of other AI tools for systematic reviews (e.g. van de Schoot et al.,
2021 who used a 95% recall threshold also).
LitQuest Accuracy was estimated in terms of the error rate in missing relevant articles after stop criteria had been reached. This was estimated by randomly sampling 100 articles from the full set of articles marked ‘not relevant’ by the LitQuest (and not screened by the research assistant team) after reaching the LitQuest stop criteria, and checking that each was appropriately classified. The mean proportion of misclassified articles flagged as ‘not relevant’ on completion of screening was used to provide a marker of LitQuest accuracy.
We note too that while LitQuest efficiency and LitQuest performance metrics are conceptually similar, it is possible that they may return different results. In particular, a stopping rule could be applied too early and thus miss relevant articles. The LitQuest performance metric thus provides some assurance about the stopping rule by quantifying the amount of titles and abstracts needing to be screened to identify a set percentage of relevant articles (in this case, 95%). Our LitQuest accuracy metric provides further test of appropriateness of the stopping rule.