Citationchaser: A tool for transparent and efficient forward and backward citation chasing in systematic searching

Systematic searching aims to find all possibly relevant research from multiple sources, the basis for an unbiased and comprehensive evidence base. Along with bibliographic databases, systematic reviewers use a variety of additional methods to minimise procedural bias. Citation chasing exploits connections between research articles to identify relevant records for a review by making use of explicit mentions of one article within another. Citation chasing is a popular supplementary search method because it helps to build on the work of primary research and review authors. It does so by identifying potentially relevant studies that might otherwise not be retrieved by other search methods; for example, because they did not use the review authors' search terms in the specified combinations in their titles, abstracts, or keywords. Here, we briefly provide an overview of citation chasing as a method for systematic reviews. Furthermore, given the challenges and high resource requirements associated with citation chasing, the limited application of citation chasing in otherwise rigorous systematic reviews, and the potential benefit of identifying terminologically disconnected but semantically linked research studies, we have developed and describe a free and open source tool that allows for rapid forward and backward citation chasing. We introduce citationchaser, an R package and Shiny app for conducting forward and backward citation chasing from a starting set of articles. We describe the sources of data, the backend code functionality, and the user interface provided in the Shiny app.

• Here, we introduce a free and open source R-based tool (citationchaser) for rapidly performing forward and backward citation chasing to support comprehensive systematic review searches • We believe citationchaser could substantially improve search comprehensiveness with minimal additional time and resources 1 | BACKGROUND

| Citation chasing as a systematic search technique
Systematic searching is the process by which researchers seek out all possibly relevant research records from a variety of sources in an attempt to collate an unbiased and comprehensive set of bibliographic records. 1 Along with searching bibliographic databases using complex search strings, 2 systematic review authors employ a suite of additional methods to minimise procedural bias, 3 including but not limited to: targeted searches of organisational websites for grey literature; suggestions of articles from stakeholders; searches of institutional repositories for unpublished theses and data; and assessing records that are cited by and that cite a set of articles of known relevance (citation chasing). Citation chasing (also referred to as 'citation searching', 'citation tracking', 'snowballing', 'pearl growing', 'footnote chasing', 'reference scanning', 'reference checking', 'bibliographic checking', 'citation mining', or 'reference harvesting') in systematic reviews exploits connections between similar research articles to identify relevant records for consideration in a review by making use of explicit mentions of one article within another (citations). These connections from a single article travel back in time (a starting article references a set of previously published articles) and forwards in time (articles published after the starting article include it in their reference lists).
Citation chasing is a popular supplementary search method because it helps to build on the work of primary research and review authors. It does so by identifying potentially relevant studies that might otherwise not be retrieved by other search methods; for example, because they did not use the review authors' search terms in the specified combinations in their titles, abstracts or keywords. 4 Backward citation chasing (Figure 1) is the process of obtaining and assessing the relevance of all records cited within the bibliographies or reference lists of a set of articles (cited articles). Backward citation chasing involves the collation of a finite and definable set of bibliographic records, defined by the number of records across multiple reference lists. For example, if articles cite on average 50 articles, then backward citation chasing of 100 starting records would yield 5000 cited records (assuming F I G U R E 1 Schematic showing a graphical representation of backward and forward citation chasing, from a starting set of research articles [Colour figure can be viewed at wileyonlinelibrary.com] minimal duplication). There is a quantifiable degree of accuracy to the obtained records, in that each included starting record has a 'true' number of references that can be obtained, and any attempt to retrieve these can be assessed for completeness. However, some databases only include references of a certain type (e.g., ignoring 'grey literature') or only those records that are already indexed in the database. This can lead to an incomplete set of references for a given article.
Forward citation chasing ( Figure 1) involves identifying articles that have cited a particular set of articles (citing articles). This process looks forward in time, identifying citation networks in the years since the starting articles were published. Forward citation chasing is non-definable in volume, in that there is continued growth in the number of citations of the starting set. Furthermore, databases of citations vary in their estimates of the number of citing articles depending on the type of citing records included. For example, Web of Science may state that a certain record has 50 citations since it only includes records already indexed within its databases, while Google Scholar may list 75 citations, since it counts citations from any web-based document classified as 'scholarly'.
Previously cited benefits to citation chasing include: • Even a few starting records can create a useful network 5 ; • Related but semantically disconnected (parallel) topics (i.e. those that cover the same topic but use entirely different vocabulary) can be identified that might otherwise be missed by targeted searching of bibliographic databases 6 ; • It is useful in topics where the terminology is not used consistently across the evidence base. 7 In systematic reviews, the starting set of articles is often one of the following: the final list of included studies, for example, Reference 8; the set of identified search results, for example, Reference 9; a benchmark set of articles of known relevance assembled a priori by the review team, for example, Reference 10; a list of relevant review articles-sometimes those encountered during the reviewing process, for example, Reference 11.

| Practical considerations
Prior to automated methods, backward citation chasing may have often been conducted manually (i.e., searching for citations in reference lists one by one) and nondigitally (i.e., based on transcription and searching of each reference one-by-one). Traditionally, this might be done by individually screening the reference lists from printouts of a relevant set of articles and noting relevance by hand. This is both time-consuming and very difficult to record transparently-ideally every record assessed should be documented, which would involve transcribing or copying thousands of records into a record-keeping tool, such as a spreadsheet. However, bibliographic databases and search engines typically provide lists of references that can be exported and integrated into reference or review management tools for digital transparency.
Forward citation chasing requires the use of tools that track citations as the citing records are indexed. According to Cooper et al., 12 three resources have been most commonly used in citation chasing (both forward and backward) to date: Web of Science, Scopus and Google Scholar.
Numerous library guides exist to support researchers in conducting citation chasing (e.g., the University of Illinois Library; https://guides.library.illinois.edu/c.php?g= 563215&p=3877582). These guides typically all advise users to manually examine the reference lists and lists of citing articles for each record one-by-one.
Several factors affect the efficacy and efficiency of citation chasing: • Currency refers to how up to date a resource is for tracking citations of its included records; a significant lag in recording a citation would reduce the effectiveness of that resource for forward citation chasing in recent years and may omit important parts of the citation network. 13 • Network comprehensiveness is affected both by the proportion of reference lists for included articles that are stored as digitised citations (if the database used is selective in its indexing of journals, then indexed article reference lists will only be partially complete), and by the efficiency at capturing new citations by latterly published articles. • Coverage refers to the proportion of the research literature that is indexed by any bibliographic resourcepoor coverage increases the chances that a starting record may be missing from the resource. • The functionality of bibliographic resources may not allow the user to export reference lists and lists of citing articles for more than one starting record at a time, forcing reviewers to export these lists one-by-one, requiring substantial time and reducing efficiency of the overall process.

| Efficiency and time/resource cost of citation chasing
Here, we present several studies reporting on the time requirements for citation chasing, this is not intended to be a comprehensive or systematic appraisal of the literature, but rather an informative example of some empirically quantified examples. Wright et al. 14 conducted forward citation chasing of 40 starting records using Web of Science (the precise databases were not specified), Medline via OVID, Google Scholar and Scopus on each record for comparison of citation coverage. This resulted in a total of 4161 results: Google Scholar, 1680; Scopus, 1173; Web of Science, 1095; and Medline via OVID, 213. The process yielded one relevant study that was omitted from bibliographic database searches. In addition, the authors estimated that the time to conduct forward citation chasing was a total of 4 days: 2 days downloading 1680 results from Google Scholar; 1 day downloading 2481 results from Web of Science (databases not specified), Scopus and Medline; 2 days screening the records identified.
Hinde et al. 6 conducted backward and forward citation chasing from a starting set of 14 records. The authors performed four iterations of record retrieval in each direction until no new studies were identified (what they refer to as Bidirectional Citation Searching to Completion; BCSC). They used Web of Science (they did not specify which databases), screening a total of 4529 potentially relevant records and finally yielding an additional 76 relevant studies.
Levay et al. 13 performed forward citation chasing for 46 starting records using Google Scholar and Web of Science (Science Citation Index and Social Science Citation Index) to identify citations for each record. The authors noted that the time required for this process was 75 hours (Google Scholar) and 4 h (Web of Science), including time to identify and deduplicate 783 records (Web of Science = 46 records and Google Scholar = 737 records).
It is clear from these examples that manual (but digital) citation chasing (i.e., copy-and-paste searching identifying information of individual references/citations) can be a particularly time-consuming process despite using web-based digitised resources.

| Current citation chasing practices and best practice
The Collaboration for Environmental Evidence (CEE) states that citation chasing is frequently used but provides no guidelines or set of minimum standards for how it should be conducted. Examination of 16 recent systematic reviews (and maps) published in 2020 and 2021 (to date, September 2021) in the CEE journal Environmental Evidence shows that citation chasing has not been common in recent reviews (Table 1), with six reviews not performing any backward citation chasing at all and no review performing forward citation chasing. In five reviews, the set of starting records used in citation chasing was not provided or identifiable.
In a methodological review of 96 systematic reviews published in the journal, Campbell Systematic Reviews a total of 84 reviews included citation chasing methods. 15 The percentage of reviews including the method increased from 80% to 94% for January 2011 to September 2014 relative to October 2014 to February 2018. Forward citation chasing was not assessed.
Cochrane is a leading producer of systematic reviews for healthcare decision-making. Reference list checking (i.e., backward citation chasing), based on included studies or other relevant systematic reviews, is regarded as a mandatory requirement for authors conducting Cochrane reviews, under the current (2021) MECIR standards. 16 The Cochrane Handbook 17 suggests forward citation chasing as an option for reviews on complex and public health interventions.
Briscoe et al. 18 reviewed the use of citation chasing for 198 Cochrane reviews, published between November 2016 and January 2017, after excluding withdrawn and empty reviews. Of these, 172 used backward citation chasing and 18 used forward citation chasing. Backward citation chasing was conducted by manually checking reference lists of all included studies, per the MECIR standards, in all cases. Three reviews did not describe the resource used for forward citation chasing. Web of Science was used in most cases (13/15). Scopus was used twice and Google Scholar once. Only one review made use of two named sources to identify forward citations (the remaining 14 of 15 named just one source).

| Existing tools for citation chasing
Some softwares already exist that support basic forms of citation chasing, but they typically have substantial limitations. For example, bibliographic data platforms such as Web of Science allow users to extract sets of references and citing articles for a given bibliographic record, but these must be selected and exported individually (rather than for multiple starting records). The co-citation exploration tool CoCites (https:// www.cocites.com/) allows a starting set of records to be used to identify referenced and citing records (as well as records that are frequently cited along with the starting set). CoCites uses the iCite tool from the United States National Institutes of Health and is therefore limited in scope and content.

| Objectives
Given the challenges and high resource requirements associated with citation chasing, the limited application Not performed Yes 'Reference sections of accepted articles and 168 relevant reviews were hand searched to evaluate relevant titles that were not found using the search strategy (see Additional file 2 for a list of relevant reviews). … we screened bibliographies of: (1) a large number of relevant reviews identified at title and abstract (84 reviews) or full-text screening (30 reviews); (2) additional relevant reviews identified from within the bibliographies of the reviews (54 reviews); and (3) included articles. We searched these reference lists of papers until the reviewer deemed that the number of relevant returns had significantly decreased'.
Note: 'n' corresponds to the number of starting records used for citation chasing. Abbreviation: n/a, not applicable.

| Tool framework and functionality
We set out to develop an easy-to-use tool designed to collate references (backward citation chasing) and citations (forward citation chasing) from across an entire set of starting records in a single step. This thereby saves the time-consuming effort of searching for each record individually.
We also wanted to ensure that the records retrieved in both directions were provided in an easily usable wayas an research information Systems (RIS) bibliographic data file containing sufficient unique identifying information to allow record disambiguation without unreasonable effort. An RIS file is a standardised data format for a text file containing bibliographic data, named after the developer, RIS. This would allow users to perform automated deduplication of citation chasing results against bibliographic database search results to leave only those records missed by bibliographic searches. Such deduplication is a future aim for citationchaser.
Finally, we aimed to produce open source software, with the entire code base published and freely available online to support incremental development and refinement by anyone. To do this, the backend functionality was written as an R package 19 dependent upon functions from the following packages: express 20 ; httr 21 ; jsonlite 22 ; tibble 23 and dplyr. 24 The frontend functionality was provided as a web-based Shiny app hosted on a Shiny server and depends upon the following additional packages: shiny 25 ; DT 26 ; shinycssloaders 27 ; data.table 28 ; scales 29 ; tidyr 30 ; networkD3 31 and stringr. 32

| Data sources
The tool makes use of The Lens.org bibliographic database aggregator, which collates content from across five bibliographic resources: Microsoft Academic Graph (MAG), CrossRef, CORE, PubMed and PubMedCentral (https://www.lens.org/lens/search/scholar/structured). At the time of writing, The Lens.org contains >245 million scholarly records (January 2022). Microsoft Academic Graph and CrossRef have been shown to have considerable overlap with traditional resources such as Scopus and Web of Science. 33,34 Lens.org is entirely free to use via its web-based interface, and an API (application programming interface) is provided for programmatic access of the database content. The API requires a token for personal identification, which can be requested free-of-charge for a 14 day trial, with extended access provided at the discretion of The Lens.org, which is a not-for-profit organisation that benefits from institutional support.
The tool we developed as an R package requires the user to submit a token if used via R, but the web-based Shiny app has an inbuilt token that has been kindly provided free-of-charge by The Lens.org. This means that users of the web app have free access to the API without needing to apply for a token.

| Use cases
We planned for the following use cases: 1. Users wishing to assemble a set of forward and/or backward citation chasing results manually enter a set of starting records by copy-pasting/transcribing identifiers into the relevant input field, or by uploading a CSV file containing identifiers (and identifier type

| Platform for development
We have provided the code behind citation chaser via GitHub to allow for collaborative, incremental updates, bug fixes and functionality expansions. For details on planned developments, see 'Potential and planned further developments', below. Users can flag up problems and bugs that need to be resolved or suggest future development needs by raising a new 'Issue' (https://github.com/ nealhaddaway/citationchaser/issues), creating a public discussion about the topic. Users can also make a copy of the code repository (a 'fork'), modify the code, suggesting an improvement or extension of functionality, and suggest this be integrated into the main code (a 'pull request'), becoming a co-developer on the project, if desired.

| Citationchaser functionality
The free-to-use, web-based citationchaser app is available here: https://estech.shinyapps.io/citationchaser. The app provides detailed instructions of how to use the tool on the landing page (Figure 2), including instructions for developers on how to structure referral URLs to pre-populate the article input field with identifiers.
Users can directly paste identifiers into the article input fields in the 'Article input' tab ( Figure 3). Citationchaser accepts digital object identifiers (DOIs), COR-EIDs (the CORE repository identifiers), MAGIDs (Microsoft Academic Graph identifiers), PMIDs (PubMed identifiers), and PMCIDs (PubMed Central identifiers). A recent update allows multiple IDs to be searched concurrently by pasting them into the relevant box on the 'Article input' tab.
As a first step, the users must confirm the records they entered have been correctly identified in The Lens. org database. The table of retrieved starting records is downloadable as an RIS file.
Once this is completed, users can then proceed to the references (backward citation chasing) and/or citations (forward citation chasing) tabs to retrieve all of the corresponding records obtainable. For each, a textual summary is provided below the action button summarising the number of records obtained, the set of unique (i.e., deduplicated) records retrieved from citation chasing, and the number of records shown in the table and downloadable as an RIS file.
In an additional step, users can then visualise their citation network using the 'Network' tab ( Figure 4), which provides an interactive visualisation of which retrieved articles are cited by (red dots) and cite (blue dots) the starting articles (black dots). Users can zoom in and out, and by clicking on an article, they can navigate to the relevant record in The Lens.org in a new browser tab.
Developers wishing to direct users from a reference/ review management tool or other evidence synthesis technology can refer directly to citationchaser by including a list of identifiers in the suffix of the app URL using the following syntax: https Where the [doiX/pmidX] is replaced with a string of identification codes (e.g., https://estech.shinyapps.io/citationchaser/?dois=10.1038/s41559-020-01295-x,10.1371/ journal.pone.0138237&pmids=32652585,32706299). Other identifiers can also be used together for a simultaneous search across multiple identifier types: PubMed IDs ('pmids='); PMC IDs ('pmcids='); CORE IDs ('cor-eids='); and Microsoft Academic IDs ('magids='). Each new set of identifiers should be prefixed with '&'. The resultant landing page from a referral URL is the set of retrieved starting articles. Users can then verify these records, and retrieve backward and forward citation chasing results, finally also visualising their network if so desired.

| R package (and local Shiny app)
The backend functionality behind citationchaser that checks records with The Lens.org API and formulates the RIS outputs is also provided as an R package (see the GitHub repository; https://github.com/nealhaddaway/ citationchaser). This R package has the same core functionality to the citationchaser web-based Shiny app, but it also provides the search results as a dataframe converted from the complete Lens.org API query response. This provides far more fields than are encoded into the RIS files (e.g., author identifiers and affiliations, for full details see https://docs.api.lens.org/response-scholar. html). Users wishing to retain all citing and reference article identifiers before internal deduplication should make use of the R package, which provides this information in the dataframe outputs.
The R package also requires the user to have obtained an API token from The Lens.org (https://www.lens.org/ lens/user/subscriptions#scholar). The GitHub version of the R package also includes a locally deployable Shiny app, providing a user interface that is perhaps easier to use than running the code in the R console. This Shiny app will not appear in any version of the R package that is archived on the R package repository CRAN (https://cran.r-project.org/) but will be accessible via the GitHub repository.

| Integrating citationchaser into systematic searching
We believe that citationchaser can fit well into existing common practice in systematic searching (and systematic reviews) in the following pipeline of activities: 1. Users identify a set of relevant articles they wish to use as starting records for forward and/or backward citation chasing to supplement bibliographic database searching (e.g., this may be a final set of included articles in a review, a list of reviews encountered during screening, or a list of articles of known relevance submitted by experts and stakeholders). 2. Unique identifiers (e.g., DOIs) for these articles are entered as comma-separated lists into the relevant fields of the 'Article input' tab of citationchaser. The

| Allow multiple levels of citation chasing
It is implicitly accepted that most citation chasing goes only one level deep (articles that cite or are cited by a starting record). However, and as demonstrated by one methods paper, 6 by going multiple levels forward and backward, making use of automated deduplication within citation chasing results and against bibliographic search results, a comprehensive body of potentially missed records could be obtained. At present, citationchaser goes only one level deep, but this could be expanded. Prior to deduplication, the body of potentially relevant records for backward citation chasing multiplies at each level by the average number of references per paper (perhaps around 50), and for forward citation chasing it would depend on the field and age of the articles, but could multiply by substantially more-at present, this may be too computationally complex, but this is likely to change in the coming years. Therefore, it becomes increasingly important to build in efficiency to citationchaser, and to make use of efficient deduplication and screening tools (e.g., machine learning screening algorithms) for each additional level searched. In addition, building in analysis of co-citation and other indirect citation chasing options could help to deal with the exponential increase in the volume of records beyond one level deep. Other tools exist for this independent co-citation analysis, but as yet no tool allows for

| Upload an RIS file
In addition, it would be useful to allow users to upload an RIS file that could be used as a basis for extracting identifiers for the starting set. This has not been prioritised in citationchaser because there is an on-going project to fill gaps in RIS and other bibliographic files, and a core function of this package (bibfix; https://www. eshackathon.org/software/bibfix.html) will be to obtain record identifiers for those lacking them. This would drastically increase the efficiency of an RIS upload function in citationchaser, and so this functionality will be developed once bibfix is available.

| Deduplicate against an RIS file
the process of deduplicating a large set of bibliographic (and other) search results against the citation chasing output RIS files is somewhat complex, since it is ONLY the unique citation chasing results not found in the larger set that should be retained (see Figure 5). Traditional deduplication tools would only remove the overlap, retaining all unique records, so users must go one extra step to tag and remove all records from the larger comparison step (the bibliographic search results in Figure 5). We hope to be able to provide a powerful deduplication tool that would automate this process with a high degree of accuracy, integrating tools like CrossRef for better record disambiguation where nonexact duplicates exist. Microsoft has recently announced that it will be retiring MAG from the end of 2021, but several organisations have begun to provide replacements aiming for minimal disruption (e.g., OpenAlex, https://blog.ourresearch.org/openalexupdate-june/). It is anticipated that this will not have any effect on backward citation chasing, although comprehensiveness of the database and the completeness of forward citation chasing from the end of 2021 onwards may be temporarily affected. The Lens.org already has plans in place to supplement current data sources until a replacement is integrated (most likely, OpenAlex; https://about.lens.org/the-lens-scholarly-metarecordstrategy-beyond-microsoft-academic-graph/). In the future, citationchaser could integrate other APIs that support bulk exports from their APIs as they become available, should they not be already integrated into The Lens. org as an internal replacement for MAG. Building in source redundancy would reduce the risk of problems in data supply from any one source.

| CONCLUSIONS
Citation chasing is a useful supplementary source of information in systematic searches inherent to rigorous evidence syntheses. 3 However, currently there is little evidence that it is being used in rigorous systematic reviews conducted according to the standards of leading evidence synthesis coordinating bodies like Cochrane. Given the current systems and frameworks for backward and forward citation chasing, recommendations to integrate comprehensive citation chasing into robust systematic reviews (e.g., via review guidance published by the Collaboration for Environmental Evidence and the Campbell Collaboration) are likely to be ignored due to the method's time demands.
Here, we present a free-to-use, open source tool for conducting rapid backward and forward citation chasing. Furthermore, we show that it produces standardised output files that can be readily and efficiently merged with bibliographic database search results to minimise duplication while maximising the comprehensiveness of the body of potentially relevant records that can be screened within an evidence synthesis.
Relative to reviewers using fully manual, nondigitised citation chasing, citationchaser could present a F I G U R E 5 Schematic showing the body of results required from citation chasing deduplication against bibliographic search results in a systematic search [Colour figure can be viewed at wileyonlinelibrary.com] significant time saving. We believe that the ease-of-use and efficiency of this tool will facilitate the integration of comprehensive citation chasing as a supplementary method in future evidence syntheses. We hope that evidence synthesis coordinating bodies (e.g., Cochrane, the Campbell Collaboration and the Collaboration for Environmental Evidence) will recommend the use of such open source technologies to improve and standardise rigorous information retrieval methods.