cffr: Create a CITATION.cff File for your R Package

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hex sticker for the cffr package. White background with a blue outline and blue text reading 'cffr' above a blue network diagram in the shape of a brain

A new R package, cffr, has been developed, peer-reviewed by rOpenSci and accepted by CRAN. This package has a single purpose: to create a valid CITATION.cff file using the metadata of any R package.

CITATION.cff files and why they matter

A Citation File Format (CFF) is a plain text file with human- and machine-readable citation information for software (and datasets)1.

Under the hood, a CFF file is a YAML file. YAML has the advantage of being easily understood by any user, and can also be easily converted to another data serialization language, such as JSON or XML. This is an example of the minimal content of a valid CITATION.cff file:

cff-version: 1.2.0
message: 'To cite package "cffr" in publications use:'
title: 'cffr: Generate Citation File Format (''cff'') Metadata for R Packages'
authors:
- family-names: Hernangómez
 given-names: Diego

In this example, the identification of the software and the author is quite straightforward, as it is provided by the fields title and authors. The information that can be included on a CFF file can be further enriched with additional fields (like version, year or doi), as the Citation File Format schema version 1.2.0 accepts 21 different keys.

Why do CFF files matter?

Citing a book, an article, or a thesis is not difficult. The title, authors and publication date are easily identifiable in most of the cases. However, software is rarely cited on research projects. One of the reasons is “the lack of a clear citation information from package developers”, as already mentioned in a previous post (Make Your R Package Easier to Cite). Developers spend thousand of hours on developing new and exciting software or adding new features to existing ones, so citing software is a matter of giving credit where credit is due. For more reasons why it is important to cite R software see Steffi LaZerte’s blog post How to Cite R and R Packages.

In July 2021, GitHub announced a built-in citation feature that enables any software user to cite any repository in APA or BibTeX style.

This built-in feature heavily relies on the CFF format, rendering the information of the CITATION.cff file into the aforementioned styles.

This announcement was, in my very personal opinion, a game-changer for the software citation ecosystem. As a proof of that, on the two following days Zenodo and Zotero announced support for CITATION.cff files in their GitHub integration:

Integration with Zenodo means that when creating a Digital Object Identifier (DOI) for a GitHub repository via Zenodo, the DOI would be generated according to the metadata included in the CITATION.cff file of the repository. This feature saves developers the extra effort of making both DOI and software consistent in terms of metadata. See an example of the DOI of cffr, whose title, description and author has been gathered from the cffr CITATION.cff file.

In the case of Zotero (reference management software), the information in the CITATION.cff file of any repository is detected when adding that repository to a library from a url.

This handy feature means that a GitHub repository can effectively behave as a DOI, ISBN or arXiv ID. Zotero will recognize the author, title, date and other relevant metadata of the repository, including it in the personal Zotero reference library of the user.

And there is still more! CiteAs service also supports CFF files2, and in the future more platforms such as JabRef or GitLab3 may add support to CITATION.cff files, (and why not CRAN or BioConductor?).

Other software citation projects

The CodeMeta Project4 creates a concept vocabulary that can be used to standardize the exchange of software metadata across repositories and organizations. One of the many uses of a codemeta.json file is to provide citation metadata such as title, authors, publication year or version. The codemetar package5 allows you to generate codemeta.json files from R package metadata.

Using cffr

Getting started with cffr is pretty easy. There is a main function (likely the only one you would need for an in-development package) named cff_write() that extracts the metadata of your package (already included in your DESCRIPTION and inst/CITATION files), converts it into a CITATION.cff file and validates it against the latest CFF validation schema using jsonvalidate6:

library(cffr)
# For in-development packages
cff_write()

#>
#> CITATION.cff generated
#>
#> cff_validate results-----
#> Congratulations! This .cff file is valid

Working with cff objects

It is also possible to create a cff object (a regular R list with a custom printing method) for any package installed locally on your machine. In the next example I create a cff object for the rtweet7 package:

library(cffr)
cff_rtweet <- cff_create("rtweet")
cff_rtweet

#> cff-version: 1.2.0
#> message: 'To cite package "rtweet" in publications use:'
#> type: software
#> license: MIT
#> title: 'rtweet: Collecting Twitter Data'
#> version: 0.7.0
#> doi: 10.21105/joss.01829
#> abstract: 'An implementation of calls designed to collect and organize Twitter data
#> via Twitter''s REST and stream Application Program Interfaces (API), which can be
#> found at the following URL: <https://developer.twitter.com/en/docs>. This package
#> has been peer-reviewed by rOpenSci (v. 0.6.9).'
#> authors:
#> - family-names: Kearney
#> given-names: Michael W.
#> email: [email protected]
#> orcid: https://orcid.org/0000-0002-0730-4694
#> preferred-citation:
#> type: article
#> title: 'rtweet: Collecting and analyzing Twitter data'
#> authors:
#> - family-names: Kearney
#> given-names: Michael W.
#> year: '2019'
#> journal: Journal of Open Source Software
#> volume: '4'
#> number: '42'
#> pages: '1829'
#> doi: 10.21105/joss.01829
#> url: https://joss.theoj.org/papers/10.21105/joss.01829
#> repository: https://CRAN.R-project.org/package=rtweet
#> repository-code: https://github.com/ropensci/rtweet
#> url: https://CRAN.R-project.org/package=rtweet
#> date-released: '2020-01-08'
#> contact:
#> - family-names: Kearney
#> given-names: Michael W.
#> email: [email protected]
#> orcid: https://orcid.org/0000-0002-0730-4694
#> keywords:
#> - r
#> - twitter

Note the special field, preferred-citation, that would be used to generate the citation information on GitHub. If this field is not present, GitHub would reuse other keys in the file to auto-generate a citation reference.

As already mentioned, cffr uses information from the DESCRIPTION (via the desc8 package) and the inst/CITATION file to extract the metadata. I will focus now on comparing the citation info from rtweet and the information generated by cffr:

toBibtex(citation("rtweet"))

#> @Article{rtweet-package,
#> title = {rtweet: Collecting and analyzing Twitter data},
#> author = {Michael W. Kearney},
#> year = {2019},
#> note = {R package version 0.7.0},
#> journal = {Journal of Open Source Software},
#> volume = {4},
#> number = {42},
#> pages = {1829},
#> doi = {10.21105/joss.01829},
#> url = {https://joss.theoj.org/papers/10.21105/joss.01829},
#> }

cff_rtweet$`preferred-citation`

#> type: article
#> title: 'rtweet: Collecting and analyzing Twitter data'
#> authors:
#> - family-names: Kearney
#> given-names: Michael W.
#> year: '2019'
#> journal: Journal of Open Source Software
#> volume: '4'
#> number: '42'
#> pages: '1829'
#> doi: 10.21105/joss.01829
#> url: https://joss.theoj.org/papers/10.21105/joss.01829

We can check that the core information of the rtweet citation has been included in the cff object, and we can also check fields included in the DESCRIPTION file of rtweet:

packageDescription("rtweet",
fields = c(
"Title", "Description", "Author", "Version", "URL"
)
)

#> Title: Collecting Twitter Data
#> Description: An implementation of calls designed to collect and organize Twitter
#> data via Twitter's REST and stream Application Program Interfaces
#> (API), which can be found at the following URL:
#> <https://developer.twitter.com/en/docs>. This package has been
#> peer-reviewed by rOpenSci (v. 0.6.9).
#> Author: Michael W. Kearney [aut, cre] (<https://orcid.org/0000-0002-0730-4694>),
#> Andrew Heiss [rev] (<https://orcid.org/0000-0002-3948-3914>), Francois
#> Briatte [rev]
#> Version: 0.7.0
#> URL: https://CRAN.R-project.org/package=rtweet
#>
#> -- File: C:/Users/diego/Documents/R/win-library/4.1/rtweet/Meta/package.rds
#> -- Fields read: Title, Description, Author, Version, URL

In the next chunk I compare it with the corresponding fields from the cff object:

as.cff(cff_rtweet[
c("title", "abstract", "authors", "version", "url")
])

#> title: 'rtweet: Collecting Twitter Data'
#> abstract: 'An implementation of calls designed to collect and organize Twitter data
#> via Twitter''s REST and stream Application Program Interfaces (API), which can be
#> found at the following URL: <https://developer.twitter.com/en/docs>. This package
#> has been peer-reviewed by rOpenSci (v. 0.6.9).'
#> authors:
#> - family-names: Kearney
#> given-names: Michael W.
#> email: [email protected]
#> orcid: https://orcid.org/0000-0002-0730-4694
#> version: 0.7.0
#> url: https://CRAN.R-project.org/package=rtweet

Valid keys

Here is a list of all the valid keys of the CFF schema. Most of them have an explicit mapping with the fields (or a combination of fields) in the DESCRIPTION and inst/CITATION files:

abstract identifiers repository
authors keywords repository-artifact
cff-version license repository-code
commit license-url title
contact message type
date-released preferred-citation url
doi references version

The cffr package also includes an extensive vignette describing how these fields are computed with several examples.

Validating a cff object

Once we have created an cff object, we can check its validity using the cff_validate() function. This function can be used with cff objects and with CITATION.cff files. If there are any errors, output messages will help us debug our object:

cff_validate(cff_rtweet)

#>
#> cff_validate results-----

#> Congratulations! This cff object is valid

# Creating a CITATION.cff file from an cff object and validating it
cff_rtweet %>%
# Write it to a tempfile
cff_write(tempfile("CITATION", fileext = ".cff"),
verbose = FALSE,
validate = FALSE
) %>%
cff_validate()

#>
#> cff_validate results-----
#> Congratulations! This cff object is valid

# Create a deliberated error and use the validator
# Override the defaults with keys param
wrong_keys <- list(
url = "I am not an url",
doi = "I am not a doi"
)
cff_create("rtweet", keys = wrong_keys) %>%
cff_validate()

#>
#> cff_validate results-----

#> Oops! This cff object has the following errors:

#> field message
#> 1 data.doi referenced schema does not match
#> 2 data.url referenced schema does not match

Validation of the initial cff object is satisfactory, as seen in the messages. But in the second example, where I forced some invalid values using the keys parameter, we can see that the doi and url field are flagged as errors, as the text strings do not correspond with the expected patterns for those fields (e.g “http*” for urls and “10XXXX/XXXX” for DOIs).

Keeping your CITATION.cff file up-to-date

A CITATION.cff includes relevant information about the version, the release date and the DOI of your package, so you would want to keep this information up-to-date. cffr includes a GitHub Action that does the work for you.

It can be installed in your repo with the cff_gha_update() or copied to your .github/workflows folder, and it would update your CITATION.cff file on the following events:

  • When you publish a new release of the package on your GitHub repo.

  • Each time that you modify your DESCRIPTION or inst/CITATION files.

  • Additionally, the action can be run also manually.

This will ensure that the citation of your package is always accurate.

Conclusion

Over the last few months, support of CITATION.cff files has increasingly grown in the scientific citation ecosystem. The cffr package allows any R-package developer to easily integrate citation information with a wide variety of services via the creation of a CITATION.cff file leveraging the support introduced by GitHub.

Acknowledgments

I would like to thank Carl Boettiger, Maëlle Salmon and the rest of contributors of the codemetar package. This package was the primary inspiration for developing cffr and shares a common goal of increasing awareness of the efforts of software developers.

I would also like to thank João Martins and Scott Chamberlain for thorough reviews, which helped improve the package and the documentation, as well as Emily Riederer for handling the review process.


  1. Druskat, S., Spaaks, J. H., Chue Hong, N., Haines, R., Baker, J., Bliven, S., Willighagen, E., Pérez-Suárez, D., & Konovalov, A. (2021). Citation File Format (Version 1.2.0) [Computer software]. https://doi.org/10.5281/zenodo.5171937 ↩︎

  2. Du, C., Cohoon, J., Priem, J., Piwowar, H., Meyer, C., & Howison, J. (2021, October 23). CiteAs: Better Software through Sociotechnical Change for Better Software Citation. Companion Publication of the 2021 Conference on Computer Supported Cooperative Work and Social Computing. ACM. http://doi.org/10.1145/3462204.3482889 ↩︎

  3. Druskat, Stephan. (2021, September 27). Making software citation easi(er) – The Citation File Format and its integrations. Zenodo. https://doi.org/10.5281/zenodo.5529914 ↩︎

  4. Matthew B. Jones, Carl Boettiger, Abby Cabunoc Mayes, Arfon Smith, Peter Slaughter, Kyle Niemeyer, Yolanda Gil, Martin Fenner, Krzysztof Nowak, Mark Hahnel, Luke Coy, Alice Allen, Mercè Crosas, Ashley Sands, Neil Chue Hong, Patricia Cruse, Daniel S. Katz, Carole Goble. 2017. CodeMeta: an exchange schema for software metadata. Version 2.0. KNB Data Repository. https://doi.org/10.5063/schema/codemeta-2.0 ↩︎

  5. Carl Boettiger and Maëlle Salmon (2021). codemetar: Generate ‘CodeMeta’ Metadata for R Packages. https://github.com/ropensci/codemetar, https://docs.ropensci.org/codemetar/ ↩︎

  6. Rich FitzJohn, Rob Ashton, Mathias Buus and Evgeny Poberezkin (2021). jsonvalidate: Validate ‘JSON’ Schema. R package version 1.3.2. https://CRAN.R-project.org/package=jsonvalidate ↩︎

  7. Kearney, M. (2019, October 24). rtweet: Collecting and analyzing Twitter data. Journal of Open Source Software. The Open Journal. http://doi.org/10.21105/joss.01829 ↩︎

  8. Gábor Csárdi, Kirill Müller and Jim Hester (2021). desc: Manipulate DESCRIPTION Files. R package version 1.4.0. https://CRAN.R-project.org/package=desc ↩︎

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)