1 Syllabus

Please see this webpage for more details on the Syllabus. In addition, the presentation slides can be downloaded here.

2 Chapter 1

2.1 Introduction

In this chapter, we will introduce the bioinformatic tools to write and disseminate reproducible reports as implemented in RStudio (RStudio Team, 2020). More specifically, we will learn procedures to link and execute data and code into a unified environment (see Figure 2.1). This chapter focuses on learning the R Markdown syntax and protocols allowing to include text, code, figures, tables and bibliography into a document. This document will then be compiled into an output file (in either pdf, HTML or Word formats) allowing sharing your research. More specifically, this tutorial provides students with the minimum knowledge allowing them to complete their bioinformatic tutorials (PART 2) and individual projects (PART 3). The chapter will be subdivided into four parts as follows:

  • PART A: Learning the basics.
  • PART B: Tables, Figures and References.
  • PART C: Advanced R Markdown settings.
  • PART D: User Defined Functions in R.
  • PART E: Interactive tutorials.

In an effort to maximize time, we will not cover Part E in class, but students are encouraged to study this material on their own. Indeed, this latter material could provide examples and ideas for your assignments.

The spectrum of reproducibility.

Figure 2.1: The spectrum of reproducibility.

2.1.1 Learning outcomes

Learning outcomes associated to this chapter are provided at the top of each part.

2.1.2 Files supporting this tutorial

2.1.2.1 GitHub

The final files associated to Chapter 1 are all deposited on the instructor’s GitHub account at this URL:

2.1.2.2 Google Drive

Although less used in PART A, a set of files are provided to support teaching of material presented in this chapter. These files are deposited in the shared Google Drive under this path:

  • Reproducible_Science/Chapters/Chapter_1/Tutorial_files

If you have difficulties accessing the Google Drive, click here.

Files are as follows:

  • EEB603_Syllabus_BUERKI.Rmd: This is the .Rmd file used to compile the syllabus of this class. This file provides a good source of information for the syntax and protocols described in this tutorial.
  • Bibliography_Reproducible_Science_2.bib: This file contains references cited in BibTex format.
  • AmJBot.csl: This citation style language (CSL) file allows formatting citations and bibliography following citation style of American Journal of Botany.
  • Bioinformatic workflow_PART2.pdf: A pdf file containing the bioinformatic workflow taught in this class. This file will be used to learn how to incorporate a figure into R Markdown file.

2.1.3 Install RMarkdown software

Software and packages required to perform this tutorial are detailed below. Students should install those software and packages on their personal computers to be able to complete this course. Additional packages might need to be installed and the instructor will provide guidance on how to install those as part of the forthcoming tutorials.

  • R: https://www.r-project.org
  • R packages: bookdown, knitr and R Markdown. Use the following R command to install those packages:
install.packages(c("bookdown", "knitr", "rmarkdown"))
  • RStudio: https://www.rstudio.com/products/rstudio/download/
  • TeX: This software is required to compile document into pdf format. Please install MiKTeX on Windows, MacTeX on OS X and TeXLive on Linux. For this class, you are not requested to install this software, which takes significant hard drive space and is harder to operate on Windows OS.

NOTE: The instructor is using the following version of RStudio: Version 2022.12.0+353 (2022.12.0+353). If your computer is experiencing issues running the latest version of the software, you can install previous versions here.

2.1.4 RStudio

RStudio (RStudio Team, 2020) is an integrated development environment (IDE) that allows you to interact with R more readily. RStudio is similar to the standard RGUI, but it is considerably more user friendly. It has more drop-down menus, windows with multiple tabs, and many customization options (see Figure 2.2). Detailed information on using RStudio can be found at at RStudio’s website.

Snapshot of the RStudio environment showing the four windows and their content.

Figure 2.2: Snapshot of the RStudio environment showing the four windows and their content.

2.1.5 Web resources

Please find below URLs to webpages that are providing key information for chapter 1:

2.2 PART A: Learning the basics

In this part, we will provide a survey of the procedures to create and render (or knitting) your first R Markdown document.

2.2.1 Learning outcomes

This tutorial is devoted to part B of chapter 1 and provides students with opportunities to learn procedures to:

  • Create and render (or knitting) your first R Markdown document.
  • Learn the following basic R Markdown syntax and protocols:
    • Headers
    • Inside text commenting
    • Lists
    • Italicize and bold words
    • Create hyperlinks
    • Include code chunks and inline code
    • Check spelling

2.2.2 Introduction to R Markdown

Markdown is a simple formatting syntax language used for authoring HTML, PDF, and MS Word documents, which is implemented in the rmarkdown package. An R Markdown document is usually subdivided into three sections (see Figure 2.3):

  1. YAML metadata section: This section provides high level information about the output format of the R Markdown file. Information stored in this section will be used by the Pandoc program to format the output document (see Figure 2.3).
  2. Publication core text: This section represents the core of your document/publication and uses Markdown syntax.
  3. Code chunk: This section allows to import and analyze data as well as produce figures and tables that will be directly displayed in the output file.
Example of an R Markdown file showing the three major sections.

Figure 2.3: Example of an R Markdown file showing the three major sections.

2.2.3 Creating an R Markdown file

Snapshot of window to create an R Markdown file.

Figure 2.4: Snapshot of window to create an R Markdown file.

To create an R Markdown document execute the following steps in RStudio:

  1. Select: File -> New File -> R Markdown...
  2. Provide title for the document and define Default output format (see Figure 2.4). If you want to knit your document in pdf format, a version of the TeX program has to be installed on your computer (see Figure 2.4).
  3. Save the .Rmd document (using File -> Save As...). Save this file in a new folder devoted to the project (Warning: Knitting the document will generate several files).

2.2.4 Rendering (or knitting) an R Markdown document

To render or knit your R Markdown document/script into the format specified in the YAML metadata section do the following steps in RStudio:

  1. Select the Knit button (Figure 2.3) in the upper bar of your window to render document.
  2. There are several options depending on the output format; however if you just push the button it will automatically knit the document following settings provided in the YAML metadata section (Figure 2.3).
  3. The output file will automatically be created in the same directory as the .Rmd file. You can track progress in the R Markdown console. If the knitting fails, error messages will be printed in the R Markdown console (including information on which line of the script the error occurred, but it might not always be the case). Error messages are very useful to debug your R Markdown document.

2.2.5 How does the knitting process work?

When you knit your document, R Markdown will feed the .Rmd file to the R knitr package, which executes all of the code chunks and creates a new markdown (.md) document. This latter document includes the code and its output (Figure 2.5). The markdown file generated by knitr is then processed by the Pandoc program, which is responsible for creating the finished format (Figure 2.5).

R Markdown flow.

Figure 2.5: R Markdown flow.

2.2.6 Basic R Markdown syntax and protocols

We will focus here on learning the syntax and protocols to produce:

  • Headers.
  • Inside text commenting.
  • Lists.
  • Italicize and bold words.
  • Embed code chunks and inline code.
  • Check spelling.

More syntax are available in the R Markdown Reference Guide. You can access this document as follows in RStudio:

  • Select: Help -> Cheatsheets -> R Markdown Reference Guide

Notice: The Cheatsheet section also allows accessing additional supporting documents related to R Markdown and Data manipulation. Those documents will be very useful for this class.

2.2.7 Headers

Please find below the syntax to create headers (3 levels):

Syntax:

The "#" refers to the level of the header
# Header 1  
## Header 2  
### Header 3  

2.2.8 Inside text commenting

Markdown does not have a syntax to add comments in your text, but this function can be borrowed from HTML as follows:

# HTML syntax to comment inside text
<!-- COMMENT -->

Usually, we are using this syntax to e.g. highlight where the text needs improvement/work or editing points. Comments will not be visible when your document is knitted.

You can learn more about this HTML syntax on this webpage.

2.2.9 Lists

There are two types of lists:

  • Unordered
  • Ordered

2.2.9.1 Syntax for unordered lists

Syntax:

* unordered list
* item 2
  + sub-item 1
  + sub-item 2

Note: For each sub-level include two tabs to create the hierarchy.

Output:

  • unordered list
  • item 2
    • sub-item 1
    • sub-item 2

2.2.9.2 Syntax for ordered lists

Syntax:

1. ordered list
2. item 2
   + sub-item 1
   + sub-item 2

Output:

  1. ordered list
  2. item 2
    • sub-item 1
    • sub-item 2

2.2.10 Italicize and bold words

The following syntax will render text in italics or bold:

#Syntax for italics
*italics*

#Syntax for bold
**bold**

2.2.12 Include code chunks

One of the most exciting features of working with the R Markdown format is the implementation of functions allowing to directly “plug” the output of R code into the compiled document (see Figure 2.5). In other words, when you compile your .Rmd file, R Markdown will automatically run and process each code chunk and code lines (see below) and embed their results in your final document. If the output of the code is a table or a figure, you will be able to assign a label to this item (by adding information in the code chunk; see part B) and refer to it (= cross-referencing) in your pdf or html document. Cross-referencing is possible thanks to the \@ref() function implemented in the R bookdown package.

2.2.12.1 Code chunk

A code chunk could easily be inserted in your document as follows:

  • Using the keyboard shortcut Ctrl + Alt + I (OS X: Cmd + Option + I).
  • Pressing on the Insert button in the editor toolbar.
  • Typing ```{r} and ````.

By default the code chunk will expect R code, but you can also insert code chunks supporting different computer languages (e.g. Bash, Python).

2.2.12.2 Chunk options

Chunk output can be customized with knitr options arguments set in the {} of a chunk header. In the examples displayed in Figure 2.6 five arguments are used:

  • include = FALSE prevents code and results from appearing in the finished file. R Markdown still runs the code in the chunk, and the results can be used by other chunks.
  • echo = FALSE prevents code, but not the results from appearing in the finished file. This is a useful way to embed figures.
  • message = FALSE prevents messages that are generated by code from appearing in the finished file.
  • warning = FALSE prevents warnings that are generated by code from appearing in the finished.
  • fig.cap = "..." adds a caption to graphical results.

We will delve more into chunk options in part C of chapter 1, but in the meantime please see the R Markdown Reference Guide for more details.

Example of code chunks.

Figure 2.6: Example of code chunks.

2.2.13 Inline code

Code results can also be inserted directly into the text of a .Rmd file by enclosing the code with r.

R Markdown will always:

  • Display the results of inline code, but not the code.
  • Apply relevant text formatting to the results.

As a result, inline output is indistinguishable from the surrounding text.

Warning: Inline expressions do not take knitr options and is therefore less versatile. We usually use inline code to perform simple stats (e.g. 4x4; 16)

2.2.14 Check spelling

There are three ways to access spell checking in an R Markdown document in RStudio:

  1. A spell check button to the right of the save button.
  2. Edit > Check Spelling...
  3. The F7 key.

2.2.15 Exercises

Students will work individually to complete the following exercises:

  1. Create an *.Rmd file entitled Exercises chapter 1: part A and select HTML as output format.
  2. Save this document as Exe_chap1_partA.Rmd in a folder called Exercises located in:
    • Reproducible_Science/Chapters/Chapter_1
  3. Practice syntax to do:
    • headers
    • inside text commenting
    • lists
    • include R code chunk
    • include inline code
  4. Confirm that your syntax works by knitting your document and inspecting the output.

To further learn syntax and protocols, please look at associated files provided by the instructor (see above for more details).

2.3 PART B: Tables, Figures and References

The aim of this tutorial is to provide students with the expertise to generate reproducible reports using bookdown (Xie, 2016, 2023a) and allied R packages (see Appendix 1 for a full list). Unlike functions implemented in the R rmarkdown package (Xie et al., 2018, which was better suited to generating PDF reproducible reports), bookdown allows to use ONE unified set of functions to generate HTML and PDF documents. In addition, the same approach and functions are used to process tables and figures as well as cross-reference those in the main body of the text. In this tutorial, we will also cover procedures to cite references in the text, automatically generate a bibliography/references section and format citations to journal styles as well as generating an Appendix containing citations of all R packages used to conduct your research (and produce the report).

2.3.1 Learning outcomes

This tutorial is devoted to part B of chapter 1 and provides students with opportunities to learn procedures to:

2.3.2 RMarkdown file structure

To facilitate teaching of the learning outcomes, a roadmap with the RMarkdown file (*.Rmd) structure is summarized in Figure 2.7. This structure also applies to Chapter 1 - part C.

To support reproducibility of your research, we are structuring the RMarkdown file as follows (Figure 2.7):

  1. The YAML metadata section (at the top of the document) must contain information about your bibliography (*.bib) and citation style language (*.csl) files, which have to be stored in the working directory .
  2. Two R code chunks are provided just under the YAML metadata section to:
    • Code chunk #1: Install and load packages and generate a packages.bib file (stored in the working directory).
    • Code chunk #2: Setting code chunk global options
  3. The Markdown section below the two R code chunks is edited to add headers associated with References and Appendices.
  4. Finally, two R code chunks are provided in the Appendices section to:
    • Code chunk #3: Provide citations of software (relying on output of Code chunk #1) associated to Appendix 1.
    • Code chunk #4: Provide software versions (relying on sessionInfo()) associated to Appendix 2.
Representation of the RMarkdown file structure taught in this class. The following colors represent the three main computing languages found in an RMarkdown document: Black: Markdown (we also sometimes use HTML), Green: YAML, and Blue: R (included in code chunks). See text for more details.

Figure 2.7: Representation of the RMarkdown file structure taught in this class. The following colors represent the three main computing languages found in an RMarkdown document: Black: Markdown (we also sometimes use HTML), Green: YAML, and Blue: R (included in code chunks). See text for more details.

2.3.3 Associated files supporting this tutorial

Please refer to section for more details on supporting files and their locations on the shared Google Drive.

2.3.4 Set your R Markdown environment

2.3.4.1 Install dependencies

To execute this tutorial the following R packages (declared in pkg object) have to be installed on your computer using the code provided below.

The best procedure to ensure reproducibility would be to copy the code below in an R script file (entitled 01_R_dependencies.R) saved at the root of your working directory (if this directory does not exist yet, please start by creating it and naming it Chapter_1_PartB).

##~~~
# Check/Install R dependencies
##~~~
# This code is dedicated to packages for Chapter 1

##~~~
#1. List all required packages
##~~~
#Object (args) provided by user with names of packages stored into a vector
pkg <- c("knitr", "rmarkdown", "bookdown", "formattable", "kableExtra", "dplyr", "magrittr", "prettydoc", "htmltools", "knitcitations", "bibtex", "devtools")

##~~~
#2. Check if pkg are installed
##~~~
print("Check if packages are installed")
## [1] "Check if packages are installed"
#This line outputs a list of packages that are not installed
new.pkg <- pkg[!(pkg %in% installed.packages())]

##~~~
#3. Install missing packages
##~~~
# Using an if/else statement to check whether packages have to be installed
# WARNING: If your target R package is not deposited on CRAN then need to adjust code/function
if(length(new.pkg) > 0){
  print(paste("Install missing package(s):", new.pkg, sep=' '))
  install.packages(new.pkg, dependencies = TRUE)
}else{
  print("All packages are already installed!")
}
## [1] "All packages are already installed!"
##~~~
#4. Load all packages
##~~~
print("Load packages and return status")
## [1] "Load packages and return status"
#Here we use the sapply() function to require all the packages
# To know more about the function type ?sapply() in R console
sapply(pkg, require, character.only = TRUE)
##         knitr     rmarkdown      bookdown   formattable    kableExtra 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##         dplyr      magrittr     prettydoc     htmltools knitcitations 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##        bibtex      devtools 
##          TRUE          TRUE

2.3.4.2 TeX distribution

If you are planning to create PDF documents, you will need to install a TeX distribution on your computers. Please refer to this website for more details: https://www.latex-project.org/get/

2.3.4.2.1 Set MiKTeX for compiling PDF documents on Windows computers

Several students working on Windows computers shared difficulties in compiling PDF documents in RStudio. This issue is associated to MiKTeX preventing RStudio to install or update TeX packages required to knit your documents.

To solve this issue apply the following procedure:

  1. Start MiKTeX console by searching and clicking MiKTeX Console in the application launcher.
  2. Enable automatic package installation by:
  1. Clicking on Settings tab.
  2. Ticking the radio button Always install missing packages on-the-fly under the “You can choose whether missing packages are to be installed on-the-fly” header (see Figure 2.8).
  1. Restarting RStudio and you should be able to knit pdf documents.
Snapshot of the MikTex Console showing the procedure to always install packages on-the-fly.

Figure 2.8: Snapshot of the MikTex Console showing the procedure to always install packages on-the-fly.

2.3.4.3 YAML metadata section

The YAML metadata section (Figure 2.7) allows users to provide arguments (referred to as fields) to convert their R Markdown document into its final form. In this class, we will be using functions implemented in the knitr (Xie, 2015, 2023b) and bookdown packages (Xie, 2016, 2023a) to populate this section (field names as declared in the YAML metadata section are provided between parenthesis):

  1. Title (title).
  2. Subtitle (subtitle).
  3. Author(s) (author).
  4. Date (date).
  5. Output format(s) (output).
  6. Citations link (link-citations).
  7. Font size (fontsize).
  8. Bibliography file(s) (bibliography).
  9. Format for citations to follow journal styles (csl).

The YAML code provided below outputs either an HTML or PDF document (see output field) with a table of content (see toc field) and generates in text citations and bibliography section as declared in the AmJBot.csl file (under the csl field). The bibliography has to be stored in a file (here Bibliography_Reproducible_Science_2.bib) deposited at the root of your working directory.

--- 
title: "Your title"
subtitle: "Your subtitle"
author: "Your name"
date: "`r Sys.Date()`"
output:
  bookdown::html_document2: 
    toc: TRUE
  bookdown::pdf_document2:
    toc: TRUE
link-citations: yes
fontsize: 12pt
bibliography: Bibliography_Reproducible_Science_2.bib 
csl: AmJBot.csl
---
2.3.4.3.1 Step-by-step procedure

Do the following steps to set your YAML metadata section (also see Figure 2.7):

  1. Create a new R Markdown document.
  2. Save the .Rmd document into a new project folder in Reproducible_Science/Chapters/Chapter_1/ (Note: this folder has to be created prior to executing this step).
  3. Copy Bibliography_Reproducible_Science_2.bib and AmJBot.csl in your project folder. These files are available on the Shared Google Drive folder:
  • Reproducible_Science/Chapters/Chapter_1/Tutorial_files
  1. Edit the top YAML metadata section as shown in the section above.
2.3.4.3.2 Extra information
  • Warning: The .bib and .csl files have to be stored in the same working directory as your .Rmd file.
  • R functions can be used in the YAML metadata section by using inline R code syntax (see part A for more details). Here, we use Sys.Date() to automatically date the output document.
  • Multiple bibliography files can be used in your document by using this syntax: bibliography: [file1.bib, file2.bib].
  • Fields/parameters can be omitted in the YAML metadata section by inserting # in front of the command line (= equivalent of commenting).
  • To learn more about YAML and its application to R Markdown, please visit this website.

2.3.4.4 Knitting procedure

Since you have declared two output documents in the YAML metadata section and that those are specific to bookdown functions, you will have to select which output format you want to use to compile your document by clicking on the drop-down list on the left side of the Knit button (see Figure 2.9). To use bookdown functions, please make sure to select one of the following options (see Figure 2.9): Knit to html_document2 or Knit to pdf_document2.

Snapshot of RStudio console showing the drop-down list associated to Knit button.

Figure 2.9: Snapshot of RStudio console showing the drop-down list associated to Knit button.

2.3.4.5 Load required R packages

It is best practice to add an R code chunk directly under the YAML metadata section to load all the required R packages used to produce your report (same code than presented here) (see Figure 2.7). This feature will also allow to automatically generate a citation file with all R packages used to generate your report (see below). Applying this approach will contribute to improving the reproducibility of your research!

2.3.4.5.1 Step-by-step procedure
  1. Include an R code chunk directly under your YAML metadata section (Figure 2.7).
  2. Name code chunk packages and set options line as follows:
  • echo = FALSE,
  • warning = FALSE,
  • include = FALSE.
  1. Add the following code to load required packages to produce your report (or copy content of 01_R_dependencies.R):
###~~~
# Load R packages
###~~~
#Create vector w/ R packages
# --> If you have a new dependency, don't forget to add it in this vector
pkg <- c("knitr", "rmarkdown", "bookdown", "formattable", "kableExtra", "dplyr", "magrittr", "prettydoc", "htmltools", "knitcitations", "bibtex", "devtools")

##~~~
#2. Check if pkg are installed
##~~~
print("Check if packages are installed")
#This line outputs a list of packages that are not installed
new.pkg <- pkg[!(pkg %in% installed.packages())]

##~~~
#3. Install missing packages
##~~~
# Using an if/else statement to check whether packages have to be installed
# WARNING: If your target R package is not deposited on CRAN then need to adjust code/function
if(length(new.pkg) > 0){
  print(paste("Install missing package(s):", new.pkg, sep=' '))
  install.packages(new.pkg, dependencies = TRUE)
}else{
  print("All packages are already installed!")
}

##~~~
#4. Load all packages
##~~~
print("Load packages and return status")
#Here we use the sapply() function to require all the packages
# To know more about the function type ?sapply() in R console
sapply(pkg, require, character.only = TRUE)

2.3.4.6 Generate citation file of R packages used to produce the report

I don’t know about you, but I am always struggling to properly cite R packages in my publications. If you want to retrieve the citation for an R package, you can use the base R function citation(). For instance, citations for knitr can be obtained as follow:

# Generate citation for knitr
citation("knitr")
## 
## To cite the 'knitr' package in publications use:
## 
##   Yihui Xie (2023). knitr: A General-Purpose Package for Dynamic Report
##   Generation in R. R package version 1.42.
## 
##   Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition.
##   Chapman and Hall/CRC. ISBN 978-1498716963
## 
##   Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible
##   Research in R. In Victoria Stodden, Friedrich Leisch and Roger D.
##   Peng, editors, Implementing Reproducible Computational Research.
##   Chapman and Hall/CRC. ISBN 978-1466561595
## 
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.

If you want to generate those latter citation entries in BibTeX format, you can pass the returned object of citation() to toBibtex() as follows:

# Generate citation for knitr in BibTex format Note that
# there is no citation identifiers. Those will be
# automatically generated in our next code.
toBibtex(citation("knitr"))
## @Manual{,
##   title = {knitr: A General-Purpose Package for Dynamic Report Generation in R},
##   author = {Yihui Xie},
##   year = {2023},
##   note = {R package version 1.42},
##   url = {https://yihui.org/knitr/},
## }
## 
## @Book{,
##   title = {Dynamic Documents with {R} and knitr},
##   author = {Yihui Xie},
##   publisher = {Chapman and Hall/CRC},
##   address = {Boca Raton, Florida},
##   year = {2015},
##   edition = {2nd},
##   note = {ISBN 978-1498716963},
##   url = {https://yihui.org/knitr/},
## }
## 
## @InCollection{,
##   booktitle = {Implementing Reproducible Computational Research},
##   editor = {Victoria Stodden and Friedrich Leisch and Roger D. Peng},
##   title = {knitr: A Comprehensive Tool for Reproducible Research in {R}},
##   author = {Yihui Xie},
##   publisher = {Chapman and Hall/CRC},
##   year = {2014},
##   note = {ISBN 978-1466561595},
## }

To use citation entries generated from toBibtex(), you have to copy the output to a .bib file and save it in your working directory. You will then be able to cite references found in this document directly in your R Markdown. This can be done by adding the following code to your packages R code chunk:

# Generate BibTex citation file for all loaded R packages
# used to produce report Notice the syntax used here to
# call the function
knitr::write_bib(.packages(), file = "packages.bib")

The .packages() argument returns invisibly the names of all packages loaded in the current R session (if you want to see a return use .packages(all.available = TRUE)). This makes sure all packages being used in your code will have their citation entries written to the .bib file. Finally, to be able to cite those references (see Citation identifier) in your text, the YAML metadata section has to be edited. See Appendix 1 for a full list of references associated to the R packages used to generate this report.

2.3.4.7 Generate Appendix with citations for all R packages

Although a References section will be provided at the end of your document to cite in text references (see References and Figure 2.7), it is customized to add citations for all R packages used to generate the research in Appendix 1. We will learn here the procedure to assemble such Appendix.

2.3.4.7.1 Step-by-step procedure
  1. Include Appendices after the References section (see Figure 2.7).
  • This is done by using <div id="refs"></div> as shown below, which allows printing Appendices (or any other material) after the References section (see here for more details):
# References

<div id="refs"></div>

# (APPENDIX) Appendices {-}

# Appendix 1

Citations of all R packages used to generate this report.
  1. Insert an R code chunk directly under # Appendix 1 to read in and print citations saved in packages.bib. This is done as follows:
### Load R package
library("knitcitations")
### Process and print citations in packages.bib Clear all
### bibliography that could be in the cash
cleanbib()
# Set pandoc as the default output option for bib
options(citation_format = "pandoc")
# Read and print bib from file
read.bibtex(file = "packages.bib")
  1. Edit your R code chunk options line as follows to correctly print out references: {r generateBibliography, results = "asis", echo = FALSE, warning = FALSE, message = FALSE}
  2. Knit your code to check that it produces the right output (see Knitting procedure). See Appendix 1 to get an idea of the output.

2.3.4.8 Generate Appendix with R package versions used to produce report

In addition to providing citations to R packages, you might also want to provide full information on R package versions and your operating systems (see Figure 2.7). With R, the simplest (but a useful and important) approach to document your R environment is to report the output of sessionInfo() (or devtools::session_info()). Among other information, this will show all the packages and their versions that are loaded in the session you used to run your analysis. If someone wants to reproduce your analysis, they will know which packages they will need to install, what versions and on which operating systems the code was executed. For instance, here is the output of sessionInfo() showing the R version and packages that I used to create this document:

# Collect Information About the Current R Session
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rticles_0.23         DiagrammeR_1.0.9     DT_0.24             
##  [4] data.tree_1.0.0      kfigr_1.2.1          devtools_2.4.4      
##  [7] usethis_2.1.6        bibtex_0.4.2.3       knitcitations_1.0.12
## [10] htmltools_0.5.3      prettydoc_0.4.1      magrittr_2.0.3      
## [13] dplyr_1.1.2          kableExtra_1.3.4     formattable_0.2.1   
## [16] bookdown_0.33        rmarkdown_2.21       knitr_1.42          
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.5         sass_0.4.2         pkgload_1.3.2.1    jsonlite_1.8.7    
##  [5] viridisLite_0.4.2  bslib_0.4.0        shiny_1.7.2        highr_0.9         
##  [9] yaml_2.3.7         remotes_2.4.2      sessioninfo_1.2.2  pillar_1.9.0      
## [13] glue_1.6.2         digest_0.6.33      RColorBrewer_1.1-3 promises_1.2.0.1  
## [17] rvest_1.0.3        RefManageR_1.3.0   colorspace_2.1-0   httpuv_1.6.5      
## [21] plyr_1.8.7         pkgconfig_2.0.3    purrr_1.0.2        xtable_1.8-4      
## [25] scales_1.2.1       webshot_0.5.4      processx_3.8.2     svglite_2.1.0     
## [29] later_1.3.0        tibble_3.2.1       generics_0.1.3     ellipsis_0.3.2    
## [33] cachem_1.0.6       cli_3.6.1          crayon_1.5.2       mime_0.12         
## [37] memoise_2.0.1      evaluate_0.21      ps_1.7.5           fs_1.6.3          
## [41] fansi_1.0.4        xml2_1.3.5         pkgbuild_1.3.1     profvis_0.3.7     
## [45] tools_4.2.0        prettyunits_1.1.1  formatR_1.12       lifecycle_1.0.3   
## [49] stringr_1.5.0      munsell_0.5.0      callr_3.7.3        compiler_4.2.0    
## [53] jquerylib_0.1.4    systemfonts_1.0.4  rlang_1.1.1        rstudioapi_0.14   
## [57] visNetwork_2.1.0   htmlwidgets_1.5.4  miniUI_0.1.1.1     R6_2.5.1          
## [61] lubridate_1.8.0    fastmap_1.1.0      utf8_1.2.3         stringi_1.7.12    
## [65] Rcpp_1.0.11        vctrs_0.6.3        tidyselect_1.2.0   xfun_0.36         
## [69] urlchecker_1.0.1

I have also used the approach described above to add this information in Appendix 2. This can be done as follows:

  1. Edit the document to add the following text below Appendix 1.
# Appendix 2

Version information about R, the operating system (OS) and attached or R loaded packages. This appendix was generated using `sessionInfo()`.
  1. Then, add an R code chunk with the following code:
# Load and provide all packages and versions
sessionInfo()

2.3.5 All set, good to go!

We have now set our R Markdown environment and can start populating it! This means that you will be inserting your text and other code chunks directly under the packages code chunk. The References section constitutes the end of the main body of your document. If you want to add Appendices, do so under Appendix 1, appendices will be labelled differently from the main body of the document.

2.3.6 Insert tables and figures in R Markdown document

2.3.6.1 Tables

There will be more details about tables in chapter 9; however this tutorial introduces key concepts related to table making in R Markdown, more specifically on the following topics:

  • Creating a table in R.
  • Assigning a table caption.
  • Providing a unique label to the R code chunk allowing further cross-referencing in the text.
  • Displaying the table in the document.
2.3.6.1.1 Step-by-step protocol

Here, you will learn the R Markdown syntax and R code required to replicate the grading scale presented in the Syllabus (see Table 2.1):

Table 2.1: Grading scale applied in this class.
Percentage Grade
100-98 A+
97.9-93 A
92.9-90 A-
89.9-88 B+
87.9-83 B
82.9-80 B-
79.9-78 C+
77.9-73 C
72.9-70 C-
69.9-68 D+
67.9-60 D
59.9-0 F
  1. Use the same .Rmd document as above to practice working with tables.
  2. Add a first-level header entitled Tables.
  3. Insert an R code chunk under your header by clicking on the Insert button in the editor toolbar.
  4. Copy/paste the following R code in your code chunk:
# Create a data.frame w/ grading scale
grades <- data.frame(Percentage = c("100-98", "97.9-93", "92.9-90",
    "89.9-88", "87.9-83", "82.9-80", "79.9-78", "77.9-73", "72.9-70",
    "69.9-68", "67.9-60", "59.9-0"), Grade = c("A+", "A", "A-",
    "B+", "B", "B-", "C+", "C", "C-", "D+", "D", "F"))
# Plot table and add caption
knitr::kable(grades, caption = "Grading scale applied in this class.") %>%
    kable_styling(c("striped", "scale_down"))
  1. Edit the code chunk options line by adding the following argument (Note: each argument should be separated by a comma):
  • echo = FALSE
  1. Add the unique label tabgrades in the chunk options line (just after {r) to enable further cross-referencing.
  2. Test your code to check that it produces the expected table by using the Run button.
  3. Knit your document using the Knit button on the editor toolbar (Figure 2.9).

2.3.6.2 Figures

There will be more details about figures in chapter 10; however this tutorial introduces key concepts related to figure making in R Markdown, more specifically on the following topics:

  • Creating a figure in R (based on the cars dataset; Figure 2.10).
  • Assigning a figure caption.
  • Providing a unique label to the R code chunk allowing further cross-referencing in the text.
  • Displaying the figure in the document.
Plot of cars' speed in relation to distance.

Figure 2.10: Plot of cars’ speed in relation to distance.

2.3.6.2.1 Step-by-step protocol

Here, you will learn the R Markdown syntax and R code required to replicate Figure 2.10:

  1. Use the same .Rmd document as above to practice working with figures.
  2. Add a first-level header entitled Figures.
  3. Insert an R code chunk under your header by clicking on the Insert button in the editor toolbar.
  4. Copy/paste the following R code in the code chunk:
# Load and summarize cars dataset
summary(cars)
# Plot data
plot(cars)
  1. Edit the code chunk options line by adding the following arguments (each argument should be separated by a comma):
  • echo = FALSE
  • results = "hide"
  • fig.cap = "Plot of cars' speed in relation to distance."
  • out.width = "100%"
  1. Add the unique label cars in the chunk options line (just after {r) to enable further cross-referencing.
  2. Test your code to check that it produces the expected plot by using the Run button.
  3. Knit your document using the Knit button on the editor toolbar (Figure 2.9).

2.3.7 Cross-reference tables and figures in the text

Cross-referencing tables and figures in the main body of your R Markdown document can easily be done by using the \@ref() function implemented in the bookdown package (see Figure 2.7).

2.3.7.1 General syntax

The general syntax is as follows:

# Cross-referencing tables in main body of text
\@ref(tab:code_chunk_ID)

# Cross-referencing figures in main body of text
\@ref(fig:code_chunk_ID)

2.3.7.2 Step-by-step procedure

To cross-reference the tabgrades table type:

\@ref(tab:tabgrades), which translates into 2.1.

To cross-reference the cars figure type:

\@ref(fig:cars), which translates into 2.10.

Note: This syntax doesn’t automatically include the Table or Figure handles in front of the cross-reference. You will have to manually add Table or Figure in front of \@ref().

Cross-reference your table and figure by adding a new level 1 header entitled Cross-referncing tables and figures and then typing examples as shown below in this section.

2.3.8 Cite references in text and add a References section

2.3.8.1 The bibliography file

To cite references in the R Markdown document those have to be saved in a bibliography file using the BibTeX format. Other formats can be used, but the BibTeX format is open-source and easy to edit. Please see this webpage for more details on other formats.

Most journals allow saving citation of publications directly in BibTeX format, but when this feature is not available formats can be converted using online services (e.g. EndNote to BibTeX: https://www.bruot.org/ris2bib/).

2.3.8.1.1 Procedure to do prior to citing references in an R Markdown document
  1. Save all your BibTeX fromatted references in a text file and make sure to add the .bib extension.
  2. The .bib file has to be deposited in the same folder as your .Rmd file (= working directory).
  3. Provide the name of your references file in the YAML metadata section.
  4. You can visit this webpage and click on the Cite icon to download a citation in .bibtex format. More details on the BibTeX format is provided below.
  5. References formatted in the BibTeX format are available in the associated file:
    • Bibliography_Reproducible_Science_2.bib.

2.3.8.2 Specifying a bibliography in the RMarkdown file

The Pandoc program can automatically generate citations in the text and a bibliography/references section following various journal styles (see Figure 2.7). In order to use this feature, you need to specify a bibliography file in the YAML metadata section.

2.3.8.3 BibTeX format

Please find below an example of a reference formatted in BibTeX format:

# Example of BibTex format for Baker (2016) published in Nature
@Article{Baker_2016,
  doi = {10.1038/533452a},
  url = {https://doi.org/10.1038/533452a},
  year = {2016},
  month = {may},
  publisher = {Springer Nature},
  volume = {533},
  number = {7604},
  pages = {452--454},
  author = {Monya Baker},
  title = {1,500 scientists lift the lid on reproducibility},
  journal = {Nature},
}

2.3.8.4 Citation identifier

The unique citation identifier of a reference (Baker_2016 in the example above) is set by the user in the BibTeX citation file (see first line in the example provided above). This unique identifier is used to refer to the reference/publication in the R Markdown document and also allows citing references and generating the References section.

2.3.8.5 Citing references

Citations go inside square brackets ([]) and are separated by semicolons. Each citation must have a key, composed of @ + the citation identifier (see above) as stored into the BibTeX file.

Please find below some examples on citation protocols:

#Syntax
Blah blah [see @Baker_2016, pp. 33-35; also @Smith2016, ch. 1].
Blah blah [@Baker_2016; @Smith2016].

Once knitted (using the button), the above code/syntax turns into:

Blah blah (see Baker, 2016 pp. 33–35; also Smith et al., 2016, ch. 1).

Blah blah (Baker, 2016; Smith et al., 2016).

A minus sign (-) before the @ will suppress mention of the author in the citation. This can be useful when the author is already mentioned in the text:

#Syntax
Baker says blah blah [-@Baker_2016].

Once knitted, the above code/syntax turns into:

Baker says blah blah (2016).

You can also write an in-text citation, as follows:

#Syntax
@Baker_2016 says blah.
@Baker_2016 [p. 1] says blah.

Once knitted, the above code/syntax turns into:

Baker (2016) says blah.

Baker (2016 p. 1) says blah.

2.3.8.5.1 Practice citing references

Students have to use their .Rmd document to practice citing references in the text using procedures described above. To clearly define where you practice citing references, please do so under a Citing references header.

2.3.9 Adding a References section

Upon knitting, a References section will automatically be generated and inserted at the end of the document (see Figure 2.7). Usually, we recommend adding a References header (level 1) just after the last paragraph of the document as displayed below:

last paragraph...

# References

The bibliography will be inserted after this header (please see References section of this tutorial for more details).

2.3.10 Format citations to journal style

In this section, we are studying how your bibliography can be automatically formatted following a journal style. This is achieved by providing the name of a citation style language file (containing the protocol to format citations and bibliography following a journal style) in the YAML metadata section.

2.3.10.1 What is the citation style language (CSL)?

The Citation Style Language (CSL) was developed by an open-source project and aims at facilitating scholarly publishing by automating the formatting of citations and bibliographies. This project has developed the CSL and maintains a crowd sourced repository with over 8000 free CSL citation styles. Please see the following website for more details: https://citationstyles.org

2.3.10.2 CSL repositories

There are two main CSL repositories:

2.3.10.3 How to use a CSL file in an RMarkdown to format citations and bibliography?

Please follow the steps below to format your citations and bibliography following the citation style provided in a CSL file (see Figure 2.7 for more details):

  1. Download the CSL file using repositories provided above. Some journals provide their CSL files on their websites (one has been made available for you to use in the associated files: AmJBot.csl).
  2. Save the CSL file in the same working directory as the .Rmd file.
  3. Edit the YAML metadata section as follows to specify the CSL file:
# Add a "csl" argument and provide name of the CSL file (here AmJBot.csl) 
---
title: "Sample Document"
output:
  bookdown::html_document2: 
    toc: TRUE
  bookdown::pdf_document2:
    toc: TRUE
bibliography: bibliography.bib
csl: AmJBot.csl
---
  1. Knit the R Markdown document using the Knit button. The Pandoc program will use the information stored in the YAML metadata section to format the bibliography (citations and bibliography section) following the citation style provided in the CSL file. Do not forget to add a References header at the end of your .Rmd document.

2.4 PART C: Advanced R and R Markdown settings

2.4.1 Learning outcomes

This tutorial is devoted to part C of chapter 1 and provides students with opportunities to learn procedures to (see Figure 2.7):

  • Set your working directory.
  • Set global options for code chunks related to:
    • text,
    • code decoration,
    • caching,
    • plots (output format and resolution),
    • positioning figures (close to associated code chunks).

2.4.2 Associated files supporting this tutorial

Please refer to section for more details on supporting files and their locations on the shared Google Drive.

2.4.3 Download presentation

The slides presented in class can be downloaded here. All the information presented in these slides are found in the text below.

2.4.4 Set your working directory

Unlike R scripts where you have to set your working directory or provide the path to your files, the approach implemented in R Markdown document (.Rmd) automatically sets your working directory to the location of your .Rmd file. This procedure is done by knitr functions. knitr expects all declared files to be located in the same path as your .Rmd file or in a subfolder within this working directory. The main reason for this approach is to maximize portability of your R Markdown project, which is usually composed of a set of files (see Figure 2.11).

Snapshot of the project structure associated to part C of Chapter 1. External figures used in the document are in Figures, whereas those generated by the document are saved in Figures MS.

Figure 2.11: Snapshot of the project structure associated to part C of Chapter 1. External figures used in the document are in Figures, whereas those generated by the document are saved in Figures MS.

2.4.4.1 Instructions to setup your working directory

Before knitting your document, you will be testing your code and this requires to set your working directory. The can be done in RStudio by clicking (see Figure 2.12):

Session --> Set Working Directory --> To Source File Location

Snapshot of RStudio showing procedure to set your working directory to allow testing your code prior to knitting.

Figure 2.12: Snapshot of RStudio showing procedure to set your working directory to allow testing your code prior to knitting.

2.4.5 Set global options for code chunks

To improve code reproducibility and efficiency and to follow publication requirements, it is customed to include a “code chunk” at the beginning of your .Rmd file to set global options applying to the whole document (see Figure 2.7). Those settings are related to the following elements of your code:

  • text results.
  • code decoration.
  • caching code.
  • plots (or figures).
  • positioning figures.

These general settings will be set using the opts_chunk() function implemented in knitr (Xie, 2023b). The following website contains valuable information on code chunk options:

2.4.5.1 The opts_chunk$set() function

The knitr function opts_chunk$set() is used to change the default global options in an .Rmd document.

Before starting, a few special notes should be known on the options:

  1. Chunk options must be written in one line; no line breaks are allowed inside chunk options.
  2. Avoid spaces and periods (.) in chunk labels and directory names.
  3. All option values must be valid R expressions just like how we write function arguments.

Here we will be discussing each part of the settings individually, but those will have to be merged into one code chunk in your document entitled setup (please see below for more details).

2.4.5.2 Text results

This section deals with settings related to text results generated by code chunks.

Please find below an example of options that could be applied across code chunks:

# Setup options for text results
opts_chunk$set(echo = TRUE, warning = TRUE, message = TRUE, include = TRUE)
2.4.5.2.1 Explanations of the text results options
  • echo = TRUE: Include all R source codes in the output file.
  • warning = TRUE: Preserve warnings (produced by warning()) in the output like we run R code in a terminal.
  • message = TRUE: Preserve messages emitted by message() (similar to warning)
  • include = TRUE: Include all chunk outputs in the final output document.

If you want some of the text results to have different options, please adjust those in their specific code chunks. This comment is valid for all the other general settings.

2.4.5.3 Code decoration

This section deals with settings related to code decoration (i.e. how it is outputted in the final html or pdf document) generated by code chunks.

Please find below an example of options that could be applied across code chunks:

# Setup options for code decoration
opts_chunk$set(tidy = TRUE, tidy.opts = list(blank = FALSE, width.cutoff = 60),
    highlight = TRUE)
2.4.5.3.1 Explanations of the code decoration options
  • tidy = TRUE: Use formatR::tidy_source() to reformat the code. Please see tidy.opts below.
  • tidy.opts = list(blank = FALSE, width.cutoff = 60): This provides a list of options to be passed to the function determined by the tidy option. Here we format the code to avoid blank lines and with a width cutoff of 60 characters.
  • highlight = TRUE: This highlights the source code.

2.4.5.4 Caching code

To compile your .Rmd document faster (especially if you have computing intensive tasks), you can cache the output of your code into files associated to each of your code chunks. This process allows compute intensive chunks to be saved and the output used later without being re-run.

The knitr package has options to only evaluate cached chunks when necessary, but this has to be set by users. Such procedure creates a unique MD5 digest (= a data storage technique) of each chunk to track when changes are present. When the option cache = TRUE (there are other, more granular settings; see below) is set, the chunk will only be evaluated in the following scenarios:

  • There are no cached results (either this is the first time running or the results were moved/deleted).
  • The code chunk has been modified.

The following code allows implementing this procedure to your document:

# Setup options for code cache
opts_chunk$set(cache = 2, cache.path = "cache/")
2.4.5.4.1 Explanations of the caching code options
  • Besides TRUE and FALSE for the chunk option cache, advanced users can also consider more granular cache by using numeric values for cache = 0, 1, 2, 3. 0 means FALSE, and 3 is equivalent to TRUE. For cache = 1, the results of the computation are loaded from the cache, so the code is not evaluated again, but everything else is still executed, such the output hooks and saving recorded plots to files. For cache = 2 (used here), it is very similar to 1, and the only difference is that the recorded plots will not be re-saved to files when the plot files already exist, which might save some time when the plots are big.
  • cache.path = "cache/": Directory where cache files will be saved. You don’t have to create the directory before executing the code, it will be created automatically by knitr if it doesn’t exist yet.

2.4.5.5 Plots

Plots are a major element of your research and they are at the core of your figures. We can take advantage of options implemented in the knitr package to output plots meeting publication requirements. This approach will save precious time during the writing phase of your research (= no need to fiddle with the size and resolution of figures to meet journal policies).

Please find below an example of options that could be applied across code chunks:

# Setup options for plots The first dev is the master for
# the output document
opts_chunk$set(fig.path = "Figures_MS/", dev = c("png", "pdf"),
    dpi = 300)
2.4.5.5.1 Explanations of the caching plots options
  • fig.path = "Figures_MS/"": Set directory to save figures generated by the R Markdown document. As above, this folder doesn’t need to exist prior to executing the code chunks. Files will be save based on code chunk title and assigned figure number.
  • dev = c('pdf', 'png'): Save figures in both pdf and png formats.
  • dpi = 300: The DPI (dots per inch) for bitmap devices (dpi * inches = pixels). Please look at publishing requirements to set this parameter appropriately.

It is worth noting that you might be using external figures in your .Rmd document. To avoid confusions between figures generated by the .Rmd document and those coming from outside, it is best practice to have them saved in two different subfolders (see Figure 2.11 for more details).

2.4.5.5.2 Additional plots options

Some journals have specific requirements on figure dimensions. You can easily set these by using the following option:

  • fig.dim: (NULL; numeric) if a numeric vector of length 2, it gives fig.width and fig.height, e.g., fig.dim = c(5, 7).

2.4.5.6 Positioning figures

Positioning figures close to their code chunks is critical and can get sorted by adding another opts_chunk$set() code line in your setup R code chunk. This is done by invoking the fig.pos argument and setting it to "H". Warning: Setting this argument might generate errors when documents are knitted as pdf documents. If it happens, please comment this line using # and knit again.

## Locate figures as close as possible to requested
## position (=code)
opts_chunk$set(fig.pos = "H")

2.4.6 Apply global changes to all code chunks

In this section, we will collate all global settings discussed above into a code chunk entitled setup, which will be placed under the YAML metadata section (see Figure 2.7 for more details on location). In addition to containing the global settings, it is advisable to also include a code section devoted to loading required R packages (see Chapter 1 - part B and Figure 2.7).

Please find below the code for the setup code chunk based on the options presented above:

# Load packages
## Add any packages specific to your code
library("knitr")
library("bookdown")
# Chunk options: see http://yihui.name/knitr/options/ ###
## Text results
opts_chunk$set(echo = TRUE, warning = TRUE, message = TRUE, include = TRUE)
## Code decoration
opts_chunk$set(tidy = TRUE, tidy.opts = list(blank = FALSE, width.cutoff = 60),
    highlight = TRUE)
## Caching code
opts_chunk$set(cache = 2, cache.path = "cache/")
## Plots The first dev is the master for the output
## document
opts_chunk$set(fig.path = "Figures_MS/", dev = c("png", "pdf"),
    dpi = 300)
## Locate figures as close as possible to requested
## position (=code)
opts_chunk$set(fig.pos = "H")

2.4.6.1 Settings of the setup R code chunk

When inserting the above code into an R code chunk (see Figure 2.7), please set the options of the chunk as follows:

  • setup: Unique ID of the code chunk.
  • include = FALSE: Nothing will be written into the output document, but the code will be evaluated and plot files will be generated (if there are any plots in the chunk).
  • cache = FALSE: Code chunk will not be cached (see above for more details).
  • message = FALSE: Messages emitted by message() will not be preserved.

Options (and their associated arguments) in the code chunk have to be separated by commas.

2.4.7 Exercise to work on learning objectives

Please conduct the above exercise to get accustomed with the material presented in the tutorial. This exercise is divided into six steps as follows:

  1. Open RStudio, and open your Chapter_1_PartB.Rmd document.
  2. Set your working directory according to file location. Note: This is especially important if you want to test your R code prior to knitting the document.
  3. Insert an R code chunk (using the Insert button) directly under the YAML metadata section and entitle it setup. This code chunk will be used to define the global settings for the following options as implemented in the opts_chunk$set() function:
    • Text results.
    • Code decoration.
    • Caching code.
    • Plots.
    • Positioning of figures.
  4. Copy the content of the code presented in this section inside of your setup code chunk.
  5. Use the R code provided below (which generates a plot) to learn more about the effect of global settings on code outputs.
  6. Once the file is completed, compile your document using the Knit button. Please pay attention to the outputs in your folder (= working directory).
  7. Use the bookdown \@ref() function to cite your figure/plot in the text.

2.4.7.1 R code for step 5

The R code provided below is associated with step 5 of the exercise and it produces the plot displayed in Figure 2.13.

Plot of y ~ x.

Figure 2.13: Plot of y ~ x.

To get there:

  1. Insert an R code chunk (using the Insert button) and set the following options and associated arguments:
    • Plot: Unique ID of code chunk.
    • fig.cap = "Plot of y ~ x.": Figure caption.
    • fig.show = "asis": Figure display.
    • out.width = "100%": Figure width on the page.
  2. Enter the following R code in the code chunk.
# Generate a set of observations (n=100) that have a normal
# distribution
x <- rnorm(100)
# Add a small amount of noise to x to generate a new vector
# (y)
y <- jitter(x, 1000)
# Plot y ~ x
plot(x, y)
  1. Test your code to make sure it does what it is supposed to do (using the Run button).
  2. Go to step 6 in the above section and complete the exercise.

2.5 PART D: User Defined Functions in R

2.5.1 Learning outcomes

This tutorial provides students with opportunities to gain the following skills by:

  • Defining what is a function and its applicability.
  • Learning the syntax to implement user-defined functions in R.
  • Learning R specific functions associated to producing and loading functions: return(), source().
  • Providing some background on R lists (list()) and their applicability to functions returning multiple values.
  • Developing, implementing and applying user-defined functions returning single and multiple values.
  • Learning protocols to implement defensive programming to your code.
  • Learning about the use of logical operators in R.

2.5.2 What do you need to complete this tutorial?

To complete part D, please create a new R script document (Figure 2.14) and save it in your working directory. All the code and functions presented here should be reported in your R script.

Snapshot of RStudio showing procedure to create a new R script document.

Figure 2.14: Snapshot of RStudio showing procedure to create a new R script document.

2.5.3 Introduction

This tutorial aims at providing an introduction to functions, more specifically, we will be studying the user defined functions (UDFs) as implemented in R. UDFs allow users to write their own functions and make them available in the R Global Environment (using the source() function) or ultimately in R packages.

To gain this knowledge, students will be conducting three exercises to learn about the following topics:

  1. Develop, implement and apply function returning a single value.
  2. Develop, implement and apply function returning multiple values.
  3. Implement defensive programming to your code to support debugging.

To show the broad applications of the teaching material, we will be using mathematical examples. Before delving into these topics, the instructor is providing some general context and touch upon what a function is and when it is best applied as well as best practices to write pseudocode/code (more during Chapter 4) and approaches to calling R functions.

2.5.4 What is a function?

In programming, you use functions to incorporate sets of instructions that you want to use repeatedly or that, because of their complexity, are better self-contained in a sub program and called when needed.

A function is a piece of code written to carry out a specified task; it can or can not accept arguments or parameters and it can or can not return one or more values.

2.5.5 Functions in R

There exist many terms to define and express functions, subroutines, procedures, method, etc., but for the purposes of this tutorial, you will ignore these distinctions, which are often semantic and reminiscent of other older programming languages (see here for more details on semantics). In our context, those definitions are less important, because in R we only have functions.

2.5.5.1 R syntax for function

In R, according to the base documentation, you define a function with the following construct:

function(arglist){
   body
}

The code between the curly braces is the body of the function.

When you use built-in functions, the only thing you need to worry about is how to effectively communicate the correct input arguments (arglist) and manage the return value(s) (or outputs), if there are any. To know more about arguments associated with a specific function you can access its documentation by using the following syntax (entered in the R console):

#General syntax
?function() 

#Example with read.csv()
?read.csv()

2.5.5.2 User defined functions (UDFs) in R

R allows users to define their own functions, which are based on the following syntax:

function.name <- function(arguments){
   computations on the arguments
   some more code
   return value(s)
}

So, in most cases, a function has a name (here function.name), some arguments (here arguments) used as input to the function (declared within the () following the keyword function); a body, which is the code within the curly braces {}, where you carry out the computation; and can have one or more return values (the output). You define the function similarly to variables, by “assigning” the directive function(arguments) to the “variable” function.name, followed by the rest.

2.5.6 Best practice to writing code

This topic will be covered in chapter 4, but here is an outline of the best practice to write code and functions in R. Before delving into code writing, we will usually work on developing a pseudocode, which aims at providing a high-level description of the tasks that will have to be performed by the function. Once this job done, we will then start writing the code by turning the tasks identified into the pseudocode into real R code. This will be done by searching for existing R functions allowing to execute each task described in the pseudocode and if they don’t exist develop new functions (this task might require some additional pseudocode). Please find below more detailed definitions of the two concepts described here.

2.5.6.1 Writing pseudocode

Pseudocode is an informal high-level description of the operating principle of a computer program or other algorithm. It uses the structural conventions of a normal programming language (here R), but is intended for human reading rather than machine reading. Here, you will establish the big steps (and their associated tasks) and tie R functions (existing or that have to be made) to those steps. This provides the backbone of your code and will support writing it.

2.5.6.2 Writing code

Writing clear, reproducible code has (at least) three main benefits:

  1. It makes returning to the code much easier a few months down the line; whether revisiting an old project, or making revisions following peer review.
  2. The results of your analysis are more easily scrutinized by the readers of your paper, meaning it is easier to show their validity.
  3. Having clean and reproducible code available can encourage greater uptake of new methods that you have developed (and therefore citations).

2.5.7 Loading and calling UDFs

When you will be working on your project, it is highly likely that you will have developed multiple UDFs tailored to your research. In this case, it would be appropriate to create a new folder entitled R_functions, which will be located in the same directory as your R scripts (see Figure 2.15). Save all your UDFs as independent files (e.g. check.install.pkg.R) into the R_functions folder.

Example file structure of a simple analysis project. See Chapter 4 for more details

Figure 2.15: Example file structure of a simple analysis project. See Chapter 4 for more details

2.5.7.1 The source() function: Read R code from a file

Once your project will be properly structured (see Figure 2.15), it will be easy to call specific UDFs into any R script by using the source() function.

For instance, to load the check.install.pkg() (saved in R_functions/check.install.pkg.R) into the Global Environment enter the following code into the R console (see Figure 2.16):

source("R_functions/check.install.pkg.R")
Snapshot of RStudio showing output of source() function and the UDF made available in the Global Environment. You can also view the code unperpinning the UDF.

Figure 2.16: Snapshot of RStudio showing output of source() function and the UDF made available in the Global Environment. You can also view the code unperpinning the UDF.

Below, you will find some R code allowing to load all your UDFs stored in R_functions folder. This code is very handy when you have several UDFs associated to your pipeline.

### Load all UDFs stored in R_functions
# 1. Create vector with names of UDF files in R_functions
# (with full path)
files_source <- list.files("R_functions", full.names = T)
# 2. Iterative sourcing of all UDFs
sapply(files_source, source)

2.5.8 Call your UDFs

Once loaded into your Global Environment (see Figure 2.16 for an example), you will be able to call your functions by typing their names directly into the console. For instance:

# Check if knitr package is installed, if not install it
# and then load it
check.install.pkg(pkg = c("knitr"))
## [1] "Check if packages are installed"
## [1] "Load packages"
## knitr 
##  TRUE

2.5.9 Exercise 1: Develop, implement and apply function returning a single value

In this exercise, we will be working on developing a function that returns a single value, but how do we tell our UDF to return the value? In R, this task is accomplished with the return() function.

2.5.9.1 Aim

Develop, implement and apply a UDF to calculate the square of a number.

In math, the squared symbol (\(^2\)) is an arithmetic operator that signifies multiplying a number by itself. The “square” of a number is the product of the number and itself. Multiplying a number by itself is called “squaring” the number.

2.5.9.2 Pseudocode

Although straightforward, the function will have to execute the following tasks:

  1. Store number inputted by user into an object entitled base as part of the argument(s) of the function.
  2. Infer square of base by multiplying number by itself.
  3. Save output of task 2 into an object called sq.
  4. Return sq object to user (here a single value). To return sq, we will be using the return() function.

This also means that the class of the input provided by user has to be numeric and the output will also be numeric. We will further discuss this topic during the third exercise. The class of an object can be checked by using the class() function. Note: The class() function will be useful for implementing defensive programming.

2.5.9.3 Turning your pseudocode into a function

In this section, we will implement the pseudocode proposed above into a function entitled square_number(). This function requires one argument from the user (base, which is a number) and returns the square of that number (here square of base = base*base).

## Create a UDF in R to calculate square number:
# - argument: base (= one number)
# - output: square of base (= one number)
square_number <- function(base){
  #Infer square of base and save it into object
  sq <- base*base
  
  #Return sq object
  return(sq)
}

Write the code associated to the square_number into your new R script (saved in your working directory) and load function by executing all lines associated to function. Please carry on populating this document with the rest of the exercises.

2.5.9.4 Loading your function into the Global Environment

Before being able to use your UDF, execute the code associated to the square_number() function in the console. The UDF should now be loaded in the Global Environment and therefore be available for use. To verify that the UDF is loaded, please check the Environment panel in RStudio (see Figure 2.17).

Close-up of the Environment panel in RStudio showing that the UDF is loaded in the Global Environment and can be used.

Figure 2.17: Close-up of the Environment panel in RStudio showing that the UDF is loaded in the Global Environment and can be used.

2.5.9.5 Applying your function

The R language is quite flexible and allows functions to be applied to a single value (e.g. base = 2) or a vector (e.g. base = c(2,4,16,23,45)). Please see below for more example:

  1. Apply function to one value:
# Square number of 2
square_number(base = 2)
## [1] 4
  1. Apply function to a vector containing multiple values:
# Create vector with numbers
bases <- c(2, 4, 16, 23, 45)
# Apply function to vector
square_number(base = bases)
## [1]    4   16  256  529 2025

2.5.10 Exercise 2: Develop, implement and apply function returning multiple values

In our previous exercise we developed a function returning only one value. As part of your research, there will be multiple instances where you will be performing multiple actions onto your data, which will call for multiple values to be outputted by the function. To do so, we will be harvesting the different outputs of the function into a list, which will be returned to the users (using return()).

2.5.10.1 What is a list in R?

Lists are R objects containing elements of different types such as numbers, strings, vectors or another list inside it. A list can also contain a matrix or a function as its elements. Lists are created using the list() function.

2.5.10.2 Creating a list

Find below an example to create a list containing strings, numbers, vectors and a logical values:

# Create a list containing strings, numbers, vectors and a
# logical value
list_data <- list("Red", 51.3, 72, c(21, 32, 11), TRUE)
# Print object
print(list_data)
## [[1]]
## [1] "Red"
## 
## [[2]]
## [1] 51.3
## 
## [[3]]
## [1] 72
## 
## [[4]]
## [1] 21 32 11
## 
## [[5]]
## [1] TRUE

2.5.10.3 Naming list elements

The list elements can be given names and they can be accessed using these names (see below):

# Create a list containing a vector, a matrix and a list
list_data <- list(c("Jan", "Feb", "Mar"), matrix(c(3, 9, 5, 1,
    -2, 8), nrow = 2), list("green", 12.3))
# Give names to the elements in the list
names(list_data) <- c("1st_Quarter", "A_Matrix", "An_inner_list")
# Show the list
print(list_data)
## $`1st_Quarter`
## [1] "Jan" "Feb" "Mar"
## 
## $A_Matrix
##      [,1] [,2] [,3]
## [1,]    3    5   -2
## [2,]    9    1    8
## 
## $An_inner_list
## $An_inner_list[[1]]
## [1] "green"
## 
## $An_inner_list[[2]]
## [1] 12.3

2.5.10.4 Accessing list elements

Elements of the list can be accessed by the index of the element in the list. In case of named lists it can also be accessed using the names. We use the same example as above to illustrate procedure to access list elements:

# Create a list containing a vector, a matrix and a list
list_data <- list(c("Jan", "Feb", "Mar"), matrix(c(3, 9, 5, 1,
    -2, 8), nrow = 2), list("green", 12.3))
# Give names to the elements in the list
names(list_data) <- c("1st_Quarter", "A_Matrix", "An_inner_list")
# Access the first element of the list
print(list_data[1])
## $`1st_Quarter`
## [1] "Jan" "Feb" "Mar"
# Access the thrid element. As it is also a list, all its
# elements will be printed
print(list_data[3])
## $An_inner_list
## $An_inner_list[[1]]
## [1] "green"
## 
## $An_inner_list[[2]]
## [1] 12.3
# Access the list element using the name of the element
print(list_data$A_Matrix)
##      [,1] [,2] [,3]
## [1,]    3    5   -2
## [2,]    9    1    8

2.5.10.5 Let’s go back to our exercise

Now that you have more knowledge about list objects, we will be working on an exercise aiming at developing and implementing a UDF to calculate the log and square of a number.

2.5.10.6 Pseudocode and associated UDF

Each student is tasked to:

  1. Develop a pseudocode to execute the UDF.
  2. Implement the pseudocode into a UDF.
  3. Apply and test your code/function.

2.5.10.7 Solution proposed by the instructor

Please find below the solution proposed by the instructor:

##Create a user defined function in R to calculate log and square of a number:
# argument: base (= one number)
# output: log and square of base (= two numbers) returned in a list
my_log_square <- function(base){
  #log (base 10)
  log_value <- log(base)
  #Square of base
  square_value <- base^2
  
  #Return both objects
  return(list(log_val = log_value, square_val = square_value))
}

# Call the function
my_log_square(base = 2)
## $log_val
## [1] 0.6931472
## 
## $square_val
## [1] 4

2.5.11 Exercise 3: Implement defensive programming to your code to support debugging

2.5.11.1 Background information and examples

Defensive programming is a technique to ensure that code fails with well-defined errors, i.e. where you know it should not work. The key here is to ‘fail fast’ and ensure that the code throws an error as soon as something unexpected happens. This creates a little more work for the programmer, but it makes debugging code a lot easier at a later date.

In order to demonstrate how to apply defensive programming to your code, a new function will be defined:

# Define a power function (exp_number): y = x^n
# - Arguments: base (= x) and power (= n)
# - Output: a number (y)
exp_number <- function(base, power){
  #Infer exp (y) based on base (x) and power (n) (y=base^power)
  exp <- base^power
  
  #Return exp object
  return(exp)
}

# Call function
exp_number(base = 2, power = 5)
## [1] 32

You can employ defensive programming on the exp_number function defined above. The function requires that both arguments are of class numeric, if you were to provide a string (e.g. a word) as input, you would get an error:

# Example where we don't respect the class associated with
# the argument base
exp_number(base = "hello", power = 5)
## Error in base^power: non-numeric argument to binary operator

If you add in a line of code to test the data type of the inputs, you get a more meaningful error.

# Define a power function (exp_number): y = x ^n
# - Arguments: base (= x) and power (= n)
# - Output: a number (y)
exp_number <- function(base, power){
    # This if statement tests if classes of base and power are numeric.
    # If one of them is not numeric it stops and return meaningful message
    if(class(base) != "numeric" | class(power) != "numeric"){
          stop("Both base and power inputs must be numeric")
    }
    #If classes are good then infer exp
    exp <- base^power
    
    # Return exp object
    return(exp)
}

# Call function
exp_number(base = "hello", power = 5)
## Error in exp_number(base = "hello", power = 5): Both base and power inputs must be numeric

Although in this case debugging the error would not have taken long, in more complicated functions you are likely to either have less meaningful error messages, or code that runs for some time before it fails. By applying defensive programming and adding in these checks to the code, you can find unexpected behavior sooner and with more meaningful error messages.

2.5.12 The logical operators in R

As you have seen in the previous example, we have used a logical operator (in this last case OR represented with the | syntax) to implement defensive programming into our UDF function. In a nutshell, the most used logical operators in R are as follows:

  • AND (&): This operator takes two logical values and returns TRUE only if both values are TRUE themselves.
  • OR (|): This operator takes two logical values and returns TRUE if just one value is TRUE.
  • NOT (!): This operator negates the logical value it is used on.

You can learn more about the logical operators (including exercises) on the website.

2.5.13 Exercise 4: Your turn to implement defensive programming!

Update my_log_square() to verify the class of base argument and print a meaningful error message if the class is not numeric.

2.6 PART E: Interactive tutorials

The objective of this section is to provide students with some tools and ideas to design their bioinformatic tutorials (for PART 2). Here, students will have an overview of the tools implemented in the R package learnr, which was developed to produce interactive tutorials. Although developed in R, the interactive tutorials are designed to be conducted in web browsers (but it could also be entirely done within RStudio).

The interactive tutorial presented here is subdivided into five topics:

  1. Introduction: This part sets the scene and provides some background information on the tutorial.
  2. Exercise: The aim of the exercise is presented together with a pseudocode (with associated R functions) outlying steps to design and implement an R code to complete the exercise. Students are asked to develop their R codes and execute them in the code chunk. You can type your code directly in the window/console and execute it using the R Code button. The instructor has also provided the solution to the exercise, which could be accessed by pressing the Solution button available in the top banner of the code chunk window.
  3. Questions: These questions are designed to test students’ knowledge of the code.
  4. Solution: A commented solution proposed by the instructor is available in a code chunk. Students can execute the code and inspect outputs produced to generate the final answer.
  5. YouTube tutorial: Here, a short YouTube video presents the procedure to launch a learn interactive tutorial and briefly presents the exercise. The main objective of this video is to show students that it is quite easy to integrate a video into their tutorials.

Finally, the instructor wants to stress that students are not obliged to design their tutorials using learnr. You can also use the R Markdown language/syntax and output tutorials in HTML or PDF formats (more on this subject in Chapter 1).

This document highlights steps to execute the interactive tutorial designed by the instructor.

2.6.1 Installation of required package

Open RStudio and install the learnr package from CRAN by typing the following command in the console:

install.packages("learnr")

2.6.2 Files location & description

Files associated to this tutorial are deposited on the Google Drive under this path: Reproducible_Science -> Bioinformatic_tutorials -> Intro_interactive_tutorial

There are two main files:

  • README.html: The documentation to install R package and run interactive tutorial.
  • EEB603_Interactive_tutorial.Rmd: The interactive tutorial written in R Markdown, but requiring functions from learnr.

2.6.3 YouTube video

The instructor has made a video explaining the procedure to launch the interactive tutorial (based on option 1; see below) as well as some additional explanations related to the exercise.

2.6.4 Running tutorial

  1. Download the Intro_interactive_tutorial folder and save it on your local computer.
  2. Set working directory to location of EEB603_Interactive_tutorial.Rmd.
  • This can be done as follows in RStudio: Session -> Set Working Directory -> Choose Directory....
  • You can also use the setwd() function to set your working directory (e.g. setwd("~/Documents/Course_Reproducible_Science/Timetable/Intro_interactive_tutorial")).
  1. Running tutorial. There are two options to run the integrative tutorial:
  • Option 1: Open EEB603_Interactive_tutorial.Rmd in RStudio and press the Run Document button on the upper side bar to launch the tutorial. It will appear in the Viewer panel (on right bottom corner). You can open the interactive tutorial in your web browser by clicking on the third icon at the top of the viewer panel. This procedure is also explained in the Youtube video.
  • Option 2: Type the following command in the console. This will open a window in RStudio. You can also open the tutorial in your web browser by pressing the Open in Browser button.
rmarkdown::run("EEB603_Interactive_tutorial.Rmd")
  1. Enjoy going through the tutorial!

2.6.5 Learning syntax to develop interactive tutorials

The procedure to develop interactive tutorials using learnr is presented here. To learn more about the syntax, the instructor encourages you to open EEB603_Interactive_tutorial.Rmd in RStudio and inspect the document. This will allow learning syntax and associated procedures to:

  • Design exercises (with embedded R code chunks and associated solutions).
  • Include multiple choices questions.
  • Embed a YouTube video.

3 Chapter 2

3.1 Introduction

In this chapter, we are investigating the causes leading to irreproducible science and discussing ways to mitigate this crisis. We are using results from the survey published by Baker (2016) as baseline to support our discussions.

3.2 What’s the difference between reproduction and replication?

Before delving into the cause leading to irreproducible science, we need to look into the differences between replication and reproduction. The material presented here has been adapted from RESCIENCE C.

  • Reproduction of a study means running the same computation (mostly referred as code in this class) on the same input data, and then checking if the results are the same. Reproduction can be considered as software testing at the level of a complete study.

  • Replication of a scientific study (computational or other) means repeating a published protocol, respecting its spirit and intentions, but varying the technical details. For instance, it would mean using a protocol aiming at extracting genomic DNA developed on tomato and applying it on sagebrush. For computational work, this would mean using different software, running a simulation from different initial conditions, etc. The idea is to change something that everyone believes should not matter (e.g. both tomato and sagebrush are plants and have DNA), and see if the scientific conclusions are affected or not.

Overall, reproduction verifies that a computation was recorded with enough detail that it can be analyzed later or by someone else. On the other hand, replication explores which details matter for reaching a specific scientific conclusion. A replication attempt is most useful if reproducibility has already been verified. Otherwise, if replication fails, leads to different conclusions, you cannot trace back the differences in the results to the underlying code and data.

3.3 Online resources

The list of websites listed here have been used to design this chapter:

  • RESCIENCE C - Reproducible Science is good. Replicated Science is better.
  • Retraction Watch - Tracking retractions as a window into the scientific process.
  • Reproducibility and research integrity - A Special Issue published in BMC Research Notes.
  • Research Integrity and Peer Review - An international, open access, peer reviewed journal that encompasses all aspects of integrity in research publication, including peer review, study reporting, and research and publication ethics.
  • Peer Review Week - Peer Review Week is a community-led yearly global virtual event celebrating the essential role that peer review plays in maintaining research quality. The event brings together individuals, institutions, and organizations committed to sharing the central message that quality peer review in whatever shape or form it may take is critical to scholarly communication.

3.4 Teaching material

The pdf of the presentation can be downloaded here. The pdf of Baker (2016) is available on our shared Google Drive at this path:

Reproducible_Science > Publications > Baker_Nature_2016.pdf

4 Chapter 3

4.1 Introduction

Improving the reliability and efficiency of scientific research will increase the credibility of the published scientific literature and accelerate discovery.

In this chapter, we will survey measures that can be adopted to optimize key elements of the scientific process by especially focusing on:

  • Methods
  • Reporting and dissemination
  • Reproducibility
  • Evaluation
  • Incentives

These measures aim at minimizing threats to the scientific process therefore making it more open and transparent (Figure 4.1).

While studying this topic, keep in mind this quote from Richard Feynman:

The first principle is that you must not fool yourself – and you are the easiest person to fool.

Overview of the scientific process and threats preventing reproducibility of the study (indicated in red). Abbreviations: HARKing: hypothesizing after the results are known; P-hacking: data dredging.

Figure 4.1: Overview of the scientific process and threats preventing reproducibility of the study (indicated in red). Abbreviations: HARKing: hypothesizing after the results are known; P-hacking: data dredging.

4.2 Learning outcomes

  • Review the steps involved in the scientific process.
  • Identify and define threats to the scientific process.
  • Discuss solutions to mitigate threats to the scientific process.
  • Learn about the Transparency and Openness Promotion (TOP) guidelines and their applications in scientific publications and funding activities.

4.3 Publications and web resources

This chapter is mostly based on the following resources:

The website of the Center for Open Science (https://cos.io) was also used to design the chapter content. More specifically, visit this webpage on study pre-registration.

4.4 Teaching material

The presentation slides for this chapter can be downloaded here.

4.5 Scientific process in a nutshell

The scientific process can be subdivided into six phases (Figure 4.1):

  1. Ask a question, review literature and generate hypothesis1.
  2. Design study.
  3. Collect data.
  4. Analyze data and test hypothesis.
  5. Interpret results.
  6. Publish and/or conduct next experiment.

In order to facilitate the understanding of the material taught in this chapter, the six phases of the scientific process are split into three categories reflecting the study’s progress:

  • Pre-study: including phases 1 and 2.
  • Study: including phases 3 to 5.
  • Post-study: including phase 6.

The distinction of those categories is very important to ensure the reproducibility and transparency of the study and to avoid falling into the traps described in Figure 4.1. For instance, considering the pre-study category as an independent step in the scientific process will promote study pre-registration and therefore avoid HARKing, P-hacking and publication bias (see Figure 4.1 and the Glossary section). The recognition of the post-study category is also very important since it encourages pre- and post-publication reviews therefore supporting a better dissemination and transparency of your research.

4.6 Threats to the scientific process

A hallmark of scientific creativity is the ability to see novel and unexpected patterns in data. However, a major challenge for scientists is to be open to new and important insights while simultaneously avoiding being misled by our tendency to see structure in randomness. The combination of:

  1. Apophenia: the tendency to see patterns in random data.
  2. Confirmation bias: the tendency to focus on evidence that is in line with our expectations or favored explanation.
  3. Hindsight bias (also known as knew-it-all-along effect): the tendency to see an event as having been predictable only after it has occurred.

Those factors can easily lead us to false conclusions and therefore be threats to our scientific process. Some of these threats (e.g. HARKing, P-hacking) are displayed in Figure 4.1 and definitions are provided in the Glossary section.

4.7 Chapter content

Our objective is to tackle threats by providing measures to ensure reproducibility and transparency. Following the approach proposed by Munafo et al. (2017), the measures studied in this chapter to ensure research reproducibility and transparency are organized into five categories. In addition, when possible, categories contain specific working themes designed to minimize the threats discussed above (see Figure 4.1).

  1. Methods
    • Protecting against cognitive biases.
    • Improving methodological training.
    • Improving chain of evidence supporting (identification of) biodiversity occurrences.
    • Implementing independent methodological support.
    • Encouraging collaboration and team science.
  2. Reporting & dissemination
    • Promoting study pre-registration.
    • Improving the quality of reporting.
    • Promoting linkage of publication, data, code & analyses into a unified environment (see chapter 1).
  3. Reproducibility
    • Promoting transparency and open science.
  4. Evaluation
    • Diversifying peer review: pre- and post-publication reviews.
  5. Incentives

These measures are not intended to be exhaustive, but aim at providing a broad, practical and evidence-based set of actions that can be implemented by researchers, institutions, journals and funders. They will also provide a road map for students to design their thesis projects.

4.8 Methods

This section describes measures that can be implemented when performing research (including, for example, study design, methods, statistics, and collaboration).

4.8.1 Protecting against cognitive biases

There is a substantial literature on the difficulty of avoiding cognitive biases. An effective solution to mitigate self-deception and unwanted biases is blinding. In some research contexts, participants and data collectors can be blinded to the experimental condition that participants are assigned to, and to the research hypotheses, while the data analyst can be blinded to key parts of the data. For example, during data preparation and cleaning (see chapters 6, 7), the identity of experimental conditions or the variable labels can be masked so that the output is not interpretable in terms of the research hypothesis.

Pre-registration of the study design, primary outcome(s) and analysis plan (see the Promoting study pre-registration section below) is a highly effective form of blinding because the data do not exist and the outcomes are not yet known.

4.8.2 Improving methodological training

Research design and statistical analysis are mutually dependent. Common misperceptions, such as the interpretation of P values, limitations of null-hypothesis significance testing, the meaning and importance of statistical power, the accuracy of reported effect sizes, and the likelihood that a sample size that generated a statistically significant finding will also be adequate to replicate a true finding, could all be addressed through improved statistical training. These concepts are presented in BIOL603, ADVANCED BIOMETRY.

4.8.3 Improving chain of evidence supporting (identification of) biodiversity occurrences

4.8.3.1 Are occurrences in GBIF scientifically sound?

Primary biodiversity occurrence data are at the core of research in Ecology & Evolution. They are, however, no longer gathered as they used to be and the mass-production of observation-based (OB) occurrences is overthrowing the collection of specimen-based (SB) occurrences. Troudet et al. (2018) analyzed 536 million occurrences from the Global Biodiversity Information Facility (GBIF) database2 and concluded that from 1970 to 2016 the proportion of occurrences marked as traceable to tangible material (i.e., SB occurrences) fell from 68% to 18%. Moreover, the authors added that most of those specimen based-occurrences could not be readily traced back to a specimen because the necessary information was missing. This alarming trend (i.e. the low traceability of occurrences and therefore the low confidence of species identifications based on those observations) threatens the reproducibility of biodiversity research. For instance, low confidence in species identifications prevents mining into larger databases (to infer e.g. species distribution, ecology, phenology, conservation status, phylogenetic position) to gather data allowing testing hypotheses.

Overall, in their study, Troudet et al. (2018) advocated that SB occurrences must be gathered, as a warrant to allow both repeating ecological and evolutionary studies and conducting rich and diverse investigations. They also suggested that when impossible to secure, voucher specimens must be replaced with OB occurrences combined with ancillary data (e.g., pictures, recordings, samples, DNA sequences). Ancillary data are instrumental for the usefulness of biodiversity occurrences and sadly those tend not to be shared. Such approach will allow ensuring that primary biodiversity data collected lately do not partly become obsolete when doubtful.

4.8.3.2 Specimens are more than just dead stuff stored in dusty cabinets

Underpinning biodiversity occurrence with specimens (deposited in Natural History Museums and Botanical Gardens) allow:

  • Confirming species identifications.
  • Gathering ecological data from label (e.g. GPS point, vegetation type, soil).
  • Gathering morphological features (conduct measurements allowing e.g. assessing phenotypic variability across species range).
  • Sampling tissues to support additional inferences:
    • molecular analyses (e.g. genotyping or sequencing).
    • anatomical analyses.
    • physiological analyses (e.g. infer diet based on carbon isotope analyses).
    • etc…

Those additional data provided by specimens are key in the data collecting phase and will allow further analyses to thoroughly test hypotheses. We fully understand that SB occurrences can be problematic to process (especially in Ecology), but we would urge students to consider gathering ancillary data to back-up their observations and make sure their analyses are reproducible.

4.8.4 Implementing independent methodological support

The need for independent methodological support is well-established in some areas — many clinical trials, for example, have multidisciplinary trial steering committees to provide advice and oversee the design and conduct of the trial. The need for these committees grew out of the well-understood financial conflicts of interest that exist in many clinical trials. Including independent researchers (particularly methodologists with no personal investment in a research topic) in the design, monitoring, analysis or interpretation of research outcomes may mitigate some of those influences, and can be done either at the level of the individual research project or through a process facilitated by a funding agency.

4.8.5 Encouraging collaboration and team science

Studies of statistical power persistently find it to be below (sometimes well below) 50%, across both time and the different disciplines studied (see Munafo et al., 2017 and references therein). Low statistical power increases the likelihood of obtaining both false-positive and false-negative results, meaning that it offers no advantage if the purpose is to accumulate knowledge. Despite this, low-powered research persists because of dysfunctional incentives, poor understanding of the consequences of low power, and lack of resources to improve power. Team science is a solution to the latter problem — instead of relying on the limited resources of single investigators, distributed collaboration across many study sites facilitates high-powered designs and greater potential for testing generalizability across the settings and populations sampled. This also brings greater scope for multiple theoretical and disciplinary perspectives, and a diverse range of research cultures and experiences, to be incorporated into a research project.

4.9 Reporting and dissemination

This section describes measures that can be implemented when communicating research (including, for example, reporting standards, study pre-registration, and disclosing conflicts of interest).

4.9.1 Promoting study pre-registration

Progress in science relies in part on generating hypotheses with existing observations and testing hypotheses with new observations. This distinction between postdiction3 and prediction is appreciated conceptually, but is not respected in practice. Mistaking generation of postdictions with testing of predictions reduces the credibility of research findings. However, ordinary biases in human reasoning, such as hindsight bias, make it hard to avoid this mistake. An effective solution is to define the research questions and analysis plan before observing the research outcomes —a process called pre-registration. Pre-registration distinguishes analyses and outcomes that result from predictions from those that result from postdictions. A variety of practical strategies are available to make the best possible use of pre-registration in circumstances that fall short of the ideal application, such as when the data are pre-existing. Services are now available for pre-registration across all disciplines, facilitating a rapid increase in the practice. Widespread adoption of pre-registration will increase distinctiveness between hypothesis generation and hypothesis testing and will improve the credibility of research findings (in term of research quality and transparency).

In its simplest form study pre-registration (see Nosek et al., 2018) may simply comprise the registration of the basic study design, but it can also include a detailed pre-specification of the study procedures, outcomes and statistical analysis plan.

Study pre-registration was introduced to address two major problems:

  1. Publication bias.
  2. Analytical flexibility (in particular outcome switching).

Please see the Glossary section for definitions of these concepts.

4.9.2 Improving the quality of reporting

Pre-registration will improve discoverability of research, but discoverability does not guarantee usability. Poor usability reflects difficulty in evaluating what was done, in reusing the methodology to assess reproducibility, and in incorporating the evidence into systematic reviews and meta-analyses. Improving the quality and transparency in the reporting of research is necessary to address this.

4.9.2.1 The Transparency and Openness Promotion (TOP) guidelines: A mean to improve usability

TOP guidelines (published in Nosek et al., 2015) offer standards as a basis for journals and funders to incentivize or require greater transparency in planning and reporting of research. More precisely, TOP guidelines include eight modular standards, each with three levels of increasing stringency (Figure 4.3). Journals are selecting which of the eight transparency standards they wish to implement and also select a level of implementation for each. These features provide flexibility for adoption depending on disciplinary variation, but simultaneously establish community standards.

4.9.2.1.1 TOP transparency modular standards

Please find below the list of eight TOP modular standards:

  1. Citation standards (citation of data sets etc.).
  2. Data transparency (data archiving).
  3. Analytic methods transparency (code archiving).
  4. Research materials transparency (materials archiving).
  5. Design and analysis transparency (reporting of details of methods and results).
  6. Pre-registration of studies (registering study prior to initiation).
  7. Pre-registration of analysis plans (registering analysis plan prior to study initiation).
  8. Replication (a study designed to replicate a previously published study).

Each category template text for three levels of transparency: Level 1, Level 2, and Level 3 (Figure 4.3). Adopting journals select among the levels based on readiness to adopt milder to stronger transparency standards for authors and researchers. There are many factors that will influence level selection including considerations for implementation, and concordance with disciplinary norms and expectations.

Table presenting the TOP standards and their associated levels of transparency.

Figure 4.2: Table presenting the TOP standards and their associated levels of transparency.

Over 1,000 journals or organizations have implemented one or more TOP-compliant policy as of August 2018 (e.g. Ecology Letters, The Royal Society, Science). The full list of journals implementing TOP guidelines is available at this url: https://osf.io/2sk9f/

4.9.3 Promoting linkage of publication, data, code and analyses into a unified environment

The material presented in chapter 1 focusing on R Markdown (as implemented in RStudio) is a response to the need to provide a unified environment linking publication, code and data.

The Center for Open Science also proposes a platform called “Open Science Framework” or OSF to achieve the same goal (Figure 4.3). OSF is a free and open source project management repository that supports researchers across their entire project life-cycle. As a collaboration tool, OSF helps researchers work on projects privately with a limited number of collaborators and make parts of their projects public, or make all the project publicly accessible for broader dissemination with citable, discoverable DOIs. As a workflow system, OSF enables connections to the many products researchers already use to streamline their process and increase efficiency.

With OSF’s workflow and storage integrations, researchers can really manage their entire projects from one place. The OSF workflow connects the valuable research tools researchers are already using, so that they can effectively share the story of their research projects and eliminate data silos and information gaps (Figure 4.3). OSF ecosystem is designed to allow all those tools to work together the way researchers’ do, removing barriers to collaboration and knowledge (Figure 4.3).

Overview of the OSF workflow and connections with other widely used software.

Figure 4.3: Overview of the OSF workflow and connections with other widely used software.

4.10 Reproducibility

This section describes measures that can be implemented to support verification of research (including, for example, sharing data and methods).

4.10.1 Promoting transparency and open science

Science is a social enterprise: independent and collaborative groups work to accumulate knowledge as a public good. The credibility of scientific claims is rooted in the evidence supporting them, which includes the methodology applied, the data acquired, the process of methodology implementation, and data analysis and outcome interpretation. Claims become credible by the community reviewing, criticizing, extending and reproducing the supporting evidence. However, without transparency, claims only achieve credibility based on trust in the confidence or authority of the originator. Transparency is superior to trust.

Open science refers to the process of making the content and process of producing evidence and claims transparent and accessible to others. Transparency is a scientific ideal, and adding ‘open’ should therefore be redundant. In reality, science often lacks openness: many published articles are not available to people without a personal or institutional subscription, and most data, materials and code supporting research outcomes are not made accessible, for example, in a public repository (however this is rapidly changing with several initiatives, e.g., Dryad digital repository).

Very little of the research process (for example, study protocols, analysis workflows, peer review) is accessible because, historically, there have been few opportunities to make it accessible even if one wanted to do so. This has motivated calls for open access, open data and open workflows (including analysis pipelines), but there are substantial barriers to meeting these ideals, including vested financial interests (particularly in scholarly publishing) and few incentives for researchers to pursue open practices.

4.10.1.1 Promoting open science is good, but “open” costs fall onto researchers…

To promote open science, several open-access journals were recently created (e.g. BMC, Frontiers, PLoS). These journals facilitate sharing of scientific research (and associated methods, data and code), but they are quite expensive (>$1500 on average). Waivers can be obtained for researchers based in certain countries or institutions can sponsor those initiatives and have a certain amount of papers per year published for “free”. However, if you do not fall into one of these categories, it will be quite challenging to pay for these costs without support from a grant (NSF is making an effort to promote open-science). The EEB program might be able to support some of those costs, but it will vary on the yearly budget (and when you ask). This topic is further investigated in Chapter 4.

4.11 Evaluation

This section describes measures that can be implemented when evaluating research (including, for example, peer review).

4.11.1 Diversifying peer review: pre- and post-publication reviews

For most of the history of scientific publishing, two functions have been confounded — evaluation and dissemination. Journals have provided dissemination via sorting and delivering content to the research community, and gate-keeping via peer review to determine what is worth disseminating. However, with the advent of the internet, individual researchers are no longer dependent on publishers to bind, print and mail their research to subscribers. Dissemination is now easy and can be controlled by researchers themselves (see examples of preprint publishers below).

With increasing ease of dissemination, the role of publishers as a gate-keeper is declining. Nevertheless, the other role of publishing — evaluation — remains a vital part of the research enterprise. Conventionally, a journal editor will select a limited number of reviewers to assess the suitability of a submission for a particular journal. However, more diverse evaluation processes are now emerging, allowing the collective wisdom of the scientific community to be harnessed. For example, some preprint services support public comments on manuscripts, a form of pre-publication review that can be used to improve the manuscript (see below). Other services, such as PubMed Commons and PubPeer, offer public platforms to comment on published works facilitating post-publication peer review. At the same time, some journals are trialing ‘results-free’ review, where editorial decisions to accept are based solely on review of the rationale and study methods alone (that is, results-blind; for instance, PLoS ONE is applying this approach.).

Both pre- and post-publication peer review mechanisms dramatically accelerate and expand the evaluation process. By sharing preprints, researchers can obtain rapid feedback on their work from a diverse community, rather than waiting several months for a few reviews in the conventional, closed peer review process. Using post-publication services, reviewers can make positive and critical commentary on articles instantly, rather than relying on the laborious, uncertain and lengthy process of authoring a commentary and submitting it to the publishing journal for possible publication.

4.11.1.1 Preprint services to disseminate your research early and even get feedbacks

bioRxiv (pronounced “bio-archive”) is a free online archive and distribution service for unpublished preprints in the life sciences. It is operated by Cold Spring Harbor Laboratory, a not-for-profit research and educational institution. By posting preprints on bioRxiv, authors are able to make their findings immediately available to the scientific community and receive feedback on draft manuscripts before they are submitted to journals.

Articles are not peer-reviewed, edited, or typeset before being posted online. However, all articles undergo a basic screening process for offensive and/or non-scientific content and for material that might pose a health or bio-security risk and are checked for plagiarism. No endorsement of an article’s methods, assumptions, conclusions, or scientific quality by Cold Spring Harbor Laboratory is implied by its appearance in bioRxiv. An article may be posted prior to, or concurrently with, submission to a journal but should not be posted if it has already been accepted for publication by a journal.

PeerJ Preprints is a ‘preprint server’ for the Biological Sciences, Environmental Sciences, Medical Sciences, Health Sciences and Computer Sciences. A PeerJ Preprint is a draft of an article, abstract, or poster that has not yet been peer-reviewed for formal publication. Submit a draft, incomplete, or final version of your work for free.

Submissions to PeerJ Preprints are not formally peer-reviewed. Instead they are screened by PeerJ staff to ensure that they fit the subject area; do not contravene any of their policies; and that they can reasonably be considered a part of the academic literature. If a submission is found to be unsuitable in any of these respects then it will not be accepted for posting. Content which is considered to be non-scientific or pseudo-scientific will not pass the screening.

4.12 Incentives

Publication is the currency of academic science and increases the likelihood of employment, funding, promotion and tenure. However, not all research is equally publishable. Positive, novel and clean results are more likely to be published than negative results, replications and results with loose ends; as a consequence, researchers are incentivized to produce the former, even at the cost of accuracy (see Nosek et al., 2012). These incentives ultimately increase the likelihood of false positives in the published literature. Shifting the incentives therefore offers an opportunity to increase the credibility and reproducibility of published results.

Funders, publishers, societies, institutions, editors, reviewers and authors all contribute to the cultural norms that create and sustain dysfunctional incentives. Changing the incentives is therefore a problem that requires a coordinated effort by all stakeholders to alter reward structures. There will always be incentives for innovative outcomes — those who discover new things will be rewarded more than those who do not. However, there can also be incentives for efficiency and effectiveness — those who conduct rigorous, transparent and reproducible research could be rewarded more than those who do not. There are promising examples of effective interventions for nudging incentives. For example, journals are adopting:

  • Badges to acknowledge open practices.
  • Registered Reports as a results-blind publishing model.
  • TOP guidelines to promote openness and transparency.

Collectively, and at scale, such efforts can shift incentives such that what is good for the scientist is also good for science — rigorous, transparent and reproducible research practices producing credible results.

4.13 Glossary

HARKing: HARKing (hypothesizing after the results are known ) is defined as presenting a post hoc hypothesis (i.e., one based on or informed by one’s results) in one’s research report as if it were, in fact, an a priori hypothesis.

Outcome switching: refers to the possibility of changing the outcomes of interest in the study depending on the observed results. A researcher may include ten variables that could be considered outcomes of the research, and — once the results are known — intentionally or unintentionally select the subset of outcomes that show statistically significant results as the outcomes of interest. The consequence is an increase in the likelihood that reported results are spurious by leveraging chance, while negative evidence gets ignored.

P-hacking: also known as “Data dredging” is the misuse of data analysis to find patterns in data that can be presented as statistically significant when in fact there is no real underlying effect. This is done by performing many statistical tests on the data and only paying attention to those that come back with significant results, instead of stating a single hypothesis about an underlying effect before the analysis and then conducting a single test for it.

Publication bias: also known as the file drawer problem, refers to the fact that many more studies are conducted than published. Studies that obtain positive and novel results are more likely to be published than studies that obtain negative results or report replications of prior results. The consequence is that the published literature indicates stronger evidence for findings than exists in reality.

5 Chapter 4

5.1 Introduction

Balancing Open Science to support reproducibility with values of your stakeholders could be a challenging task.

Nowadays, funding agencies and journals are moving forward to promote Open Science and we started discussing this approach in Chapter 3. For instance, the White House is requiring immediate public access to all U.S.-funded research papers by 2025 without an embargo or cost. You can learn more about this initiative in this Science news article (doi: 10.1126/science.ade6076).

5.2 Objectives

The overarching objective of this chapter is to study the mechanisms implemented to promote Open Science while highlighting circumstances where other principles might have to be adopted to respect different views, opinions and/or data sovereignty.

Overall, content of this chapter aims at providing solutions to tackles threats to the scientific process (discussed in Chapters 2 and 3) as well as information to manage the data life cycle and research dissemination.

5.3 Learning outcomes

5.4 Web resources and publications

This chapter is based on:

Web resources

Publications

5.5 What is Open Science?

The Organisation for Economic Co-operation and Development (OECD) defines Open Science as follows:

to make the primary outputs of publicly funded research results – publications and the research data – publicly accessible in digital format with no or minimal restriction”.

Several other agencies argue that the definition of Open Science should be expanded. For them, Open Science is about extending the principles of openness to the whole research cycle, fostering sharing and collaboration as early as possible thus entailing a systemic change to the way science and research is done.

5.6 What are the pillars of Open Science?

The 7 pillars of Open Science are:

5.6.1 FAIR Data

The FAIR Principles (Wilkinson et al., 2016) steward researchers in making outputs of research that are (Figure 5.1):

  • Findable: making research outputs discoverable by the wider academic community and the public.
  • Accessible: using unique identifiers, metadata and a clear use of language and access protocols.
  • Interoperable: applying standards to encode and exchange data and metadata. Interoperability is the ability of different systems, devices, applications or products to connect and communicate in a coordinated way, without effort from the end user.
  • Reusable: enabling the re-purposing of research outputs to maximize their research potential.

When combined, these four elements are designed to help:

  • Lower barriers to research outputs
  • Facilitate potential secondary researchers finding, understanding, reusing and re-purposing data to realize additional research opportunities
  • Maximize existing resources
The FAIR Principles (source: SangyaPundir, CC BY-SA 4.0).

Figure 5.1: The FAIR Principles (source: SangyaPundir, CC BY-SA 4.0).

5.6.1.1 The FAIR Guiding Principles

Wilkinson et al. (2016) have provided the following guidelines that have to be applied to support the FAIR Principles:

  1. To be Findable:
    • F1. (meta)data are assigned a globally unique and persistent identifier
    • F2. data are described with rich metadata (defined by R1 below)
    • F3. metadata clearly and explicitly include the identifier of the data it describes
    • F4. (meta)data are registered or indexed in a searchable resource
  2. To be Accessible:
    • A1. (meta)data are retrievable by their identifier using a standardized communications protocol
      • A1.1 the protocol is open, free, and universally implementable
      • A1.2 the protocol allows for an authentication and authorization procedure, where necessary
      • A2. metadata are accessible, even when the data are no longer available
  3. To be Interoperable:
    • I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation
    • I2. (meta)data use vocabularies that follow FAIR principles
    • I3. (meta)data include qualified references to other (meta)data
  4. To be Reusable:
    • R1. meta(data) are richly described with a plurality of accurate and relevant attributes
      • R1.1. (meta)data are released with a clear and accessible data usage license
      • R1.2. (meta)data are associated with detailed provenance
      • R1.3. (meta)data meet domain-relevant community standards

5.6.2 Research Integrity

This is the practice of researchers acting honestly, reliably, respectfully and are accountable for their actions.

At BSU, the Office of Research Compliance is providing some information and guidance associated with research integrity.

5.6.3 Next Generation Metrics

The next-generation metrics pillar of Open Science seeks to catalyze a shift in cultural thinking around the way in which bibliometrics are utilized in research, particularly when evaluating quality, and go beyond simply citation counts and journal impact (Figure 5.2). Appropriate metrics, drawn from different sources and describing different things, can help us gain a broader understanding of the significance and impact of research.

Example of new metrics associated to publications to assess impact of research beyond the impact factor.

Figure 5.2: Example of new metrics associated to publications to assess impact of research beyond the impact factor.

5.6.3.0.1 The San Francisco Declaration of Research Assessment - DORA

As an example showing changes in how research is perceived, more institutions and funders are supporting the San Francisco Declaration of Research Assessment - DORA and openly reject the use of quantitative metrics commonly associated with journal impact factors as a measure of research quality. Among many others, Springer Nature has joined DORA and here are their commitments.

5.6.3.0.2 Altmetric

Another example is the Altmetric, which tracks and demonstrates the reach and influence of your work to key stakeholders by mainly focusing on social media (Figure 5.2).

5.6.4 Future of Scholarly Communication

The future of scholarly communication is one of the most prominent pillars of Open Scholarship given its intention to shift the current academic publishing model towards fully Open Access. We will be developing more this topic below in our section on open access and associated licenses.

5.6.5 Citizen Science

Movement towards members of the public having a greater role within research and recognizing the invaluable role they play in providing insights a researcher may not typically have. Examples of such initiatives are e.g. eBird or iNaturalist.

Harnessing the advantages of the internet, openly available software packages and local knowledge, citizen science brings about a change in the way research is conducted – no longer limited to academic researchers, it encourages collaboration from groups across society.

Wagenknecht et al. (2021) discuss this topic and its implementation in the case of a European Citizen Science project and Groom et al. (2017) are reflecting on the role of Citizen Science in biodiversity research. Overall, citizen science could really propel your research forward, but you have to be aware of the potential pitfalls. Please see this section in Chapter 3 where this topic was discussed.

5.6.6 Education and Skills

This pillar focuses on identifying which are the training needs of researchers and sufficiently addressing any gaps in knowledge and skills around engaging with Open Science such as making publications openly accessible, managing research data in-line with the FAIR Principles (Figure 5.1) and acting with integrity.

All researchers at all levels should have access to education and skills programmes to support their work and continued learning. Further, skill development programmes should be opened up to other stakeholders in research such as professional staff including librarians and data managers and members of the public to facilitate the undertaking of citizen science. At BSU, the Alberstons library has resources on this topic and also runs workshops and seminars.

5.6.7 Rewards and Initiatives

Fostering engagement with the principle of Open Science requires reward and recognition of the efforts to do so – this pillar addresses barriers and champions best practice.

A perceived lack of reward and recognition for work undertaken to manage research data and make publications openly accessible discourages researchers from engaging with the principle of Open Science. Work falling under this pillar seeks to address these challenges and champion engagement with Open Science practices. Please see Chapter 3 for more details on some of these initiatives and rewards to promote Open Science.

5.7 Open Access: A tool to promote Open Science?

Open access (OA) is a set of principles and a range of practices through which research outputs are distributed online, free of access charges or other barriers (Figure 5.3). Although publications are free of charges to the users, researchers will have to pay publication fees (usually between $1500 and $2500) to publish their research as OA. With open access strictly defined (according to the 2001 definition), barriers to copying or reuse are also reduced or removed by applying an open license for copyright. An open license is a license which allows others to reuse another creator’s work as they wish. Without a special license, these uses are normally prohibited by copyright, patent or commercial license. Most free licenses are worldwide, royalty-free, non-exclusive, and perpetual.

5.7.1 The Open Access conundrum

Since the revenue of most open access journals is earned from publication fees charged to the authors, OA publishers are motivated to increase their profits by accepting low-quality papers and by not performing thorough peer review. On the other hand, the prices for OA publications in the most prestigious journals have exceeded $5000, making such publishing model unaffordable to a large number of researchers.

The increase in publishing cost has been called the “Open-Access Sequel to [the] Serials Crisis” (Khoo, 2019). To provide further context, the serial crisis involves unsustainable budgetary pressures on libraries due to hyperinflation of subscription costs. In this framework, OA was proposed as one way of coping with these costs because articles would not require ongoing subscriptions to remain accessible, but findings by Khoo (2019) might suggest that both systems have limitations.

5.8 Creative Common licences

Most open access journals will be using (Creative Common; CC) licenses and it is important that you and your co-authors are aware of their implications prior to submitting your manuscript, especially in regard to potential commercial usage of your work by third parties (Figure 5.3). A CC license is one of several public copyright licenses that enable the free distribution of an otherwise copyrighted “work”.

A CC license is used when an author wants to give other people the right to share, use, and build upon a work that the author has created. CC provides an author flexibility (e.g., they might choose to allow only non-commercial uses of a given work) and protects the people who use or redistribute an author’s work from concerns of copyright infringement as long as they abide by the conditions that are specified in the license by which the author distributes the work. To support this fexibility, the CC initiative is providing an online tool to choose the best license to share your work, which I would advise you to consult to identify what license best suits your work. To know more about CC licenses, please visit this wikipedia page.

Front page of Ecology and Evolution, an open access journal publishing articles under the Creative Common licensing system.

Figure 5.3: Front page of Ecology and Evolution, an open access journal publishing articles under the Creative Common licensing system.

5.8.1 Available CC licenses

The available licenses provided by the CC initiative are shown in Figure 5.4.

Overview of Creative Common licenses sorted from most open to most restrictive (provided by Andrew Child).

Figure 5.4: Overview of Creative Common licenses sorted from most open to most restrictive (provided by Andrew Child).

The description of these licenses are always presented in this format:

  1. You are free to:
    • Share - Conditions.
    • Adapt - Conditions.
  2. Under the following terms:
    • Attribution - Conditions.
    • No additional restrictions - Conditions.

Here is the list of some of the mostly used CC licenses with URLs leading to their descriptions:

5.9 CARE Principles

We have been promoting Open Science, but is this approach always possible?

As pointed out by Carroll et al. (2021), as big data, open data, and open science advance to increase access to complex and large datasets for innovation, discovery, and decision-making, Indigenous Peoples’ rights to control and access their data within these data environments remain limited. Operationalizing the FAIR Principles for scientific data with the CARE Principles for Indigenous Data Governance enhances machine actionability and brings people and purpose to the fore to resolve Indigenous Peoples’ rights to and interests in their data across the data lifecycle (Figure 5.5).

Be FAIR and CARE (source: https://www.gida-global.org/care).

Figure 5.5: Be FAIR and CARE (source: https://www.gida-global.org/care).

The CARE Principles detail that the use of Indigenous data should result in tangible benefits for Indigenous collectives through inclusive development and innovation, improved governance and citizen engagement, and result in equitable outcomes.

Collective benefit is more likely to be realized when data ecosystems are designed to support Indigenous nations and when the use/reuse of data for resource allocation is consistent with community values. The United Nations Declaration on the Rights of Indigenous Peoples asserts Indigenous Peoples’ rights and interests in data and their authority to control their data. Access to ‘data for governance’ is vital to support self-determination and Indigenous nations should be actively involved in ‘governance of data’ to ensure ethical reuse of data. Given the majority of Indigenous data is controlled by non-Indigenous institutions there is a responsibility to engage respectfully with those communities to ensure the use of Indigenous data supports capacity development, increasing community data capabilities, and the strengthening of Indigenous languages and cultures. Similarly, Indigenous Peoples’ ethics should inform the use of data across time in order to minimize harm, maximize benefits, promote justice, and allow for future use (see Carroll et al., 2021 for more details).

5.10 Data sovereignty

The CARE Principles (Figure 5.5) are directly connected to the concept of data sovereignty. We will provide a definition here and it is especially important to cover such material in this course since data sovereignty could impact your data sharing protocols. More on this topic will be discussed in Chapter 5.

Data sovereignty is the idea that data are subject to the laws and governance structures of the nation where they are collected. The concept of data sovereignty is closely linked with data security, cloud computing, network sovereignty and technological sovereignty. Unlike technological sovereignty, which is vaguely defined and can be used as an umbrella term in policymaking, data sovereignty is specifically concerned with questions surrounding the data itself. Data sovereignty as the idea that data is subject to the laws and governance structures within one nation is usually discussed in two ways: in relation to Indigenous groups and Indigenous autonomy from post-colonial states or in relation to transnational data flow. With the rise of cloud computing, many countries have passed various laws around control and storage of data, which all reflects measures of data sovereignty. More than 100 countries have some sort of data sovereignty laws in place. With self-sovereign identity (SSI) the individual identity holders can fully create and control their credentials, although a nation can still issue a digital identity in that paradigm.

6 Chapter 5

6.1 Introduction

This chapter is subdivided into two parts as follows:

6.2 Data management

6.2.1 Introduction

Good data management is fundamental to research excellence. It produces high-quality research data that are accessible to others and usable in the future (see TOP guidelines in chapter 3). The value of data is now explicitly acknowledged through citations (e.g. GBIF and Dryad repositories provide DOIs to cite datasets) so researchers can make a difference to their own careers, as well as to their fields of research, by sharing and making data available for reuse. These citations have to be directly included in publications in devoted sections usually entitled “Data Availability Statement” located at the end of the article (see Figure 6.1 for an example).

Example of a Data Availability Statement.

Figure 6.1: Example of a Data Availability Statement.

This chapter aims at helping students navigate data management firstly by explaining what data and data management are and why data sharing is important, and secondly by providing advice and examples of best practice in data management.

6.2.2 Learning outcomes

  • Learn what are the different types of research data and their specificity.
  • Provide an overview of the data life-cycle.
  • Highlight the benefits of good data management.
  • Study best-practices to planning data management.

6.2.3 Literature and web resources

This chapter is based on:

6.2.4 Teaching material

The presentation slides for Chapter 5 - part A can be downloaded here.

6.2.5 What are research data?

Research data are the factual pieces of information used to test research hypotheses. Data can be classified into five categories:

  • Observational: Data which are tied to time and place and are irreplaceable (e.g. field observations, weather station readings, satellite data).
  • Experimental: Data generated in a controlled or partially controlled environment, which can be reproduced, although it may be expensive to do so (e.g. field plots or greenhouse experiments, chemical analyses).
  • Simulation: Data generated from models (e.g. climate or population modelling).
  • Derived: Data which are not collected directly but inferred from (an)other data file(s) (e.g. a population biomass which has been calculated from population density and average body size data).
  • Metadata: Data about data (more about this category later).

A key challenge facing researchers today is the need to work with different data sources. It is not uncommon for projects to integrate any combination of data types into a single analysis, even drawing on data from disciplines outside Ecology and Evolution. As research becomes increasingly collaborative and interdisciplinary, issues surrounding data management are growing in prevalence.

6.2.6 The data life cycle

Data have a longer lifespan than the project they were created for, as illustrated by the data life-cycle displayed in Figure 6.2.

The data lifecycle

Figure 6.2: The data lifecycle

Some projects may only focus on certain parts of the data life-cycle, such as primary data creation, or reusing others’ data. Other projects may go through several revolutions of the cycle. Either way, most researchers will work with data at all stages throughout their career.

6.2.7 Why should I manage data?

Data management concerns how you plan for all stages of the data life-cycle and implement this plan throughout the research project. Done effectively it will ensure that the data life-cycle is kept in motion. It will also keep the research process efficient and ensure that your data meet all the expectations set by you, funders, research institutions, legislation and publishers (e.g. copyright, data protection).

In order to bring some perspective on this topic, ask yourself this question:

Would a colleague be able to take over my project tomorrow if I disappeared, or make sense of the data without talking to me?

If you can answer YES, then you are managing your data well!

6.2.7.1 The benefits of good data management include:

  • Ensuring data are accurate, complete, authentic and reliable.
  • Increasing research efficiency.
  • Saving time and money in the long run – ‘undoing’ mistakes is frustrating.
  • Meeting funder requirements.
  • Minimizing the risk of data loss.
  • Preventing duplication by others.
  • Facilitating data sharing.

6.2.8 Why should I share my data?

It is increasingly common for funders and publishers to mandate data sharing wherever possible. In addition, some funding bodies require data managing and sharing plans as part of grant applications (e.g. NSF). Sharing data can be daunting, but data are valuable resources and their usefulness could extend beyond the original purpose for which they were created. Benefits of sharing data include:

  • Increasing the impact and visibility of research.
  • Encouraging collaborations and partnerships with other researchers.
  • Maximizing transparency and accountability.
  • Encouraging the improvement and validation of research methods.
  • Reducing costs of duplicating data collection.
  • Advancing science by letting others use data in innovative ways.

6.2.8.1 There are circumstances where data can’t be shared!

  • If the datasets contain sensitive information about endangered or threatened species.
  • If the data contain personal information – sharing them may be a breach of protocol (and even break the law).
  • If parts of the data are owned by others – you may not have the rights to share them.

During the planning stages of your project determine which of your data can’t and shouldn’t be shared. Journal data archiving policies recognize these reasons for not sharing.

6.2.9 Planning data management

Regardless of whether your funder requires you to have a data management or sharing plan as part of a grant application, having such a plan in place before you begin your research project will mean that you are prepared for any data management issues that may come your way (see the Data management checklist section below).

6.2.9.1 Before you start planning

Here are few things that you should consider before planning your data management workflow:

  • Check funder specifications for data management plans.
  • Consult with your institution (especially regarding resources and policies).
  • Consider your budget.
  • Talk to your supervisor, colleagues and collaborators.

6.2.9.2 What are the key things to consider when planning?

  • Time: Writing a data management plan may well take a significant amount of time. It is not as simple as filling out a template or working through a checklist of things to include. Planning for data management should be thorough before the research project starts to ensure that data management is embedded throughout the research process.
  • Design according to your needs: Data management should be planned and implemented with the purpose of the research project in mind.
  • Roles and responsibilities: Creating a data management plan may be the responsibility of one single person, but data management implementation may involve various people at different stages. One of the major uses of a data management plan is to enable coordinated working and communication among researchers on a project.
  • Review: Plan how data management will be reviewed throughout the project and adapted if necessary; this will help to integrate data management into the research process and ensure that the best data management practices are being implemented. Reviewing will also help to catch any issues early on, before they turn into bigger problems.

6.2.10 Creating data

In the data life-cycle (Figure 6.2), creating datasets occurs as a researcher collects data in the field or lab, and digitizes them to end up with a raw dataset.

# Workflow associated with creating data

Collect data (in the field and/or lab) --> Digitize data --> Raw dataset

!! Perform quality checks @ each step to validate data !!

Quality control during data collection is important because often there is only one opportunity to collect data from a given situation. Researchers should be critical of methods before collection begins – high-quality methods will result in high-quality data. Likewise, when collection is under way, detailed documentation of the collection process should be kept as evidence of quality.

6.2.10.1 Key things to consider during data collection

  • Logistical issues in the field (what are the challenges associated with your fieldwork (e.g. no power to re-charge your batteries?).
  • Calibration of instruments (is the balance that I plan using accurate enough to weigh my samples?).
  • Taking multiple measurements/observations/samples (to e.g. ensure statistical accuracy or cover genetic variation within population).
  • Creating a template (and associated protocols) for use during data collection to ensure that all information is collected consistently, especially if there are multiple collectors.
  • Describing any conditions during data collection that might affect the quality of the data (e.g. impact of weather conditions, tropical storm, on wildlife observations).
  • Creating an accompanying questionnaire for multiple collectors, asking them any questions which may affect the quality of the data (this could account for slight differences observed during analytical phase of project. In other words, account for the effect of data collecting on analyses).
  • Widening the possible applications of data, and therefore increasing their impact, by adding variables and parameters to data. For instance, wider landscape variables, which will encourage reuse and potentially open up new avenues for research.

6.2.11 Data digitization

Data may be collected directly in a digital form using devices that feed results straight into a computer/tablet or they may be collected as hand-written notes. Either way, there will be some level of processing involved to end up with a digital raw dataset.

Key things to consider during data digitization include:

  • Designing a database structure to organize data and data files (i.e. making sure that observations have unique IDs).
  • Using a consistent format for each data file – e.g. one row represents a complete record and the columns represent all the parameters that make up that record (this is known as spreadsheet format).
  • Atomizing data – make sure that only one piece of data is in each entry (this greatly helps analyses).
  • Using plain text characters (e.g. ASCII, which stands for American Standard Code for Information Interchange used for computers) to ensure data are readable by a maximum number of software programs.
  • Using code – coding assigns a numerical value to variables and allows for statistical analysis of data. Keep coding simple.
  • Describing the contents of your data files in a Readme.txt file or even better in protocol files (later attached as appendixes of your manuscript), or other metadata standard, including a definition of each parameter, the units used and codes for missing values.
  • Keeping raw data raw.

6.2.12 Processing data

Data should be processed into a format that is suited to subsequent analyses and ensures long-term usability. Data are at risk of being lost if the hardware and software originally used to create and process them are rendered obsolete. Therefore, data should be well organized, structured, named and versioned in standard formats that can be interpreted in the future (see the Data structure and organisation of files section below).

Here are some guidelines to ensure best processing of data (Figure 6.2):

  • File formats: Data should be written in non-proprietary formats, also known as open standard formats. These files can be used and implemented by anyone, as opposed to proprietary formats which can only be used by those with the correct software installed. The most common format used for spreadsheet data are comma-separated values files (.csv). Other non-proprietary formats include: plain text files (.txt) for text; and GIF, JPEG and PNG for images.
  • File names and folders: To keep track of data and know how to find them, digital files and folders should be structured and well organized. Use a folder hierarchy that fits the structure of the project and ensure that it is used consistently. Drawing a folder map which details where specific data are saved may be particularly useful if others will be accessing folders, or if there is simply a lot to navigate (store this information in the Readme.txt file).
  • File names should be:
    • unique,
    • descriptive,
    • succinct,
    • naturally ordered and consistent,
    • describing the project, file contents, location, date, researcher’s initials and version.
  • File names should not include spaces – these can cause problems with scripting and metadata.
  • Quality assurance: Checking that data have been edited, cleaned, verified and validated to create a reliable masterfile which will become the basis for further analyses (tyding up data so that they are ready for analyses). Use a scripting language, such as R, to process your data for quality checking so that all steps are documented.
    Assurance checks may include:
    • Identifying estimated values, missing values or double entries.
    • Performing statistical analyses to check for questionable or impossible values and outliers (which may just be typos from data entry).
    • Checking the format of the data for consistency across the dataset.
    • Checking the data against similar data to identify potential problems.
  • Version control: Once the masterfile has been finalized, keeping track of ensuing versions of this file can be challenging, especially if working with collaborators in different locations. A version control strategy will help locate required versions, and clarify how versions differ from one another.
    Version control best practice includes:
    • Deciding how many and which versions to keep.
    • Using a systematic file naming convention, using filenames that include the version number and status of the file (e.g. v1_draft, v2_internal, v3_final).
    • Record what changes took place to create the version in a separate file (e.g. a version table).
    • Mapping versions if they are stored in different locations.
    • Synchronizing versions across different locations.
    • Ensuring any cross-referenced information between files is also subject to version control.

6.2.13 Data structure and organisation of files

Inferring the directory tree structure of your project provides a simple and efficient way to summarize the data structure and organization of files related to your project (see above). In UNIX, this can be easily achieved by using the tree package as shown in Figure 6.3.

Infer directory tree structure using the UNIX tree package.

Figure 6.3: Infer directory tree structure using the UNIX tree package.

6.2.13.1 The UNIX tree package: A way to support data structure & organisation

The UNIX tree package was used here to infer the directory structure of the folder Project_ID/ and several arguments were set to obtain data on files that could be used to check for file names and location as well as version control:

  • -s: Print the size of each file in bytes along with the name.
  • -D: Print the date of the last modification time or if -c is used, the last status change time for the file listed.

To install the tree package on your Mac, do the following:

#Install tree by typing the following code in a terminal
brew install tree

6.2.14 Documenting data

Producing good documentation and metadata ensures that data can be understood and used in the long term (Figure 6.2). Documentation (i.e. the manual associated with your data) describes the data, explains any manipulations and provides contextual information – no room should be left for others to misinterpret the data.

All documentation requirements should be identified at the planning stages so you are prepared for it during all stages of the data life-cycle, particularly during data creation and processing. This will avoid having to perform a rescue mission in situations where you have forgotten what has happened or when a collaborator has left without leaving any key documentation behind.

Data documentation includes information at project and data levels and should cover the following information.

6.2.14.1 Project level

  • The project aim, objectives and hypotheses.
  • Personnel involved throughout the project, including who to contact with questions.
  • Details of sponsors.
  • Data collection methods, including details of instrumentation and environmental conditions during collection, copies of collection instructions if applicable.
  • Standards used.
  • Data structure and organisation of files (see Figure 6.3).
  • Software used to prepare and read the data.
  • Procedures used for data processing, including quality control and versioning and the dates these were carried out.
  • Known problems that may limit data use.
  • Instructions on how to cite the data.
  • Intellectual property rights and other licensing considerations.

6.2.14.2 Data level

  • Names, labels and descriptions for variables.
  • Detailed explanation of codes used.
  • Definitions of acronyms or specialist terminology.
  • Reasons for missing values.
  • Derived data created from the raw file, including the code or algorithm used to create them.

If a software package such as R is used for processing data, much of the data level documentation will be created and embedded during analysis.

Metadata help others discover data through searching and browsing online and enable machine-to-machine interoperability of data, which is necessary for data reuse. Metadata are created by using either a data center’s deposit form, a metadata editor, or a metadata creator tool, which can be searched for online. Metadata follow a standard structure and come in three forms:

  • Descriptive – fields such as title, author, abstract and keywords.
  • Administrative – rights and permissions and data on formatting.
  • Structural – explanations of e.g. tables within the data.

6.2.15 Preserving data

To protect data from loss and to make sure data are securely stored, good data management should include a strategy for backing up and storing data effectively (see where this step fits in Figure 6.2). It is recommended to keep three versions of your data: the original, external/local and external/remote. Talk with your thesis supervisor to isolate best procedure to ensure preservation of your data.

6.2.15.1 Data backup

When designing a backup strategy, thought should be given to the possible means by which data loss could occur. These include:

  • Hardware failure.
  • Software faults.
  • Virus infection or hacking.
  • Power failure.
  • Human error.
  • Hardware theft or misplacement (especially common during fieldwork).
  • Hardware damage (e.g. fire, flood. Again common in the field).
  • Backups – good backups being overwritten with backups from corrupt data (this happen more often that you would imagine).

An ideal backup strategy should provide protection against all the risks, but it can be sensible to consider which are the most likely to occur in any particular context and be aware of these when designing your backup strategy.

6.2.15.1.1 Things to consider to establish a backup plan
  • Which files require backup.
  • Who is responsible for backups.
  • The frequency of backup needed, this will be affected by how regularly files are updated.
  • Whether full or incremental backups are needed – consider running a mix of frequent incremental backups (capturing recent data changes) along with periodic full backups (capturing a ‘snapshot’ of the state of all files).
  • Backup procedures for each location where data are held, e.g. tablets, home-based computers or remote drives.
  • How to organize and label backup files.

6.2.16 Data storage

Data storage, whether of the original or backed up data, needs to be robust. This is true whether the data are stored on paper or electronically, but electronic storage raises particular issues.

6.2.16.1 Guidelines to ensure best practice for electronic storage of data

  • Use high-quality storage systems (e.g. media, devices).
  • Use non-proprietary formats for long-term software readability.
  • Migrate data files every two to five years to new storage – storage media such as CDs, DVDs and hard drives can degrade over time or become outdated (e.g. floppy disks).
  • Check stored data regularly to make sure nothing has been lost.
  • Use different forms of storage for the same data, this also acts as a method of backup, e.g. using remote storage, external hard drives and a network drive.
  • Label and organize stored files logically to make them easy to locate and access.
  • Think about encryption: sensitive data may be regarded as protected while on a password-protected computer, but when backed up onto a portable hard drive they may become accessible to anyone – backups may need to be encrypted or secured too.
6.2.16.1.1 Storage and backing-up of data

Data can be stored and backed up on:

  • Network drives which are managed by IT staff and are regularly backed up. They ensure secure storage and prohibit unauthorized access to files.
  • Personal devices such as laptops and tablets are convenient for short-term, temporary storage but should not be used for storing master files. They are at high risk of being lost, stolen or damaged.
  • External devices such as hard drives, USB sticks, CDs and DVDs are often convenient because of their cost and portability. However, they do not guarantee long-term preservation and can also be lost, stolen or damaged. High-quality external devices from reputable manufacturers should be used.
  • Remote or online services such as Google Drive or Dropbox use cloud technology to allow users to synchronize files across different computers. They provide some storage space for free but any extra space or functions will have to be bought.
  • Paper! If data files are not too big, do not overlook the idea of printing out a paper copy of important ones data files as a complement to electronic storage. It may not be convenient, but ink on paper has proven longevity and system independence (as long as you can remember where you put it)!

6.2.17 Sharing data

Research data can be shared in many ways and each method of sharing will have advantages and disadvantages. Ways to share data include:

  • Using a disciplinary data center such as Dryad or GenBank.
  • Depositing data in your research funder’s data center.
  • Depositing data in university repositories.
  • Making data available online via open notebooks or project websites (e.g. Open Science Framework from the Center of Open Science).
  • Using virtual research environments such as SharePoint and Sakai.

Institutions, funders and journals have specific policies associated with data sharing and you should read those prior to establishing a data sharing plan. Here we will discuss the advantages of archiving your data in data repositories.

6.2.17.1 Data repositories

Archiving your data in a repository is a reliable method of sharing data. Data submitted to repositories will have to conform to submission guidelines, which may restrict which data you share via the repository. However, the advantages of sharing data through these centers include:

  • Assurance for others that the data meet quality standards.
  • Guaranteed long-term preservation.
  • Data are secure and access can be controlled.
  • Data are regularly backed up.
  • Chances of others discovering the data are improved.
  • Citation methods are specified.
  • Secondary usage of the data is monitored.

6.2.18 Reusing data

All aspects of data management lead up to data discovery and reuse by others. Intellectual property rights, licenses and permissions, which concern reuse of data, should be explained in the data documentation and/or metadata.

At this stage of the life-cycle it is important to state your expectations for the reuse of your data, e.g. terms of acknowledgment, citation and co-authorship. Likewise, it becomes the responsibility of others to reuse data effectively, credit the collectors of the original data, cite the original data and manage any subsequent research to the same effect.

When requesting to use someone else’s data it is important to clearly state the purpose of the request, including the idea you will be addressing and your expectations for co-authorship or acknowledgment. Co-authorship is a complex issue and should be discussed with any collaborators at the outset of a project.

Increasing openness to data and ensuring long-term preservation of data fosters collaboration and transparency, furthering research that aims to answer the big questions in ecology and evolution. By implementing good data management practices, researchers can ensure that high-quality data are preserved for the research community and will play a role in advancing science for future generations.

6.2.19 Data management checklist

The data management checklist from the UK Data Archive will help you design your own data management planning and data sharing. The text is available below:

6.2.19.1 Planning

  • Who is responsible for which part of data management?
  • Are new skills required for any activities?
  • Do you need extra resources to manage data, such as people, time or hardware?
  • Have you accounted for costs associated with depositing data for longer-term preservation and access?

6.2.19.2 Documenting

  • Will others be able to understand your data and use them properly?
  • Are your structured data self-explanatory in terms of variable names, codes and abbreviations used?
  • Which descriptions and contextual documentation explain what your data mean, how they were collected and the methods used to create them?
  • How will you label and organize data, records and files?
  • Will you be consistent in how data are cataloged?

6.2.19.3 Formatting

  • Are you using standardized and consistent procedures to collect, process, transcribe, check, validate and verify data, such as standard protocols, templates or input forms?
  • Which data formats will you use? Do formats and software enable sharing and long-term sustainability of data, such as non-proprietary software and software based on open standards?
  • When converting data across formats, do you check that no data, annotation or internal metadata have been lost or changed?

6.2.19.4 Storing

  • Are your digital and non-digital data, and any copies, held in multiple safe and secure locations?
  • Do you need to securely store personal or sensitive data? If so, are they properly protected?
  • If data are collected with mobile devices, how will you transfer and store the data?
  • If data are held in multiple places, how will you keep track of versions?
  • Are your files backed up sufficiently and regularly and are backups stored safely?
  • Do you know which version of your data files is the master?
  • Who has access to which data during and after research? Is there a need for access restrictions? How will these be managed after you are dead?
  • How long will you store your data for and do you need to select which data to keep and which to destroy?

6.2.19.7 Sharing

  • Do you intend to make all your data available for sharing or how will you select which data to preserve and share?
  • How and where will you preserve your research data for the longer-term?
  • How will you make your data accessible to future users?

6.3 Reproducible code

6.3.1 Introduction

The following steps have to be fully integrated in order to produce a reproducible code:

  • Step 1: Establish a reproducible project workflow.
  • Step 2: Organize project for reproducibility.
  • Step 3: Ensure basic programming standards.
  • Step 4: Document and manage dependencies.
  • Step 5: Produce a reproducible report (with R Markdown).
  • Step 6: Implement a version control protocol (with Git).
  • Step 7: Ensure archiving and citation of code.

In this chapter, we will cover steps 1 to 4. Steps 5 and 7 have been respectively studied in chapters 1 and 5 (Data management), whereas step 6 will be covered in the bioinformatic tutorial associated to chapter 12.

6.3.2 Learning outcomes

  • Learn protocol to organize projects for reproducibility.
  • Discuss licenses for code and software.
  • Learn basic programming standards to ensure transparency and broad understanding of the data workflow.
  • Learn how to use R to infer data structure and files organization.
  • Learn about code portability: Absolute vs. Relative paths.
  • Review knowledge on documenting and managing software dependencies.

6.3.3 Literature and web resources

This chapter is based on:

6.3.4 Teaching material

The presentation slides for Chapter 5 - part B can be downloaded here.

6.3.5 Data and dependencies

To conduct the exercises presented in this chapter, you will need to:

  1. Download a folder from the shared Google Drive
  2. Install R packages

More information on these items is provided below.

6.3.5.1 Data

The data (Project_ID) used for this chapter are deposited on our shared Google Drive at this address:

Reproducible_Science > Exercises > Chapter_5 > Project_ID

Please download this whole folder prior to starting the exercises.

6.3.5.2 R packages

Here, we are using code developed in Chapter 1 to make sure the R dependencies (data.tree, DiagrammeR) are installed and loaded prior to pursuing the exercises.

## ~~~ 1. List all required packages ~~~ Object (args)
## provided by user with names of packages stored into a
## vector
pkg <- c("data.tree", "DiagrammeR")
## ~~~ 2. Check if pkg are installed ~~~
print("Check if packages are installed")
## [1] "Check if packages are installed"
# This line outputs a list of packages that are not
# installed
new.pkg <- pkg[!(pkg %in% installed.packages())]
## ~~~ 3. Install missing packages ~~~
if (length(new.pkg) > 0) {
    print(paste("Install missing package(s):", new.pkg, sep = " "))
    install.packages(new.pkg, dependencies = TRUE)
} else {
    print("All packages are already installed!")
}
## [1] "All packages are already installed!"
## ~~~ 4. Load all packages ~~~
print("Load packages and return status")
## [1] "Load packages and return status"
# Here we use the sapply() function to require all the
# packages
sapply(pkg, require, character.only = TRUE)
##  data.tree DiagrammeR 
##       TRUE       TRUE

6.3.6 Organize project for reproducibility

6.3.6.1 “Draw” a reproducible project workflow

Organizing your project for reproducibility starts by drawing a workflow serving as basis to guide your project implementation. An example of a reproducible project workflow is displayed in Figure 6.4. This workflow will be used as template for the material presented in this chapter.

A simple reproducible project workflow.

Figure 6.4: A simple reproducible project workflow.

6.3.6.2 The repeatable, reproducible analysis workflow

The fundamental idea behind a robust, reproducible analysis is a clean, repeatable script-based workflow (i.e. the sequence of tasks from the start to the end of a project) linking raw data through to clean data and to final analysis outputs.

6.3.6.3 Principles of a good analysis workflow

Please find below some key concepts associated with this task:

  • Start your analysis from copies of your raw data.
  • Any cleaning, merging, transforming, etc. of data should be done in scripts, not manually.
  • Split your workflow (scripts) into logical thematic units. For example, you might separate your code into scripts that
    1. load, merge and clean data,
    2. analyse data,
    3. produce outputs like figures and tables.
  • Eliminate code duplication by packaging up useful code into custom functions (see Step3 : Ensure basic programming standards). Make sure to comment your functions thoroughly, explaining their expected inputs and outputs (as well as the associated arguments and their options), and what they are doing and why.
  • Document your code and data as comments in your scripts and by producing separate documentation (using the R Markdown format).
  • Any intermediary outputs generated by your workflow should be kept separate from raw data.
  • Keep your raw data raw.

6.3.6.4 Organizing and documenting workflows

The simplest and most effective way of documenting your workflow – its inputs and outputs – is through good file system organization, and informative, consistent naming of materials associated with your analysis. The name and location of files should be as informative as possible on what a file contains, why it exists, and how it relates to other files in the project. These principles extend to all files in your project (not just scripts) and are also intimately linked to good research data management (see Chapter 5: Data management).

6.3.6.5 File system structure

It is best to keep all files associated with a particular project in a single root directory. RStudio projects offer a great way to keep everything together in a self-contained and portable (i.e. so they can be moved from computer to computer) manner, allowing internal pathways to data and other scripts to remain valid even when shared or moved.

Example file structure of a simple analysis project. Make sure you left-pad single digit numbers with a zero for the R scripts to avoid having those misordered.

Figure 6.5: Example file structure of a simple analysis project. Make sure you left-pad single digit numbers with a zero for the R scripts to avoid having those misordered.

There is no single best way to organize a file system. The key is to make sure that the structure of directories and location of files are consistent, informative and works for you. Please find below an example of basic project directory structure (Figure 6.5):

  • The data folder contains all input data (and metadata) used in the analysis.
  • The MS folder contains the manuscript.
  • The Figures_&_Tables folder contains figures and tables generated by the analyses.
  • The Output folder contains any type of intermediate or output files (e.g. simulation outputs, models, processed datasets, etc.). You might separate this and also have a cleaned-data folder.
  • The R_functions folder contains R scripts with function definitions.
  • The Reports folder contains RMarkdown files that document the analysis or report on results.
  • The scripts (*.R) that actually do things are stored in the root directory together with the README.md file. If your project has too many scripts, you might consider organizing them in a separate folder.

6.3.6.6 License file

Usually, code and software are licensed under a GNU Affero General Public License. This is a free, copyleft license for software and other kinds of works, specifically designed to ensure cooperation with the community in the case of network server software. For instance, this is the licence that the instructor has used for this class.

6.3.6.7 Using R to infer data structure and files organization

Inferring the directory tree structure of your project provides a simple and efficient way to summarize the data structure and organization of files related to your project as well as track versioning. The R base functions list.files() and file.info() can be combined to obtain information on files stored in your project. Please see code below for an example associated to Figure 6.5.

# Produce a list of all files in working directory
# (Project-ID) together with info related to those files
file.info(list.files(path = "Project_ID", recursive = TRUE, full.name = TRUE))
##                                                         size isdir mode
## Project_ID/Data/species_data.csv                           6 FALSE  700
## Project_ID/Figures_&_Tables/Fig_01_Data_lifecycle.pdf  10282 FALSE  700
## Project_ID/MS/MS_et_al.docx                            11705 FALSE  700
## Project_ID/Output/DataStr_network.html                256368 FALSE  700
## Project_ID/Output/DataStr_tree.pdf                      3611 FALSE  700
## Project_ID/R_functions/check.install.pkg.R               684 FALSE  700
## Project_ID/README.md                                    1000 FALSE  700
## Project_ID/Reports/Documentation.md                       14 FALSE  700
## Project_ID/Scripts/01_download_data.R                     14 FALSE  700
## Project_ID/Scripts/02_clean_data.R                        14 FALSE  700
## Project_ID/Scripts/03_exploratory_analyses.R              14 FALSE  700
## Project_ID/Scripts/04_fit_models.R                        14 FALSE  700
## Project_ID/Scripts/05_generate_figures.R                  14 FALSE  700
##                                                                     mtime
## Project_ID/Data/species_data.csv                      2018-09-12 10:02:55
## Project_ID/Figures_&_Tables/Fig_01_Data_lifecycle.pdf 2018-09-09 10:38:23
## Project_ID/MS/MS_et_al.docx                           2019-09-23 11:51:58
## Project_ID/Output/DataStr_network.html                2019-09-23 13:27:30
## Project_ID/Output/DataStr_tree.pdf                    2019-09-23 13:04:03
## Project_ID/R_functions/check.install.pkg.R            2019-09-10 09:54:03
## Project_ID/README.md                                  2021-10-03 13:32:58
## Project_ID/Reports/Documentation.md                   2018-09-12 08:56:42
## Project_ID/Scripts/01_download_data.R                 2018-09-12 08:56:42
## Project_ID/Scripts/02_clean_data.R                    2018-09-12 08:56:42
## Project_ID/Scripts/03_exploratory_analyses.R          2018-09-12 08:56:42
## Project_ID/Scripts/04_fit_models.R                    2018-09-12 08:56:42
## Project_ID/Scripts/05_generate_figures.R              2018-09-12 08:56:42
##                                                                     ctime
## Project_ID/Data/species_data.csv                      2022-06-24 14:49:58
## Project_ID/Figures_&_Tables/Fig_01_Data_lifecycle.pdf 2022-06-24 14:49:58
## Project_ID/MS/MS_et_al.docx                           2022-06-24 14:49:56
## Project_ID/Output/DataStr_network.html                2022-06-24 14:49:56
## Project_ID/Output/DataStr_tree.pdf                    2022-06-24 14:49:57
## Project_ID/R_functions/check.install.pkg.R            2022-06-24 14:49:57
## Project_ID/README.md                                  2022-06-24 14:49:57
## Project_ID/Reports/Documentation.md                   2022-06-24 14:49:58
## Project_ID/Scripts/01_download_data.R                 2022-06-24 14:49:57
## Project_ID/Scripts/02_clean_data.R                    2022-06-24 14:49:57
## Project_ID/Scripts/03_exploratory_analyses.R          2022-06-24 14:49:58
## Project_ID/Scripts/04_fit_models.R                    2022-06-24 14:49:57
## Project_ID/Scripts/05_generate_figures.R              2022-06-24 14:49:57
##                                                                     atime uid
## Project_ID/Data/species_data.csv                      2022-06-24 15:34:11 502
## Project_ID/Figures_&_Tables/Fig_01_Data_lifecycle.pdf 2022-06-24 15:34:11 502
## Project_ID/MS/MS_et_al.docx                           2022-06-24 15:34:11 502
## Project_ID/Output/DataStr_network.html                2022-06-24 15:34:11 502
## Project_ID/Output/DataStr_tree.pdf                    2022-06-24 15:34:11 502
## Project_ID/R_functions/check.install.pkg.R            2022-06-24 15:34:11 502
## Project_ID/README.md                                  2022-06-24 15:34:11 502
## Project_ID/Reports/Documentation.md                   2022-06-24 15:34:11 502
## Project_ID/Scripts/01_download_data.R                 2022-06-24 15:34:11 502
## Project_ID/Scripts/02_clean_data.R                    2022-06-24 15:34:11 502
## Project_ID/Scripts/03_exploratory_analyses.R          2022-06-24 15:34:11 502
## Project_ID/Scripts/04_fit_models.R                    2022-06-24 15:34:11 502
## Project_ID/Scripts/05_generate_figures.R              2022-06-24 15:34:11 502
##                                                       gid uname grname
## Project_ID/Data/species_data.csv                       20  sven  staff
## Project_ID/Figures_&_Tables/Fig_01_Data_lifecycle.pdf  20  sven  staff
## Project_ID/MS/MS_et_al.docx                            20  sven  staff
## Project_ID/Output/DataStr_network.html                 20  sven  staff
## Project_ID/Output/DataStr_tree.pdf                     20  sven  staff
## Project_ID/R_functions/check.install.pkg.R             20  sven  staff
## Project_ID/README.md                                   20  sven  staff
## Project_ID/Reports/Documentation.md                    20  sven  staff
## Project_ID/Scripts/01_download_data.R                  20  sven  staff
## Project_ID/Scripts/02_clean_data.R                     20  sven  staff
## Project_ID/Scripts/03_exploratory_analyses.R           20  sven  staff
## Project_ID/Scripts/04_fit_models.R                     20  sven  staff
## Project_ID/Scripts/05_generate_figures.R               20  sven  staff

6.3.6.8 Pseudocode and associated R script

The code presented above could be at the core of a user-defined function that would aim at managing files and ensuring data reliability for your project. To help you define such a function, let’s investigate this further and produce a diagram summarizing your project structure. This objective is achieved in three steps (see Figure 6.6):

  • Step 1: Produce a list of all files in your working directory using file.info() and list.files() functions. The output of this code is a data.frame
  • Step 2: Convert the data.frame into a data.tree class object. This is done using the data.tree::as.Node() function from the data.tree package.
  • Step 3: Prepare and plot diagram of project structure. This is done using set of functions from data.tree and plotting the output using DiagrammR package.
  • Step 4: Save the output from step 3.

WARNING: Prior to starting the exercise, please go on the shared Google Drive and download the Project_ID folder at this path:

Reproducible_Science > Exercises > Chapter_5 > Project_ID

6.3.6.8.1 R code and associated outputs

Procedure to follow:

  1. Copy Project_ID/ in a directory entitled Chapter_5_PartB.
  2. Create and save an R script and save it in Chapter_5_PartB.
  3. Make sure R packages are installed. See here for more details.
  4. Study and execute the R code below to infer the file structure in Project_ID/. Also read the text below, which further explains the approach.
### Load R packages
library(data.tree)
library(DiagrammeR)
### Step 1: Produce a list of all files in Project_ID
filesInfo <- file.info(list.files(path = "Project_ID", recursive = TRUE,
    full.name = TRUE))
### Step 2: Convert filesInfo into data.tree class
myproj <- data.tree::as.Node(data.frame(pathString = rownames(filesInfo)))
# Inspect output
print(myproj)
##                            levelName
## 1  Project_ID                       
## 2   ¦--Data                         
## 3   ¦   °--species_data.csv         
## 4   ¦--Figures_&_Tables             
## 5   ¦   °--Fig_01_Data_lifecycle.pdf
## 6   ¦--MS                           
## 7   ¦   °--MS_et_al.docx            
## 8   ¦--Output                       
## 9   ¦   ¦--DataStr_network.html     
## 10  ¦   °--DataStr_tree.pdf         
## 11  ¦--R_functions                  
## 12  ¦   °--check.install.pkg.R      
## 13  ¦--README.md                    
## 14  ¦--Reports                      
## 15  ¦   °--Documentation.md         
## 16  °--Scripts                      
## 17      ¦--01_download_data.R       
## 18      ¦--02_clean_data.R          
## 19      ¦--03_exploratory_analyses.R
## 20      ¦--04_fit_models.R          
## 21      °--05_generate_figures.R
### Step 3: Prepare and plot diagram of project structure
### (it requires DiagrammeR)
# Set general parameters related to graph
data.tree::SetGraphStyle(myproj$root, rankdir = "LR")
# Set parameters for edges
data.tree::SetEdgeStyle(myproj$root, arrowhead = "vee", color = "grey",
    penwidth = "2px")
# Set parameters for nodes
data.tree::SetNodeStyle(myproj, style = "rounded", shape = "box")
# Apply specific criteria only to children nodes of Scripts
# and R_functions folders
data.tree::SetNodeStyle(myproj$Scripts, style = "box", penwidth = "2px")
data.tree::SetNodeStyle(myproj$R_functions, style = "box", penwidth = "2px")
# Plot diagram
plot(myproj)

Figure 6.6: Diagram of project structure for the Project_ID directory. Nodes representing folders and files associated to R code are symbolized by boxes, whereas the others are rounded.

Finally, the R DiagrammeR package sadly doesn’t allow to easily save the graph into a file (step 4) using e.g. the pdf() and dev.off() functions, but this task can be done in RStudio as follows:

  1. The diagram outputted by step 3 is displayed in the Viewer window (in the bottom right panel) and can be exported by clicking on the button and select Save as Image... as shown in Figure 6.7.
Snapshot showing how to export the diagram in RStudio.

Figure 6.7: Snapshot showing how to export the diagram in RStudio.

  1. Once you have executed the latter step, a window will open allowing you to select the image format, directory and file name as shown in Figure 6.8.
Snapshot showing how to save the diagram in RStudio.

Figure 6.8: Snapshot showing how to save the diagram in RStudio.

To find out more about your options to export/save DiagrammeR graphs please visit this website:

https://rich-iannone.github.io/DiagrammeR/io.html

6.3.6.9 Informative, consistent naming

Good naming extends to all files, folders and even objects in your analysis and serves to make the contents and relationships among elements of your analysis understandable, searchable and organised in a logical fashion (see Figure 6.5 for examples and Chapter 5: Data management for more details).

6.3.7 Ensure basic programming standards

6.3.7.1 Writing pseudocode

Pseudocode is an informal high-level description of the operating principle of a computer program or other algorithm. It uses the structural conventions of a normal programming language (here R), but is intended for human reading rather than machine reading. Here, you will establish the big steps (and their associated tasks) and tie R functions (existing or that have to be made) to those steps. This provides the backbone of your code and will support writing it.

6.3.7.2 Writing code

Writing clear, reproducible code has (at least) three main benefits: 1. It makes returning to the code much easier a few months down the line; whether revisiting an old project, or making revisions following peer review. 2. The results of your analysis are more easily scrutinized by the readers of your paper, meaning it is easier to show their validity. 3. Having clean and reproducible code available can encourage greater uptake of new methods that you have developed.

To write clear and reproducible code, it is recommended to follow the workflow depicted in Figure 6.4. The following section explains each part of the workflow along with some tips for writing code. Although this workflow would be the ‘gold standard’, just picking up some of the elements can help to make your code more effective and readable.

6.3.7.3 Style guides

The foundation of writing readable code is to choose a logical and readable coding style, and to stick to it. Some key elements to consider when developing a coding style are:

  • Using meaningful file names, and numbering these if they are in a sequence (see Figure 6.5 for examples and Chapter 5: Data management for more details).
### Naming files
# Good
fit-models.R
utility-functions.R

# Bad
foo.r
stuff.r
  • Concise and descriptive object names. Variable names should usually be nouns and function names verbs. Using names of existing variables or functions should be avoided.
### Naming objects
# Good
day_one
day_1

# Bad
first_day_of_the_month
DayOne
dayone
djm1
  • Spacing should be used to improve visual effect: use spaces around operators (=, +, -, <-, etc.), and after commas (much like in a sentence).
### Spacing
# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)

# Bad
average<-mean(feet/12+inches,na.rm=TRUE)
  • Indentation should be with two spaces, not tabs, and definitely not a mixture of tabs and spaces.
### Indentation
long_function_name <- function(a = "a long argument", 
                               b = "another argument",
                               c = "another long argument") {
  # As usual code is indented by two spaces.
}
  • Assignment. Use <-, not =, for assignment.
### Assignment
# Good
x <- 5

# Bad
x = 5

The most important role of a style guide, however, is to introduce consistency to scripts.

6.3.7.4 Portable code: Absolute vs. Relative paths

When working collaboratively, portability between machines is very important, i.e. will your code work on someone else’s computer? Portability is also important if code is being run on another server, for example a High Performance Cluster. One way to improve portability of code is to avoid using absolute paths and use only relative paths (see https://en.wikipedia.org/wiki/Path_(computing)#Absolute_and_relative_paths).

  • An absolute path is one that gives the full address to a folder or file.
  • A relative path gives the location of the file from the current working directory. Relative paths are favored to promote code portability.

For example based on species_data.csv stored in the Data folder shown in Figure 6.5:

# Absolute path -----------------------------
C:/Project_ID/Data/species_data.csv

Project_ID = Project root folder = working directory

# Relative path ------
Data/species_data.csv

Relative paths are particularly useful when transferring between computers because while I may have stored my project folder in ‘C:/Project_ID/’, you may have yours stored in ‘C:/Users/My Documents’. Using relative paths and running from the project folder will ensure that file-not-found errors are avoided.

RStudio is especially designed to help portability of code. For instance, you can easily set the working directory in RStudio by clicking:

Session > Set Working Directory > To Source File Location

or using the setwd() function. Please see Chapter 1 for more details.

6.3.7.5 Commenting code

How often have you revisited an old script six months down the line and not been able to figure out what you had been doing? Or have taken on a script from a collaborator and not been able to understand what their code is doing and why? An easy win for making code more readable and reproducible is the liberal, and effective, use of comments. A comment is a line of code that is visible, but does not get run with the rest of the script. In R and Python this is signified by beginning the line with a #.

One good principle to adhere to is to comment the ‘why’ rather than the ‘what’. The code itself tells the reader what is being done, it is far more important to document the reasoning behind a particular section of code or, if it is doing something nonstandard or complicated, to take some time to describe that section of code.

It is also good practice to use comments to add an overview at the beginning of a script, and commented lines of --- to break up the script: e.g. in R

# Load data -------

6.3.7.6 Writing functions

Often when you are analyzing data, you need to repeat the same task many times. For example, you might have several files that all need loading and cleaning in the same way, or you might need to perform the same analysis for multiple species or parameters. The best way (by far) to handle those tasks is to write functions and store those in a specific folder (e.g. R_functions in 6.5). Those functions will be loaded into the R environment (and therefore made available to users) by using the source() function, which will be placed at the top of your R script. Please see Chapter 1 part D for more details.

6.3.7.7 Testing scientific code

In the experimental sciences, rigorous testing is applied to ensure that results are accurate, reproducible and reliable. Testing will show that the experimental setup is doing what it is meant to do and will quantify any systematic biases. Results of experiments will not be trusted without such tests; why should your code be any different? Testing scientific code allows you to be sure that it is working as intended and to understand and quantify any limitations of the code. Using tests can also help to speed up the code development process by finding errors early on.

Although the instructor recognizes the importance of establishing protocols, those tests will be especially relevant if you were to design R packages. However, the instructor would recommend informal testing by loading functions that have been written by the students and running ad hoc tests in the command line to make sure that they perform as expected.

6.3.8 Document and manage dependencies

Reproducibility is also about making sure someone else can re-use your code to obtain the same results as you (see Appendix 1 for more details).

For someone else to be able to reproduce the results included in your report, you need to provide more than the code and the data. You also need to document the exact versions of all the packages, libraries, and software you used, and potentially your operating system as well as your hardware.

R itself is very stable, and the core team of developers takes backward compatibility (old code works with recent versions of R) very seriously. However, default values in some functions change, and new functions get introduced regularly. If you wrote your code on a recent version of R and give it to someone who has not upgraded recently, they may not be able to run your code. Code written for one version of a package may produce very different results with a more recent version.

6.3.8.1 Reporting R packages and versions

With R, the simplest (but a useful and important) approach to document your dependencies is to report the output of sessionInfo() (or devtools::session_info()). Among other information, this will show all the packages and their versions that are loaded in the session you used to run your analysis. If someone wants to recreate your analysis, they will know which packages they will need to install. Please see Appendix 2 for more details.

For instance, here is the output of sessionInfo() showing the R version and packages that I used to create this document:

sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rticles_0.23         DiagrammeR_1.0.9     DT_0.24             
##  [4] data.tree_1.0.0      kfigr_1.2.1          devtools_2.4.4      
##  [7] usethis_2.1.6        bibtex_0.4.2.3       knitcitations_1.0.12
## [10] htmltools_0.5.3      prettydoc_0.4.1      magrittr_2.0.3      
## [13] dplyr_1.1.2          kableExtra_1.3.4     formattable_0.2.1   
## [16] bookdown_0.33        rmarkdown_2.21       knitr_1.42          
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.5         tidyr_1.3.0        sass_0.4.2         pkgload_1.3.2.1   
##  [5] jsonlite_1.8.7     viridisLite_0.4.2  bslib_0.4.0        shiny_1.7.2       
##  [9] highr_0.9          yaml_2.3.7         remotes_2.4.2      sessioninfo_1.2.2 
## [13] pillar_1.9.0       glue_1.6.2         digest_0.6.33      RColorBrewer_1.1-3
## [17] promises_1.2.0.1   rvest_1.0.3        RefManageR_1.3.0   colorspace_2.1-0  
## [21] httpuv_1.6.5       plyr_1.8.7         pkgconfig_2.0.3    purrr_1.0.2       
## [25] xtable_1.8-4       scales_1.2.1       webshot_0.5.4      processx_3.8.2    
## [29] svglite_2.1.0      later_1.3.0        tibble_3.2.1       generics_0.1.3    
## [33] ellipsis_0.3.2     withr_2.5.0        cachem_1.0.6       cli_3.6.1         
## [37] crayon_1.5.2       mime_0.12          memoise_2.0.1      evaluate_0.21     
## [41] ps_1.7.5           fs_1.6.3           fansi_1.0.4        xml2_1.3.5        
## [45] pkgbuild_1.3.1     profvis_0.3.7      tools_4.2.0        prettyunits_1.1.1 
## [49] formatR_1.12       lifecycle_1.0.3    stringr_1.5.0      munsell_0.5.0     
## [53] callr_3.7.3        compiler_4.2.0     jquerylib_0.1.4    systemfonts_1.0.4 
## [57] rlang_1.1.1        rstudioapi_0.14    visNetwork_2.1.0   htmlwidgets_1.5.4 
## [61] miniUI_0.1.1.1     R6_2.5.1           lubridate_1.8.0    fastmap_1.1.0     
## [65] utf8_1.2.3         stringi_1.7.12     Rcpp_1.0.11        vctrs_0.6.3       
## [69] tidyselect_1.2.0   xfun_0.36          urlchecker_1.0.1

6.3.8.2 Using R packages to help managing dependencies and recreate setup

This will not be covered in this class, but it is worth noting that there are at least two R packages allowing to better manage dependencies and recreate your setup. Those packages are:

The instructor expects students to discuss with their supervisors to identify if those packages are potentially useful for their projects.

7 Chapter 6

7.1 Introduction

This chapter is subdivided into two parts as follows:

7.2 Getting published

7.2.1 Learning outcome

  • Learn procedures to present your research and disseminate it through publications.

To support the learning outcome, we will focus on discussing:

  1. The importance of knowing the key conclusion that you want to communicate and revolve your publication around it.
  2. The steps towards getting your research published:
    • Selecting your journal.
    • Writing your manuscript.
    • Submitting your manuscript.
    • Acceptance and publication.

7.2.2 Literature and web resources

This chapter is mostly based on the following books, publications and web resources:

Books and Guides

Publications

  • Allen et al. (2014) – Comment published in Nature advocating for a “taxonomy” of author roles.
  • Fox and Burns (2015) – A publication investigating the relationship between manuscript title structure and success in Ecology.

Websites

7.2.3 Download presentation

The presentation slides for Chapter 6 - part A can be downloaded here.

7.2.4 Why publish?

Publishing research results is the one thing that unites scientists across disciplines, and it is a necessary part of the scientific process. You can have the best ideas in the world, but if you don’t communicate them clearly enough to be published, your work won’t be acknowledged by the scientific community.

By publishing you are achieving three key goals for yourself and the larger scientific endeavor:

  • Disseminating your research.
  • Advancing your career.
  • Advancing science.

In biology, publishing research is equal to journal articles.

7.2.5 Know your message

Before beginning writing your journal article and thinking where to submit it, it is important to thoroughly understand your own research and know the key conclusion you want to communicate (see chapter 3). In other words, what is your take home message?

Consider your conclusion and ask yourself, is it:

  • New and interesting?
  • Contributing to a hot topic?
  • Providing solutions to difficult problems?

If you can answer ‘yes’ to all three, you have a good foundation message for a paper.

Shape the whole narrative of your paper around this message.

7.2.6 Steps towards getting your research published

Once you know your message, getting your research published will be a four steps process:

  • Step 1: Selecting your journal.
  • Step 2: Writing your manuscript.
  • Step 3: Submitting your manuscript.
  • Step 4: Acceptance and publication.

Each step will be discussed below. Please seek support from your supervisor to learn more about specifics from you field.

7.2.7 Step 1: Selecting your journal

To target the best journal to publish your research, you need to ask yourself what audience do I want my paper to reach?

Your manuscript should be tailored to the journal you want to submit to in terms of content and in terms of style (as outlined in journals’ author guidelines). To confirm that a journal is the best outlet to publish your research ask yourself this question: can I relate my research to other papers published in this journal?

Here are some things to consider when choosing which journal to submit to:

7.2.7.1 Journal aims and scope

Look closely at what the journal publishes; manuscripts are often rejected on the basis that they would be more suitable for another journal. There can be crossover between different journals’ aims and scope – differences may be subtle, but all important when it comes to getting accepted.

Do you want your article read by a more specialist audience working on closely related topics to yours, or researchers within your broader discipline?

Once you have decided which journal you are most interested in, make sure that you tailor the article according to its aims and scope.

7.2.7.2 Editors and editorial boards

It is a good sign if you recognize the names of the editors and editorial board members of a journal from the work you have already encountered (even better if they contributed to some of the references cited in your manuscript). Research who would likely deal with your paper if you submitted to a journal and find someone who would appreciate reading your paper. You can suggest handling editors in your cover letter or in the submission form, if it allows, but be aware that journals do not have to follow your suggestions and/or requests.

7.2.7.3 Impact factor and other metrics

A summary of our previously discussed material is presented below to provide more context for this chapter, but please consult Chapter 4 for more details on this topic.

Impact factors are the one unambiguous measure widely used to compare journal quality based on citations the journal receives. However, other metrics are becoming more common, e.g. altmetric score measuring the impact of individual articles through online activity (shares on different social media platforms etc.), or article download figures listed next to the published paper.

None of the metrics described here are an exact measure of the quality of the journal or published research. You will have to decide which of these metrics (if any) matter most to your work or your funders and institutions.

7.2.7.4 Open access

Do you need to publish open access (OA)? Some funders mandate it and grant money often has an amount earmarked to cover the article processing charge (APC) required for Gold OA. Some universities have established agreements with publishers whereby their staff get discounts on APCs when publishing in certain journals (or even a quota of manuscripts that can be published for “free” on a yearly basis). If you do not have grant funding, check whether your university or department has got an OA fund that you could tap into.

However, if you are not mandated to publish OA by your funder and/or you do not have the funds to do so, your paper will still reach your target audience if you select the right journal for your paper. Remember, you can share your paper over email.

7.2.7.5 Author guidelines

Author guidelines will outline the journal’s requirements for submissions:

  • Aims and scope.
  • Formatting requirements (incl. words limit, numbers of figures, tables, references).
  • Journal policies (e.g. on data sharing and citation).

Always follow the author guidelines, stick to the word limit and tailor your manuscript accordingly. Remember that papers can be rejected immediately if they do not meet the author guidelines.

7.2.7.6 Time to publication

The length of time a paper takes to be peer reviewed does not correlate to the quality of peer review, but rather reflects the resources a journal has to manage the process (e.g. do they have paid editorial staff or is it managed by full-time academics?).

Journals usually give their average time to a decision on their website, so take note of this if time is a consideration for you.

Some journals also make it clear that they are reviewing for soundness of science rather than novelty and will therefore often have a faster review process (e.g. PLoS ONE).

7.2.7.7 Ethics

Ethics can be divided into two groups:

  • Research ethics: this term includes aspects such as how you manage sensitive species information, whether you adhere to animal welfare guidelines and regulations or how you deal with data protection.
  • Publication ethics: this term concerns practices around the publication process. Standards set across scholarly publishing help define good practice and identify cases of misconduct. The Committee on Publication Ethics (COPE) provides the main forum for advice on ethics within scholarly publishing and has issued several sets of guidelines that help journals, editors and publishers handle cases of misconduct such as data fabrication, peer review fraud, plagiarism, etc.

As an author, it helps if you are familiar with what constitutes good practices and what is considered unacceptable. Please see section “Used literature & web resources” for more details on this topic.

7.2.7.8 Authorship

Start talking about authorship and author order for your paper with collaborators at an early stage – before submitting and ideally before writing the paper. To deal with potential issues, ask yourself the question: Who will do what?

Some journals are now encouraging ‘authorship contribution statements’ so check the journal guidelines to see if this is required and how to format it. To help you figuring out authorship associated with your research article, you could apply the CRediT (Contribution Roles Taxonomy) model presented by Wiley. The classification is as follows:

CRediT Classification

  • Conceptualization: Ideas; formulation or evolution of overarching research goals and aims.
  • Data Curation: Management activities to annotate (produce metadata), scrub data and maintain research data (including software code, where it is necessary for interpreting the data itself) for initial use and later reuse.
  • Formal Analysis: Application of statistical, mathematical, computational, or other formal techniques to analyze or synthesize study data.
  • Funding Acquisition: Acquisition of the financial support for the project leading to this publication.
  • Investigation: Conducting a research and investigation process, specifically performing the experiments, or data/evidence collection.
  • Methodology: Development or design of methodology; creation of models.
  • Project Administration: Management and coordination responsibility for the research activity planning and execution.
  • Resources: Provision of study materials, reagents, materials, patients, laboratory samples, animals, instrumentation, computing resources, or other analysis tools.
  • Software: Programming, software development; designing computer programs; implementation of the computer code and supporting algorithms; testing of existing code components.
  • Supervision: Oversight and leadership responsibility for the research activity planning and execution, including mentorship external to the core team.
  • Validation: Verification, whether as a part of the activity or separate, of the overall replication/reproducibility of results/experiments and other research outputs.
  • Visualization: Preparation, creation and/or presentation of the published work, specifically visualization/data presentation.
  • Writing – Original Draft Preparation: Creation and/or presentation of the published work, specifically writing the initial draft (including substantive translation).
  • Writing – Review & Editing: Preparation, creation and/or presentation of the published work by those from the original research group, specifically critical review, commentary or revision – including pre- or post-publication stages.

Here is an example of an author contributions paragraph inspired by the CRediT model:

F.F., N.A.B. and S.B. designed study; F.F., E.B., S.P.B., J.M., and S.B. compiled data; F.F., S.B. and J.M. performed analyses; S.I.B., P.M.H., A.L., D.P.L., S.M., H.R., C.R., D.W.S., and P.T. provided material and/or sequences; F.F., J.M. and S.B. wrote the manuscript; all authors critically read and revised the manuscript, and approved the final submitted version.

7.2.8 Step 2: Writing your manuscript

7.2.8.1 Planning to write

Develop a narrative that leads to your main conclusion and develop a backbone around that narrative. The narrative should progress logically, which does not necessarily mean chronologically. Work out approximate word counts for each section to help manage the article structure and keep you on track for word limits.

It is important to set aside enough time to write your manuscript and – importantly – enough time to edit, which may actually take longer than the writing itself.

7.2.8.2 Structure

The article structure will be defined in the author guidelines, but if the journal’s guidelines permit it, there may be scope to use your own subheadings. By breaking down your manuscript into smaller sections, you will be communicating your message in a much more digestible form.

Use subheadings to shape your narrative and write each subheading in statement form (e.g. ecological variables do not predict genome size variation).

7.2.8.3 Title

The title is the most visible part of your paper and it should thus clearly communicates your key message. Pre-publication, reviewers base their decision on whether to review a paper on the quality of the title and abstract. Post-publication, if you publish in a subscription journal and not OA, the title and abstract are the only freely available parts of your paper, which will turn up in search engines and thus reach the widest audience. A good title will help you get citations and may even be picked up by the press.

Draft a title before you write your manuscript to help focusing your paper. The title needs to be informative and interesting to make it stand out to reviewers and subsequently readers. Some key tips for a successful title include:

  • Write it in statement form. When scanning papers, most people skip to the last sentence of the abstract to look for the key message, so make that sentence your title.
  • Keep it around 15 words – any longer or shorter and it has more chance of being rejected at peer review.
  • Use punctuation to split the main message and qualifier/subtitle e.g. ‘Feeding evolution of a herbivore influences an arthropod community through plants: implications for plant-mediated eco-evolutionary feedback loop’.
  • Keep it general – readers prefer titles that emphasize broader conceptual or comparative issues, and these titles fare better both pre- and post-publication than papers with organism-specific titles. Try to avoid using species names, put them in the abstract and keywords instead.
  • Do not use abbreviations even if they are familiar in your field. You should keep a broad audience in mind.
  • Do not use phrases such as ‘The effect of…’, ‘The involvement of…’. These phrases give the reader scope to question your message – instead tell the reader what they are being told.

7.2.8.4 Abstract

Write your abstract after you have written your paper, when you are fully aware of the narrative of your paper. After the title, the abstract is the most read part of your paper. Abstracts are freely available and affect how discoverable your article is via search engines.

Given its importance, your abstract should:

  • Articulate your new and interesting key message.
  • Outline the methods and results.
  • Contextualize the work.
  • Highlight how your research contributes to the field and its future implications.
  • Have the last sentence communicating the key message.

7.2.8.5 Writing style

Writing with clarity, simplicity and accuracy takes practice and we can all get carried away with what we think is ‘academic writing’ (i.e. long words and jargon) but good science speaks for itself. Write short sentences (ca. 12 words on average).

Every extra word you write is another word for a reviewer to disagree with. Single out the narrative that leads to your main conclusion and write that – it is easy to get side tracked with lots of interesting avenues distracting from your work, but by including those in your paper, you are inviting more criticism from reviewers.

Write in an active, positive voice (e.g. ‘we found this…’ ‘we did this…’) and be direct so that your message is clear. Ambiguous writing is another invitation for reviewers to disagree with you.

In your introduction, state that your research is timely, important, and why. Begin each section with that section’s key message and end each section with that message again plus further implications. This will place your work in the broader context that high-quality journals like.

Draft and redraft your work to ensure it flows well and your message is clear and focused throughout. While doing this process, keep the reader in mind at all times (to have a critical look on your research and its presentation).

7.2.8.6 Keywords

Keywords are used by readers to discover your paper. You will increase chances of your paper being discovered through search engines by using them strategically throughout your paper – this is search engine optimization (SEO).

Think of the words you would search for to bring up your paper in a Google search. Try it and see what comes up – are there papers that cover similar research to your own?

Build up a list of 15–20 terms relevant to your paper and divide them into two groups:

  • A core group of around 5 keywords,
  • A larger group of secondary keywords.

Place your core keywords in the title, abstract and subheadings, and the secondary keywords throughout the text and in figures and tables. Repeat keywords in the abstract and text naturally.

7.2.8.7 References

Reference all sources and do it as you go along (e.g. copy the BibTeX citation into a reference file; see chapter 1 part B), then tidy them once the paper is complete.

Make sure that most of your references are recent to demonstrate both that you have a good understanding of current literature, and that your research is relevant to current topics.

7.2.8.8 Figures and tables

Figures and tables enhance your paper by communicating results or data concisely (more on this topic in chapters 9 and 10).

Use figures and tables to maintain the flow of your narrative – e.g. instead of trying to describe patterns in your results, create a figure and say ‘see Fig. 1’. Not only does this keep your word count down, but a well-designed figure can replace 1000 words!

Figures are useful for communicating overall trends and shapes, allowing simple comparisons between fewer elements. Tables should be used to display precise data values that require comparisons between many different elements.

Figure captions and table titles should explain what is presented and highlight the key message of this part of your narrative – the figure/table and its caption/title should be understandable in isolation from the rest of your manuscript.

Check the journal’s author guidelines for details on table and figure formatting, appropriate file types, number of tables and figures allowed and any other specifications that may apply. Material presented in chapter 1 part C can help you produce figures meeting journal expectations.

7.2.8.9 Editing

Once you have finished writing your manuscript, put it on ice for a week so you come back to it with fresh eyes. Take your time to read it through. Editing can take more time than you expect, but this is your opportunity to fine-tune and submit the best paper possible. Don’t hesitate to seek support from your thesis committee to fasten and streamline this process.

Key things to look out for when editing include:

  • Spelling and grammar – a surprising amount of errors slip through. If you are a non-native English speaker, ask a native speaker, ideally a colleague who knows a little bit about the subject, to read it through, or use a language-editing service if you have the funds to do so.
  • Make sure all statements and assumptions are explained.
  • Remove redundant words or phrases – keep it concise and jargon-free to avoid diluting your message.
  • Abbreviations – check that they have been expanded on the first use.
  • Acknowledgements – make sure all funders are clearly mentioned and that all people who contributed in any way are acknowledged.
  • Keywords – they should be consistent, evenly spaced throughout the text and placed at key points in your manuscript e.g. subheadings.
  • Make sure you have specifically dealt with the hypothesis set out in the introduction – you’d be surprised at the number of papers submitted that don’t!
  • Circulate the manuscript to your co-authors to get their comments and final approval before submission.

7.2.9 Step 3: Submitting your manuscript

You are now ready to submit your paper to your chosen journal. Each journal will have a different submission procedure that you will have to adhere to, and most manage their submissions through online submission systems (e.g. ScholarOne Manuscripts).

Only submit your paper for consideration to one journal at a time otherwise you will be breaching publishing ethics.

7.2.9.1 Cover letters

A great cover letter can set the stage towards convincing editors to send your paper for review. Write a concise and engaging letter addressed to the editor-in-chief, who may not be an expert in your field or sub-field.

The following points should be covered in your cover letter:

  • State your key message and why your paper is important and relevant to the journal.
  • State that your paper is not under review in another journal and hasn’t been published before (although you will most likely adhere to that during the submission process and if it is the case you then don’t need to mention this in the cover letter).
  • The cover letter should be shorter than your abstract and be written in less technical language.
  • Use it to recommend reviewers (include their emails) and/or a relevant handling editor. Pick suggested reviewers with a good reputation to demonstrate both your knowledge of the field and your belief that your paper can stand up to their scrutiny.
  • State any potential conflict of interest with other teams and blacklist potential reviewers.

7.2.9.2 Handling revisions

Very rarely is a paper immediately accepted – almost all papers go through few rounds of reviews before they get published.

If a decision comes back asking for revisions you should reply to all comments politely. Here are some tips on handling reviewer comments and revising your paper:

  • Look at the reviewer comments with scrutiny and make a list of all the points that need to be addressed.
  • Start with the minor revisions such as spelling, grammar, inconsistencies – these are often the most numerous but the easiest to correct.
  • If you disagree with certain comments, disagree politely and with evidence. Do not skip over them when writing your reply.
  • If things can’t be dealt with in this paper then explain that to the editor – reviewers may try to push their own agenda e.g. ‘why don’t you write this paper instead’, but you have the right to disagree if you don’t feel it is appropriate to deal with this in your paper.
  • Respond to comments as thoroughly as you can.
  • Include a point-by-point response to the reviewer comments in the relevant section of the online system.

7.2.9.3 Handling rejection

Reviewers are volunteers, but the service they provide is invaluable – by undergoing peer review, regardless of the outcome, you are receiving some of the best advice from leading experts for free. With this in mind, any feedback you get will be constructive in the end and will lead you on the way to a successful publishing portfolio.

Keep in mind that feedback is another person’s opinion on what you have done, not on who you are, and it is up to you to decide what to do with it.

If your paper is rejected look at the reviewer’s comments and use their feedback to improve your paper before resubmitting it.

7.2.9.4 Appeals

If you are unhappy with a reject decision, 99.9% of the time, move on. However, don’t be afraid of appealing if you have well-founded concerns or think that the reviewers have done a bad job. There are instances where journals grant your appeal and allow you to revise your paper, but in the large majority of cases, the decision to reject will be upheld.

7.2.10 Step 4: Acceptance and publication

Congratulations! By now you should have an acceptance email from the editor-in-chief in your inbox. The process from here will vary according to each journal, but the post-acceptance workflow is usually as follows:

  • Your paper will be published online, unedited, but citable as an ‘Accepted Article’ within a week of acceptance (usually a DOI is assigned and your paper is citable at this stage).
  • Your paper will be copy-edited. The level of copy-editing your paper will receive will vary according to each journal, so it is worth checking your proof thoroughly.
  • Your paper will be typeset and a proof will be sent to you for checking. Author queries will be marked on the proof. At this stage, only minor corrections related to the typesetting are allowed.
  • Your finalized proof will be published online in ‘Early View’.
  • Finally, according to the journal’s schedule, your paper will be placed in an issue (or not if it is an online only journal, e.g. Scientific Reports).

It might be then time to coordinate the publication of a press release or post the link of your article on social media to share your joy!

7.3 Writing papers in R Markdown

7.3.1 Download presentation

The presentation slides for Chapter 6 - part B can be downloaded here.

7.3.2 Literature and web resources

This chapter is mostly based on the following books and web resources:

7.3.3 Software requirements

To apply the approach described below make sure that you have a Tex distribution installed on your computer. More information on this topic is available here. You will also need to install the R rticles package as demonstrated here.

7.3.4 Challenges to writing publications in R Markdown

Traditionally, journals are accepting manuscripts submitted in either Word (.doc) or LaTex (.tex) formats. In addition, most journals are requesting figures to be submitted as separate files (in e.g. .tiff or .eps formats). Online submission platforms are collating your different files to produce a .pdf document, which is shared with reviewers for evaluation. In this context, although the .Rmd format is growing in popularity (due to its ability to “mesh” data analyses with data communication), this format is technically currently not accepted by journals. In this document, we are discussing ways that have been developed to circumvent this issue and allow using the approach implemented in R Markdown for journal submissions.

7.3.5 Solution: Develop templates producing LaTex files matching journal requirements!

As mentioned above, many journals support the LaTeX format (.tex) for manuscript submissions. While you can convert R Markdown (.Rmd) to LaTeX, different journals have different typesetting requirements and LaTeX styles. The solution is to develop scripts converting R Markdown files into LaTex files, which are meeting journal requirements.

7.3.6 The rticles package

Submitting scientific manuscripts written in R Markdown is still challenging; however the R rticles package was designed to simplify the creation of documents that conform to submission standards for academic journals (see Allaire et al., 2022). The package provides a suite of custom R Markdown LaTeX formats and templates for the following journals/publishers that are relevant to the EEB program:

  • Biometrics articles
  • Elsevier journal submissions
  • Frontiers articles
  • MDPI journal submissions
  • PeerJ articles
  • PNAS articles
  • Royal Society Open Science journal submissions
  • Sage journal submissions
  • Springer journal submissions
  • The R Journal articles
  • Taylor & Francis articles

An understanding of LaTeX is recommended, but not essential in order to use this package. R Markdown templates may sometimes inevitably contain LaTeX code, but usually we can use the simpler R Markdown and knitr function to produce elements like figures, tables, and math equations.

7.3.6.1 Install rticles and use templates to write publications

  1. Install the rticles package:
  • Type the above code in the R console:
install.packages("rticles")
  • Or use the RStudio interface to install the package by clicking:
Tools -> Install Packages...

Then, type "rticles" in the prompted window to install package. 
  • If you wish to install the development version from GitHub (which often contains new article formats), you can do this (note that this code uses a function from the remotes package, Hester et al., 2019):
remotes::install_github("rstudio/rticles")
  1. Create a new R Markdown document in RStudio:
File -> New File -> R Markdown... 
  1. In the New R Markdown window, click on From Template in the left panel and select the journal style that you would like to follow for your article (here PNAS Journal Article; see Figure 7.1). Before pushing the OK button, provide a name for your project and set a location where the project will be saved (see Figure 7.1).
R Markdown window allowing to select templates following journal styles.

Figure 7.1: R Markdown window allowing to select templates following journal styles.

  1. Once you completed this task, a folder will be created (and saved in the path that you provided) containing files associated with the article submission process and the template will be automatically opened in RStudio (see Figure 7.2).
Snapshot showing the template R Markdown and the associated folder created to generate your submission.

Figure 7.2: Snapshot showing the template R Markdown and the associated folder created to generate your submission.

  1. In this example, the following suite of files were created (see Figure 7.3):
  • Submission_PNAS.Rmd: R Markdown file that will be used to write your article.
  • pnas-sample.bib: BibTeX file to store your bibliography.
  • pnas.csl and pnas-new.cls: Files containing information about the formatting of citations and bibliography adapted to journal policies.
  • frog.png: A .png file used to show you how to include figures in .Rmd document.
Snapshot showing the suite of files associated to your submission and saved in your project folder.

Figure 7.3: Snapshot showing the suite of files associated to your submission and saved in your project folder.

  1. Start writing your article
  • Open Submission_PNAS.Rmd and update the YAML metadata section with information on authors, your abstract, summary and keywords (see Figure 7.4).
Update YAML metadata section with information on authors, your abstract, summary and keywords.

Figure 7.4: Update YAML metadata section with information on authors, your abstract, summary and keywords.

  • Write your manuscript by following the journal’s structure. You can take advantage of the R Markdown language in your manuscript (e.g. include R code chunks and outputs) and those will be converted by knitr and rticles packages during the compilation procedure.
  1. Compile your document and use both .pdf and .tex files to submit your article (see Figure 7.5). The output files will be saved in your project folder.
Snapshot of the procedure to knit the document.

Figure 7.5: Snapshot of the procedure to knit the document.

7.3.6.2 Exercise

To get familiar with this procedure, please practice by applying the above approach to different journal templates by favoring those where you might be submitting to.

Enjoy writing scientific publications in R Markdown!

8 Chapter 7

8.1 Introduction

In this chapter, we will study protocols to import and gather data with R. As stated by Gandrud (2015) in the chapter 6 of his book, how you gather your data directly impacts how reproducible your research will be. In this context, it is your duty to try your best to document every step of your data gathering process. Reproduction will be easier if all of your data gathering steps are tied together by your source code, then independent researchers (and you) can more easily regather the data. Regathering data will be easiest if running your code allows you to get all the way back to the raw data files (the rawer the better). Of course, this may not always be possible. You may need to conduct interviews or compile information from paper based archives, for example. The best you can sometimes do is describe your data gathering process in detail. Nonetheless, R’s automated data gathering capabilities for internet-based information is extensive. Learning how to take full advantage of these capabilities greatly increases reproducibility and can save you considerable time and effort over the long run.

Gathering data can be done by either importing locally stored data sets (= files stored on your computer) or by importing data sets from the Internet. Usually, these data sets are saved in plain-text format (usually in comma-separated values format or csv) making importing them into R a fairly straightforward task (using the read.csv() function). However, if data sets are not saved in plain-text format, the users will have to start by converting them. In most cases, data sets will be saved in xls or xlsx formats and functions implemented in the readxl package (Wickham and Bryan, 2019) would be used (using the read_xlsx() function). If your data sets were created by other statistical programs such as SPSS, SAS or Stata, these could be imported into R using functions from the foreign package (R Core Team, 2020). Finally, data sets could be saved in compressed documents, which will have to be processed prior to importing the data into R.

Learning skills to import and gather data sets is especially important in the fields of Ecology, Evolution and Behavior since your research is highly likely to depend on large and complex data sets (see Figure 8.1). In addition, testing your hypotheses will rely on your ability to manage your data sets to test for complex interactions (e.g. does abiotic factors such as temperature drive selection processes in plants?; Figure 8.1).

Example of datasets involved in Ecology, Evolution and Behavior and their intercations.

Figure 8.1: Example of datasets involved in Ecology, Evolution and Behavior and their intercations.

Here, we are providing methods and R functions that are applied to manage your projects and gather, convert and clean your data. Ultimately these tools will be applied to document and produce the raw data at the basis of your research.

8.2 Learning outcomes

  • Creating RStudio projects to manage your reproducible project.
  • Importing csv files deposited on GitHub into R.
  • Learning about SHA-1 hash accession numbers and their usage to retrieve csv files associated to specific GitHub commit events.
  • Downloading whole GitHub repository (in zip format) on your computer.
  • Listing all files in a compressed zip file.
  • Extracting and saving selected files from a zip file without decompressing it.
  • Manipulating files and directories to organize your project.

8.3 Managing projects in RStudio

RStudio projects (.Rproj) allow users to manage their project, more specifically by dividing their work into multiple contexts, each with their own working directory, workspace, history, and source documents.

8.3.1 Creating projects

RStudio projects are associated with R working directories. You can create an RStudio project:

  1. In a brand new directory.
  2. In an existing directory where you already have R code and data.
  3. By cloning a version control (Git or Subversion) repository.

We will be covering the last option during Chapter 11.

To create a new project in RStudio, do File > New Project... and then a window will pop-up allowing you to select among the 3 options (see Figure 8.2).

Window allowing you to create a New RStudio project. See text for more details.

Figure 8.2: Window allowing you to create a New RStudio project. See text for more details.

When a new project is created RStudio:

  1. Creates a project file (with an .Rproj extension) within the project directory. This file contains various project options and can also be used as a shortcut for opening the project directly from the filesystem.
  2. Creates a hidden directory (named .Rproj.user) where project-specific temporary files (e.g. auto-saved source documents, window-state, etc.) are stored. This directory is also automatically added to .Rbuildignore, .gitignore, etc. if required.
  3. Loads the project into RStudio and display its name in the Projects toolbar (which is located on the far right side of the main toolbar).

8.3.2 Working with projects

To open a project, go to your project directory and double-click on the project file (*.Rproj). When your project opens within RStudio, the following actions will be taken:

  • A new R session (process) is started.
  • The current working directory is set to the project directory.
  • Previously edited source documents are restored into editor tabs.
  • If you saved your workspace into a .RData file (see below), it will be loaded into your environment and allowing you to purse your work.

When you are within a project and choose to either Quit, close the project, or open another project the following actions will be taken:

  • .RData and/or .Rhistory are written to the project directory (if current options indicate they should be).
  • The list of open source documents is saved (so it can be restored next time the project is opened).
  • The R session is terminated.

8.3.3 Additional information

Additional information on RStudio projects can be found here:

https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects

8.4 Bioinformatic dependencies and data repositories

The list of R packages and GitHub repositories used in this chapter are listed below. Please make sure you have all these set-up before starting reading and completing the material presented in this chapter.

8.4.1 R packages

Most of the functions that we will be using in this chapter are base R functions installed (by default) in the utils package (R Core Team, 2019). However, the following R package (and its dependencies) has to be installed on your computer prior to starting this tutorial: repmis (Gandrud, 2016). In the event that you wanted to import xsl or xlsx files into R, you would also have to install the readxl package (Wickham and Bryan, 2019).

8.4.2 GitHub repositories used in this chapter

8.4.2.1 EEB603_Reproducible_Science repository

This repository is dedicated to this course and is used to demonstrate the procedure to import csv files into R from GitHub repositories. More specifically, we will be importing different versions of the file (Timetable_EEB603_topic_tasks.csv) at the origin of the Timetable to study procedure associated to file versioning in GitHub. It is located at this URL (Uniform Resource Locator):

8.4.2.2 Sagebrush_rooting_in_vitro_prop

This repository is associated to Barron et al. (2020) and is used to demonstrate procedure to download whole GitHub repositories. We will be downloading the whole repository on your local computer and then extracting all csv files in the 01_Raw_Data/ folder and saving them on your local computer using R (see Figure 8.3 for more details on file content).

Snapshot showing files in the 01_Raw_Data folder that we will be targeting in the GitHub repository.

Figure 8.3: Snapshot showing files in the 01_Raw_Data folder that we will be targeting in the GitHub repository.

This approach is aimed at demonstrating how you could access raw data deposited on GitHub. The repository is located at this URL:

8.5 Overview of methods applied in this chapter

In this chapter, we will be focusing on learning procedures to import data from the internet focusing on GitHub. More precisely, we will be learning procedures to:

  1. Import comma-separated values format (csv) stored in GitHub repositories.
  2. Download and process whole GitHub repositories saved in compressed zip format.

8.6 List of R functions covered in this chapter

The list of major R functions covered in this chapter are provided in the Table below.

8.7 Getting started!

Before starting coding, please do the following:

  1. Open RStudio and create a New Project in a New Directory entitled EEB603_Chapter_06.
  2. Open the new project and create and save a new R script entitled 01_Data_gathering.R.

8.8 Importing csv files from GitHub into R

Before delving into this subject, there are several topics that we need to cover.

8.8.1 The SHA-1 hash accession number

With the growing popularity of GitHub, several authors are depositing their data sets on GitHub and you might would like to access those for your research. Since Git and GitHub are supporting version control, it is important to report the exact version of the file or data set that you have downloaded. To support such feature, each version of a file/data set is associated to a unique encrypted SHA-1 (Secure Hash Algorithm 1) hash accession number. This means that if the file changes (because the owner of the repository updated it), its SHA-1 hash accession number will change. Such feature allows users to referring to the exact file/data set used in their analyses therefore supporting reproducibility.

8.8.2 How to retrieve the URL of a GitHub csv file?

Before being able to import a csv file deposited in GitHub into R you have to find it’s raw URL. In this section, we will demonstrate how to obtain this information by using the Timetable_EEB603_topic_tasks.csv file located on the course’s GitHub repository.

To retrieve the raw URL associated to Timetable_EEB603_topic_tasks.csv do the following:

  1. Navigate to the file location on the GitHub repository by clicking here (see Figure 8.4).
    Location of Timetable_EEB603_topic_tasks.csv on the EEB603_Reproducible_Science GitHub repository.

    Figure 8.4: Location of Timetable_EEB603_topic_tasks.csv on the EEB603_Reproducible_Science GitHub repository.

  2. Click on the Raw button on the right just above the file preview (Figure 8.4). This action should open a new window showing you the raw csv file (see Figure 8.5).
    Raw csv file associated to Timetable_EEB603_topic_tasks.csv. See text for more details.

    Figure 8.5: Raw csv file associated to Timetable_EEB603_topic_tasks.csv. See text for more details.

  3. The raw URL for this csv file can be retrieved by copying the URL address (see Figure 8.5). In this case, the URL is as follows: https://raw.githubusercontent.com/svenbuerki/EEB603_Reproducible_Science/master/Data/Timetable_EEB603_topic_tasks.csv
  4. The URL can be shorten for free by using Bitly at https://bitly.com. In this case, the short URL pointing to our csv file is https://bit.ly/3BDxECl. We will be using this URL in the example below.

8.8.3 Importing GitHub csv into R

Now that we have retrieved and shortened the raw GitHub URL pointing to our target csv file, we can use the source_data() function implemented in the R package repmis (Gandrud, 2016) to download the file. The object return by source_data() is a data.frame and can therefore be easily manipulated and saved on your local computer (using e.g. write.csv()). Retrieving a csv data set from a GitHub repository can be done as follows:

### ~~~ Load package ~~~
library(repmis)
### ~~~ Store raw short URL into object ~~~
urlcsv <- "https://bit.ly/3BDxECl"
### ~~~ Download/Import csv into R ~~~
csvTimeTable <- repmis::source_data(url = urlcsv)
## Downloading data from: https://bit.ly/3BDxECl
## SHA-1 hash of the downloaded data file is:
## e1feec6965718f2b9299d24119afa7aac310425c
### ~~~ Check class ~~~ Class should be `data.frame`
class(csvTimeTable)
## [1] "data.frame"
### ~~~ Print csv ~~~
print(csvTimeTable)
##                                                                                                                                                                       Topic
## 1                                                                                                                                                                  Syllabus
## 2                                                                                                                     Intro. Chapt. 1 & Chap. 2: The reproducibility crisis
## 3                                                                                                                                     Chap. 1 - part A: Learning the basics
## 4                                                                                            Chap. 1 - Complete part A & overview of part B: Tables, Figures and References
## 5                                                                                                                          Chap. 1 - part B: Tables, Figures and References
## 6                                                                                                                          Chap. 1 - part B: Tables, Figures and References
## 7                                                                                                        Chap. 1 - Complete part B and part C: Advanced R Markdown settings
## 8                                                                                                                             Chap. 1 - part D: User Defined Functions in R
## 9                                                                                                                             Chap. 1 - part D: User Defined Functions in R
## 10                                                                                                                                  Q&A and work on bioinformatic tutorials
## 11                                                                                    Chap. 3: A roadmap to implement reproducible science in Ecology, Evolution & Behavior
## 12                                                                                                                                Chap. 4: Open science and CARE principles
## 13                                                                                                                                                 Chap. 5: Data management
## 14                                                                                                         Complete Chap. 5: Data management and Chap. 5: Reproducible code
## 15                                                                                                                                               Chap. 6: Getting published
## 16 Chap. 6: Writing papers in R Markdown ([rticles](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#rticlespkg)) and Work on bioinformatic tutorials
## 17                                                                                                                                              Bioinfo. tutorial - Chap. 7
## 18                                                                                                                                              Bioinfo. tutorial - Chap. 7
## 19                                                                                                                                              Bioinfo. tutorial - Chap. 8
## 20                                                                                                                                              Bioinfo. tutorial - Chap. 8
## 21                                                                                                                                              Bioinfo. tutorial - Chap. 9
## 22                                                                                                                                              Bioinfo. tutorial - Chap. 9
## 23                                                                                                                                             Bioinfo. tutorial - Chap. 10
## 24                                                                                                                                             Bioinfo. tutorial - Chap. 10
## 25                                                                                                                                             Bioinfo. tutorial - Chap. 11
## 26                                                                                                                                             Bioinfo. tutorial - Chap. 11
## 27                                                                                                                                             Bioinfo. tutorial - Chap. 12
## 28                                                                                                                                             Bioinfo. tutorial - Chap. 12
## 29                                                                                                                                                Individual projects - Q&A
## 30                                                                                                                                                Individual projects - Q&A
## 31                                                                                                                                Oral presentations of individual projects
## 32                                                                                                                                Oral presentations of individual projects
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Homework
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## 2                                                                                                                                                                                                                                                                                                                                                                                                                       Due to internet outage on campus, we had to change schedule. Read Baker (2016) and prepare for discussing outcome of study
## 3                                                                                                                                                                                                                                                                                                                                                                                                                       [Install software](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#213_Install_R_Markdown_software)
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Read Chapt. 1 - part B (Set your R Markdown environment)
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Work on bioinfo. tutorials
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Work on bioinfo. tutorials
## 7  **1.** Complete material all the way to the end of section 2.3.6; **2.** Read sections 2.3.7 and 2.3.8 and make sure that the bibliography file ""Bibliography_Reproducible_Science_2.bib” is copied in your working directory; **3.** Read section 2.3.9 and make sure that the csl file ""AmJBot.csl” is copied in your working directory; **4.** Read section 2.4 (Chapter 1 - part C) and get accustomed with the concepts presented in this part. I will be giving a presentation on these concepts and you will be implementing the code.
## 8                                                                                                                                                                                                                                                                                                                                                                                                                                                   **1.** Complete section 2.4.6 of part C and **2.** Read part D until the end of section 2.5.8.
## 9                                                                                                                                                                                                                                                                                                                                                                                                                                       **1.** Complete exercise in section 2.4.7 of part C and **2.** Read part D until the end of section 2.5.8.
## 10                                                                                                                                                                                                                                                                                                                                                                                      **1.** Sign up for bioinformatics tutorial (see Syllabus sections 5.1, 6 and 11 ) and **2.** Work with your group to organize your bioinformatics tutorial
## 11                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Read material presented in Chapter 3
## 12                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Read material presented in Chapter 4
## 13                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Read material presented in Chapter 5 - Part A
## 14                                                                                                                                                                                                                                                                                                                                                                                     Read material presented in Chapter 5 - Part B + Download folder “Project_ID/“ on Google Drive under this path: Reproducible_Science > Exercises > Chapter_5
## 15                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Turn in tutorial for Chap. 7 & start individual reports
## 16                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Upload tutorial of Chap. 7 on Google Drive
## 17                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Turn in tutorial for Chap. 8
## 18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Upload tutorial of Chap. 8 on Google Drive
## 19                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Turn in tutorial for Chap. 9
## 20                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Upload tutorial of Chap. 9 on Google Drive
## 21                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Turn in tutorial for Chap. 10
## 22                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Upload tutorial of Chap. 10 on Google Drive
## 23                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Turn in tutorial for Chap. 11
## 24                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Upload tutorial of Chap. 11 on Google Drive
## 25                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Turn in tutorial for Chap. 12
## 26                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Upload tutorial of Chap. 12 on Google Drive
## 27                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Students work on ind. projects: Review literature (see Syllabus)
## 28                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Students work on ind. projects: Data management workflow
## 29                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Students work on ind. projects: Data management workflow
## 30                                                                                                                                                                                                                                                                                                                                                                                                                 Turn in ind. projects (on [Google Drive](https://drive.google.com/drive/folders/1MZt5kNKusCv6OeZpjuuPxUiqBaoQVQAc?usp=sharing))
## 31                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## 32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
##                                                                                                                                                                                                                                                            URL
## 1                                                                                                                                                                              [Syllabus](https://svenbuerki.github.io/EEB603_Reproducible_Science/index.html)
## 2                                                            [Chapter 1](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#2_Chapter_1) & [Chapter 2](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#3_Chapter_2)
## 3                                                                                                                                                                       [Part A](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#partA)
## 4               [Part A](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#partA) & [Part B: Set your R Markdown environment](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#233_Set_your_R_Markdown_environment)
## 5                                                                                                                                   [Part B](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#13_PART_B:_Tables,_Figures_and_References)
## 6                                                                                                                                   [Part B](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#13_PART_B:_Tables,_Figures_and_References)
## 7  [Part B](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#13_PART_B:_Tables,_Figures_and_References) & [Part C](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#14_PART_C:_Advanced_R_and_R_Markdown_settings)
## 8                                                                                                                                      [Part D](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#15_PART_D:_User_Defined_Functions_in_R)
## 9                                                                                                                                      [Part D](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#15_PART_D:_User_Defined_Functions_in_R)
## 10                                                                                                     [Publications and Resources for bioinformatic tutorials](https://svenbuerki.github.io/EEB603_Reproducible_Science/index.html#7_Publications__Textbooks)
## 11                                                                                                                                                             [Chapter 3](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#4_Chapter_3)
## 12                                                                                                                                                             [Chapter 4](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#5_Chapter_4)
## 13                                                                                                                                                         [Chapter 5 - Part A](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#partDM)
## 14                                                                                                                                                        [Chapter 5 - Part B](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#repcode)
## 15                                                                                                                                                         [Chapter 6 - Part A](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#getpub)
## 16                                                                                                                                                        [Chapter 6 - Part B](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#writpub)
## 17                                                                                                                                                                                                                                                            
## 18                                                                                                                                                                                                                                                            
## 19                                                                                                                                                                                                                                                            
## 20                                                                                                                                                                                                                                                            
## 21                                                                                                                                                                                                                                                            
## 22                                                                                                                                                                                                                                                            
## 23                                                                                                                                                                                                                                                            
## 24                                                                                                                                                                                                                                                            
## 25                                                                                                                                                                                                                                                            
## 26                                                                                                                                                                                                                                                            
## 27                                                                                                                                                                                                                                                            
## 28                                                                                                                                                                                                                                                            
## 29                                                                                                                                                                                                                                                            
## 30                                                                                                                                                                                                                                                            
## 31                                                                                                                                                                                                                                                            
## 32

source_data() will always download the most recent version of the file from the master branch and return its unique SHA-1 hash. However, you could also download a prior version of the file by providing its specific SHA-1 hash associated to a previous commit.

8.8.4 Retrieving csv from previous commit

Retrieving the csv file associated to a specific commit can be done by applying the following approach (here using the same example than above):

  1. To find a file’s particular commit raw URL navigate to its location on GitHub by clicking here.
  2. Click the button. This action will take you to a page listing all of the file’s versions (see Figure 8.6).
    Page showing commit history associated to Timetable_EEB603_topic_tasks.csv.

    Figure 8.6: Page showing commit history associated to Timetable_EEB603_topic_tasks.csv.

  3. Click on the button next to the version of the file that you want to use (here Commits on Aug 23, 2021; Figure 8.6). This will take you to the version of the file at this point in history.
    Page showing prior commit associated to Timetable_EEB603_topic_tasks.csv.

    Figure 8.7: Page showing prior commit associated to Timetable_EEB603_topic_tasks.csv.

  4. Clicking on the Raw button will load the csv file and allow you to retrieve the URL (as done above).
  5. Download the file using the following R code:
### ~~~ Store raw URL into object ~~~
urlcsvold <- "https://raw.githubusercontent.com/svenbuerki/EEB603_Reproducible_Science/1312eae07e0515d8d2423ff83834b184cf6eeb8d/Data/Timetable_EEB603_topic_tasks.csv"
### ~~~ Download/Import csv into R ~~~
csvTimeTableOld <- repmis::source_data(url = urlcsvold)
## Downloading data from: https://raw.githubusercontent.com/svenbuerki/EEB603_Reproducible_Science/1312eae07e0515d8d2423ff83834b184cf6eeb8d/Data/Timetable_EEB603_topic_tasks.csv
## SHA-1 hash of the downloaded data file is:
## e7071b1e85ada38d6b2cf2a93bbb43c2b96a331f
### ~~~ Check class ~~~ Class should be `data.frame`
class(csvTimeTableOld)
## [1] "data.frame"
### ~~~ Print csv ~~~
print(csvTimeTableOld)
##                                              Topic
## 1                                         Syllabus
## 2              Example of a bioinformatic tutorial
## 3                      Chap. 1 - R Markdown part A
## 4                      Chap. 1 - R Markdown part B
## 5                      Chap. 1 - R Markdown part C
## 6                 Chap. 1 - User-defined functions
## 7                                Chap. 1 - Wrap-up
## 8                                          Chap. 2
## 9                                          Chap. 3
## 10                        Chap. 4: Data management
## 11                      Chap. 4: Reproducible code
## 12                      Chap. 5: Getting published
## 13 Chap. 5: Writing papers in R Markdown (rticles)
## 14                                             TBD
## 15                     Bioinfo. tutorial - Chap. 6
## 16                     Bioinfo. tutorial - Chap. 6
## 17                     Bioinfo. tutorial - Chap. 7
## 18                     Bioinfo. tutorial - Chap. 7
## 19                     Bioinfo. tutorial - Chap. 8
## 20                     Bioinfo. tutorial - Chap. 8
## 21                     Bioinfo. tutorial - Chap. 9
## 22                     Bioinfo. tutorial - Chap. 9
## 23                    Bioinfo. tutorial - Chap. 10
## 24                    Bioinfo. tutorial - Chap. 10
## 25                    Bioinfo. tutorial - Chap. 11
## 26                    Bioinfo. tutorial - Chap. 11
## 27                       Individual projects - Q&A
## 28                              Oral presentations
## 29                              Oral presentations
## 30                              Oral presentations
##                                                       Task Deadline
## 1                                                                  
## 2                                                                  
## 3                                        Work on bioinfo. tutorials
## 4                                        Work on bioinfo. tutorials
## 5                                        Work on bioinfo. tutorials
## 6                                        Work on bioinfo. tutorials
## 7                                        Work on bioinfo. tutorials
## 8     Read Baker (2016) and prepare for discussing outcome of study
## 9                                        Work on bioinfo. tutorials
## 10                                       Work on bioinfo. tutorials
## 11                                       Work on bioinfo. tutorials
## 12                                       Work on bioinfo. tutorials
## 13          Turn in tutorial for Chap. 6 & start individual reports
## 14                       Upload tutorial of Chap. 6 on Google Drive
## 15                                     Turn in tutorial for Chap. 7
## 16                       Upload tutorial of Chap. 7 on Google Drive
## 17                                     Turn in tutorial for Chap. 8
## 18                       Upload tutorial of Chap. 8 on Google Drive
## 19                                     Turn in tutorial for Chap. 9
## 20                       Upload tutorial of Chap. 9 on Google Drive
## 21                                    Turn in tutorial for Chap. 10
## 22                      Upload tutorial of Chap. 10 on Google Drive
## 23                                    Turn in tutorial for Chap. 11
## 24                      Upload tutorial of Chap. 11 on Google Drive
## 25 Students work on ind. projects: Review literature (see Syllabus)
## 26         Students work on ind. projects: Data management workflow
## 27                                    Students work on ind. project
## 28                                    Work on reports/presentations
## 29                                             Turn in ind. reports
## 30
  1. You can see here that the SHA-1 hash is different from our previous download confirming that the files are different.

8.9 Downloading and processing GitHub repositories

Before being able to download a GitHub repository and work with its files into R, you have to find the URL pointing to the compressed zip file containing all the files for the target repository. In this section, we will demonstrate how to obtain this information by using the Sagebrush_rooting_in_vitro_prop GitHub repository. As mentioned above, we will be downloading the whole repository and then extract all the csv files in the 01_Raw_Data/ folder (see Figure 8.3).

8.9.1 How to retrieve the URL for the GitHub repository?

To retrieve the URL associated to the compressed zip file containing all files for the repository do the following:

  1. Navigate to the GitHub repository page by clicking here (see Figure 8.8).
    GitHub repository page for Sagebrush_rooting_in_vitro_prop.

    Figure 8.8: GitHub repository page for Sagebrush_rooting_in_vitro_prop.

  2. To copy the URL pointing to the compressed zip file do the following actions:
  • Click on the button.
  • Navigate to the section.
  • Right-click to copy the link (as shown in Figure 8.8).
  1. The copied URL (https://github.com/svenbuerki/Sagebrush_rooting_in_vitro_prop/archive/refs/heads/master.zip) will serve as input for downloading the repository on your local computer (see below).

8.9.2 Downloading GitHub repository on your computer

Now that we have secured the URL pointing to the compressed zip file for the target repository (by copying it), we will use this URL and the base R download.file() function as input to download the file on our local computer. Since compressed files could be large, we are also providing some code to check if the file already exists on your computer before downloading.

### ~~~ Store URL in object ~~~ Paste the URL that you
### copied in the previous section here
URLrepo <- "https://github.com/svenbuerki/Sagebrush_rooting_in_vitro_prop/archive/refs/heads/master.zip"
### ~~~ Download the repository from GitHub ~~~ Arguments:
### - url: URLrepo - destfile: Path and name of destination
### file on your computer YOU HAVE TO ADJUST PATH TO YOUR
### COMPUTER
# First check if the file exists, if yes then return file
# already downloaded else proceed with download
if (file.exists("Data/GitHubRepoSagebrush.zip") == TRUE) {
    # File already exists!
    print("file already exists and doesn't need to be downloaded!")
} else {
    # Download the file
    print("Downloading GitHub repository!")
    download.file(url = URLrepo, destfile = "Data/GitHubRepoSagebrush.zip")
}
## [1] "file already exists and doesn't need to be downloaded!"

8.9.3 Extracting raw data from compressed GitHub repository

Compressed files can be quite large and you might would like to avoid decompressing them, but rather accessing target files and only decompressing those. Here, we are practicing such approach by using GitHubRepoSagebrush.zip and targeting csv files in the 01_Raw_Data/ folder.

8.9.3.1 What is the size of the zip file?

To estimate the size (in bytes) of a file you can use base R function file.size() as follows:

### ~~~ Infer file size of GitHubRepoSagebrush.zip ~~~
# Transform and round file size from bytes to Mb
ZipSize <- round(file.size("Data/GitHubRepoSagebrush.zip")/1e+06,
    2)
print(paste("The zip file size is", ZipSize, "Mb", sep = " "))
## [1] "The zip file size is 22.26 Mb"

8.9.3.2 Decompressing and saving csv files in 01_Raw_Data/

Finally, we can now i) list all files in the zip file, ii) identify csv files in 01_Raw_Data/ and iii) save these files on our local computer in a folder entitled 01_Raw_Data/. These files will then constitute the raw data for your subsequent analyses.

### ~~~ List all files in zip file without decompressing it
### ~~~
filesZip <- as.character(unzip("Data/GitHubRepoSagebrush.zip",
    list = TRUE)$Name)
### ~~~ Identify files in 01_Raw_Data/ that are csv ~~~ Use
### grepl() to match criteria
targetF <- which(grepl("01_Raw_Data/", filesZip) & grepl(".csv",
    filesZip) == TRUE)
# Subset files from filesZip to only get our target files
rawcsvfiles <- filesZip[targetF]
# print list of target files
print(rawcsvfiles)
## [1] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/1_block_8_12_2020 - 1_block.csv"  
## [2] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/2_block_8_15_2020 - 2_block.csv"  
## [3] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/3_block_8_15_2020 - 3_block.csv"  
## [4] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/4_block_8_16_2020 - 4_block.csv"  
## [5] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/5_block_8_19_2020 - 5_block.csv"  
## [6] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/Phenotypes_sagebrush_in_vitro.csv"
## [7] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/Survival_height_clones.csv"
### ~~~ Create local directory to save csv files ~~~ Check
### if the folder already exists if not then creates it
output_dir <- file.path(paste0("Data/01_Raw_Data/"))
if (dir.exists(output_dir)) {
    print(paste0("Dir", output_dir, " already exists!"))
} else {
    print(paste0("Created ", output_dir))
    dir.create(output_dir)
}
## [1] "DirData/01_Raw_Data/ already exists!"
### ~~~ Save csv in output_dir ~~~ Use loop to read csv
### file in and then save it in output_dir
for (i in 1:length(rawcsvfiles)) {
    ### ~~~ Decompress and read in csv file ~~~
    tempcsv <- read.csv(unz("Data/GitHubRepoSagebrush.zip", rawcsvfiles[i]))
    ### ~~~ Write file in ~~~ Extract file name
    csvName <- strsplit(rawcsvfiles[i], split = "01_Raw_Data/")[[1]][2]
    # Write csv file in output_dir
    write.csv(tempcsv, file = paste0(output_dir, csvName))
}

We can verify that all the files are in the newly created directory on your computer by listing them as follows (compare your results with files shown in Figure 8.3):

# List all the files in output_dir (on your local computer)
list.files(paste0(output_dir))
## [1] "1_block_8_12_2020 - 1_block.csv"   "2_block_8_15_2020 - 2_block.csv"  
## [3] "3_block_8_15_2020 - 3_block.csv"   "4_block_8_16_2020 - 4_block.csv"  
## [5] "5_block_8_19_2020 - 5_block.csv"   "Phenotypes_sagebrush_in_vitro.csv"
## [7] "Survival_height_clones.csv"

9 References

Allaire, J., Y. Xie, C. Dervieux, R Foundation, H. Wickham, Journal of Statistical Software, R. Vaidyanathan, et al. 2022. Rticles: Article formats for r markdown. Available at: https://github.com/rstudio/rticles.
Allen, L., J. Scott, A. Brand, M. Hlava, and M. Altman. 2014. Publishing: Credit where credit is due. Nature 508: 312–313.
Baker, M. 2016. 1,500 scientists lift the lid on reproducibility. Nature 533: 452–454. Available at: https://doi.org/10.1038/533452a.
Barron, R., P. Martinez, M. Serpe, and S. Buerki. 2020. Development of an in vitro method of propagation for artemisia tridentata subsp. Tridentata to support genome sequencing and genotype-by-environment research. Plants 9: Available at: https://www.mdpi.com/2223-7747/9/12/1717.
British Ecological Society ed. 2014a. A guide to data management in ecology and evolution. British Ecological Society.
British Ecological Society ed. 2014b. A guide to getting published in ecology and evolution. British Ecological Society.
British Ecological Society ed. 2014c. A guide to reproducible code in ecology and evolution. British Ecological Society.
Cargill, M., and P. O’Connor. 2011. Writing scientific research articles: Strategy and steps. Wiley-Blackwell.
Carroll, S.R., E. Herczog, M. Hudson, K. Russell, and S. Stall. 2021. Operationalizing the CARE and FAIR principles for indigenous data futures. Scientific Data 8: 108. Available at: https://doi.org/10.1038/s41597-021-00892-0.
Fox, C.W., and C.S. Burns. 2015. The relationship between manuscript title structure and success: Editorial decisions and citation performance for an ecological journal. Ecology and Evolution 5: 1970–1980. Available at: https://onlinelibrary.wiley.com/doi/abs/10.1002/ece3.1480.
Gandrud, C. 2016. Repmis: Miscellaneous tools for reproducible research. Available at: https://CRAN.R-project.org/package=repmis.
Gandrud, C. 2015. Reproducible Research with R and RStudio. C. Gandrud [ed.],. CRC Press.
Groom, Q., L. Weatherdon, and I.R. Geijzendorffer. 2017. Is citizen science an open science in the case of biodiversity observations? Journal of Applied Ecology 54: 612–617. Available at: https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/1365-2664.12767.
Hester, J., G. Csárdi, H. Wickham, W. Chang, M. Morgan, and D. Tenenbaum. 2019. Remotes: R package installation from remote repositories, including GitHub. Available at: https://github.com/r-lib/remotes#readme.
Khoo, S.Y.-S. 2019. Article processing charge hyperinflation and price insensitivity: An open access sequel to the serials crisis. LIBER Quarterly: The Journal of the Association of European Research Libraries 29: 1–18. Available at: https://liberquarterly.eu/article/view/10729.
Munafo, M.R., B.A. Nosek, D.V.M. Bishop, K.S. Button, C.D. Chambers, N.P. du Sert, U. Simonsohn, et al. 2017. A manifesto for reproducible science. Nature human behaviour 1: 0021.
Nosek, B.A., G. Alter, G.C. Banks, D. Borsboom, S.D. Bowman, S.J. Breckler, S. Buck, et al. 2015. Promoting an open research culture. Science 348: 1422–1425. Available at: http://science.sciencemag.org/content/348/6242/1422.
Nosek, B.A., C.R. Ebersole, A.C. DeHaven, and D.T. Mellor. 2018. The preregistration revolution. Proceedings of the National Academy of Sciences 115: 2600–2606. Available at: http://www.pnas.org/content/115/11/2600.
Nosek, B.A., J.R. Spies, and M. Motyl. 2012. Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science 7: 615–631. Available at: https://doi.org/10.1177/1745691612459058.
R Core Team. 2020. Foreign: Read data stored by ’minitab’, ’s’, ’SAS’, ’SPSS’, ’stata’, ’systat’, ’weka’, ’dBase’, ... Available at: https://CRAN.R-project.org/package=foreign.
R Core Team. 2019. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Available at: https://www.R-project.org/.
RStudio Team. 2020. RStudio: Integrated development environment for r. RStudio, PBC., Boston, MA. Available at: http://www.rstudio.com/.
Smith, J.F., T.H. Parker, S. Nakagawa, J. Gurevitch, Ecology, and T.(Tools. for T. in Evolution) Working Group. 2016. Promoting transparency in evolutionary biology and ecology. Systematic Botany 41: 495–497. Available at: http://www.bioone.org/doi/abs/10.1600/036364416X692262.
Troudet, J., R. Vignes-Lebbe, P. Grandcolas, and F. Legendre. 2018. The increasing disconnection of primary biodiversity data from specimens: How does it happen and how to handle it? Systematic Biologysyy044. Available at: http://dx.doi.org/10.1093/sysbio/syy044.
Wagenknecht, K., T. Woods, F.G. Sanz, M. Gold, A. Bowser, S. Rüfenacht, L. Ceccaroni, and J. Piera. 2021. EU-Citizen.Science: A Platform for Mainstreaming Citizen Science and Open Science in Europe. Data Intelligence 3: 136–149. Available at: https://doi.org/10.1162/dint\_a\_00085.
Wickham, H. 2014. Advanced R. Taylor & Francis. Available at: https://books.google.com/books?id=PFHFNAEACAAJ.
Wickham, H., and J. Bryan. 2019. Readxl: Read excel files. Available at: https://CRAN.R-project.org/package=readxl.
Wilkinson, M.D., M. Dumontier, Ij.J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, et al. 2016. The FAIR guiding principles for scientific data management and stewardship. Scientific Data 3: 160018. Available at: https://doi.org/10.1038/sdata.2016.18.
Xie, Y. 2023a. Bookdown: Authoring books and technical documents with r markdown. Available at: https://CRAN.R-project.org/package=bookdown.
Xie, Y. 2016. Bookdown: Authoring books and technical documents with R markdown. Chapman; Hall/CRC, Boca Raton, Florida. Available at: https://bookdown.org/yihui/bookdown.
Xie, Y. 2015. Dynamic documents with R and knitr. 2nd ed. Chapman; Hall/CRC, Boca Raton, Florida. Available at: https://yihui.org/knitr/.
Xie, Y. 2023b. Knitr: A general-purpose package for dynamic report generation in r. Available at: https://yihui.org/knitr/.
Xie, Y., J.J. Allaire, and G. Grolemund. 2018. R markdown: The definitive guide. Chapman; Hall/CRC, Boca Raton, Florida. Available at: https://bookdown.org/yihui/rmarkdown.

Appendices

A Appendix 1

Citations of all R packages used to generate this report.

[1] J. Allaire, Y. Xie, C. Dervieux, et al. rmarkdown: Dynamic Documents for R. R package version 2.21. 2023. https://CRAN.R-project.org/package=rmarkdown.

[2] J. Allaire, Y. Xie, C. Dervieux, et al. rticles: Article Formats for R Markdown. R package version 0.23. 2022. https://github.com/rstudio/rticles.

[3] S. M. Bache and H. Wickham. magrittr: A Forward-Pipe Operator for R. R package version 2.0.3. 2022. https://CRAN.R-project.org/package=magrittr.

[4] C. Boettiger. knitcitations: Citations for Knitr Markdown Files. R package version 1.0.12. 2021. https://github.com/cboettig/knitcitations.

[5] J. Cheng, C. Sievert, B. Schloerke, et al. htmltools: Tools for HTML. R package version 0.5.3. 2022. https://github.com/rstudio/htmltools.

[6] R. Francois. bibtex: Bibtex Parser. R package version 0.4.2.3. 2020. https://github.com/romainfrancois/bibtex.

[7] C. Glur. data.tree: General Purpose Hierarchical Data Structure. R package version 1.0.0. 2020. http://github.com/gluc/data.tree.

[8] R. Iannone. DiagrammeR: Graph/Network Visualization. R package version 1.0.9. 2022. https://github.com/rich-iannone/DiagrammeR.

[9] M. C. Koohafkan. kfigr: Integrated Code Chunk Anchoring and Referencing for R Markdown Documents. R package version 1.2.1. 2021. https://github.com/mkoohafkan/kfigr.

[10] Y. Qiu. prettydoc: Creating Pretty Documents from R Markdown. R package version 0.4.1. 2021. https://github.com/yixuan/prettydoc.

[11] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2022. https://www.R-project.org/.

[12] K. Ren and K. Russell. formattable: Create Formattable Data Structures. R package version 0.2.1. 2021. https://CRAN.R-project.org/package=formattable.

[13] H. Wickham, J. Bryan, and M. Barrett. usethis: Automate Package and Project Setup. R package version 2.1.6. 2022. https://CRAN.R-project.org/package=usethis.

[14] H. Wickham, R. François, L. Henry, et al. dplyr: A Grammar of Data Manipulation. R package version 1.1.2. 2023. https://CRAN.R-project.org/package=dplyr.

[15] H. Wickham, J. Hester, W. Chang, et al. devtools: Tools to Make Developing R Packages Easier. R package version 2.4.4. 2022. https://CRAN.R-project.org/package=devtools.

[16] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. ISBN 978-1138700109. Boca Raton, Florida: Chapman and Hall/CRC, 2016. https://bookdown.org/yihui/bookdown.

[17] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.33. 2023. https://CRAN.R-project.org/package=bookdown.

[18] Y. Xie. Dynamic Documents with R and knitr. 2nd. ISBN 978-1498716963. Boca Raton, Florida: Chapman and Hall/CRC, 2015. https://yihui.org/knitr/.

[19] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014.

[20] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.42. 2023. https://yihui.org/knitr/.

[21] Y. Xie, J. Allaire, and G. Grolemund. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman and Hall/CRC, 2018. ISBN: 9781138359338. https://bookdown.org/yihui/rmarkdown.

[22] Y. Xie, J. Cheng, and X. Tan. DT: A Wrapper of the JavaScript Library DataTables. R package version 0.24. 2022. https://github.com/rstudio/DT.

[23] Y. Xie, C. Dervieux, and E. Riederer. R Markdown Cookbook. Boca Raton, Florida: Chapman and Hall/CRC, 2020. ISBN: 9780367563837. https://bookdown.org/yihui/rmarkdown-cookbook.

[24] H. Zhu. kableExtra: Construct Complex Table with kable and Pipe Syntax. R package version 1.3.4. 2021. https://CRAN.R-project.org/package=kableExtra.

B Appendix 2

Version information about R, the operating system (OS) and attached or R loaded packages. This appendix was generated using sessionInfo().

## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] repmis_0.5           rticles_0.23         DiagrammeR_1.0.9    
##  [4] DT_0.24              data.tree_1.0.0      kfigr_1.2.1         
##  [7] devtools_2.4.4       usethis_2.1.6        bibtex_0.4.2.3      
## [10] knitcitations_1.0.12 htmltools_0.5.3      prettydoc_0.4.1     
## [13] magrittr_2.0.3       dplyr_1.1.2          kableExtra_1.3.4    
## [16] formattable_0.2.1    bookdown_0.33        rmarkdown_2.21      
## [19] knitr_1.42          
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.5         tidyr_1.3.0        sass_0.4.2         pkgload_1.3.2.1   
##  [5] jsonlite_1.8.7     viridisLite_0.4.2  R.utils_2.12.0     bslib_0.4.0       
##  [9] shiny_1.7.2        highr_0.9          yaml_2.3.7         remotes_2.4.2     
## [13] sessioninfo_1.2.2  pillar_1.9.0       glue_1.6.2         digest_0.6.33     
## [17] RColorBrewer_1.1-3 promises_1.2.0.1   rvest_1.0.3        RefManageR_1.3.0  
## [21] colorspace_2.1-0   R.oo_1.25.0        httpuv_1.6.5       plyr_1.8.7        
## [25] pkgconfig_2.0.3    purrr_1.0.2        xtable_1.8-4       scales_1.2.1      
## [29] webshot_0.5.4      processx_3.8.2     svglite_2.1.0      later_1.3.0       
## [33] tibble_3.2.1       generics_0.1.3     ellipsis_0.3.2     withr_2.5.0       
## [37] cachem_1.0.6       cli_3.6.1          crayon_1.5.2       mime_0.12         
## [41] memoise_2.0.1      evaluate_0.21      ps_1.7.5           R.methodsS3_1.8.2 
## [45] fs_1.6.3           fansi_1.0.4        R.cache_0.16.0     xml2_1.3.5        
## [49] pkgbuild_1.3.1     data.table_1.14.2  profvis_0.3.7      tools_4.2.0       
## [53] prettyunits_1.1.1  formatR_1.12       lifecycle_1.0.3    stringr_1.5.0     
## [57] munsell_0.5.0      callr_3.7.3        compiler_4.2.0     jquerylib_0.1.4   
## [61] systemfonts_1.0.4  rlang_1.1.1        rstudioapi_0.14    visNetwork_2.1.0  
## [65] htmlwidgets_1.5.4  crosstalk_1.2.0    miniUI_0.1.1.1     curl_5.0.0        
## [69] R6_2.5.1           lubridate_1.8.0    fastmap_1.1.0      utf8_1.2.3        
## [73] stringi_1.7.12     Rcpp_1.0.11        vctrs_0.6.3        tidyselect_1.2.0  
## [77] xfun_0.36          urlchecker_1.0.1

  1. What is a Hypothesis? A hypothesis is a tentative, testable answer to a scientific question. Once a scientist has a scientific question they perform a literature review to find out what is already known on the topic. Then this information is used to form a tentative answer to the scientific question. Keep in mind, that the hypothesis also has to be testable since the next step is to do an experiment to determine whether or not the hypothesis is right! A hypothesis leads to one or more predictions that can be tested by experimenting. Predictions often take the shape of “If ____then ____” statements, but do not have to. Predictions should include both an independent variable (the factor you change in an experiment) and a dependent variable (the factor you observe or measure in an experiment). A single hypothesis can lead to multiple predictions.↩︎

  2. GBIF — the Global Biodiversity Information Facility — is an international network and research infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth.↩︎

  3. Postdiction involves explanation after the fact.↩︎