Please see this webpage for more details on the Syllabus. In addition, the presentation slides can be downloaded here.
In this chapter, we will introduce the bioinformatic tools to write and disseminate reproducible reports as implemented in RStudio (RStudio Team, 2020). More specifically, we will learn procedures to link and execute data and code into a unified environment (see Figure 2.1). This chapter focuses on learning the R Markdown syntax and protocols allowing to include text, code, figures, tables and bibliography into a document. This document will then be compiled into an output file (in either pdf
, HTML
or Word
formats) allowing sharing your research. More specifically, this tutorial provides students with the minimum knowledge allowing them to complete their bioinformatic tutorials (PART 2) and individual projects (PART 3). The chapter will be subdivided into four parts as follows:
In an effort to maximize time, we will not cover Part E in class, but students are encouraged to study this material on their own. Indeed, this latter material could provide examples and ideas for your assignments.
Learning outcomes associated to this chapter are provided at the top of each part.
The final files associated to Chapter 1 are all deposited on the instructor’s GitHub account at this URL:
Although less used in PART A, a set of files are provided to support teaching of material presented in this chapter. These files are deposited in the shared Google Drive under this path:
Reproducible_Science/Chapters/Chapter_1/Tutorial_files
If you have difficulties accessing the Google Drive, click here.
Files are as follows:
EEB603_Syllabus_BUERKI.Rmd
: This is the .Rmd
file used to compile the syllabus of this class. This file provides a good source of information for the syntax and protocols described in this tutorial.Bibliography_Reproducible_Science_2.bib
: This file contains references cited in BibTex
format.AmJBot.csl
: This citation style language (CSL) file allows formatting citations and bibliography following citation style of American Journal of Botany.Bioinformatic workflow_PART2.pdf
: A pdf
file containing the bioinformatic workflow taught in this class. This file will be used to learn how to incorporate a figure into R Markdown file.Software and packages required to perform this tutorial are detailed below. Students should install those software and packages on their personal computers to be able to complete this course. Additional packages might need to be installed and the instructor will provide guidance on how to install those as part of the forthcoming tutorials.
bookdown
, knitr
and R Markdown
. Use the following R command to install those packages:install.packages(c("bookdown", "knitr", "rmarkdown"))
pdf
format. Please install MiKTeX on Windows, MacTeX on OS X and TeXLive on Linux. For this class, you are not requested to install this software, which takes significant hard drive space and is harder to operate on Windows OS.NOTE: The instructor is using the following version of RStudio
: Version 2022.12.0+353 (2022.12.0+353). If your computer is experiencing issues running the latest version of the software, you can install previous versions here.
RStudio (RStudio Team, 2020) is an integrated development environment (IDE) that allows you to interact with R more readily. RStudio is similar to the standard RGUI, but it is considerably more user friendly. It has more drop-down menus, windows with multiple tabs, and many customization options (see Figure 2.2). Detailed information on using RStudio can be found at at RStudio’s website.
Please find below URLs to webpages that are providing key information for chapter 1:
In this part, we will provide a survey of the procedures to create and render (or knitting) your first R Markdown document.
This tutorial is devoted to part B of chapter 1 and provides students with opportunities to learn procedures to:
Markdown is a simple formatting syntax language used for authoring HTML, PDF, and MS Word documents, which is implemented in the rmarkdown
package. An R Markdown document is usually subdivided into three sections (see Figure 2.3):
To create an R Markdown document execute the following steps in RStudio:
File -> New File -> R Markdown...
Default output format
(see Figure 2.4).
If you want to knit your document in pdf
format, a version of the TeX
program has to be installed on your computer (see Figure 2.4)..Rmd
document (using File -> Save As...
). Save this file in a new folder devoted to the project (Warning: Knitting the document will generate several files).To render or knit your R Markdown document/script into the format specified in the YAML metadata section do the following steps in RStudio:
Knit
button (Figure 2.3) in the upper bar of your window to render document..Rmd
file. You can track progress in the R Markdown
console. If the knitting fails, error messages will be printed in the R Markdown
console (including information on which line of the script the error occurred, but it might not always be the case). Error messages are very useful to debug your R Markdown document.When you knit your document, R Markdown will feed the .Rmd
file to the R knitr package, which executes all of the code chunks and creates a new markdown (.md
) document. This latter document includes the code and its output (Figure 2.5). The markdown file generated by knitr
is then processed by the Pandoc program, which is responsible for creating the finished format (Figure 2.5).
We will focus here on learning the syntax and protocols to produce:
More syntax are available in the R Markdown Reference Guide. You can access this document as follows in RStudio:
Help -> Cheatsheets -> R Markdown Reference Guide
Notice: The Cheatsheet section also allows accessing additional supporting documents related to R Markdown and Data manipulation. Those documents will be very useful for this class.
Please find below the syntax to create headers (3 levels):
Syntax:
The "#" refers to the level of the header
# Header 1
## Header 2
### Header 3
Markdown does not have a syntax to add comments in your text, but this function can be borrowed from HTML as follows:
# HTML syntax to comment inside text
<!-- COMMENT -->
Usually, we are using this syntax to e.g. highlight where the text needs improvement/work or editing points. Comments will not be visible when your document is knitted.
You can learn more about this HTML syntax on this webpage.
There are two types of lists:
Syntax:
* unordered list
* item 2
+ sub-item 1
+ sub-item 2
Note: For each sub-level include two tabs to create the hierarchy.
Output:
Syntax:
1. ordered list
2. item 2
+ sub-item 1
+ sub-item 2
Output:
The following syntax will render text in italics or bold:
#Syntax for italics
*italics*
#Syntax for bold
**bold**
Adding hyperlinks to your documents support reproducible science and it can be easily done with the following syntax:
#Syntax to add hyperlink
[text](link)
#Example to provide hyperlink to RStudio
[RStudio](https://www.rstudio.com)
When knitted, the example with RStudio turns like that RStudio.
One of the most exciting features of working with the R Markdown format is the implementation of functions allowing to directly “plug” the output of R code into the compiled document (see Figure 2.5). In other words, when you compile your .Rmd
file, R Markdown will automatically run and process each code chunk and code lines (see below) and embed their results in your final document. If the output of the code is a table or a figure, you will be able to assign a label to this item (by adding information in the code chunk; see part B) and refer to it (= cross-referencing) in your pdf
or html
document. Cross-referencing is possible thanks to the \@ref()
function implemented in the R bookdown
package.
A code chunk could easily be inserted in your document as follows:
Insert
button in the editor toolbar.By default the code chunk will expect R code, but you can also insert code chunks supporting different computer languages (e.g. Bash, Python).
Chunk output can be customized with knitr options arguments set in the {}
of a chunk header. In the examples displayed in Figure 2.6 five arguments are used:
include = FALSE
prevents code and results from appearing in the finished file. R Markdown still runs the code in the chunk, and the results can be used by other chunks.echo = FALSE
prevents code, but not the results from appearing in the finished file. This is a useful way to embed figures.message = FALSE
prevents messages that are generated by code from appearing in the finished file.warning = FALSE
prevents warnings that are generated by code from appearing in the finished.fig.cap = "..."
adds a caption to graphical results.We will delve more into chunk options in part C of chapter 1, but in the meantime please see the R Markdown Reference Guide for more details.
Code results can also be inserted directly into the text of a .Rmd
file by enclosing the code with r
.
R Markdown will always:
As a result, inline output is indistinguishable from the surrounding text.
Warning: Inline expressions do not take knitr options and is therefore less versatile. We usually use inline code to perform simple stats (e.g. 4x4; 16)
There are three ways to access spell checking in an R Markdown document in RStudio:
Edit > Check Spelling...
Students will work individually to complete the following exercises:
*.Rmd
file entitled Exercises chapter 1: part A
and select HTML
as output format.Exe_chap1_partA.Rmd
in a folder called Exercises
located in:
Reproducible_Science/Chapters/Chapter_1
To further learn syntax and protocols, please look at associated files provided by the instructor (see above for more details).
The aim of this tutorial is to provide students with the expertise to generate reproducible reports using bookdown (Xie, 2016, 2023a) and allied R packages (see Appendix 1 for a full list). Unlike functions implemented in the R rmarkdown package (Xie et al., 2018, which was better suited to generating PDF
reproducible reports), bookdown allows to use ONE unified set of functions to generate HTML
and PDF
documents. In addition, the same approach and functions are used to process tables and figures as well as cross-reference those in the main body of the text. In this tutorial, we will also cover procedures to cite references in the text, automatically generate a bibliography/references section and format citations to journal styles as well as generating an Appendix containing citations of all R packages used to conduct your research (and produce the report).
This tutorial is devoted to part B of chapter 1 and provides students with opportunities to learn procedures to:
*.R
script saved at the root of your working directory).To facilitate teaching of the learning outcomes, a roadmap with the RMarkdown file (*.Rmd
) structure is summarized in Figure 2.7. This structure also applies to Chapter 1 - part C.
To support reproducibility of your research, we are structuring the RMarkdown file as follows (Figure 2.7):
*.bib
) and citation style language (*.csl
) files, which have to be stored in the working directory .Please refer to section for more details on supporting files and their locations on the shared Google Drive.
To execute this tutorial the following R packages (declared in pkg
object) have to be installed on your computer using the code provided below.
The best procedure to ensure reproducibility would be to copy the code below in an R script file (entitled 01_R_dependencies.R
) saved at the root of your working directory (if this directory does not exist yet, please start by creating it and naming it Chapter_1_PartB
).
##~~~
# Check/Install R dependencies
##~~~
# This code is dedicated to packages for Chapter 1
##~~~
#1. List all required packages
##~~~
#Object (args) provided by user with names of packages stored into a vector
pkg <- c("knitr", "rmarkdown", "bookdown", "formattable", "kableExtra", "dplyr", "magrittr", "prettydoc", "htmltools", "knitcitations", "bibtex", "devtools")
##~~~
#2. Check if pkg are installed
##~~~
print("Check if packages are installed")
## [1] "Check if packages are installed"
#This line outputs a list of packages that are not installed
new.pkg <- pkg[!(pkg %in% installed.packages())]
##~~~
#3. Install missing packages
##~~~
# Using an if/else statement to check whether packages have to be installed
# WARNING: If your target R package is not deposited on CRAN then need to adjust code/function
if(length(new.pkg) > 0){
print(paste("Install missing package(s):", new.pkg, sep=' '))
install.packages(new.pkg, dependencies = TRUE)
}else{
print("All packages are already installed!")
}
## [1] "All packages are already installed!"
##~~~
#4. Load all packages
##~~~
print("Load packages and return status")
## [1] "Load packages and return status"
#Here we use the sapply() function to require all the packages
# To know more about the function type ?sapply() in R console
sapply(pkg, require, character.only = TRUE)
## knitr rmarkdown bookdown formattable kableExtra
## TRUE TRUE TRUE TRUE TRUE
## dplyr magrittr prettydoc htmltools knitcitations
## TRUE TRUE TRUE TRUE TRUE
## bibtex devtools
## TRUE TRUE
If you are planning to create PDF
documents, you will need to install a TeX
distribution on your computers. Please refer to this website for more details: https://www.latex-project.org/get/
Several students working on Windows computers shared difficulties in compiling PDF
documents in RStudio. This issue is associated to MiKTeX
preventing RStudio to install or update TeX
packages required to knit your documents.
To solve this issue apply the following procedure:
MiKTeX
console by searching and clicking MiKTeX Console
in the application launcher.Settings
tab.Always install missing packages on-the-fly
under the “You can choose whether missing packages are to be installed on-the-fly” header (see Figure 2.8).RStudio
and you should be able to knit pdf
documents.The YAML metadata section (Figure 2.7) allows users to provide arguments (referred to as fields) to convert their R Markdown document into its final form. In this class, we will be using functions implemented in the knitr (Xie, 2015, 2023b) and bookdown packages (Xie, 2016, 2023a) to populate this section (field names as declared in the YAML metadata section are provided between parenthesis):
title
).subtitle
).author
).date
).output
).link-citations
).fontsize
).bibliography
).csl
).The YAML code provided below outputs either an HTML
or PDF
document (see output
field) with a table of content (see toc
field) and generates in text citations and bibliography section as declared in the AmJBot.csl
file (under the csl
field). The bibliography has to be stored in a file (here Bibliography_Reproducible_Science_2.bib
) deposited at the root of your working directory.
---
title: "Your title"
subtitle: "Your subtitle"
author: "Your name"
date: "`r Sys.Date()`"
output:
bookdown::html_document2:
toc: TRUE
bookdown::pdf_document2:
toc: TRUE
link-citations: yes
fontsize: 12pt
bibliography: Bibliography_Reproducible_Science_2.bib
csl: AmJBot.csl
---
Do the following steps to set your YAML metadata section (also see Figure 2.7):
.Rmd
document into a new project folder in Reproducible_Science/Chapters/Chapter_1/
(Note: this folder has to be created prior to executing this step).Bibliography_Reproducible_Science_2.bib
and AmJBot.csl
in your project folder. These files are available on the Shared Google Drive folder:Reproducible_Science/Chapters/Chapter_1/Tutorial_files
.bib
and .csl
files have to be stored in the same working directory as your .Rmd
file.R
functions can be used in the YAML metadata section by using inline R code syntax (see part A for more details). Here, we use Sys.Date()
to automatically date the output document.bibliography: [file1.bib, file2.bib]
.#
in front of the command line (= equivalent of commenting).Since you have declared two output documents in the YAML metadata section and that those are specific to bookdown
functions, you will have to select which output format you want to use to compile your document by clicking on the drop-down list on the left side of the Knit
button (see Figure 2.9). To use bookdown
functions, please make sure to select one of the following options (see Figure 2.9): Knit to html_document2
or Knit to pdf_document2
.
It is best practice to add an R code chunk directly under the YAML metadata section to load all the required R packages used to produce your report (same code than presented here) (see Figure 2.7). This feature will also allow to automatically generate a citation file with all R packages used to generate your report (see below). Applying this approach will contribute to improving the reproducibility of your research!
packages
and set options line as follows:echo = FALSE
,warning = FALSE
,include = FALSE
.01_R_dependencies.R
):###~~~
# Load R packages
###~~~
#Create vector w/ R packages
# --> If you have a new dependency, don't forget to add it in this vector
pkg <- c("knitr", "rmarkdown", "bookdown", "formattable", "kableExtra", "dplyr", "magrittr", "prettydoc", "htmltools", "knitcitations", "bibtex", "devtools")
##~~~
#2. Check if pkg are installed
##~~~
print("Check if packages are installed")
#This line outputs a list of packages that are not installed
new.pkg <- pkg[!(pkg %in% installed.packages())]
##~~~
#3. Install missing packages
##~~~
# Using an if/else statement to check whether packages have to be installed
# WARNING: If your target R package is not deposited on CRAN then need to adjust code/function
if(length(new.pkg) > 0){
print(paste("Install missing package(s):", new.pkg, sep=' '))
install.packages(new.pkg, dependencies = TRUE)
}else{
print("All packages are already installed!")
}
##~~~
#4. Load all packages
##~~~
print("Load packages and return status")
#Here we use the sapply() function to require all the packages
# To know more about the function type ?sapply() in R console
sapply(pkg, require, character.only = TRUE)
I don’t know about you, but I am always struggling to properly cite R packages in my publications. If you want to retrieve the citation for an R package, you can use the base R function citation()
. For instance, citations for knitr can be obtained as follow:
# Generate citation for knitr
citation("knitr")
##
## To cite the 'knitr' package in publications use:
##
## Yihui Xie (2023). knitr: A General-Purpose Package for Dynamic Report
## Generation in R. R package version 1.42.
##
## Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition.
## Chapman and Hall/CRC. ISBN 978-1498716963
##
## Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible
## Research in R. In Victoria Stodden, Friedrich Leisch and Roger D.
## Peng, editors, Implementing Reproducible Computational Research.
## Chapman and Hall/CRC. ISBN 978-1466561595
##
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.
If you want to generate those latter citation entries in BibTeX format, you can pass the returned object of citation()
to toBibtex()
as follows:
# Generate citation for knitr in BibTex format Note that
# there is no citation identifiers. Those will be
# automatically generated in our next code.
toBibtex(citation("knitr"))
## @Manual{,
## title = {knitr: A General-Purpose Package for Dynamic Report Generation in R},
## author = {Yihui Xie},
## year = {2023},
## note = {R package version 1.42},
## url = {https://yihui.org/knitr/},
## }
##
## @Book{,
## title = {Dynamic Documents with {R} and knitr},
## author = {Yihui Xie},
## publisher = {Chapman and Hall/CRC},
## address = {Boca Raton, Florida},
## year = {2015},
## edition = {2nd},
## note = {ISBN 978-1498716963},
## url = {https://yihui.org/knitr/},
## }
##
## @InCollection{,
## booktitle = {Implementing Reproducible Computational Research},
## editor = {Victoria Stodden and Friedrich Leisch and Roger D. Peng},
## title = {knitr: A Comprehensive Tool for Reproducible Research in {R}},
## author = {Yihui Xie},
## publisher = {Chapman and Hall/CRC},
## year = {2014},
## note = {ISBN 978-1466561595},
## }
To use citation entries generated from toBibtex()
, you have to copy the output to a .bib
file and save it in your working directory. You will then be able to cite references found in this document directly in your R Markdown. This can be done by adding the following code to your packages
R code chunk:
# Generate BibTex citation file for all loaded R packages
# used to produce report Notice the syntax used here to
# call the function
knitr::write_bib(.packages(), file = "packages.bib")
The .packages()
argument returns invisibly the names of all packages loaded in the current R session (if you want to see a return use .packages(all.available = TRUE)
). This makes sure all packages being used in your code will have their citation entries written to the .bib
file. Finally, to be able to cite those references (see Citation identifier) in your text, the YAML metadata section has to be edited. See Appendix 1 for a full list of references associated to the R packages used to generate this report.
Although a References section will be provided at the end of your document to cite in text references (see References and Figure 2.7), it is customized to add citations for all R packages used to generate the research in Appendix 1. We will learn here the procedure to assemble such Appendix.
<div id="refs"></div>
as shown below, which allows printing Appendices (or any other material) after the References section (see here for more details):# References
<div id="refs"></div>
# (APPENDIX) Appendices {-}
# Appendix 1
Citations of all R packages used to generate this report.
# Appendix 1
to read in and print citations saved in packages.bib
. This is done as follows:### Load R package
library("knitcitations")
### Process and print citations in packages.bib Clear all
### bibliography that could be in the cash
cleanbib()
# Set pandoc as the default output option for bib
options(citation_format = "pandoc")
# Read and print bib from file
read.bibtex(file = "packages.bib")
{r generateBibliography, results = "asis", echo = FALSE, warning = FALSE, message = FALSE}
In addition to providing citations to R packages, you might also want to provide full information on R package versions and your operating systems (see Figure 2.7). With R, the simplest (but a useful and important) approach to document your R environment is to report the output of sessionInfo()
(or devtools::session_info()
). Among other information, this will show all the packages and their versions that are loaded in the session you used to run your analysis. If someone
wants to reproduce your analysis, they will know which packages they will need to install, what versions and on which operating systems the code was executed. For instance, here is the output of sessionInfo()
showing the R version and packages that I used to create this document:
# Collect Information About the Current R Session
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rticles_0.23 DiagrammeR_1.0.9 DT_0.24
## [4] data.tree_1.0.0 kfigr_1.2.1 devtools_2.4.4
## [7] usethis_2.1.6 bibtex_0.4.2.3 knitcitations_1.0.12
## [10] htmltools_0.5.3 prettydoc_0.4.1 magrittr_2.0.3
## [13] dplyr_1.1.2 kableExtra_1.3.4 formattable_0.2.1
## [16] bookdown_0.33 rmarkdown_2.21 knitr_1.42
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.5 sass_0.4.2 pkgload_1.3.2.1 jsonlite_1.8.7
## [5] viridisLite_0.4.2 bslib_0.4.0 shiny_1.7.2 highr_0.9
## [9] yaml_2.3.7 remotes_2.4.2 sessioninfo_1.2.2 pillar_1.9.0
## [13] glue_1.6.2 digest_0.6.33 RColorBrewer_1.1-3 promises_1.2.0.1
## [17] rvest_1.0.3 RefManageR_1.3.0 colorspace_2.1-0 httpuv_1.6.5
## [21] plyr_1.8.7 pkgconfig_2.0.3 purrr_1.0.2 xtable_1.8-4
## [25] scales_1.2.1 webshot_0.5.4 processx_3.8.2 svglite_2.1.0
## [29] later_1.3.0 tibble_3.2.1 generics_0.1.3 ellipsis_0.3.2
## [33] cachem_1.0.6 cli_3.6.1 crayon_1.5.2 mime_0.12
## [37] memoise_2.0.1 evaluate_0.21 ps_1.7.5 fs_1.6.3
## [41] fansi_1.0.4 xml2_1.3.5 pkgbuild_1.3.1 profvis_0.3.7
## [45] tools_4.2.0 prettyunits_1.1.1 formatR_1.12 lifecycle_1.0.3
## [49] stringr_1.5.0 munsell_0.5.0 callr_3.7.3 compiler_4.2.0
## [53] jquerylib_0.1.4 systemfonts_1.0.4 rlang_1.1.1 rstudioapi_0.14
## [57] visNetwork_2.1.0 htmlwidgets_1.5.4 miniUI_0.1.1.1 R6_2.5.1
## [61] lubridate_1.8.0 fastmap_1.1.0 utf8_1.2.3 stringi_1.7.12
## [65] Rcpp_1.0.11 vctrs_0.6.3 tidyselect_1.2.0 xfun_0.36
## [69] urlchecker_1.0.1
I have also used the approach described above to add this information in Appendix 2. This can be done as follows:
# Appendix 2
Version information about R, the operating system (OS) and attached or R loaded packages. This appendix was generated using `sessionInfo()`.
# Load and provide all packages and versions
sessionInfo()
We have now set our R Markdown environment and can start populating it! This means that you will be inserting your text and other code chunks directly under the packages
code chunk. The References section constitutes the end of the main body of your document. If you want to add Appendices, do so under Appendix 1, appendices will be labelled differently from the main body of the document.
There will be more details about tables in chapter 9; however this tutorial introduces key concepts related to table making in R Markdown, more specifically on the following topics:
Here, you will learn the R Markdown syntax and R code required to replicate the grading scale presented in the Syllabus (see Table 2.1):
Percentage | Grade |
---|---|
100-98 | A+ |
97.9-93 | A |
92.9-90 | A- |
89.9-88 | B+ |
87.9-83 | B |
82.9-80 | B- |
79.9-78 | C+ |
77.9-73 | C |
72.9-70 | C- |
69.9-68 | D+ |
67.9-60 | D |
59.9-0 | F |
.Rmd
document as above to practice working with tables.Tables
.Insert
button in the editor toolbar.# Create a data.frame w/ grading scale
grades <- data.frame(Percentage = c("100-98", "97.9-93", "92.9-90",
"89.9-88", "87.9-83", "82.9-80", "79.9-78", "77.9-73", "72.9-70",
"69.9-68", "67.9-60", "59.9-0"), Grade = c("A+", "A", "A-",
"B+", "B", "B-", "C+", "C", "C-", "D+", "D", "F"))
# Plot table and add caption
knitr::kable(grades, caption = "Grading scale applied in this class.") %>%
kable_styling(c("striped", "scale_down"))
echo = FALSE
tabgrades
in the chunk options line (just after {r
) to enable further cross-referencing.Run
button.Knit
button on the editor toolbar (Figure 2.9).There will be more details about figures in chapter 10; however this tutorial introduces key concepts related to figure making in R Markdown, more specifically on the following topics:
cars
dataset; Figure 2.10).Here, you will learn the R Markdown syntax and R code required to replicate Figure 2.10:
.Rmd
document as above to practice working with figures.Figures
.Insert
button in the editor toolbar.# Load and summarize cars dataset
summary(cars)
# Plot data
plot(cars)
echo = FALSE
results = "hide"
fig.cap = "Plot of cars' speed in relation to distance."
out.width = "100%"
cars
in the chunk options line (just after {r
) to enable further cross-referencing.Run
button.Knit
button on the editor toolbar (Figure 2.9).Cross-referencing tables and figures in the main body of your R Markdown document can easily be done by using the \@ref()
function implemented in the bookdown package (see Figure 2.7).
The general syntax is as follows:
# Cross-referencing tables in main body of text
\@ref(tab:code_chunk_ID)
# Cross-referencing figures in main body of text
\@ref(fig:code_chunk_ID)
To cross-reference the tabgrades
table type:
\@ref(tab:tabgrades)
, which translates into 2.1.
To cross-reference the cars
figure type:
\@ref(fig:cars)
, which translates into 2.10.
Note: This syntax doesn’t automatically include the Table
or Figure
handles in front of the cross-reference. You will have to manually add Table
or Figure
in front of \@ref()
.
Cross-reference your table and figure by adding a new level 1 header entitled Cross-referncing tables and figures
and then typing examples as shown below in this section.
To cite references in the R Markdown document those have to be saved in a bibliography file using the BibTeX
format. Other formats can be used, but the BibTeX
format is open-source and easy to edit. Please see this webpage for more details on other formats.
Most journals allow saving citation of publications directly in BibTeX
format, but when this feature is not available formats can be converted using online services (e.g. EndNote to BibTeX: https://www.bruot.org/ris2bib/).
BibTeX
fromatted references in a text file and make sure to add the .bib
extension..bib
file has to be deposited in the same folder as your .Rmd
file (= working directory).Cite
icon to download a citation in .bibtex
format. More details on the BibTeX
format is provided below.BibTeX
format are available in the associated file:
Bibliography_Reproducible_Science_2.bib
.The Pandoc program can automatically generate citations in the text and a bibliography/references section following various journal styles (see Figure 2.7). In order to use this feature, you need to specify a bibliography file in the YAML metadata section.
Please find below an example of a reference formatted in BibTeX
format:
# Example of BibTex format for Baker (2016) published in Nature
@Article{Baker_2016,
doi = {10.1038/533452a},
url = {https://doi.org/10.1038/533452a},
year = {2016},
month = {may},
publisher = {Springer Nature},
volume = {533},
number = {7604},
pages = {452--454},
author = {Monya Baker},
title = {1,500 scientists lift the lid on reproducibility},
journal = {Nature},
}
The unique citation identifier of a reference (Baker_2016
in the example above) is set by the user in the BibTeX
citation file (see first line in the example provided above). This unique identifier is used to refer to the reference/publication in the R Markdown document and also allows citing references and generating the References section.
Citations go inside square brackets ([]
) and are separated by semicolons. Each citation must have a key, composed of @
+ the citation identifier (see above) as stored into the BibTeX
file.
Please find below some examples on citation protocols:
#Syntax
Blah blah [see @Baker_2016, pp. 33-35; also @Smith2016, ch. 1].
Blah blah [@Baker_2016; @Smith2016].
Once knitted (using the button), the above code/syntax turns into:
Blah blah (see Baker, 2016 pp. 33–35; also Smith et al., 2016, ch. 1).
Blah blah (Baker, 2016; Smith et al., 2016).
A minus sign (-) before the @
will suppress mention of the author in the citation. This can be useful when the author is already mentioned in the text:
#Syntax
Baker says blah blah [-@Baker_2016].
Once knitted, the above code/syntax turns into:
Baker says blah blah (2016).
You can also write an in-text citation, as follows:
#Syntax
@Baker_2016 says blah.
@Baker_2016 [p. 1] says blah.
Once knitted, the above code/syntax turns into:
Baker (2016) says blah.
Baker (2016 p. 1) says blah.
Students have to use their .Rmd
document to practice citing references in the text using procedures described above. To clearly define where you practice citing references, please do so under a Citing references
header.
Upon knitting, a References section will automatically be generated and inserted at the end of the document (see Figure 2.7). Usually, we recommend adding a References header (level 1) just after the last paragraph of the document as displayed below:
last paragraph...
# References
The bibliography will be inserted after this header (please see References section of this tutorial for more details).
In this section, we are studying how your bibliography can be automatically formatted following a journal style. This is achieved by providing the name of a citation style language file (containing the protocol to format citations and bibliography following a journal style) in the YAML metadata section.
The Citation Style Language (CSL) was developed by an open-source project and aims at facilitating scholarly publishing by automating the formatting of citations and bibliographies. This project has developed the CSL and maintains a crowd sourced repository with over 8000 free CSL citation styles. Please see the following website for more details: https://citationstyles.org
There are two main CSL repositories:
Please follow the steps below to format your citations and bibliography following the citation style provided in a CSL file (see Figure 2.7 for more details):
AmJBot.csl
)..Rmd
file.# Add a "csl" argument and provide name of the CSL file (here AmJBot.csl)
---
title: "Sample Document"
output:
bookdown::html_document2:
toc: TRUE
bookdown::pdf_document2:
toc: TRUE
bibliography: bibliography.bib
csl: AmJBot.csl
---
Knit
button. The Pandoc program will use the information stored in the YAML metadata section to format the bibliography (citations and bibliography section) following the citation style provided in the CSL file. Do not forget to add a References
header at the end of your .Rmd
document.This tutorial is devoted to part C of chapter 1 and provides students with opportunities to learn procedures to (see Figure 2.7):
Please refer to section for more details on supporting files and their locations on the shared Google Drive.
The slides presented in class can be downloaded here. All the information presented in these slides are found in the text below.
Unlike R scripts where you have to set your working directory or provide the path to your files, the approach implemented in R Markdown document (.Rmd
) automatically sets your working directory to the location of your .Rmd
file. This procedure is done by knitr
functions. knitr
expects all declared files to be located in the same path as your .Rmd
file or in a subfolder within this working directory. The main reason for this approach is to maximize portability of your R Markdown project, which is usually composed of a set of files (see Figure 2.11).
Before knitting your document, you will be testing your code and this requires to set your working directory. The can be done in RStudio by clicking (see Figure 2.12):
Session --> Set Working Directory --> To Source File Location
To improve code reproducibility and efficiency and to follow publication requirements, it is customed to include a “code chunk” at the beginning of your .Rmd
file to set global options applying to the whole document (see Figure 2.7). Those settings are related to the following elements of your code:
These general settings will be set using the opts_chunk()
function implemented in knitr
(Xie, 2023b). The following website contains valuable information on code chunk options:
The knitr
function opts_chunk$set()
is used to change the default global options in an .Rmd
document.
Before starting, a few special notes should be known on the options:
.
) in chunk labels and directory names.Here we will be discussing each part of the settings individually, but those will have to be merged into one code chunk in your document entitled setup
(please see below for more details).
This section deals with settings related to text results generated by code chunks.
Please find below an example of options that could be applied across code chunks:
# Setup options for text results
opts_chunk$set(echo = TRUE, warning = TRUE, message = TRUE, include = TRUE)
echo = TRUE
: Include all R source codes in the output file.warning = TRUE
: Preserve warnings (produced by warning()
) in the output like we run R code in a terminal.message = TRUE
: Preserve messages emitted by message()
(similar to warning)include = TRUE
: Include all chunk outputs in the final output document.If you want some of the text results to have different options, please adjust those in their specific code chunks. This comment is valid for all the other general settings.
This section deals with settings related to code decoration (i.e. how it is outputted in the final html
or pdf
document) generated by code chunks.
Please find below an example of options that could be applied across code chunks:
# Setup options for code decoration
opts_chunk$set(tidy = TRUE, tidy.opts = list(blank = FALSE, width.cutoff = 60),
highlight = TRUE)
tidy = TRUE
: Use formatR::tidy_source()
to reformat the code. Please see tidy.opts
below.tidy.opts = list(blank = FALSE, width.cutoff = 60)
: This provides a list of options to be passed to the function determined by the tidy
option. Here we format the code to avoid blank lines and with a width cutoff of 60 characters.highlight = TRUE
: This highlights the source code.To compile your .Rmd
document faster (especially if you have computing intensive tasks), you can cache the output of your code into files associated to each of your code chunks. This process allows compute intensive chunks to be saved and the output used later without being re-run.
The knitr package has options to only evaluate cached chunks when necessary, but this has to be set by users. Such procedure creates a unique MD5 digest (= a data storage technique) of each chunk to track when changes are present. When the option cache = TRUE
(there are other, more granular settings; see below) is set, the chunk will only be evaluated in the following scenarios:
The following code allows implementing this procedure to your document:
# Setup options for code cache
opts_chunk$set(cache = 2, cache.path = "cache/")
TRUE
and FALSE
for the chunk option cache
, advanced users can also consider more granular cache by using numeric values for cache = 0, 1, 2, 3
. 0
means FALSE
, and 3
is equivalent to TRUE
. For cache = 1
, the results of the computation are loaded from the cache, so the code is not evaluated again, but everything else is still executed, such the output hooks and saving recorded plots to files. For cache = 2
(used here), it is very similar to 1
, and the only difference is that the recorded plots will not be re-saved to files when the plot files already exist, which might save some time when the plots are big.cache.path = "cache/"
: Directory where cache files will be saved. You don’t have to create the directory before executing the code, it will be created automatically by knitr if it doesn’t exist yet.Plots are a major element of your research and they are at the core of your figures. We can take advantage of options implemented in the knitr
package to output plots meeting publication requirements. This approach will save precious time during the writing phase of your research (= no need to fiddle with the size and resolution of figures to meet journal policies).
Please find below an example of options that could be applied across code chunks:
# Setup options for plots The first dev is the master for
# the output document
opts_chunk$set(fig.path = "Figures_MS/", dev = c("png", "pdf"),
dpi = 300)
fig.path = "Figures_MS/""
: Set directory to save figures generated by the R Markdown document. As above, this folder doesn’t need to exist prior to executing the code chunks. Files will be save based on code chunk title and assigned figure number.dev = c('pdf', 'png')
: Save figures in both pdf
and png
formats.dpi = 300
: The DPI (dots per inch) for bitmap devices (dpi * inches = pixels). Please look at publishing requirements to set this parameter appropriately.It is worth noting that you might be using external figures in your .Rmd
document. To avoid confusions between figures generated by the .Rmd
document and those coming from outside, it is best practice to have them saved in two different subfolders (see Figure 2.11 for more details).
Some journals have specific requirements on figure dimensions. You can easily set these by using the following option:
fig.dim
: (NULL; numeric) if a numeric vector of length 2, it gives fig.width
and fig.height
, e.g., fig.dim = c(5, 7)
.Positioning figures close to their code chunks is critical and can get sorted by adding another opts_chunk$set()
code line in your setup
R code chunk. This is done by invoking the fig.pos
argument and setting it to "H"
. Warning: Setting this argument might generate errors when documents are knitted as pdf
documents. If it happens, please comment this line using #
and knit again.
## Locate figures as close as possible to requested
## position (=code)
opts_chunk$set(fig.pos = "H")
In this section, we will collate all global settings discussed above into a code chunk entitled setup
, which will be placed under the YAML metadata section (see Figure 2.7 for more details on location). In addition to containing the global settings, it is advisable to also include a code section devoted to loading required R packages (see Chapter 1 - part B and Figure 2.7).
Please find below the code for the setup
code chunk based on the options presented above:
# Load packages
## Add any packages specific to your code
library("knitr")
library("bookdown")
# Chunk options: see http://yihui.name/knitr/options/ ###
## Text results
opts_chunk$set(echo = TRUE, warning = TRUE, message = TRUE, include = TRUE)
## Code decoration
opts_chunk$set(tidy = TRUE, tidy.opts = list(blank = FALSE, width.cutoff = 60),
highlight = TRUE)
## Caching code
opts_chunk$set(cache = 2, cache.path = "cache/")
## Plots The first dev is the master for the output
## document
opts_chunk$set(fig.path = "Figures_MS/", dev = c("png", "pdf"),
dpi = 300)
## Locate figures as close as possible to requested
## position (=code)
opts_chunk$set(fig.pos = "H")
setup
R code chunkWhen inserting the above code into an R code chunk (see Figure 2.7), please set the options of the chunk as follows:
setup
: Unique ID of the code chunk.include = FALSE
: Nothing will be written into the output document, but the code will be evaluated and plot files will be generated (if there are any plots in the chunk).cache = FALSE
: Code chunk will not be cached (see above for more details).message = FALSE
: Messages emitted by message()
will not be preserved.Options (and their associated arguments) in the code chunk have to be separated by commas.
Please conduct the above exercise to get accustomed with the material presented in the tutorial. This exercise is divided into six steps as follows:
Chapter_1_PartB.Rmd
document.Insert
button) directly under the YAML metadata section and entitle it setup
. This code chunk will be used to define the global settings for the following options as implemented in the opts_chunk$set()
function:
setup
code chunk.Knit
button. Please pay attention to the outputs in your folder (= working directory).bookdown
\@ref()
function to cite your figure/plot in the text.The R code provided below is associated with step 5 of the exercise and it produces the plot displayed in Figure 2.13.
To get there:
Insert
button) and set the following options and associated arguments:
Plot
: Unique ID of code chunk.fig.cap = "Plot of y ~ x."
: Figure caption.fig.show = "asis"
: Figure display.out.width = "100%"
: Figure width on the page.# Generate a set of observations (n=100) that have a normal
# distribution
x <- rnorm(100)
# Add a small amount of noise to x to generate a new vector
# (y)
y <- jitter(x, 1000)
# Plot y ~ x
plot(x, y)
Run
button).This tutorial provides students with opportunities to gain the following skills by:
return()
, source()
.list()
) and their applicability to functions returning multiple values.To complete part D, please create a new R script document (Figure 2.14) and save it in your working directory. All the code and functions presented here should be reported in your R script.
This tutorial aims at providing an introduction to functions, more specifically, we will be studying the user defined functions (UDFs) as implemented in R. UDFs allow users to write their own functions and make them available in the R Global Environment (using the source()
function) or ultimately in R packages.
To gain this knowledge, students will be conducting three exercises to learn about the following topics:
To show the broad applications of the teaching material, we will be using mathematical examples. Before delving into these topics, the instructor is providing some general context and touch upon what a function is and when it is best applied as well as best practices to write pseudocode/code (more during Chapter 4) and approaches to calling R functions.
In programming, you use functions to incorporate sets of instructions that you want to use repeatedly or that, because of their complexity, are better self-contained in a sub program and called when needed.
A function is a piece of code written to carry out a specified task; it can or can not accept arguments or parameters and it can or can not return one or more values.
There exist many terms to define and express functions, subroutines, procedures, method, etc., but for the purposes of this tutorial, you will ignore these distinctions, which are often semantic and reminiscent of other older programming languages (see here for more details on semantics). In our context, those definitions are less important, because in R we only have functions.
In R, according to the base documentation, you define a function with the following construct:
function(arglist){
body
}
The code between the curly braces is the body
of the function.
When you use built-in functions, the only thing you need to worry about is how to effectively communicate the correct input arguments (arglist
) and manage the return value(s) (or outputs), if there are any. To know more about arguments associated with a specific function you can access its documentation by using the following syntax (entered in the R console):
#General syntax
?function()
#Example with read.csv()
?read.csv()
R allows users to define their own functions, which are based on the following syntax:
function.name <- function(arguments){
computations on the arguments
some more code
return value(s)
}
So, in most cases, a function has a name (here function.name
), some arguments (here arguments
) used as input to the function (declared within the ()
following the keyword function
); a body, which is the code within the curly braces {}
, where you carry out the computation; and can have one or more return values (the output). You define the function similarly to variables, by “assigning” the directive function(arguments)
to the “variable” function.name
, followed by the rest.
This topic will be covered in chapter 4, but here is an outline of the best practice to write code and functions in R. Before delving into code writing, we will usually work on developing a pseudocode, which aims at providing a high-level description of the tasks that will have to be performed by the function. Once this job done, we will then start writing the code by turning the tasks identified into the pseudocode into real R code. This will be done by searching for existing R functions allowing to execute each task described in the pseudocode and if they don’t exist develop new functions (this task might require some additional pseudocode). Please find below more detailed definitions of the two concepts described here.
Pseudocode is an informal high-level description of the operating principle of a computer program or other algorithm. It uses the structural conventions of a normal programming language (here R
), but is intended for human reading rather than machine reading. Here, you will establish the big steps (and their associated tasks) and tie R
functions (existing or that have to be made) to those steps. This provides the backbone of your code and will support writing it.
Writing clear, reproducible code has (at least) three main benefits:
When you will be working on your project, it is highly likely that you will have developed multiple UDFs tailored to your research. In this case, it would be appropriate to create a new folder entitled R_functions
, which will be located in the same directory as your R scripts (see Figure 2.15). Save all your UDFs as independent files (e.g. check.install.pkg.R
) into the R_functions
folder.
source()
function: Read R code from a fileOnce your project will be properly structured (see Figure 2.15), it will be easy to call specific UDFs into any R script by using the source()
function.
For instance, to load the check.install.pkg()
(saved in R_functions/check.install.pkg.R
) into the Global Environment enter the following code into the R console (see Figure 2.16):
source("R_functions/check.install.pkg.R")
Below, you will find some R code allowing to load all your UDFs stored in R_functions
folder. This code is very handy when you have several UDFs associated to your pipeline.
### Load all UDFs stored in R_functions
# 1. Create vector with names of UDF files in R_functions
# (with full path)
files_source <- list.files("R_functions", full.names = T)
# 2. Iterative sourcing of all UDFs
sapply(files_source, source)
Once loaded into your Global Environment (see Figure 2.16 for an example), you will be able to call your functions by typing their names directly into the console. For instance:
# Check if knitr package is installed, if not install it
# and then load it
check.install.pkg(pkg = c("knitr"))
## [1] "Check if packages are installed"
## [1] "Load packages"
## knitr
## TRUE
In this exercise, we will be working on developing a function that returns a single value, but how do we tell our UDF to return the value? In R, this task is accomplished with the return()
function.
Develop, implement and apply a UDF to calculate the square of a number.
In math, the squared symbol (\(^2\)) is an arithmetic operator that signifies multiplying a number by itself. The “square” of a number is the product of the number and itself. Multiplying a number by itself is called “squaring” the number.
Although straightforward, the function will have to execute the following tasks:
base
as part of the argument(s) of the function.base
by multiplying number by itself.sq
.sq
object to user (here a single value). To return sq
, we will be using the return()
function.This also means that the class of the input provided by user has to be numeric and the output will also be numeric. We will further discuss this topic during the third exercise. The class of an object can be checked by using the class()
function. Note: The class()
function will be useful for implementing defensive programming.
In this section, we will implement the pseudocode proposed above into a function entitled square_number()
. This function requires one argument from the user (base
, which is a number) and returns the square of that number (here square of base
= base
*base
).
## Create a UDF in R to calculate square number:
# - argument: base (= one number)
# - output: square of base (= one number)
square_number <- function(base){
#Infer square of base and save it into object
sq <- base*base
#Return sq object
return(sq)
}
Write the code associated to the square_number
into your new R script (saved in your working directory) and load function by executing all lines associated to function. Please carry on populating this document with the rest of the exercises.
Before being able to use your UDF, execute the code associated to the square_number()
function in the console. The UDF should now be loaded in the Global Environment and therefore be available for use. To verify that the UDF is loaded, please check the Environment panel in RStudio (see Figure 2.17).
The R language is quite flexible and allows functions to be applied to a single value (e.g. base = 2
) or a vector (e.g. base = c(2,4,16,23,45)
). Please see below for more example:
# Square number of 2
square_number(base = 2)
## [1] 4
# Create vector with numbers
bases <- c(2, 4, 16, 23, 45)
# Apply function to vector
square_number(base = bases)
## [1] 4 16 256 529 2025
In our previous exercise we developed a function returning only one value. As part of your research, there will be multiple instances where you will be performing multiple actions onto your data, which will call for multiple values to be outputted by the function. To do so, we will be harvesting the different outputs of the function into a list, which will be returned to the users (using return()
).
Lists are R objects containing elements of different types such as numbers, strings, vectors or another list inside it. A list can also contain a matrix or a function as its elements. Lists are created using the list()
function.
Find below an example to create a list containing strings, numbers, vectors and a logical values:
# Create a list containing strings, numbers, vectors and a
# logical value
list_data <- list("Red", 51.3, 72, c(21, 32, 11), TRUE)
# Print object
print(list_data)
## [[1]]
## [1] "Red"
##
## [[2]]
## [1] 51.3
##
## [[3]]
## [1] 72
##
## [[4]]
## [1] 21 32 11
##
## [[5]]
## [1] TRUE
The list elements can be given names and they can be accessed using these names (see below):
# Create a list containing a vector, a matrix and a list
list_data <- list(c("Jan", "Feb", "Mar"), matrix(c(3, 9, 5, 1,
-2, 8), nrow = 2), list("green", 12.3))
# Give names to the elements in the list
names(list_data) <- c("1st_Quarter", "A_Matrix", "An_inner_list")
# Show the list
print(list_data)
## $`1st_Quarter`
## [1] "Jan" "Feb" "Mar"
##
## $A_Matrix
## [,1] [,2] [,3]
## [1,] 3 5 -2
## [2,] 9 1 8
##
## $An_inner_list
## $An_inner_list[[1]]
## [1] "green"
##
## $An_inner_list[[2]]
## [1] 12.3
Elements of the list can be accessed by the index of the element in the list. In case of named lists it can also be accessed using the names. We use the same example as above to illustrate procedure to access list elements:
# Create a list containing a vector, a matrix and a list
list_data <- list(c("Jan", "Feb", "Mar"), matrix(c(3, 9, 5, 1,
-2, 8), nrow = 2), list("green", 12.3))
# Give names to the elements in the list
names(list_data) <- c("1st_Quarter", "A_Matrix", "An_inner_list")
# Access the first element of the list
print(list_data[1])
## $`1st_Quarter`
## [1] "Jan" "Feb" "Mar"
# Access the thrid element. As it is also a list, all its
# elements will be printed
print(list_data[3])
## $An_inner_list
## $An_inner_list[[1]]
## [1] "green"
##
## $An_inner_list[[2]]
## [1] 12.3
# Access the list element using the name of the element
print(list_data$A_Matrix)
## [,1] [,2] [,3]
## [1,] 3 5 -2
## [2,] 9 1 8
Now that you have more knowledge about list objects, we will be working on an exercise aiming at developing and implementing a UDF to calculate the log and square of a number.
Each student is tasked to:
Please find below the solution proposed by the instructor:
##Create a user defined function in R to calculate log and square of a number:
# argument: base (= one number)
# output: log and square of base (= two numbers) returned in a list
my_log_square <- function(base){
#log (base 10)
log_value <- log(base)
#Square of base
square_value <- base^2
#Return both objects
return(list(log_val = log_value, square_val = square_value))
}
# Call the function
my_log_square(base = 2)
## $log_val
## [1] 0.6931472
##
## $square_val
## [1] 4
Defensive programming is a technique to ensure that code fails with well-defined errors, i.e. where you know it should not work. The key here is to ‘fail fast’ and ensure that the code throws an error as soon as something unexpected happens. This creates a little more work for the programmer, but it makes debugging code a lot easier at a later date.
In order to demonstrate how to apply defensive programming to your code, a new function will be defined:
# Define a power function (exp_number): y = x^n
# - Arguments: base (= x) and power (= n)
# - Output: a number (y)
exp_number <- function(base, power){
#Infer exp (y) based on base (x) and power (n) (y=base^power)
exp <- base^power
#Return exp object
return(exp)
}
# Call function
exp_number(base = 2, power = 5)
## [1] 32
You can employ defensive programming on the exp_number
function defined above. The function requires that both arguments are of class numeric, if you were to provide a string (e.g. a word) as input, you would get an error:
# Example where we don't respect the class associated with
# the argument base
exp_number(base = "hello", power = 5)
## Error in base^power: non-numeric argument to binary operator
If you add in a line of code to test the data type of the inputs, you get a more meaningful error.
# Define a power function (exp_number): y = x ^n
# - Arguments: base (= x) and power (= n)
# - Output: a number (y)
exp_number <- function(base, power){
# This if statement tests if classes of base and power are numeric.
# If one of them is not numeric it stops and return meaningful message
if(class(base) != "numeric" | class(power) != "numeric"){
stop("Both base and power inputs must be numeric")
}
#If classes are good then infer exp
exp <- base^power
# Return exp object
return(exp)
}
# Call function
exp_number(base = "hello", power = 5)
## Error in exp_number(base = "hello", power = 5): Both base and power inputs must be numeric
Although in this case debugging the error would not have taken long, in more complicated functions you are likely to either have less meaningful error messages, or code that runs for some time before it fails. By applying defensive programming and adding in these checks to the code, you can find unexpected behavior sooner and with more meaningful error messages.
As you have seen in the previous example, we have used a logical operator (in this last case OR represented with the |
syntax) to implement defensive programming into our UDF function. In a nutshell, the most used logical operators in R
are as follows:
&
): This operator takes two logical values and returns TRUE
only if both values are TRUE themselves.|
): This operator takes two logical values and returns TRUE
if just one value is TRUE
.!
): This operator negates the logical value it is used on.You can learn more about the logical operators (including exercises) on the website.
Update my_log_square()
to verify the class of base
argument and print a meaningful error message if the class is not numeric.
The objective of this section is to provide students with some tools and ideas to design their bioinformatic tutorials (for PART 2). Here, students will have an overview of the tools implemented in the R package learnr, which was developed to produce interactive tutorials. Although developed in R, the interactive tutorials are designed to be conducted in web browsers (but it could also be entirely done within RStudio).
The interactive tutorial presented here is subdivided into five topics:
R Code
button. The instructor has also provided the solution to the exercise, which could be accessed by pressing the Solution
button available in the top banner of the code chunk window.Finally, the instructor wants to stress that students are not obliged to design their tutorials using learnr. You can also use the R Markdown language/syntax and output tutorials in HTML or PDF formats (more on this subject in Chapter 1).
This document highlights steps to execute the interactive tutorial designed by the instructor.
Open RStudio and install the learnr package from CRAN by typing the following command in the console:
install.packages("learnr")
Files associated to this tutorial are deposited on the Google Drive under this path:
Reproducible_Science -> Bioinformatic_tutorials -> Intro_interactive_tutorial
There are two main files:
README.html
: The documentation to install R package and run interactive tutorial.EEB603_Interactive_tutorial.Rmd
: The interactive tutorial written in R Markdown, but requiring functions from learnr.The instructor has made a video explaining the procedure to launch the interactive tutorial (based on option 1; see below) as well as some additional explanations related to the exercise.
Intro_interactive_tutorial
folder and save it on your local computer.EEB603_Interactive_tutorial.Rmd
.Session -> Set Working Directory -> Choose Directory...
.setwd()
function to set your working directory (e.g. setwd("~/Documents/Course_Reproducible_Science/Timetable/Intro_interactive_tutorial")
).EEB603_Interactive_tutorial.Rmd
in RStudio and press the Run Document
button on the upper side bar to launch the tutorial. It will appear in the Viewer
panel (on right bottom corner). You can open the interactive tutorial in your web browser by clicking on the third icon at the top of the viewer panel. This procedure is also explained in the Youtube video.Open in Browser
button.rmarkdown::run("EEB603_Interactive_tutorial.Rmd")
The procedure to develop interactive tutorials using learnr is presented here. To learn more about the syntax, the instructor encourages you to open EEB603_Interactive_tutorial.Rmd
in RStudio and inspect the document. This will allow learning syntax and associated procedures to:
In this chapter, we are investigating the causes leading to irreproducible science and discussing ways to mitigate this crisis. We are using results from the survey published by Baker (2016) as baseline to support our discussions.
Before delving into the cause leading to irreproducible science, we need to look into the differences between replication and reproduction. The material presented here has been adapted from RESCIENCE C.
Reproduction of a study means running the same computation (mostly referred as code in this class) on the same input data, and then checking if the results are the same. Reproduction can be considered as software testing at the level of a complete study.
Replication of a scientific study (computational or other) means repeating a published protocol, respecting its spirit and intentions, but varying the technical details. For instance, it would mean using a protocol aiming at extracting genomic DNA developed on tomato and applying it on sagebrush. For computational work, this would mean using different software, running a simulation from different initial conditions, etc. The idea is to change something that everyone believes should not matter (e.g. both tomato and sagebrush are plants and have DNA), and see if the scientific conclusions are affected or not.
Overall, reproduction verifies that a computation was recorded with enough detail that it can be analyzed later or by someone else. On the other hand, replication explores which details matter for reaching a specific scientific conclusion. A replication attempt is most useful if reproducibility has already been verified. Otherwise, if replication fails, leads to different conclusions, you cannot trace back the differences in the results to the underlying code and data.
The list of websites listed here have been used to design this chapter:
Improving the reliability and efficiency of scientific research will increase the credibility of the published scientific literature and accelerate discovery.
In this chapter, we will survey measures that can be adopted to optimize key elements of the scientific process by especially focusing on:
These measures aim at minimizing threats to the scientific process therefore making it more open and transparent (Figure 4.1).
While studying this topic, keep in mind this quote from Richard Feynman:
The first principle is that you must not fool yourself – and you are the easiest person to fool.
This chapter is mostly based on the following resources:
The website of the Center for Open Science (https://cos.io) was also used to design the chapter content. More specifically, visit this webpage on study pre-registration.
The presentation slides for this chapter can be downloaded here.
The scientific process can be subdivided into six phases (Figure 4.1):
In order to facilitate the understanding of the material taught in this chapter, the six phases of the scientific process are split into three categories reflecting the study’s progress:
The distinction of those categories is very important to ensure the reproducibility and transparency of the study and to avoid falling into the traps described in Figure 4.1. For instance, considering the pre-study category as an independent step in the scientific process will promote study pre-registration and therefore avoid HARKing, P-hacking and publication bias (see Figure 4.1 and the Glossary section). The recognition of the post-study category is also very important since it encourages pre- and post-publication reviews therefore supporting a better dissemination and transparency of your research.
A hallmark of scientific creativity is the ability to see novel and unexpected patterns in data. However, a major challenge for scientists is to be open to new and important insights while simultaneously avoiding being misled by our tendency to see structure in randomness. The combination of:
Those factors can easily lead us to false conclusions and therefore be threats to our scientific process. Some of these threats (e.g. HARKing, P-hacking) are displayed in Figure 4.1 and definitions are provided in the Glossary section.
Our objective is to tackle threats by providing measures to ensure reproducibility and transparency. Following the approach proposed by Munafo et al. (2017), the measures studied in this chapter to ensure research reproducibility and transparency are organized into five categories. In addition, when possible, categories contain specific working themes designed to minimize the threats discussed above (see Figure 4.1).
These measures are not intended to be exhaustive, but aim at providing a broad, practical and evidence-based set of actions that can be implemented by researchers, institutions, journals and funders. They will also provide a road map for students to design their thesis projects.
This section describes measures that can be implemented when performing research (including, for example, study design, methods, statistics, and collaboration).
There is a substantial literature on the difficulty of avoiding cognitive biases. An effective solution to mitigate self-deception and unwanted biases is blinding. In some research contexts, participants and data collectors can be blinded to the experimental condition that participants are assigned to, and to the research hypotheses, while the data analyst can be blinded to key parts of the data. For example, during data preparation and cleaning (see chapters 6, 7), the identity of experimental conditions or the variable labels can be masked so that the output is not interpretable in terms of the research hypothesis.
Pre-registration of the study design, primary outcome(s) and analysis plan (see the Promoting study pre-registration section below) is a highly effective form of blinding because the data do not exist and the outcomes are not yet known.
Research design and statistical analysis are mutually dependent. Common misperceptions, such as the interpretation of P values, limitations of null-hypothesis significance testing, the meaning and importance of statistical power, the accuracy of reported effect sizes, and the likelihood that a sample size that generated a statistically significant finding will also be adequate to replicate a true finding, could all be addressed through improved statistical training. These concepts are presented in BIOL603, ADVANCED BIOMETRY.
Primary biodiversity occurrence data are at the core of research in Ecology & Evolution. They are, however, no longer gathered as they used to be and the mass-production of observation-based (OB) occurrences is overthrowing the collection of specimen-based (SB) occurrences. Troudet et al. (2018) analyzed 536 million occurrences from the Global Biodiversity Information Facility (GBIF) database2 and concluded that from 1970 to 2016 the proportion of occurrences marked as traceable to tangible material (i.e., SB occurrences) fell from 68% to 18%. Moreover, the authors added that most of those specimen based-occurrences could not be readily traced back to a specimen because the necessary information was missing. This alarming trend (i.e. the low traceability of occurrences and therefore the low confidence of species identifications based on those observations) threatens the reproducibility of biodiversity research. For instance, low confidence in species identifications prevents mining into larger databases (to infer e.g. species distribution, ecology, phenology, conservation status, phylogenetic position) to gather data allowing testing hypotheses.
Overall, in their study, Troudet et al. (2018) advocated that SB occurrences must be gathered, as a warrant to allow both repeating ecological and evolutionary studies and conducting rich and diverse investigations. They also suggested that when impossible to secure, voucher specimens must be replaced with OB occurrences combined with ancillary data (e.g., pictures, recordings, samples, DNA sequences). Ancillary data are instrumental for the usefulness of biodiversity occurrences and sadly those tend not to be shared. Such approach will allow ensuring that primary biodiversity data collected lately do not partly become obsolete when doubtful.
Underpinning biodiversity occurrence with specimens (deposited in Natural History Museums and Botanical Gardens) allow:
Those additional data provided by specimens are key in the data collecting phase and will allow further analyses to thoroughly test hypotheses. We fully understand that SB occurrences can be problematic to process (especially in Ecology), but we would urge students to consider gathering ancillary data to back-up their observations and make sure their analyses are reproducible.
The need for independent methodological support is well-established in some areas — many clinical trials, for example, have multidisciplinary trial steering committees to provide advice and oversee the design and conduct of the trial. The need for these committees grew out of the well-understood financial conflicts of interest that exist in many clinical trials. Including independent researchers (particularly methodologists with no personal investment in a research topic) in the design, monitoring, analysis or interpretation of research outcomes may mitigate some of those influences, and can be done either at the level of the individual research project or through a process facilitated by a funding agency.
Studies of statistical power persistently find it to be below (sometimes well below) 50%, across both time and the different disciplines studied (see Munafo et al., 2017 and references therein). Low statistical power increases the likelihood of obtaining both false-positive and false-negative results, meaning that it offers no advantage if the purpose is to accumulate knowledge. Despite this, low-powered research persists because of dysfunctional incentives, poor understanding of the consequences of low power, and lack of resources to improve power. Team science is a solution to the latter problem — instead of relying on the limited resources of single investigators, distributed collaboration across many study sites facilitates high-powered designs and greater potential for testing generalizability across the settings and populations sampled. This also brings greater scope for multiple theoretical and disciplinary perspectives, and a diverse range of research cultures and experiences, to be incorporated into a research project.
This section describes measures that can be implemented when communicating research (including, for example, reporting standards, study pre-registration, and disclosing conflicts of interest).
Progress in science relies in part on generating hypotheses with existing observations and testing hypotheses with new observations. This distinction between postdiction3 and prediction is appreciated conceptually, but is not respected in practice. Mistaking generation of postdictions with testing of predictions reduces the credibility of research findings. However, ordinary biases in human reasoning, such as hindsight bias, make it hard to avoid this mistake. An effective solution is to define the research questions and analysis plan before observing the research outcomes —a process called pre-registration. Pre-registration distinguishes analyses and outcomes that result from predictions from those that result from postdictions. A variety of practical strategies are available to make the best possible use of pre-registration in circumstances that fall short of the ideal application, such as when the data are pre-existing. Services are now available for pre-registration across all disciplines, facilitating a rapid increase in the practice. Widespread adoption of pre-registration will increase distinctiveness between hypothesis generation and hypothesis testing and will improve the credibility of research findings (in term of research quality and transparency).
In its simplest form study pre-registration (see Nosek et al., 2018) may simply comprise the registration of the basic study design, but it can also include a detailed pre-specification of the study procedures, outcomes and statistical analysis plan.
Study pre-registration was introduced to address two major problems:
Please see the Glossary section for definitions of these concepts.
Pre-registration will improve discoverability of research, but discoverability does not guarantee usability. Poor usability reflects difficulty in evaluating what was done, in reusing the methodology to assess reproducibility, and in incorporating the evidence into systematic reviews and meta-analyses. Improving the quality and transparency in the reporting of research is necessary to address this.
TOP guidelines (published in Nosek et al., 2015) offer standards as a basis for journals and funders to incentivize or require greater transparency in planning and reporting of research. More precisely, TOP guidelines include eight modular standards, each with three levels of increasing stringency (Figure 4.3). Journals are selecting which of the eight transparency standards they wish to implement and also select a level of implementation for each. These features provide flexibility for adoption depending on disciplinary variation, but simultaneously establish community standards.
Please find below the list of eight TOP modular standards:
Each category template text for three levels of transparency: Level 1, Level 2, and Level 3 (Figure 4.3). Adopting journals select among the levels based on readiness to adopt milder to stronger transparency standards for authors and researchers. There are many factors that will influence level selection including considerations for implementation, and concordance with disciplinary norms and expectations.
Over 1,000 journals or organizations have implemented one or more TOP-compliant policy as of August 2018 (e.g. Ecology Letters, The Royal Society, Science). The full list of journals implementing TOP guidelines is available at this url: https://osf.io/2sk9f/
The material presented in chapter 1 focusing on R Markdown (as implemented in RStudio) is a response to the need to provide a unified environment linking publication, code and data.
The Center for Open Science also proposes a platform called “Open Science Framework” or OSF to achieve the same goal (Figure 4.3). OSF is a free and open source project management repository that supports researchers across their entire project life-cycle. As a collaboration tool, OSF helps researchers work on projects privately with a limited number of collaborators and make parts of their projects public, or make all the project publicly accessible for broader dissemination with citable, discoverable DOIs. As a workflow system, OSF enables connections to the many products researchers already use to streamline their process and increase efficiency.
With OSF’s workflow and storage integrations, researchers can really manage their entire projects from one place. The OSF workflow connects the valuable research tools researchers are already using, so that they can effectively share the story of their research projects and eliminate data silos and information gaps (Figure 4.3). OSF ecosystem is designed to allow all those tools to work together the way researchers’ do, removing barriers to collaboration and knowledge (Figure 4.3).
This section describes measures that can be implemented to support verification of research (including, for example, sharing data and methods).
Science is a social enterprise: independent and collaborative groups work to accumulate knowledge as a public good. The credibility of scientific claims is rooted in the evidence supporting them, which includes the methodology applied, the data acquired, the process of methodology implementation, and data analysis and outcome interpretation. Claims become credible by the community reviewing, criticizing, extending and reproducing the supporting evidence. However, without transparency, claims only achieve credibility based on trust in the confidence or authority of the originator. Transparency is superior to trust.
Open science refers to the process of making the content and process of producing evidence and claims transparent and accessible to others. Transparency is a scientific ideal, and adding ‘open’ should therefore be redundant. In reality, science often lacks openness: many published articles are not available to people without a personal or institutional subscription, and most data, materials and code supporting research outcomes are not made accessible, for example, in a public repository (however this is rapidly changing with several initiatives, e.g., Dryad digital repository).
Very little of the research process (for example, study protocols, analysis workflows, peer review) is accessible because, historically, there have been few opportunities to make it accessible even if one wanted to do so. This has motivated calls for open access, open data and open workflows (including analysis pipelines), but there are substantial barriers to meeting these ideals, including vested financial interests (particularly in scholarly publishing) and few incentives for researchers to pursue open practices.
To promote open science, several open-access journals were recently created (e.g. BMC, Frontiers, PLoS). These journals facilitate sharing of scientific research (and associated methods, data and code), but they are quite expensive (>$1500 on average). Waivers can be obtained for researchers based in certain countries or institutions can sponsor those initiatives and have a certain amount of papers per year published for “free”. However, if you do not fall into one of these categories, it will be quite challenging to pay for these costs without support from a grant (NSF is making an effort to promote open-science). The EEB program might be able to support some of those costs, but it will vary on the yearly budget (and when you ask). This topic is further investigated in Chapter 4.
This section describes measures that can be implemented when evaluating research (including, for example, peer review).
For most of the history of scientific publishing, two functions have been confounded — evaluation and dissemination. Journals have provided dissemination via sorting and delivering content to the research community, and gate-keeping via peer review to determine what is worth disseminating. However, with the advent of the internet, individual researchers are no longer dependent on publishers to bind, print and mail their research to subscribers. Dissemination is now easy and can be controlled by researchers themselves (see examples of preprint publishers below).
With increasing ease of dissemination, the role of publishers as a gate-keeper is declining. Nevertheless, the other role of publishing — evaluation — remains a vital part of the research enterprise. Conventionally, a journal editor will select a limited number of reviewers to assess the suitability of a submission for a particular journal. However, more diverse evaluation processes are now emerging, allowing the collective wisdom of the scientific community to be harnessed. For example, some preprint services support public comments on manuscripts, a form of pre-publication review that can be used to improve the manuscript (see below). Other services, such as PubMed Commons and PubPeer, offer public platforms to comment on published works facilitating post-publication peer review. At the same time, some journals are trialing ‘results-free’ review, where editorial decisions to accept are based solely on review of the rationale and study methods alone (that is, results-blind; for instance, PLoS ONE is applying this approach.).
Both pre- and post-publication peer review mechanisms dramatically accelerate and expand the evaluation process. By sharing preprints, researchers can obtain rapid feedback on their work from a diverse community, rather than waiting several months for a few reviews in the conventional, closed peer review process. Using post-publication services, reviewers can make positive and critical commentary on articles instantly, rather than relying on the laborious, uncertain and lengthy process of authoring a commentary and submitting it to the publishing journal for possible publication.
bioRxiv (pronounced “bio-archive”) is a free online archive and distribution service for unpublished preprints in the life sciences. It is operated by Cold Spring Harbor Laboratory, a not-for-profit research and educational institution. By posting preprints on bioRxiv, authors are able to make their findings immediately available to the scientific community and receive feedback on draft manuscripts before they are submitted to journals.
Articles are not peer-reviewed, edited, or typeset before being posted online. However, all articles undergo a basic screening process for offensive and/or non-scientific content and for material that might pose a health or bio-security risk and are checked for plagiarism. No endorsement of an article’s methods, assumptions, conclusions, or scientific quality by Cold Spring Harbor Laboratory is implied by its appearance in bioRxiv. An article may be posted prior to, or concurrently with, submission to a journal but should not be posted if it has already been accepted for publication by a journal.
PeerJ Preprints is a ‘preprint server’ for the Biological Sciences, Environmental Sciences, Medical Sciences, Health Sciences and Computer Sciences. A PeerJ Preprint is a draft of an article, abstract, or poster that has not yet been peer-reviewed for formal publication. Submit a draft, incomplete, or final version of your work for free.
Submissions to PeerJ Preprints are not formally peer-reviewed. Instead they are screened by PeerJ staff to ensure that they fit the subject area; do not contravene any of their policies; and that they can reasonably be considered a part of the academic literature. If a submission is found to be unsuitable in any of these respects then it will not be accepted for posting. Content which is considered to be non-scientific or pseudo-scientific will not pass the screening.
Publication is the currency of academic science and increases the likelihood of employment, funding, promotion and tenure. However, not all research is equally publishable. Positive, novel and clean results are more likely to be published than negative results, replications and results with loose ends; as a consequence, researchers are incentivized to produce the former, even at the cost of accuracy (see Nosek et al., 2012). These incentives ultimately increase the likelihood of false positives in the published literature. Shifting the incentives therefore offers an opportunity to increase the credibility and reproducibility of published results.
Funders, publishers, societies, institutions, editors, reviewers and authors all contribute to the cultural norms that create and sustain dysfunctional incentives. Changing the incentives is therefore a problem that requires a coordinated effort by all stakeholders to alter reward structures. There will always be incentives for innovative outcomes — those who discover new things will be rewarded more than those who do not. However, there can also be incentives for efficiency and effectiveness — those who conduct rigorous, transparent and reproducible research could be rewarded more than those who do not. There are promising examples of effective interventions for nudging incentives. For example, journals are adopting:
Collectively, and at scale, such efforts can shift incentives such that what is good for the scientist is also good for science — rigorous, transparent and reproducible research practices producing credible results.
HARKing: HARKing (hypothesizing after the results are known ) is defined as presenting a post hoc hypothesis (i.e., one based on or informed by one’s results) in one’s research report as if it were, in fact, an a priori hypothesis.
Outcome switching: refers to the possibility of changing the outcomes of interest in the study depending on the observed results. A researcher may include ten variables that could be considered outcomes of the research, and — once the results are known — intentionally or unintentionally select the subset of outcomes that show statistically significant results as the outcomes of interest. The consequence is an increase in the likelihood that reported results are spurious by leveraging chance, while negative evidence gets ignored.
P-hacking: also known as “Data dredging” is the misuse of data analysis to find patterns in data that can be presented as statistically significant when in fact there is no real underlying effect. This is done by performing many statistical tests on the data and only paying attention to those that come back with significant results, instead of stating a single hypothesis about an underlying effect before the analysis and then conducting a single test for it.
Publication bias: also known as the file drawer problem, refers to the fact that many more studies are conducted than published. Studies that obtain positive and novel results are more likely to be published than studies that obtain negative results or report replications of prior results. The consequence is that the published literature indicates stronger evidence for findings than exists in reality.
Balancing Open Science to support reproducibility with values of your stakeholders could be a challenging task.
Nowadays, funding agencies and journals are moving forward to promote Open Science and we started discussing this approach in Chapter 3. For instance, the White House is requiring immediate public access to all U.S.-funded research papers by 2025 without an embargo or cost. You can learn more about this initiative in this Science news article (doi: 10.1126/science.ade6076).
The overarching objective of this chapter is to study the mechanisms implemented to promote Open Science while highlighting circumstances where other principles might have to be adopted to respect different views, opinions and/or data sovereignty.
Overall, content of this chapter aims at providing solutions to tackles threats to the scientific process (discussed in Chapters 2 and 3) as well as information to manage the data life cycle and research dissemination.
This chapter is based on:
Web resources
Publications
The Organisation for Economic Co-operation and Development (OECD) defines Open Science as follows:
“to make the primary outputs of publicly funded research results – publications and the research data – publicly accessible in digital format with no or minimal restriction”.
Several other agencies argue that the definition of Open Science should be expanded. For them, Open Science is about extending the principles of openness to the whole research cycle, fostering sharing and collaboration as early as possible thus entailing a systemic change to the way science and research is done.
The 7 pillars of Open Science are:
The FAIR Principles (Wilkinson et al., 2016) steward researchers in making outputs of research that are (Figure 5.1):
When combined, these four elements are designed to help:
Wilkinson et al. (2016) have provided the following guidelines that have to be applied to support the FAIR Principles:
This is the practice of researchers acting honestly, reliably, respectfully and are accountable for their actions.
At BSU, the Office of Research Compliance is providing some information and guidance associated with research integrity.
The next-generation metrics pillar of Open Science seeks to catalyze a shift in cultural thinking around the way in which bibliometrics are utilized in research, particularly when evaluating quality, and go beyond simply citation counts and journal impact (Figure 5.2). Appropriate metrics, drawn from different sources and describing different things, can help us gain a broader understanding of the significance and impact of research.
As an example showing changes in how research is perceived, more institutions and funders are supporting the San Francisco Declaration of Research Assessment - DORA and openly reject the use of quantitative metrics commonly associated with journal impact factors as a measure of research quality. Among many others, Springer Nature has joined DORA and here are their commitments.
The future of scholarly communication is one of the most prominent pillars of Open Scholarship given its intention to shift the current academic publishing model towards fully Open Access. We will be developing more this topic below in our section on open access and associated licenses.
Movement towards members of the public having a greater role within research and recognizing the invaluable role they play in providing insights a researcher may not typically have. Examples of such initiatives are e.g. eBird or iNaturalist.
Harnessing the advantages of the internet, openly available software packages and local knowledge, citizen science brings about a change in the way research is conducted – no longer limited to academic researchers, it encourages collaboration from groups across society.
Wagenknecht et al. (2021) discuss this topic and its implementation in the case of a European Citizen Science project and Groom et al. (2017) are reflecting on the role of Citizen Science in biodiversity research. Overall, citizen science could really propel your research forward, but you have to be aware of the potential pitfalls. Please see this section in Chapter 3 where this topic was discussed.
This pillar focuses on identifying which are the training needs of researchers and sufficiently addressing any gaps in knowledge and skills around engaging with Open Science such as making publications openly accessible, managing research data in-line with the FAIR Principles (Figure 5.1) and acting with integrity.
All researchers at all levels should have access to education and skills programmes to support their work and continued learning. Further, skill development programmes should be opened up to other stakeholders in research such as professional staff including librarians and data managers and members of the public to facilitate the undertaking of citizen science. At BSU, the Alberstons library has resources on this topic and also runs workshops and seminars.
Fostering engagement with the principle of Open Science requires reward and recognition of the efforts to do so – this pillar addresses barriers and champions best practice.
A perceived lack of reward and recognition for work undertaken to manage research data and make publications openly accessible discourages researchers from engaging with the principle of Open Science. Work falling under this pillar seeks to address these challenges and champion engagement with Open Science practices. Please see Chapter 3 for more details on some of these initiatives and rewards to promote Open Science.
Open access (OA) is a set of principles and a range of practices through which research outputs are distributed online, free of access charges or other barriers (Figure 5.3). Although publications are free of charges to the users, researchers will have to pay publication fees (usually between $1500 and $2500) to publish their research as OA. With open access strictly defined (according to the 2001 definition), barriers to copying or reuse are also reduced or removed by applying an open license for copyright. An open license is a license which allows others to reuse another creator’s work as they wish. Without a special license, these uses are normally prohibited by copyright, patent or commercial license. Most free licenses are worldwide, royalty-free, non-exclusive, and perpetual.
Since the revenue of most open access journals is earned from publication fees charged to the authors, OA publishers are motivated to increase their profits by accepting low-quality papers and by not performing thorough peer review. On the other hand, the prices for OA publications in the most prestigious journals have exceeded $5000, making such publishing model unaffordable to a large number of researchers.
The increase in publishing cost has been called the “Open-Access Sequel to [the] Serials Crisis” (Khoo, 2019). To provide further context, the serial crisis involves unsustainable budgetary pressures on libraries due to hyperinflation of subscription costs. In this framework, OA was proposed as one way of coping with these costs because articles would not require ongoing subscriptions to remain accessible, but findings by Khoo (2019) might suggest that both systems have limitations.
Most open access journals will be using (Creative Common; CC) licenses and it is important that you and your co-authors are aware of their implications prior to submitting your manuscript, especially in regard to potential commercial usage of your work by third parties (Figure 5.3). A CC license is one of several public copyright licenses that enable the free distribution of an otherwise copyrighted “work”.
A CC license is used when an author wants to give other people the right to share, use, and build upon a work that the author has created. CC provides an author flexibility (e.g., they might choose to allow only non-commercial uses of a given work) and protects the people who use or redistribute an author’s work from concerns of copyright infringement as long as they abide by the conditions that are specified in the license by which the author distributes the work. To support this fexibility, the CC initiative is providing an online tool to choose the best license to share your work, which I would advise you to consult to identify what license best suits your work. To know more about CC licenses, please visit this wikipedia page.
The available licenses provided by the CC initiative are shown in Figure 5.4.
The description of these licenses are always presented in this format:
Here is the list of some of the mostly used CC licenses with URLs leading to their descriptions:
We have been promoting Open Science, but is this approach always possible?
As pointed out by Carroll et al. (2021), as big data, open data, and open science advance to increase access to complex and large datasets for innovation, discovery, and decision-making, Indigenous Peoples’ rights to control and access their data within these data environments remain limited. Operationalizing the FAIR Principles for scientific data with the CARE Principles for Indigenous Data Governance enhances machine actionability and brings people and purpose to the fore to resolve Indigenous Peoples’ rights to and interests in their data across the data lifecycle (Figure 5.5).
The CARE Principles detail that the use of Indigenous data should result in tangible benefits for Indigenous collectives through inclusive development and innovation, improved governance and citizen engagement, and result in equitable outcomes.
Collective benefit is more likely to be realized when data ecosystems are designed to support Indigenous nations and when the use/reuse of data for resource allocation is consistent with community values. The United Nations Declaration on the Rights of Indigenous Peoples asserts Indigenous Peoples’ rights and interests in data and their authority to control their data. Access to ‘data for governance’ is vital to support self-determination and Indigenous nations should be actively involved in ‘governance of data’ to ensure ethical reuse of data. Given the majority of Indigenous data is controlled by non-Indigenous institutions there is a responsibility to engage respectfully with those communities to ensure the use of Indigenous data supports capacity development, increasing community data capabilities, and the strengthening of Indigenous languages and cultures. Similarly, Indigenous Peoples’ ethics should inform the use of data across time in order to minimize harm, maximize benefits, promote justice, and allow for future use (see Carroll et al., 2021 for more details).
The CARE Principles (Figure 5.5) are directly connected to the concept of data sovereignty. We will provide a definition here and it is especially important to cover such material in this course since data sovereignty could impact your data sharing protocols. More on this topic will be discussed in Chapter 5.
Data sovereignty is the idea that data are subject to the laws and governance structures of the nation where they are collected. The concept of data sovereignty is closely linked with data security, cloud computing, network sovereignty and technological sovereignty. Unlike technological sovereignty, which is vaguely defined and can be used as an umbrella term in policymaking, data sovereignty is specifically concerned with questions surrounding the data itself. Data sovereignty as the idea that data is subject to the laws and governance structures within one nation is usually discussed in two ways: in relation to Indigenous groups and Indigenous autonomy from post-colonial states or in relation to transnational data flow. With the rise of cloud computing, many countries have passed various laws around control and storage of data, which all reflects measures of data sovereignty. More than 100 countries have some sort of data sovereignty laws in place. With self-sovereign identity (SSI) the individual identity holders can fully create and control their credentials, although a nation can still issue a digital identity in that paradigm.
This chapter is subdivided into two parts as follows:
Good data management is fundamental to research excellence. It produces high-quality research data that are accessible to others and usable in the future (see TOP guidelines in chapter 3). The value of data is now explicitly acknowledged through citations (e.g. GBIF and Dryad repositories provide DOIs to cite datasets) so researchers can make a difference to their own careers, as well as to their fields of research, by sharing and making data available for reuse. These citations have to be directly included in publications in devoted sections usually entitled “Data Availability Statement” located at the end of the article (see Figure 6.1 for an example).
This chapter aims at helping students navigate data management firstly by explaining what data and data management are and why data sharing is important, and secondly by providing advice and examples of best practice in data management.
This chapter is based on:
The presentation slides for Chapter 5 - part A can be downloaded here.
Research data are the factual pieces of information used to test research hypotheses. Data can be classified into five categories:
A key challenge facing researchers today is the need to work with different data sources. It is not uncommon for projects to integrate any combination of data types into a single analysis, even drawing on data from disciplines outside Ecology and Evolution. As research becomes increasingly collaborative and interdisciplinary, issues surrounding data management are growing in prevalence.
Data have a longer lifespan than the project they were created for, as illustrated by the data life-cycle displayed in Figure 6.2.
Some projects may only focus on certain parts of the data life-cycle, such as primary data creation, or reusing others’ data. Other projects may go through several revolutions of the cycle. Either way, most researchers will work with data at all stages throughout their career.
Data management concerns how you plan for all stages of the data life-cycle and implement this plan throughout the research project. Done effectively it will ensure that the data life-cycle is kept in motion. It will also keep the research process efficient and ensure that your data meet all the expectations set by you, funders, research institutions, legislation and publishers (e.g. copyright, data protection).
In order to bring some perspective on this topic, ask yourself this question:
Would a colleague be able to take over my project tomorrow if I disappeared, or make sense of the data without talking to me?
If you can answer YES, then you are managing your data well!
Regardless of whether your funder requires you to have a data management or sharing plan as part of a grant application, having such a plan in place before you begin your research project will mean that you are prepared for any data management issues that may come your way (see the Data management checklist section below).
Here are few things that you should consider before planning your data management workflow:
In the data life-cycle (Figure 6.2), creating datasets occurs as a researcher collects data in the field or lab, and digitizes them to end up with a raw dataset.
# Workflow associated with creating data
Collect data (in the field and/or lab) --> Digitize data --> Raw dataset
!! Perform quality checks @ each step to validate data !!
Quality control during data collection is important because often there is only one opportunity to collect data from a given situation. Researchers should be critical of methods before collection begins – high-quality methods will result in high-quality data. Likewise, when collection is under way, detailed documentation of the collection process should be kept as evidence of quality.
Data may be collected directly in a digital form using devices that feed results straight into a computer/tablet or they may be collected as hand-written notes. Either way, there will be some level of processing involved to end up with a digital raw dataset.
Key things to consider during data digitization include:
Readme.txt
file or even better in protocol files (later attached as appendixes of your manuscript), or other metadata standard, including a definition of each parameter, the units used and codes for missing values.Data should be processed into a format that is suited to subsequent analyses and ensures long-term usability. Data are at risk of being lost if the hardware and software originally used to create and process them are rendered obsolete. Therefore, data should be well organized, structured, named and versioned in standard formats that can be interpreted in the future (see the Data structure and organisation of files section below).
Here are some guidelines to ensure best processing of data (Figure 6.2):
.csv
). Other non-proprietary formats include: plain text files (.txt
) for text; and GIF
, JPEG
and PNG
for images.Readme.txt
file).Inferring the directory tree structure of your project provides a simple and efficient way to summarize the data structure and organization of files related to your project (see above). In UNIX, this can be easily achieved by using the tree package as shown in Figure 6.3.
The UNIX tree package was used here to infer the directory structure of the folder Project_ID/
and several arguments were set to obtain data on files that could be used to check for file names and location as well as version control:
-s
: Print the size of each file in bytes along with the name.-D
: Print the date of the last modification time or if -c
is used, the last status change time for the file listed.To install the tree package on your Mac, do the following:
#Install tree by typing the following code in a terminal
brew install tree
Producing good documentation and metadata ensures that data can be understood and used in the long term (Figure 6.2). Documentation (i.e. the manual associated with your data) describes the data, explains any manipulations and provides contextual information – no room should be left for others to misinterpret the data.
All documentation requirements should be identified at the planning stages so you are prepared for it during all stages of the data life-cycle, particularly during data creation and processing. This will avoid having to perform a rescue mission in situations where you have forgotten what has happened or when a collaborator has left without leaving any key documentation behind.
Data documentation includes information at project and data levels and should cover the following information.
If a software package such as R is used for processing data, much of the data level documentation will be created and embedded during analysis.
Metadata help others discover data through searching and browsing online and enable machine-to-machine interoperability of data, which is necessary for data reuse. Metadata are created by using either a data center’s deposit form, a metadata editor, or a metadata creator tool, which can be searched for online. Metadata follow a standard structure and come in three forms:
To protect data from loss and to make sure data are securely stored, good data management should include a strategy for backing up and storing data effectively (see where this step fits in Figure 6.2). It is recommended to keep three versions of your data: the original, external/local and external/remote. Talk with your thesis supervisor to isolate best procedure to ensure preservation of your data.
When designing a backup strategy, thought should be given to the possible means by which data loss could occur. These include:
An ideal backup strategy should provide protection against all the risks, but it can be sensible to consider which are the most likely to occur in any particular context and be aware of these when designing your backup strategy.
Data storage, whether of the original or backed up data, needs to be robust. This is true whether the data are stored on paper or electronically, but electronic storage raises particular issues.
Data can be stored and backed up on:
All aspects of data management lead up to data discovery and reuse by others. Intellectual property rights, licenses and permissions, which concern reuse of data, should be explained in the data documentation and/or metadata.
At this stage of the life-cycle it is important to state your expectations for the reuse of your data, e.g. terms of acknowledgment, citation and co-authorship. Likewise, it becomes the responsibility of others to reuse data effectively, credit the collectors of the original data, cite the original data and manage any subsequent research to the same effect.
When requesting to use someone else’s data it is important to clearly state the purpose of the request, including the idea you will be addressing and your expectations for co-authorship or acknowledgment. Co-authorship is a complex issue and should be discussed with any collaborators at the outset of a project.
Increasing openness to data and ensuring long-term preservation of data fosters collaboration and transparency, furthering research that aims to answer the big questions in ecology and evolution. By implementing good data management practices, researchers can ensure that high-quality data are preserved for the research community and will play a role in advancing science for future generations.
The data management checklist from the UK Data Archive will help you design your own data management planning and data sharing. The text is available below:
The following steps have to be fully integrated in order to produce a reproducible code:
In this chapter, we will cover steps 1 to 4. Steps 5 and 7 have been respectively studied in chapters 1 and 5 (Data management), whereas step 6 will be covered in the bioinformatic tutorial associated to chapter 12.
This chapter is based on:
The presentation slides for Chapter 5 - part B can be downloaded here.
To conduct the exercises presented in this chapter, you will need to:
More information on these items is provided below.
The data (Project_ID
) used for this chapter are deposited on our shared Google Drive at this address:
Reproducible_Science > Exercises > Chapter_5 > Project_ID
Please download this whole folder prior to starting the exercises.
Here, we are using code developed in Chapter 1 to make sure the R dependencies (data.tree, DiagrammeR) are installed and loaded prior to pursuing the exercises.
## ~~~ 1. List all required packages ~~~ Object (args)
## provided by user with names of packages stored into a
## vector
pkg <- c("data.tree", "DiagrammeR")
## ~~~ 2. Check if pkg are installed ~~~
print("Check if packages are installed")
## [1] "Check if packages are installed"
# This line outputs a list of packages that are not
# installed
new.pkg <- pkg[!(pkg %in% installed.packages())]
## ~~~ 3. Install missing packages ~~~
if (length(new.pkg) > 0) {
print(paste("Install missing package(s):", new.pkg, sep = " "))
install.packages(new.pkg, dependencies = TRUE)
} else {
print("All packages are already installed!")
}
## [1] "All packages are already installed!"
## ~~~ 4. Load all packages ~~~
print("Load packages and return status")
## [1] "Load packages and return status"
# Here we use the sapply() function to require all the
# packages
sapply(pkg, require, character.only = TRUE)
## data.tree DiagrammeR
## TRUE TRUE
Organizing your project for reproducibility starts by drawing a workflow serving as basis to guide your project implementation. An example of a reproducible project workflow is displayed in Figure 6.4. This workflow will be used as template for the material presented in this chapter.
The fundamental idea behind a robust, reproducible analysis is a clean, repeatable script-based workflow (i.e. the sequence of tasks from the start to the end of a project) linking raw data through to clean data and to final analysis outputs.
Please find below some key concepts associated with this task:
The simplest and most effective way of documenting your workflow – its inputs and outputs – is through good file system organization, and informative, consistent naming of materials associated with your analysis. The name and location of files should be as informative as possible on what a file contains, why it exists, and how it relates to other files in the project. These principles extend to all files in your project (not just scripts) and are also intimately linked to good research data management (see Chapter 5: Data management).
It is best to keep all files associated with a particular project in a single root directory. RStudio projects offer a great way to keep everything together in a self-contained and portable (i.e. so they can be moved from computer to computer) manner, allowing internal pathways to data and other scripts to remain valid even when shared or moved.
There is no single best way to organize a file system. The key is to make sure that the structure of directories and location of files are consistent, informative and works for you. Please find below an example of basic project directory structure (Figure 6.5):
data
folder contains all input data (and metadata) used in the analysis.MS
folder contains the manuscript.Figures_&_Tables
folder contains figures and tables generated by the analyses.Output
folder contains any type of intermediate or output files (e.g. simulation outputs, models, processed datasets, etc.). You might separate this and also have a cleaned-data
folder.R_functions
folder contains R scripts with function definitions.Reports
folder contains RMarkdown files that document the analysis or report on results.*.R
) that actually do things are stored in the root directory together with the README.md
file. If your project has too many scripts, you might consider organizing them in a separate folder.Usually, code and software are licensed under a GNU Affero General Public License. This is a free, copyleft license for software and other kinds of works, specifically designed to ensure cooperation with the community in the case of network server software. For instance, this is the licence that the instructor has used for this class.
Inferring the directory tree structure of your project provides a simple and efficient way to summarize the data structure and organization of files related to your project as well as track versioning. The R base functions list.files() and file.info() can be combined to obtain information on files stored in your project. Please see code below for an example associated to Figure 6.5.
# Produce a list of all files in working directory
# (Project-ID) together with info related to those files
file.info(list.files(path = "Project_ID", recursive = TRUE, full.name = TRUE))
## size isdir mode
## Project_ID/Data/species_data.csv 6 FALSE 700
## Project_ID/Figures_&_Tables/Fig_01_Data_lifecycle.pdf 10282 FALSE 700
## Project_ID/MS/MS_et_al.docx 11705 FALSE 700
## Project_ID/Output/DataStr_network.html 256368 FALSE 700
## Project_ID/Output/DataStr_tree.pdf 3611 FALSE 700
## Project_ID/R_functions/check.install.pkg.R 684 FALSE 700
## Project_ID/README.md 1000 FALSE 700
## Project_ID/Reports/Documentation.md 14 FALSE 700
## Project_ID/Scripts/01_download_data.R 14 FALSE 700
## Project_ID/Scripts/02_clean_data.R 14 FALSE 700
## Project_ID/Scripts/03_exploratory_analyses.R 14 FALSE 700
## Project_ID/Scripts/04_fit_models.R 14 FALSE 700
## Project_ID/Scripts/05_generate_figures.R 14 FALSE 700
## mtime
## Project_ID/Data/species_data.csv 2018-09-12 10:02:55
## Project_ID/Figures_&_Tables/Fig_01_Data_lifecycle.pdf 2018-09-09 10:38:23
## Project_ID/MS/MS_et_al.docx 2019-09-23 11:51:58
## Project_ID/Output/DataStr_network.html 2019-09-23 13:27:30
## Project_ID/Output/DataStr_tree.pdf 2019-09-23 13:04:03
## Project_ID/R_functions/check.install.pkg.R 2019-09-10 09:54:03
## Project_ID/README.md 2021-10-03 13:32:58
## Project_ID/Reports/Documentation.md 2018-09-12 08:56:42
## Project_ID/Scripts/01_download_data.R 2018-09-12 08:56:42
## Project_ID/Scripts/02_clean_data.R 2018-09-12 08:56:42
## Project_ID/Scripts/03_exploratory_analyses.R 2018-09-12 08:56:42
## Project_ID/Scripts/04_fit_models.R 2018-09-12 08:56:42
## Project_ID/Scripts/05_generate_figures.R 2018-09-12 08:56:42
## ctime
## Project_ID/Data/species_data.csv 2022-06-24 14:49:58
## Project_ID/Figures_&_Tables/Fig_01_Data_lifecycle.pdf 2022-06-24 14:49:58
## Project_ID/MS/MS_et_al.docx 2022-06-24 14:49:56
## Project_ID/Output/DataStr_network.html 2022-06-24 14:49:56
## Project_ID/Output/DataStr_tree.pdf 2022-06-24 14:49:57
## Project_ID/R_functions/check.install.pkg.R 2022-06-24 14:49:57
## Project_ID/README.md 2022-06-24 14:49:57
## Project_ID/Reports/Documentation.md 2022-06-24 14:49:58
## Project_ID/Scripts/01_download_data.R 2022-06-24 14:49:57
## Project_ID/Scripts/02_clean_data.R 2022-06-24 14:49:57
## Project_ID/Scripts/03_exploratory_analyses.R 2022-06-24 14:49:58
## Project_ID/Scripts/04_fit_models.R 2022-06-24 14:49:57
## Project_ID/Scripts/05_generate_figures.R 2022-06-24 14:49:57
## atime uid
## Project_ID/Data/species_data.csv 2022-06-24 15:34:11 502
## Project_ID/Figures_&_Tables/Fig_01_Data_lifecycle.pdf 2022-06-24 15:34:11 502
## Project_ID/MS/MS_et_al.docx 2022-06-24 15:34:11 502
## Project_ID/Output/DataStr_network.html 2022-06-24 15:34:11 502
## Project_ID/Output/DataStr_tree.pdf 2022-06-24 15:34:11 502
## Project_ID/R_functions/check.install.pkg.R 2022-06-24 15:34:11 502
## Project_ID/README.md 2022-06-24 15:34:11 502
## Project_ID/Reports/Documentation.md 2022-06-24 15:34:11 502
## Project_ID/Scripts/01_download_data.R 2022-06-24 15:34:11 502
## Project_ID/Scripts/02_clean_data.R 2022-06-24 15:34:11 502
## Project_ID/Scripts/03_exploratory_analyses.R 2022-06-24 15:34:11 502
## Project_ID/Scripts/04_fit_models.R 2022-06-24 15:34:11 502
## Project_ID/Scripts/05_generate_figures.R 2022-06-24 15:34:11 502
## gid uname grname
## Project_ID/Data/species_data.csv 20 sven staff
## Project_ID/Figures_&_Tables/Fig_01_Data_lifecycle.pdf 20 sven staff
## Project_ID/MS/MS_et_al.docx 20 sven staff
## Project_ID/Output/DataStr_network.html 20 sven staff
## Project_ID/Output/DataStr_tree.pdf 20 sven staff
## Project_ID/R_functions/check.install.pkg.R 20 sven staff
## Project_ID/README.md 20 sven staff
## Project_ID/Reports/Documentation.md 20 sven staff
## Project_ID/Scripts/01_download_data.R 20 sven staff
## Project_ID/Scripts/02_clean_data.R 20 sven staff
## Project_ID/Scripts/03_exploratory_analyses.R 20 sven staff
## Project_ID/Scripts/04_fit_models.R 20 sven staff
## Project_ID/Scripts/05_generate_figures.R 20 sven staff
The code presented above could be at the core of a user-defined function that would aim at managing files and ensuring data reliability for your project. To help you define such a function, let’s investigate this further and produce a diagram summarizing your project structure. This objective is achieved in three steps (see Figure 6.6):
data.frame
data.frame
into a data.tree
class object. This is done using the data.tree::as.Node() function from the data.tree
package.data.tree
and plotting the output using DiagrammR
package.WARNING: Prior to starting the exercise, please go on the shared Google Drive and download the Project_ID
folder at this path:
Reproducible_Science > Exercises > Chapter_5 > Project_ID
Procedure to follow:
Project_ID/
in a directory entitled Chapter_5_PartB
.R
script and save it in Chapter_5_PartB
.Project_ID/
. Also read the text below, which further explains the approach.### Load R packages
library(data.tree)
library(DiagrammeR)
### Step 1: Produce a list of all files in Project_ID
filesInfo <- file.info(list.files(path = "Project_ID", recursive = TRUE,
full.name = TRUE))
### Step 2: Convert filesInfo into data.tree class
myproj <- data.tree::as.Node(data.frame(pathString = rownames(filesInfo)))
# Inspect output
print(myproj)
## levelName
## 1 Project_ID
## 2 ¦--Data
## 3 ¦ °--species_data.csv
## 4 ¦--Figures_&_Tables
## 5 ¦ °--Fig_01_Data_lifecycle.pdf
## 6 ¦--MS
## 7 ¦ °--MS_et_al.docx
## 8 ¦--Output
## 9 ¦ ¦--DataStr_network.html
## 10 ¦ °--DataStr_tree.pdf
## 11 ¦--R_functions
## 12 ¦ °--check.install.pkg.R
## 13 ¦--README.md
## 14 ¦--Reports
## 15 ¦ °--Documentation.md
## 16 °--Scripts
## 17 ¦--01_download_data.R
## 18 ¦--02_clean_data.R
## 19 ¦--03_exploratory_analyses.R
## 20 ¦--04_fit_models.R
## 21 °--05_generate_figures.R
### Step 3: Prepare and plot diagram of project structure
### (it requires DiagrammeR)
# Set general parameters related to graph
data.tree::SetGraphStyle(myproj$root, rankdir = "LR")
# Set parameters for edges
data.tree::SetEdgeStyle(myproj$root, arrowhead = "vee", color = "grey",
penwidth = "2px")
# Set parameters for nodes
data.tree::SetNodeStyle(myproj, style = "rounded", shape = "box")
# Apply specific criteria only to children nodes of Scripts
# and R_functions folders
data.tree::SetNodeStyle(myproj$Scripts, style = "box", penwidth = "2px")
data.tree::SetNodeStyle(myproj$R_functions, style = "box", penwidth = "2px")
# Plot diagram
plot(myproj)
Finally, the R DiagrammeR package sadly doesn’t allow to easily save the graph into a file (step 4) using e.g. the pdf() and dev.off() functions, but this task can be done in RStudio as follows:
Viewer
window (in the bottom right panel) and can be exported by clicking on the button and select Save as Image...
as shown in Figure 6.7.To find out more about your options to export/save DiagrammeR graphs please visit this website:
Good naming extends to all files, folders and even objects in your analysis and serves to make the contents and relationships among elements of your analysis understandable, searchable and organised in a logical fashion (see Figure 6.5 for examples and Chapter 5: Data management for more details).
Pseudocode is an informal high-level description of the operating principle of a computer program or other algorithm. It uses the structural conventions of a normal programming language (here R), but is intended for human reading rather than machine reading. Here, you will establish the big steps (and their associated tasks) and tie R functions (existing or that have to be made) to those steps. This provides the backbone of your code and will support writing it.
Writing clear, reproducible code has (at least) three main benefits: 1. It makes returning to the code much easier a few months down the line; whether revisiting an old project, or making revisions following peer review. 2. The results of your analysis are more easily scrutinized by the readers of your paper, meaning it is easier to show their validity. 3. Having clean and reproducible code available can encourage greater uptake of new methods that you have developed.
To write clear and reproducible code, it is recommended to follow the workflow depicted in Figure 6.4. The following section explains each part of the workflow along with some tips for writing code. Although this workflow would be the ‘gold standard’, just picking up some of the elements can help to make your code more effective and readable.
The foundation of writing readable code is to choose a logical and readable coding style, and to stick to it. Some key elements to consider when developing a coding style are:
### Naming files
# Good
fit-models.R
utility-functions.R
# Bad
foo.r
stuff.r
### Naming objects
# Good
day_one
day_1
# Bad
first_day_of_the_month
DayOne
dayone
djm1
=
, +
, -
, <-
, etc.), and after commas (much like in a sentence).### Spacing
# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)
# Bad
average<-mean(feet/12+inches,na.rm=TRUE)
### Indentation
long_function_name <- function(a = "a long argument",
b = "another argument",
c = "another long argument") {
# As usual code is indented by two spaces.
}
<-
, not =
, for assignment.### Assignment
# Good
x <- 5
# Bad
x = 5
The most important role of a style guide, however, is to introduce consistency to scripts.
When working collaboratively, portability between machines is very important, i.e. will your code work on someone else’s computer? Portability is also important if code is being run on another server, for example a High Performance Cluster. One way to improve portability of code is to avoid using absolute paths and use only relative paths (see https://en.wikipedia.org/wiki/Path_(computing)#Absolute_and_relative_paths).
For example based on species_data.csv
stored in the Data
folder shown in Figure 6.5:
# Absolute path -----------------------------
C:/Project_ID/Data/species_data.csv
Project_ID = Project root folder = working directory
# Relative path ------
Data/species_data.csv
Relative paths are particularly useful when transferring between computers because while I may have stored my project folder in ‘C:/Project_ID/’, you may have yours stored in ‘C:/Users/My Documents’. Using relative paths and running from the project folder will ensure that file-not-found errors are avoided.
RStudio is especially designed to help portability of code. For instance, you can easily set the working directory in RStudio by clicking:
Session > Set Working Directory > To Source File Location
or using the setwd() function. Please see Chapter 1 for more details.
How often have you revisited an old script six months down the line and not been able to figure out what you had been doing? Or have taken on a script from a collaborator and not been able to understand what their code is doing and why? An easy win for making code more readable and reproducible is the liberal, and effective, use of comments. A comment is a line of code that is visible, but does not get run with the rest of the script. In R and Python this is signified by beginning the line with a #.
One good principle to adhere to is to comment the ‘why’ rather than the ‘what’. The code itself tells the reader what is being done, it is far more important to document the reasoning behind a particular section of code or, if it is doing something nonstandard or complicated, to take some time to describe that section of code.
It is also good practice to use comments to add an overview at the beginning of a script, and commented lines of ---
to break up the script: e.g. in R
# Load data -------
Often when you are analyzing data, you need to repeat the same task many times. For example, you might have several files that all need loading and cleaning in the same way, or you might need to perform the same analysis for multiple species or parameters. The best way (by far) to handle those tasks is to write functions and store those in a specific folder (e.g. R_functions
in 6.5). Those functions will be loaded into the R environment (and therefore made available to users) by using the source()
function, which will be placed at the top of your R script. Please see Chapter 1 part D for more details.
In the experimental sciences, rigorous testing is applied to ensure that results are accurate, reproducible and reliable. Testing will show that the experimental setup is doing what it is meant to do and will quantify any systematic biases. Results of experiments will not be trusted without such tests; why should your code be any different? Testing scientific code allows you to be sure that it is working as intended and to understand and quantify any limitations of the code. Using tests can also help to speed up the code development process by finding errors early on.
Although the instructor recognizes the importance of establishing protocols, those tests will be especially relevant if you were to design R packages. However, the instructor would recommend informal testing by loading functions that have been written by the students and running ad hoc tests in the command line to make sure that they perform as expected.
Reproducibility is also about making sure someone else can re-use your code to obtain the same results as you (see Appendix 1 for more details).
For someone else to be able to reproduce the results included in your report, you need to provide more than the code and the data. You also need to document the exact versions of all the packages, libraries, and software you used, and potentially your operating system as well as your hardware.
R itself is very stable, and the core team of developers takes backward compatibility (old code works with recent versions of R) very seriously. However, default values in some functions change, and new functions get introduced regularly. If you wrote your code on a recent version of R and give it to someone who has not upgraded recently, they may not be able to run your code. Code written for one version of a package may produce very different results with a more recent version.
With R, the simplest (but a useful and important) approach to document your dependencies is to report the output of sessionInfo()
(or devtools::session_info()
). Among other information, this will show all the packages and their versions that are loaded in the session you used to run your analysis. If someone wants to recreate your analysis, they will know which packages they will need to install. Please see Appendix 2 for more details.
For instance, here is the output of sessionInfo()
showing the R version and packages that I used to create this document:
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rticles_0.23 DiagrammeR_1.0.9 DT_0.24
## [4] data.tree_1.0.0 kfigr_1.2.1 devtools_2.4.4
## [7] usethis_2.1.6 bibtex_0.4.2.3 knitcitations_1.0.12
## [10] htmltools_0.5.3 prettydoc_0.4.1 magrittr_2.0.3
## [13] dplyr_1.1.2 kableExtra_1.3.4 formattable_0.2.1
## [16] bookdown_0.33 rmarkdown_2.21 knitr_1.42
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.5 tidyr_1.3.0 sass_0.4.2 pkgload_1.3.2.1
## [5] jsonlite_1.8.7 viridisLite_0.4.2 bslib_0.4.0 shiny_1.7.2
## [9] highr_0.9 yaml_2.3.7 remotes_2.4.2 sessioninfo_1.2.2
## [13] pillar_1.9.0 glue_1.6.2 digest_0.6.33 RColorBrewer_1.1-3
## [17] promises_1.2.0.1 rvest_1.0.3 RefManageR_1.3.0 colorspace_2.1-0
## [21] httpuv_1.6.5 plyr_1.8.7 pkgconfig_2.0.3 purrr_1.0.2
## [25] xtable_1.8-4 scales_1.2.1 webshot_0.5.4 processx_3.8.2
## [29] svglite_2.1.0 later_1.3.0 tibble_3.2.1 generics_0.1.3
## [33] ellipsis_0.3.2 withr_2.5.0 cachem_1.0.6 cli_3.6.1
## [37] crayon_1.5.2 mime_0.12 memoise_2.0.1 evaluate_0.21
## [41] ps_1.7.5 fs_1.6.3 fansi_1.0.4 xml2_1.3.5
## [45] pkgbuild_1.3.1 profvis_0.3.7 tools_4.2.0 prettyunits_1.1.1
## [49] formatR_1.12 lifecycle_1.0.3 stringr_1.5.0 munsell_0.5.0
## [53] callr_3.7.3 compiler_4.2.0 jquerylib_0.1.4 systemfonts_1.0.4
## [57] rlang_1.1.1 rstudioapi_0.14 visNetwork_2.1.0 htmlwidgets_1.5.4
## [61] miniUI_0.1.1.1 R6_2.5.1 lubridate_1.8.0 fastmap_1.1.0
## [65] utf8_1.2.3 stringi_1.7.12 Rcpp_1.0.11 vctrs_0.6.3
## [69] tidyselect_1.2.0 xfun_0.36 urlchecker_1.0.1
This will not be covered in this class, but it is worth noting that there are at least two R packages allowing to better manage dependencies and recreate your setup. Those packages are:
The instructor expects students to discuss with their supervisors to identify if those packages are potentially useful for their projects.
This chapter is subdivided into two parts as follows:
To support the learning outcome, we will focus on discussing:
This chapter is mostly based on the following books, publications and web resources:
Books and Guides
Publications
Websites
The presentation slides for Chapter 6 - part A can be downloaded here.
Publishing research results is the one thing that unites scientists across disciplines, and it is a necessary part of the scientific process. You can have the best ideas in the world, but if you don’t communicate them clearly enough to be published, your work won’t be acknowledged by the scientific community.
By publishing you are achieving three key goals for yourself and the larger scientific endeavor:
In biology, publishing research is equal to journal articles.
Before beginning writing your journal article and thinking where to submit it, it is important to thoroughly understand your own research and know the key conclusion you want to communicate (see chapter 3). In other words, what is your take home message?
Consider your conclusion and ask yourself, is it:
If you can answer ‘yes’ to all three, you have a good foundation message for a paper.
Shape the whole narrative of your paper around this message.
Once you know your message, getting your research published will be a four steps process:
Each step will be discussed below. Please seek support from your supervisor to learn more about specifics from you field.
To target the best journal to publish your research, you need to ask yourself what audience do I want my paper to reach?
Your manuscript should be tailored to the journal you want to submit to in terms of content and in terms of style (as outlined in journals’ author guidelines). To confirm that a journal is the best outlet to publish your research ask yourself this question: can I relate my research to other papers published in this journal?
Here are some things to consider when choosing which journal to submit to:
Look closely at what the journal publishes; manuscripts are often rejected on the basis that they would be more suitable for another journal. There can be crossover between different journals’ aims and scope – differences may be subtle, but all important when it comes to getting accepted.
Do you want your article read by a more specialist audience working on closely related topics to yours, or researchers within your broader discipline?
Once you have decided which journal you are most interested in, make sure that you tailor the article according to its aims and scope.
It is a good sign if you recognize the names of the editors and editorial board members of a journal from the work you have already encountered (even better if they contributed to some of the references cited in your manuscript). Research who would likely deal with your paper if you submitted to a journal and find someone who would appreciate reading your paper. You can suggest handling editors in your cover letter or in the submission form, if it allows, but be aware that journals do not have to follow your suggestions and/or requests.
A summary of our previously discussed material is presented below to provide more context for this chapter, but please consult Chapter 4 for more details on this topic.
Impact factors are the one unambiguous measure widely used to compare journal quality based on citations the journal receives. However, other metrics are becoming more common, e.g. altmetric score measuring the impact of individual articles through online activity (shares on different social media platforms etc.), or article download figures listed next to the published paper.
None of the metrics described here are an exact measure of the quality of the journal or published research. You will have to decide which of these metrics (if any) matter most to your work or your funders and institutions.
Do you need to publish open access (OA)? Some funders mandate it and grant money often has an amount earmarked to cover the article processing charge (APC) required for Gold OA. Some universities have established agreements with publishers whereby their staff get discounts on APCs when publishing in certain journals (or even a quota of manuscripts that can be published for “free” on a yearly basis). If you do not have grant funding, check whether your university or department has got an OA fund that you could tap into.
However, if you are not mandated to publish OA by your funder and/or you do not have the funds to do so, your paper will still reach your target audience if you select the right journal for your paper. Remember, you can share your paper over email.
The length of time a paper takes to be peer reviewed does not correlate to the quality of peer review, but rather reflects the resources a journal has to manage the process (e.g. do they have paid editorial staff or is it managed by full-time academics?).
Journals usually give their average time to a decision on their website, so take note of this if time is a consideration for you.
Some journals also make it clear that they are reviewing for soundness of science rather than novelty and will therefore often have a faster review process (e.g. PLoS ONE).
Ethics can be divided into two groups:
As an author, it helps if you are familiar with what constitutes good practices and what is considered unacceptable. Please see section “Used literature & web resources” for more details on this topic.
Develop a narrative that leads to your main conclusion and develop a backbone around that narrative. The narrative should progress logically, which does not necessarily mean chronologically. Work out approximate word counts for each section to help manage the article structure and keep you on track for word limits.
It is important to set aside enough time to write your manuscript and – importantly – enough time to edit, which may actually take longer than the writing itself.
The article structure will be defined in the author guidelines, but if the journal’s guidelines permit it, there may be scope to use your own subheadings. By breaking down your manuscript into smaller sections, you will be communicating your message in a much more digestible form.
Use subheadings to shape your narrative and write each subheading in statement form (e.g. ecological variables do not predict genome size variation).
The title is the most visible part of your paper and it should thus clearly communicates your key message. Pre-publication, reviewers base their decision on whether to review a paper on the quality of the title and abstract. Post-publication, if you publish in a subscription journal and not OA, the title and abstract are the only freely available parts of your paper, which will turn up in search engines and thus reach the widest audience. A good title will help you get citations and may even be picked up by the press.
Draft a title before you write your manuscript to help focusing your paper. The title needs to be informative and interesting to make it stand out to reviewers and subsequently readers. Some key tips for a successful title include:
Write your abstract after you have written your paper, when you are fully aware of the narrative of your paper. After the title, the abstract is the most read part of your paper. Abstracts are freely available and affect how discoverable your article is via search engines.
Given its importance, your abstract should:
Writing with clarity, simplicity and accuracy takes practice and we can all get carried away with what we think is ‘academic writing’ (i.e. long words and jargon) but good science speaks for itself. Write short sentences (ca. 12 words on average).
Every extra word you write is another word for a reviewer to disagree with. Single out the narrative that leads to your main conclusion and write that – it is easy to get side tracked with lots of interesting avenues distracting from your work, but by including those in your paper, you are inviting more criticism from reviewers.
Write in an active, positive voice (e.g. ‘we found this…’ ‘we did this…’) and be direct so that your message is clear. Ambiguous writing is another invitation for reviewers to disagree with you.
In your introduction, state that your research is timely, important, and why. Begin each section with that section’s key message and end each section with that message again plus further implications. This will place your work in the broader context that high-quality journals like.
Draft and redraft your work to ensure it flows well and your message is clear and focused throughout. While doing this process, keep the reader in mind at all times (to have a critical look on your research and its presentation).
Keywords are used by readers to discover your paper. You will increase chances of your paper being discovered through search engines by using them strategically throughout your paper – this is search engine optimization (SEO).
Think of the words you would search for to bring up your paper in a Google search. Try it and see what comes up – are there papers that cover similar research to your own?
Build up a list of 15–20 terms relevant to your paper and divide them into two groups:
Place your core keywords in the title, abstract and subheadings, and the secondary keywords throughout the text and in figures and tables. Repeat keywords in the abstract and text naturally.
Reference all sources and do it as you go along (e.g. copy the BibTeX citation into a reference file; see chapter 1 part B), then tidy them once the paper is complete.
Make sure that most of your references are recent to demonstrate both that you have a good understanding of current literature, and that your research is relevant to current topics.
Figures and tables enhance your paper by communicating results or data concisely (more on this topic in chapters 9 and 10).
Use figures and tables to maintain the flow of your narrative – e.g. instead of trying to describe patterns in your results, create a figure and say ‘see Fig. 1’. Not only does this keep your word count down, but a well-designed figure can replace 1000 words!
Figures are useful for communicating overall trends and shapes, allowing simple comparisons between fewer elements. Tables should be used to display precise data values that require comparisons between many different elements.
Figure captions and table titles should explain what is presented and highlight the key message of this part of your narrative – the figure/table and its caption/title should be understandable in isolation from the rest of your manuscript.
Check the journal’s author guidelines for details on table and figure formatting, appropriate file types, number of tables and figures allowed and any other specifications that may apply. Material presented in chapter 1 part C can help you produce figures meeting journal expectations.
Once you have finished writing your manuscript, put it on ice for a week so you come back to it with fresh eyes. Take your time to read it through. Editing can take more time than you expect, but this is your opportunity to fine-tune and submit the best paper possible. Don’t hesitate to seek support from your thesis committee to fasten and streamline this process.
Key things to look out for when editing include:
You are now ready to submit your paper to your chosen journal. Each journal will have a different submission procedure that you will have to adhere to, and most manage their submissions through online submission systems (e.g. ScholarOne Manuscripts).
Only submit your paper for consideration to one journal at a time otherwise you will be breaching publishing ethics.
A great cover letter can set the stage towards convincing editors to send your paper for review. Write a concise and engaging letter addressed to the editor-in-chief, who may not be an expert in your field or sub-field.
The following points should be covered in your cover letter:
Very rarely is a paper immediately accepted – almost all papers go through few rounds of reviews before they get published.
If a decision comes back asking for revisions you should reply to all comments politely. Here are some tips on handling reviewer comments and revising your paper:
Reviewers are volunteers, but the service they provide is invaluable – by undergoing peer review, regardless of the outcome, you are receiving some of the best advice from leading experts for free. With this in mind, any feedback you get will be constructive in the end and will lead you on the way to a successful publishing portfolio.
Keep in mind that feedback is another person’s opinion on what you have done, not on who you are, and it is up to you to decide what to do with it.
If your paper is rejected look at the reviewer’s comments and use their feedback to improve your paper before resubmitting it.
If you are unhappy with a reject decision, 99.9% of the time, move on. However, don’t be afraid of appealing if you have well-founded concerns or think that the reviewers have done a bad job. There are instances where journals grant your appeal and allow you to revise your paper, but in the large majority of cases, the decision to reject will be upheld.
Congratulations! By now you should have an acceptance email from the editor-in-chief in your inbox. The process from here will vary according to each journal, but the post-acceptance workflow is usually as follows:
It might be then time to coordinate the publication of a press release or post the link of your article on social media to share your joy!
The presentation slides for Chapter 6 - part B can be downloaded here.
This chapter is mostly based on the following books and web resources:
Bookdown ebook on writing scientific papers: https://bookdown.org/yihui/rmarkdown/journals.html
A blog on “Writing papers in R Markdown” by Francisco Rodriguez-Sanchez: https://sites.google.com/site/rodriguezsanchezf/news/writingpapersinrmarkdown
rmdTemplates R package GitHub repository: https://github.com/Pakillo/rmdTemplates
rticles R package GitHub repository: https://github.com/rstudio/rticles
To apply the approach described below make sure that you have a Tex distribution installed on your computer. More information on this topic is available here. You will also need to install the R rticles package as demonstrated here.
Traditionally, journals are accepting manuscripts submitted in either Word (.doc
) or LaTex (.tex
) formats. In addition, most journals are requesting figures to be submitted as separate files (in e.g. .tiff
or .eps
formats). Online submission platforms are collating your different files to produce a .pdf
document, which is shared with reviewers for evaluation. In this context, although the .Rmd
format is growing in popularity (due to its ability to “mesh” data analyses with data communication), this format is technically currently not accepted by journals. In this document, we are discussing ways that have been developed to circumvent this issue and allow using the approach implemented in R Markdown for journal submissions.
As mentioned above, many journals support the LaTeX format (.tex
) for manuscript submissions. While you can convert R Markdown (.Rmd
) to LaTeX, different journals have different typesetting requirements and LaTeX styles. The solution is to develop scripts converting R Markdown files into LaTex files, which are meeting journal requirements.
Submitting scientific manuscripts written in R Markdown is still challenging; however the R rticles package was designed to simplify the creation of documents that conform to submission standards for academic journals (see Allaire et al., 2022). The package provides a suite of custom R Markdown LaTeX formats and templates for the following journals/publishers that are relevant to the EEB program:
An understanding of LaTeX is recommended, but not essential in order to use this package. R Markdown templates may sometimes inevitably contain LaTeX code, but usually we can use the simpler R Markdown and knitr function to produce elements like figures, tables, and math equations.
install.packages("rticles")
Tools -> Install Packages...
Then, type "rticles" in the prompted window to install package.
remotes::install_github("rstudio/rticles")
File -> New File -> R Markdown...
New R Markdown
window, click on From Template
in the left panel and select the journal style that you would like to follow for your article (here PNAS Journal Article; see Figure 7.1). Before pushing the OK
button, provide a name for your project and set a location where the project will be saved (see Figure 7.1).Submission_PNAS.Rmd
: R Markdown file that will be used to write your article.pnas-sample.bib
: BibTeX file to store your bibliography.pnas.csl
and pnas-new.cls
: Files containing information about the formatting of citations and bibliography adapted to journal policies.frog.png
: A .png
file used to show you how to include figures in .Rmd
document.Submission_PNAS.Rmd
and update the YAML metadata section with information on authors, your abstract, summary and keywords (see Figure 7.4)..pdf
and .tex
files to submit your article (see Figure 7.5). The output files will be saved in your project folder.To get familiar with this procedure, please practice by applying the above approach to different journal templates by favoring those where you might be submitting to.
Enjoy writing scientific publications in R Markdown!
In this chapter, we will study protocols to import and gather data with R. As stated by Gandrud (2015) in the chapter 6 of his book, how you gather your data directly impacts how reproducible your research will be. In this context, it is your duty to try your best to document every step of your data gathering process. Reproduction will be easier if all of your data gathering steps are tied together by your source code, then independent researchers (and you) can more easily regather the data. Regathering data will be easiest if running your code allows you to get all the way back to the raw data files (the rawer the better). Of course, this may not always be possible. You may need to conduct interviews or compile information from paper based archives, for example. The best you can sometimes do is describe your data gathering process in detail. Nonetheless, R’s automated data gathering capabilities for internet-based information is extensive. Learning how to take full advantage of these capabilities greatly increases reproducibility and can save you considerable time and effort over the long run.
Gathering data can be done by either importing locally stored data sets (= files stored on your computer) or by importing data sets from the Internet. Usually, these data sets are saved in plain-text format (usually in comma-separated values format or csv
) making importing them into R a fairly straightforward task (using the read.csv() function). However, if data sets are not saved in plain-text format, the users will have to start by converting them. In most cases, data sets will be saved in xls
or xlsx
formats and functions implemented in the readxl package (Wickham and Bryan, 2019) would be used (using the read_xlsx() function). If your data sets were created by other statistical programs such as SPSS, SAS or Stata, these could be imported into R using functions from the foreign package (R Core Team, 2020). Finally, data sets could be saved in compressed documents, which will have to be processed prior to importing the data into R.
Learning skills to import and gather data sets is especially important in the fields of Ecology, Evolution and Behavior since your research is highly likely to depend on large and complex data sets (see Figure 8.1). In addition, testing your hypotheses will rely on your ability to manage your data sets to test for complex interactions (e.g. does abiotic factors such as temperature drive selection processes in plants?; Figure 8.1).
Here, we are providing methods and R functions that are applied to manage your projects and gather, convert and clean your data. Ultimately these tools will be applied to document and produce the raw data at the basis of your research.
csv
files deposited on GitHub into R.csv
files associated to specific GitHub commit
events.zip
format) on your computer.zip
file.zip
file without decompressing it.RStudio projects (.Rproj
) allow users to manage their project, more specifically by dividing their work into multiple contexts, each with their own working directory, workspace, history, and source documents.
RStudio projects are associated with R working directories. You can create an RStudio project:
We will be covering the last option during Chapter 11.
To create a new project in RStudio, do File > New Project...
and then a window will pop-up allowing you to select among the 3 options (see Figure 8.2).
When a new project is created RStudio:
.Rproj
extension) within the project directory. This file contains various project options and can also be used as a shortcut for opening the project directly from the filesystem.Rproj.user
) where project-specific temporary files (e.g. auto-saved source documents, window-state, etc.) are stored. This directory is also automatically added to .Rbuildignore
, .gitignore
, etc. if required.To open a project, go to your project directory and double-click on the project file (*.Rproj
). When your project opens within RStudio, the following actions will be taken:
.RData
file (see below), it will be loaded into your environment and allowing you to purse your work.When you are within a project and choose to either Quit, close the project, or open another project the following actions will be taken:
.RData
and/or .Rhistory
are written to the project directory (if current options indicate they should be).Additional information on RStudio projects can be found here:
https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects
The list of R packages and GitHub repositories used in this chapter are listed below. Please make sure you have all these set-up before starting reading and completing the material presented in this chapter.
Most of the functions that we will be using in this chapter are base R functions installed (by default) in the utils package (R Core Team, 2019). However, the following R package (and its dependencies) has to be installed on your computer prior to starting this tutorial: repmis (Gandrud, 2016). In the event that you wanted to import xsl
or xlsx
files into R, you would also have to install the readxl package (Wickham and Bryan, 2019).
This repository is dedicated to this course and is used to demonstrate the procedure to import csv
files into R from GitHub repositories. More specifically, we will be importing different versions of the file (Timetable_EEB603_topic_tasks.csv
) at the origin of the Timetable to study procedure associated to file versioning in GitHub. It is located at this URL (Uniform Resource Locator):
This repository is associated to Barron et al. (2020) and is used to demonstrate procedure to download whole GitHub repositories. We will be downloading the whole repository on your local computer and then extracting all csv
files in the 01_Raw_Data/
folder and saving them on your local computer using R (see Figure 8.3 for more details on file content).
This approach is aimed at demonstrating how you could access raw data deposited on GitHub. The repository is located at this URL:
In this chapter, we will be focusing on learning procedures to import data from the internet focusing on GitHub. More precisely, we will be learning procedures to:
csv
) stored in GitHub repositories.zip
format.The list of major R functions covered in this chapter are provided in the Table below.
Before starting coding, please do the following:
EEB603_Chapter_06
.01_Data_gathering.R
.Before delving into this subject, there are several topics that we need to cover.
With the growing popularity of GitHub, several authors are depositing their data sets on GitHub and you might would like to access those for your research. Since Git and GitHub are supporting version control, it is important to report the exact version of the file or data set that you have downloaded. To support such feature, each version of a file/data set is associated to a unique encrypted SHA-1 (Secure Hash Algorithm 1) hash accession number. This means that if the file changes (because the owner of the repository updated it), its SHA-1 hash accession number will change. Such feature allows users to referring to the exact file/data set used in their analyses therefore supporting reproducibility.
Before being able to import a csv
file deposited in GitHub into R you have to find it’s raw URL. In this section, we will demonstrate how to obtain this information by using the Timetable_EEB603_topic_tasks.csv
file located on the course’s GitHub repository.
To retrieve the raw URL associated to Timetable_EEB603_topic_tasks.csv
do the following:
Raw
button on the right just above the file preview (Figure 8.4). This action should open a new window showing you the raw csv
file (see Figure 8.5).
csv
file can be retrieved by copying the URL address (see Figure 8.5). In this case, the URL is as follows:
https://raw.githubusercontent.com/svenbuerki/EEB603_Reproducible_Science/master/Data/Timetable_EEB603_topic_tasks.csvcsv
file is https://bit.ly/3BDxECl. We will be using this URL in the example below.Now that we have retrieved and shortened the raw GitHub URL pointing to our target csv
file, we can use the source_data() function implemented in the R package repmis (Gandrud, 2016) to download the file. The object return by source_data() is a data.frame
and can therefore be easily manipulated and saved on your local computer (using e.g. write.csv()). Retrieving a csv
data set from a GitHub repository can be done as follows:
### ~~~ Load package ~~~
library(repmis)
### ~~~ Store raw short URL into object ~~~
urlcsv <- "https://bit.ly/3BDxECl"
### ~~~ Download/Import csv into R ~~~
csvTimeTable <- repmis::source_data(url = urlcsv)
## Downloading data from: https://bit.ly/3BDxECl
## SHA-1 hash of the downloaded data file is:
## e1feec6965718f2b9299d24119afa7aac310425c
### ~~~ Check class ~~~ Class should be `data.frame`
class(csvTimeTable)
## [1] "data.frame"
### ~~~ Print csv ~~~
print(csvTimeTable)
## Topic
## 1 Syllabus
## 2 Intro. Chapt. 1 & Chap. 2: The reproducibility crisis
## 3 Chap. 1 - part A: Learning the basics
## 4 Chap. 1 - Complete part A & overview of part B: Tables, Figures and References
## 5 Chap. 1 - part B: Tables, Figures and References
## 6 Chap. 1 - part B: Tables, Figures and References
## 7 Chap. 1 - Complete part B and part C: Advanced R Markdown settings
## 8 Chap. 1 - part D: User Defined Functions in R
## 9 Chap. 1 - part D: User Defined Functions in R
## 10 Q&A and work on bioinformatic tutorials
## 11 Chap. 3: A roadmap to implement reproducible science in Ecology, Evolution & Behavior
## 12 Chap. 4: Open science and CARE principles
## 13 Chap. 5: Data management
## 14 Complete Chap. 5: Data management and Chap. 5: Reproducible code
## 15 Chap. 6: Getting published
## 16 Chap. 6: Writing papers in R Markdown ([rticles](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#rticlespkg)) and Work on bioinformatic tutorials
## 17 Bioinfo. tutorial - Chap. 7
## 18 Bioinfo. tutorial - Chap. 7
## 19 Bioinfo. tutorial - Chap. 8
## 20 Bioinfo. tutorial - Chap. 8
## 21 Bioinfo. tutorial - Chap. 9
## 22 Bioinfo. tutorial - Chap. 9
## 23 Bioinfo. tutorial - Chap. 10
## 24 Bioinfo. tutorial - Chap. 10
## 25 Bioinfo. tutorial - Chap. 11
## 26 Bioinfo. tutorial - Chap. 11
## 27 Bioinfo. tutorial - Chap. 12
## 28 Bioinfo. tutorial - Chap. 12
## 29 Individual projects - Q&A
## 30 Individual projects - Q&A
## 31 Oral presentations of individual projects
## 32 Oral presentations of individual projects
## Homework
## 1
## 2 Due to internet outage on campus, we had to change schedule. Read Baker (2016) and prepare for discussing outcome of study
## 3 [Install software](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#213_Install_R_Markdown_software)
## 4 Read Chapt. 1 - part B (Set your R Markdown environment)
## 5 Work on bioinfo. tutorials
## 6 Work on bioinfo. tutorials
## 7 **1.** Complete material all the way to the end of section 2.3.6; **2.** Read sections 2.3.7 and 2.3.8 and make sure that the bibliography file ""Bibliography_Reproducible_Science_2.bib” is copied in your working directory; **3.** Read section 2.3.9 and make sure that the csl file ""AmJBot.csl” is copied in your working directory; **4.** Read section 2.4 (Chapter 1 - part C) and get accustomed with the concepts presented in this part. I will be giving a presentation on these concepts and you will be implementing the code.
## 8 **1.** Complete section 2.4.6 of part C and **2.** Read part D until the end of section 2.5.8.
## 9 **1.** Complete exercise in section 2.4.7 of part C and **2.** Read part D until the end of section 2.5.8.
## 10 **1.** Sign up for bioinformatics tutorial (see Syllabus sections 5.1, 6 and 11 ) and **2.** Work with your group to organize your bioinformatics tutorial
## 11 Read material presented in Chapter 3
## 12 Read material presented in Chapter 4
## 13 Read material presented in Chapter 5 - Part A
## 14 Read material presented in Chapter 5 - Part B + Download folder “Project_ID/“ on Google Drive under this path: Reproducible_Science > Exercises > Chapter_5
## 15 Turn in tutorial for Chap. 7 & start individual reports
## 16 Upload tutorial of Chap. 7 on Google Drive
## 17 Turn in tutorial for Chap. 8
## 18 Upload tutorial of Chap. 8 on Google Drive
## 19 Turn in tutorial for Chap. 9
## 20 Upload tutorial of Chap. 9 on Google Drive
## 21 Turn in tutorial for Chap. 10
## 22 Upload tutorial of Chap. 10 on Google Drive
## 23 Turn in tutorial for Chap. 11
## 24 Upload tutorial of Chap. 11 on Google Drive
## 25 Turn in tutorial for Chap. 12
## 26 Upload tutorial of Chap. 12 on Google Drive
## 27 Students work on ind. projects: Review literature (see Syllabus)
## 28 Students work on ind. projects: Data management workflow
## 29 Students work on ind. projects: Data management workflow
## 30 Turn in ind. projects (on [Google Drive](https://drive.google.com/drive/folders/1MZt5kNKusCv6OeZpjuuPxUiqBaoQVQAc?usp=sharing))
## 31
## 32
## URL
## 1 [Syllabus](https://svenbuerki.github.io/EEB603_Reproducible_Science/index.html)
## 2 [Chapter 1](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#2_Chapter_1) & [Chapter 2](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#3_Chapter_2)
## 3 [Part A](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#partA)
## 4 [Part A](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#partA) & [Part B: Set your R Markdown environment](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#233_Set_your_R_Markdown_environment)
## 5 [Part B](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#13_PART_B:_Tables,_Figures_and_References)
## 6 [Part B](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#13_PART_B:_Tables,_Figures_and_References)
## 7 [Part B](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#13_PART_B:_Tables,_Figures_and_References) & [Part C](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#14_PART_C:_Advanced_R_and_R_Markdown_settings)
## 8 [Part D](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#15_PART_D:_User_Defined_Functions_in_R)
## 9 [Part D](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#15_PART_D:_User_Defined_Functions_in_R)
## 10 [Publications and Resources for bioinformatic tutorials](https://svenbuerki.github.io/EEB603_Reproducible_Science/index.html#7_Publications__Textbooks)
## 11 [Chapter 3](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#4_Chapter_3)
## 12 [Chapter 4](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#5_Chapter_4)
## 13 [Chapter 5 - Part A](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#partDM)
## 14 [Chapter 5 - Part B](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#repcode)
## 15 [Chapter 6 - Part A](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#getpub)
## 16 [Chapter 6 - Part B](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#writpub)
## 17
## 18
## 19
## 20
## 21
## 22
## 23
## 24
## 25
## 26
## 27
## 28
## 29
## 30
## 31
## 32
source_data() will always download the most recent version of the file from the master branch and return its unique SHA-1 hash. However, you could also download a prior version of the file by providing its specific SHA-1 hash associated to a previous commit
.
Retrieving the csv
file associated to a specific commit
can be done by applying the following approach (here using the same example than above):
commit
raw URL navigate to its location on GitHub by clicking here.Commits on Aug 23, 2021
; Figure 8.6). This will take you to the version of the file at this point in history.
Raw
button will load the csv
file and allow you to retrieve the URL (as done above).### ~~~ Store raw URL into object ~~~
urlcsvold <- "https://raw.githubusercontent.com/svenbuerki/EEB603_Reproducible_Science/1312eae07e0515d8d2423ff83834b184cf6eeb8d/Data/Timetable_EEB603_topic_tasks.csv"
### ~~~ Download/Import csv into R ~~~
csvTimeTableOld <- repmis::source_data(url = urlcsvold)
## Downloading data from: https://raw.githubusercontent.com/svenbuerki/EEB603_Reproducible_Science/1312eae07e0515d8d2423ff83834b184cf6eeb8d/Data/Timetable_EEB603_topic_tasks.csv
## SHA-1 hash of the downloaded data file is:
## e7071b1e85ada38d6b2cf2a93bbb43c2b96a331f
### ~~~ Check class ~~~ Class should be `data.frame`
class(csvTimeTableOld)
## [1] "data.frame"
### ~~~ Print csv ~~~
print(csvTimeTableOld)
## Topic
## 1 Syllabus
## 2 Example of a bioinformatic tutorial
## 3 Chap. 1 - R Markdown part A
## 4 Chap. 1 - R Markdown part B
## 5 Chap. 1 - R Markdown part C
## 6 Chap. 1 - User-defined functions
## 7 Chap. 1 - Wrap-up
## 8 Chap. 2
## 9 Chap. 3
## 10 Chap. 4: Data management
## 11 Chap. 4: Reproducible code
## 12 Chap. 5: Getting published
## 13 Chap. 5: Writing papers in R Markdown (rticles)
## 14 TBD
## 15 Bioinfo. tutorial - Chap. 6
## 16 Bioinfo. tutorial - Chap. 6
## 17 Bioinfo. tutorial - Chap. 7
## 18 Bioinfo. tutorial - Chap. 7
## 19 Bioinfo. tutorial - Chap. 8
## 20 Bioinfo. tutorial - Chap. 8
## 21 Bioinfo. tutorial - Chap. 9
## 22 Bioinfo. tutorial - Chap. 9
## 23 Bioinfo. tutorial - Chap. 10
## 24 Bioinfo. tutorial - Chap. 10
## 25 Bioinfo. tutorial - Chap. 11
## 26 Bioinfo. tutorial - Chap. 11
## 27 Individual projects - Q&A
## 28 Oral presentations
## 29 Oral presentations
## 30 Oral presentations
## Task Deadline
## 1
## 2
## 3 Work on bioinfo. tutorials
## 4 Work on bioinfo. tutorials
## 5 Work on bioinfo. tutorials
## 6 Work on bioinfo. tutorials
## 7 Work on bioinfo. tutorials
## 8 Read Baker (2016) and prepare for discussing outcome of study
## 9 Work on bioinfo. tutorials
## 10 Work on bioinfo. tutorials
## 11 Work on bioinfo. tutorials
## 12 Work on bioinfo. tutorials
## 13 Turn in tutorial for Chap. 6 & start individual reports
## 14 Upload tutorial of Chap. 6 on Google Drive
## 15 Turn in tutorial for Chap. 7
## 16 Upload tutorial of Chap. 7 on Google Drive
## 17 Turn in tutorial for Chap. 8
## 18 Upload tutorial of Chap. 8 on Google Drive
## 19 Turn in tutorial for Chap. 9
## 20 Upload tutorial of Chap. 9 on Google Drive
## 21 Turn in tutorial for Chap. 10
## 22 Upload tutorial of Chap. 10 on Google Drive
## 23 Turn in tutorial for Chap. 11
## 24 Upload tutorial of Chap. 11 on Google Drive
## 25 Students work on ind. projects: Review literature (see Syllabus)
## 26 Students work on ind. projects: Data management workflow
## 27 Students work on ind. project
## 28 Work on reports/presentations
## 29 Turn in ind. reports
## 30
Before being able to download a GitHub repository and work with its files into R, you have to find the URL pointing to the compressed zip
file containing all the files for the target repository. In this section, we will demonstrate how to obtain this information by using the Sagebrush_rooting_in_vitro_prop GitHub repository. As mentioned above, we will be downloading the whole repository and then extract all the csv
files in the 01_Raw_Data/
folder (see Figure 8.3).
To retrieve the URL associated to the compressed zip
file containing all files for the repository do the following:
zip
file do the following actions:Now that we have secured the URL pointing to the compressed zip
file for the target repository (by copying it), we will use this URL and the base R download.file() function as input to download the file on our local computer. Since compressed files could be large, we are also providing some code to check if the file already exists on your computer before downloading.
### ~~~ Store URL in object ~~~ Paste the URL that you
### copied in the previous section here
URLrepo <- "https://github.com/svenbuerki/Sagebrush_rooting_in_vitro_prop/archive/refs/heads/master.zip"
### ~~~ Download the repository from GitHub ~~~ Arguments:
### - url: URLrepo - destfile: Path and name of destination
### file on your computer YOU HAVE TO ADJUST PATH TO YOUR
### COMPUTER
# First check if the file exists, if yes then return file
# already downloaded else proceed with download
if (file.exists("Data/GitHubRepoSagebrush.zip") == TRUE) {
# File already exists!
print("file already exists and doesn't need to be downloaded!")
} else {
# Download the file
print("Downloading GitHub repository!")
download.file(url = URLrepo, destfile = "Data/GitHubRepoSagebrush.zip")
}
## [1] "file already exists and doesn't need to be downloaded!"
Compressed files can be quite large and you might would like to avoid decompressing them, but rather accessing target files and only decompressing those. Here, we are practicing such approach by using GitHubRepoSagebrush.zip
and targeting csv
files in the 01_Raw_Data/
folder.
To estimate the size (in bytes) of a file you can use base R function file.size() as follows:
### ~~~ Infer file size of GitHubRepoSagebrush.zip ~~~
# Transform and round file size from bytes to Mb
ZipSize <- round(file.size("Data/GitHubRepoSagebrush.zip")/1e+06,
2)
print(paste("The zip file size is", ZipSize, "Mb", sep = " "))
## [1] "The zip file size is 22.26 Mb"
Finally, we can now i) list all files in the zip
file, ii) identify csv
files in 01_Raw_Data/
and iii) save these files on our local computer in a folder entitled 01_Raw_Data/
. These files will then constitute the raw data for your subsequent analyses.
### ~~~ List all files in zip file without decompressing it
### ~~~
filesZip <- as.character(unzip("Data/GitHubRepoSagebrush.zip",
list = TRUE)$Name)
### ~~~ Identify files in 01_Raw_Data/ that are csv ~~~ Use
### grepl() to match criteria
targetF <- which(grepl("01_Raw_Data/", filesZip) & grepl(".csv",
filesZip) == TRUE)
# Subset files from filesZip to only get our target files
rawcsvfiles <- filesZip[targetF]
# print list of target files
print(rawcsvfiles)
## [1] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/1_block_8_12_2020 - 1_block.csv"
## [2] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/2_block_8_15_2020 - 2_block.csv"
## [3] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/3_block_8_15_2020 - 3_block.csv"
## [4] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/4_block_8_16_2020 - 4_block.csv"
## [5] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/5_block_8_19_2020 - 5_block.csv"
## [6] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/Phenotypes_sagebrush_in_vitro.csv"
## [7] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/Survival_height_clones.csv"
### ~~~ Create local directory to save csv files ~~~ Check
### if the folder already exists if not then creates it
output_dir <- file.path(paste0("Data/01_Raw_Data/"))
if (dir.exists(output_dir)) {
print(paste0("Dir", output_dir, " already exists!"))
} else {
print(paste0("Created ", output_dir))
dir.create(output_dir)
}
## [1] "DirData/01_Raw_Data/ already exists!"
### ~~~ Save csv in output_dir ~~~ Use loop to read csv
### file in and then save it in output_dir
for (i in 1:length(rawcsvfiles)) {
### ~~~ Decompress and read in csv file ~~~
tempcsv <- read.csv(unz("Data/GitHubRepoSagebrush.zip", rawcsvfiles[i]))
### ~~~ Write file in ~~~ Extract file name
csvName <- strsplit(rawcsvfiles[i], split = "01_Raw_Data/")[[1]][2]
# Write csv file in output_dir
write.csv(tempcsv, file = paste0(output_dir, csvName))
}
We can verify that all the files are in the newly created directory on your computer by listing them as follows (compare your results with files shown in Figure 8.3):
# List all the files in output_dir (on your local computer)
list.files(paste0(output_dir))
## [1] "1_block_8_12_2020 - 1_block.csv" "2_block_8_15_2020 - 2_block.csv"
## [3] "3_block_8_15_2020 - 3_block.csv" "4_block_8_16_2020 - 4_block.csv"
## [5] "5_block_8_19_2020 - 5_block.csv" "Phenotypes_sagebrush_in_vitro.csv"
## [7] "Survival_height_clones.csv"
Citations of all R packages used to generate this report.
[1] J. Allaire, Y. Xie, C. Dervieux, et al. rmarkdown: Dynamic Documents for R. R package version 2.21. 2023. https://CRAN.R-project.org/package=rmarkdown.
[2] J. Allaire, Y. Xie, C. Dervieux, et al. rticles: Article Formats for R Markdown. R package version 0.23. 2022. https://github.com/rstudio/rticles.
[3] S. M. Bache and H. Wickham. magrittr: A Forward-Pipe Operator for R. R package version 2.0.3. 2022. https://CRAN.R-project.org/package=magrittr.
[4] C. Boettiger. knitcitations: Citations for Knitr Markdown Files. R package version 1.0.12. 2021. https://github.com/cboettig/knitcitations.
[5] J. Cheng, C. Sievert, B. Schloerke, et al. htmltools: Tools for HTML. R package version 0.5.3. 2022. https://github.com/rstudio/htmltools.
[6] R. Francois. bibtex: Bibtex Parser. R package version 0.4.2.3. 2020. https://github.com/romainfrancois/bibtex.
[7] C. Glur. data.tree: General Purpose Hierarchical Data Structure. R package version 1.0.0. 2020. http://github.com/gluc/data.tree.
[8] R. Iannone. DiagrammeR: Graph/Network Visualization. R package version 1.0.9. 2022. https://github.com/rich-iannone/DiagrammeR.
[9] M. C. Koohafkan. kfigr: Integrated Code Chunk Anchoring and Referencing for R Markdown Documents. R package version 1.2.1. 2021. https://github.com/mkoohafkan/kfigr.
[10] Y. Qiu. prettydoc: Creating Pretty Documents from R Markdown. R package version 0.4.1. 2021. https://github.com/yixuan/prettydoc.
[11] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2022. https://www.R-project.org/.
[12] K. Ren and K. Russell. formattable: Create Formattable Data Structures. R package version 0.2.1. 2021. https://CRAN.R-project.org/package=formattable.
[13] H. Wickham, J. Bryan, and M. Barrett. usethis: Automate Package and Project Setup. R package version 2.1.6. 2022. https://CRAN.R-project.org/package=usethis.
[14] H. Wickham, R. François, L. Henry, et al. dplyr: A Grammar of Data Manipulation. R package version 1.1.2. 2023. https://CRAN.R-project.org/package=dplyr.
[15] H. Wickham, J. Hester, W. Chang, et al. devtools: Tools to Make Developing R Packages Easier. R package version 2.4.4. 2022. https://CRAN.R-project.org/package=devtools.
[16] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. ISBN 978-1138700109. Boca Raton, Florida: Chapman and Hall/CRC, 2016. https://bookdown.org/yihui/bookdown.
[17] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.33. 2023. https://CRAN.R-project.org/package=bookdown.
[18] Y. Xie. Dynamic Documents with R and knitr. 2nd. ISBN 978-1498716963. Boca Raton, Florida: Chapman and Hall/CRC, 2015. https://yihui.org/knitr/.
[19] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014.
[20] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.42. 2023. https://yihui.org/knitr/.
[21] Y. Xie, J. Allaire, and G. Grolemund. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman and Hall/CRC, 2018. ISBN: 9781138359338. https://bookdown.org/yihui/rmarkdown.
[22] Y. Xie, J. Cheng, and X. Tan. DT: A Wrapper of the JavaScript Library DataTables. R package version 0.24. 2022. https://github.com/rstudio/DT.
[23] Y. Xie, C. Dervieux, and E. Riederer. R Markdown Cookbook. Boca Raton, Florida: Chapman and Hall/CRC, 2020. ISBN: 9780367563837. https://bookdown.org/yihui/rmarkdown-cookbook.
[24] H. Zhu. kableExtra: Construct Complex Table with kable and Pipe Syntax. R package version 1.3.4. 2021. https://CRAN.R-project.org/package=kableExtra.
Version information about R, the operating system (OS) and attached or R loaded packages. This appendix was generated using sessionInfo()
.
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] repmis_0.5 rticles_0.23 DiagrammeR_1.0.9
## [4] DT_0.24 data.tree_1.0.0 kfigr_1.2.1
## [7] devtools_2.4.4 usethis_2.1.6 bibtex_0.4.2.3
## [10] knitcitations_1.0.12 htmltools_0.5.3 prettydoc_0.4.1
## [13] magrittr_2.0.3 dplyr_1.1.2 kableExtra_1.3.4
## [16] formattable_0.2.1 bookdown_0.33 rmarkdown_2.21
## [19] knitr_1.42
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.5 tidyr_1.3.0 sass_0.4.2 pkgload_1.3.2.1
## [5] jsonlite_1.8.7 viridisLite_0.4.2 R.utils_2.12.0 bslib_0.4.0
## [9] shiny_1.7.2 highr_0.9 yaml_2.3.7 remotes_2.4.2
## [13] sessioninfo_1.2.2 pillar_1.9.0 glue_1.6.2 digest_0.6.33
## [17] RColorBrewer_1.1-3 promises_1.2.0.1 rvest_1.0.3 RefManageR_1.3.0
## [21] colorspace_2.1-0 R.oo_1.25.0 httpuv_1.6.5 plyr_1.8.7
## [25] pkgconfig_2.0.3 purrr_1.0.2 xtable_1.8-4 scales_1.2.1
## [29] webshot_0.5.4 processx_3.8.2 svglite_2.1.0 later_1.3.0
## [33] tibble_3.2.1 generics_0.1.3 ellipsis_0.3.2 withr_2.5.0
## [37] cachem_1.0.6 cli_3.6.1 crayon_1.5.2 mime_0.12
## [41] memoise_2.0.1 evaluate_0.21 ps_1.7.5 R.methodsS3_1.8.2
## [45] fs_1.6.3 fansi_1.0.4 R.cache_0.16.0 xml2_1.3.5
## [49] pkgbuild_1.3.1 data.table_1.14.2 profvis_0.3.7 tools_4.2.0
## [53] prettyunits_1.1.1 formatR_1.12 lifecycle_1.0.3 stringr_1.5.0
## [57] munsell_0.5.0 callr_3.7.3 compiler_4.2.0 jquerylib_0.1.4
## [61] systemfonts_1.0.4 rlang_1.1.1 rstudioapi_0.14 visNetwork_2.1.0
## [65] htmlwidgets_1.5.4 crosstalk_1.2.0 miniUI_0.1.1.1 curl_5.0.0
## [69] R6_2.5.1 lubridate_1.8.0 fastmap_1.1.0 utf8_1.2.3
## [73] stringi_1.7.12 Rcpp_1.0.11 vctrs_0.6.3 tidyselect_1.2.0
## [77] xfun_0.36 urlchecker_1.0.1
What is a Hypothesis? A hypothesis is a tentative, testable answer to a scientific question. Once a scientist has a scientific question they perform a literature review to find out what is already known on the topic. Then this information is used to form a tentative answer to the scientific question. Keep in mind, that the hypothesis also has to be testable since the next step is to do an experiment to determine whether or not the hypothesis is right! A hypothesis leads to one or more predictions that can be tested by experimenting. Predictions often take the shape of “If ____then ____” statements, but do not have to. Predictions should include both an independent variable (the factor you change in an experiment) and a dependent variable (the factor you observe or measure in an experiment). A single hypothesis can lead to multiple predictions.↩︎
GBIF — the Global Biodiversity Information Facility — is an international network and research infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth.↩︎
Postdiction involves explanation after the fact.↩︎