Please see this webpage for more details on the Syllabus.
In this chapter, we engage in group activities and literature reading to explore the concept of reproducible science, including its challenges, benefits, and practical implementation in research. Overall, this chapter aims to provide a broader understanding of reproducible science and the context that underpins the material covered throughout the course.
The chapter is subdivided into two parts:
This list of online resources supports our group activities:
To investigate this question, students engage in self-reflection and group activities designed to explore the meaning of reproducible science and its implementation in research and scientific publications.
Disclaimer: The instructor has provided definitions and other supporting materials (in collapsible boxes) throughout this section. However, students are advised not to read these materials in advance, in order to complete the exercises sincerely and derive the greatest benefit from the experience.
The structure of this class, held over two sessions, is as follows:
The first two sections will be covered during Session 1, while the remaining sections will be addressed in Session 2.
Please ensure that you have identified the reproducibility challenges in your assigned publication (see Case Studies and Challenges) prior to attending the second session.
Students reflect on what “reproducible science” means to them and write down a personal definition.
Standard definitions are introduced to clarify the distinction between reproducibility and replicability.
Students are invited to relate them to their own research experience.
Reproducible science means achieving the same findings using the original data, methods, and code.
Key elements:
Reproduction = Running the same code on the same data to verify results (like software testing for an entire study).
Replication = Repeating the study with intentional variations (e.g., different species, software, or parameters) to test if conclusions hold.
Key insight: Replication is most useful after reproduction is verified.
Each group is given a real-world publication that illustrates challenges to reproducibility, such as:
The instructor will hand out copies of the publications, but they are also available on our Google Drive.
📝 Task Instructions:
Each group:
💡 Note: Students are encouraged to explore their assigned study online and consult the provided resources.
However, please form your own conclusions before researching externally.
The data supporting the retractions of the case studies are presented in Table 2.1 (sourced from Retraction Watch). Digital Object Identifiers (DOI) and PubMed IDs are provided for each case study.
Note that the article assigned to Group 3 has not been formally retracted; however, several experts have raised concerns on PubPeer. These concerns relate to: (i) the use of machine-generated passages, (ii) the inclusion of references that do not appear to exist, and (iii) references to figures in the article that are themselves nonexistent.
GroupID | Title | Journal | ArticleType | RetractionDOI | RetractionPubMedID | OriginalPaperDOI | OriginalPaperPubMedID | Reason |
---|---|---|---|---|---|---|---|---|
1 | Primary Prevention of Cardiovascular Disease with a Mediterranean Diet | NEJM: The New England Journal of Medicine | Research Article; | 10.1056/NEJMc1806491 | 29897867 | 10.1056/NEJMoa1200303 | 23432189 | +Error in Analyses;+Error in Methods;+Error in Results and/or Conclusions;+Retract and Replace; |
2 | Hydroxychloroquine and azithromycin as a treatment of COVID-19: results of an open-label non-randomized clinical trial | International Journal of Antimicrobial Agents | Clinical Study; | 10.1016/j.ijantimicag.2024.107416 | 39730229 | 10.1016/j.ijantimicag.2020.105949 | 32205204 | +Concerns/Issues About Authorship/Affiliation;+Concerns/Issues About Data;+Concerns/Issues about Results and/or Conclusions;+Concerns/Issues about Article;+Concerns/Issues about Human Subject Welfare;+Date of Article and/or Notice Unknown;+Informed/Patient Consent - None/Withdrawn;+Investigation by Journal/Publisher;+Investigation by Third Party; |
4 | Correlation of Carotid Artery Intima-Media Thickness with Calcium and Phosphorus Metabolism, Parathyroid Hormone, Microinflammatory State, and Cardiovascular Disease | BioMed Research International | Research Article; | 10.1155/2024/9893064 | 38550095 | 10.1155/2022/2786147 | 35313627 | +Computer-Aided Content or Computer-Generated Content;+Concerns/Issues About Data;+Concerns/Issues about Referencing/Attributions;+Concerns/Issues about Results and/or Conclusions;+Concerns/Issues with Peer Review;+Investigation by Journal/Publisher;+Investigation by Third Party;+Paper Mill;+Unreliable Results and/or Conclusions; |
In small groups, students:
Afterward, we will share and compare our recommendations as a class.
The instructor goes over the points in this collapsible box to further discuss implementing reproducible science in your research. Please don’t look at the content of the box before we get here!
In Part B, we are further investigating the causes leading to irreproducible science and discussing ways to mitigate this crisis. For this purpose, we are using results from the survey published by Baker (2016) and recommendations from the report published by the National Academies of Sciences, Engineering, and Medicine (National Academies of Sciences and Medicine, 2019).
We will be investigating the following questions:
The PDFs of Baker (2016) and National Academies of Sciences and Medicine (2019) are available in our shared Google Drive at the following paths:
Reproducible_Science > Publications > Baker_Nature_2016.pdf
Reproducible_Science > Publications > Reproducibility_and_Replicability_in_Science_2019.pdf
The presentation associated with this class is available here:
The most relevant recommendations proposed by the National Academies of Sciences and Medicine (2019) for our course are listed in the table below.
In this chapter, we introduce the use of bioinformatics tools for writing and disseminating reproducible reports, as implemented in RStudio (RStudio Team, 2020). More specifically, we will learn how to link and execute data and code within a unified environment (see Figure 3.1).
This chapter aims to equip students with the essential skills required to effectively use R Markdown for integrating text, code, figures, tables, and references into a cohesive, reproducible document. The final output can be rendered into multiple formats—including PDF
, HTML
, or Word
—to streamline the communication and sharing of research findings.
This tutorial provides students with the foundational knowledge necessary to complete their bioinformatics tutorials (PART 2) and individual projects (PART 3).
The chapter is divided into six parts, each with specific learning objectives:
Figure 3.1: The spectrum of reproducibility.
Files supporting this chapter are available on Google Drive.
Software and packages required to perform this tutorial are detailed below. Students should install those software and packages on their personal computers to be able to complete this course. Additional packages might need to be installed and the instructor will provide guidance on how to install those as part of the forthcoming tutorials.
bookdown
, knitr
and R Markdown
. Use the following R command to install those packages:install.packages(c("bookdown", "knitr", "rmarkdown"))
pdf
format. Please install MiKTeX on Windows, MacTeX on OS X and TeXLive on Linux. For this class, you are not requested to install this software, which takes significant hard drive space and is harder to operate on Windows OS.NOTE: The instructor is using the following version of RStudio
: Version 2025.05.0+496 (2025.05.0+496). If your computer is experiencing issues running the latest version of the software, you can install previous versions here.
RStudio (RStudio Team, 2020) is an integrated development environment (IDE) that allows you to interact with R more efficiently. While similar to the standard R GUI, RStudio is significantly more user-friendly. It offers more drop-down menus, tabbed windows, and extensive customization options (see Figure 3.2).
Detailed information on using RStudio can be found on RStudio’s website.
Figure 3.2: Snapshot of the RStudio environment showing the four windows and their content.
Below are URLs to web resources that provide key information related to Chapter 2:
In this part, we will provide a survey of the procedures to create and render (or knitting) your first R Markdown document.
This tutorial provides students with the opportunity to learn how to:
Markdown is a simple formatting syntax used for authoring HTML, PDF, and Microsoft Word documents. It is implemented in the rmarkdown
package. An R Markdown document is typically divided into three sections (see Figure 3.3):
Figure 3.3: Example of an R Markdown file showing the three major sections.
In this section, you will learn how to create and knit your first R Markdown document.
To create an R Markdown document, follow these steps in RStudio:
File -> New File -> R Markdown...
HTML
(see Figure 3.4)
Chapter 2 - Part A
(This is the title of your document, not the name of your file; see below)HTML
format, which is the default output in RStudio. If you prefer to generate a PDF
document, you must have a version of the TeX
program installed on your computer (see Figure 3.4)..Rmd
file by selecting: File -> Save As...
Chapter_2_part_A.Rmd
Reproducible_Science/Chapters/Chapter_2
.Rmd
file in a dedicated project folder.Figure 3.4: Snapshot of window to create an R Markdown file.
To knit or render your R Markdown document into an HTML file, follow these steps in RStudio:
Knit
button (Figure 3.3) in the top toolbar to render the document..Rmd
file (Figure 3.5). You can monitor the progress in the R Markdown console.💡 Info If the knitting process fails, error messages will appear in the Render console, often indicating the line in the script where the error occurred (although this is not always the case). These messages are helpful for debugging your R Markdown document.
Figure 3.5: Snapshot of your project folder with the knitted html document.
When you knit your document, R Markdown passes the .Rmd
file to the R knitr package, which executes all code chunks and generates a new Markdown (.md
) file. This Markdown file includes both the code and its output (see Figure 3.6).
The .md
file created by knitr is then processed by the Pandoc program, which converts it into the final output format (e.g., HTML, PDF, or Word) as illustrated in Figure 3.6.
Figure 3.6: R Markdown flow.
In this section, we will focus on learning the syntax and protocols needed to create:
To master these skills, edit your .Rmd
file as we progress, and knit the document to view the final result.
Additional syntax and formatting options are available in the R Markdown Reference Guide, which you can access in RStudio as follows:
Help -> Cheatsheets -> R Markdown Reference Guide
Note: The Cheatsheets section also provides access to other helpful resources related to R Markdown and data manipulation. These documents will be very useful throughout this course.
Before you start this tutorial, please delete the content of your R Markdown file starting on line 11 ## R Markdown
to the end of the document (see Figure 3.7).
Figure 3.7: R Markdown of Chapter 2 - Part A.
Below is the syntax to create headers at three levels:
Syntax:
The "#" refers to the level of the header
# Header 1
## Header 2
### Header 3
💡 Info
💡 Your turn to practice this syntax in your document by:
I let you pick great names for your headers!
Markdown does not have a built-in syntax for adding comments within text, but you can use HTML-style comments instead:
# HTML syntax to comment inside text
<!-- COMMENT -->
This syntax is typically used to highlight areas of the text that need revision, clarification, or further work. Comments inserted this way will not be visible in the knitted document.
You can learn more about this HTML syntax on this webpage.
💡 Your turn to practice this syntax in your document by:
There are two types of lists:
Syntax:
* unordered list
* item 2
+ sub-item 1
+ sub-item 2
Note: For each sub-level include two tabs to create the hierarchy.
Output:
💡 Your turn to practice this syntax in your document by:
Syntax:
1. ordered list
2. item 2
+ sub-item 1
+ sub-item 2
Output:
💡 Your turn to practice this syntax in your document by:
The following syntax renders text in italics, bold, or both italics and bold:
#Syntax for italics
*italics*
#Syntax for bold
**bold**
#Syntax for italic and bold
***italic and bold***
Output:
💡 Your turn to practice this syntax in your document by:
Adding hyperlinks to your documents support reproducible science and it can be easily done with the following syntax:
#Syntax to add hyperlink
[text](link)
#Example to provide hyperlink to RStudio
[RStudio](https://www.rstudio.com)
When knitted, the example with RStudio turns like that RStudio.
💡 Your turn to practice this syntax in your document by:
One of the most exciting features of working with the R Markdown format is the ability to directly embed the output of R code into the compiled document (see Figure 3.6).
In other words, when you compile your .Rmd
file, R Markdown will automatically execute each code chunk and inline code expression (see example below), and insert their results into the final document.
If the output is a table or a figure, you can assign a label to it (by adding metadata within the code chunk; see Part B) and refer to it later in your PDF
or HTML
document. This process—known as cross-referencing—is made possible through the \@ref()
function, which is implemented in the R bookdown
package.
A code chunk can be easily inserted into your document using one of the following methods:
Insert
button {r} to open the chunk and
to close it.By default, the code chunk expects R code, but you can also insert chunks for other programming languages (e.g., Bash, Python).
💡 Your turn to practice this syntax in your document by:
Chunk output can be customized with knitr options arguments set in the {}
of a code chunk header. In the examples displayed in Figure 3.8 five arguments are used:
include = FALSE
prevents code and results from appearing in the finished file. R Markdown still runs the code in the chunk, and the results can be used by other chunks.echo = FALSE
prevents code, but not the results from appearing in the finished file. This is a useful way to embed figures.message = FALSE
prevents messages that are generated by code from appearing in the finished file.warning = FALSE
prevents warnings that are generated by code from appearing in the finished.fig.cap = "..."
adds a caption to graphical results.We will delve more into chunk options in part D of chapter 2, but in the meantime please see the R Markdown Reference Guide for more details.
Figure 3.8: Example of code chunks.
💡 Your turn to practice this syntax in your document by:
Code results can also be inserted directly into the text of a .Rmd
file by enclosing the code as follows:
# Syntax for inline code
``r 2+2`` # Please amend to include only 1 back tick
Output: 4
R Markdown will always:
As a result, inline output is indistinguishable from the surrounding text.
Warning: Inline expressions do not take knitr options and is therefore less versatile. We usually use inline code to perform simple stats (e.g. 4x4; 16)
💡 Your turn to practice this syntax in your document by:
There are three ways to access spell checking in an R Markdown document in RStudio:
Edit > Check Spelling...
The aim of this tutorial is to develop an R Markdown document that can be used as a template to produce your reproducible reports.
We will focus on learning how to configure your document to manage software dependencies, set global parameters for knitting, and produce appendices with software citations, package version information, and details about the operating system used to generate the reproducible report.
This tutorial provides students with the opportunity to learn how to:
To facilitate the teaching of the learning outcomes, a roadmap of the R Markdown file (*.Rmd
) structure is summarized in Figure 3.9.
To support the reproducibility of your research, we structure the R Markdown file as follows (Figure 3.9):
*.bib
) and citation style language (*.csl
) files, both of which must be stored in the working directory.packages.bib
file (stored in the working directory).Figure 3.9: Representation of the RMarkdown file structure taught in this class. The following colors represent the three main computing languages found in an RMarkdown document: Black: Markdown (we also sometimes use HTML), Green: YAML, and Blue: R (included in code chunks). See text for more details.
Please refer to section for more details on supporting files and their locations on the shared Google Drive.
Use the material from the Chapter 2 – Part A and Before Starting sections to create and save your R Markdown document for this tutorial.
💡 Guidelines
Reproducible_Science/Chapters/Chapter_2
The YAML metadata section (Figure 3.9) allows users to provide arguments (referred to as fields) that control how an R Markdown document is converted into its final output format. In this class, we will use functions from the knitr (Xie, 2015, 2023b) and bookdown (Xie, 2016, 2023a) packages to populate this section. (Field names, as declared in the YAML metadata section, are provided in parentheses.)
title
)subtitle
)author
)date
)output
)link-citations
)fontsize
)bibliography
)csl
)The YAML code shown below generates either an HTML
or PDF
document (see the output
field), includes a table of contents (see the toc
field), and formats in-text citations and the bibliography section according to the journal style specified in the AmJBot.csl
file (via the csl
field). The bibliography must be stored in a .bib
file (in this case, Bibliography_Reproducible_Science_2.bib
) placed at the root of your working directory.
---
title: "Your title"
subtitle: "Your subtitle"
author: "Your name"
date: "`r Sys.Date()`"
output:
bookdown::html_document2:
toc: TRUE
bookdown::pdf_document2:
toc: TRUE
link-citations: yes
fontsize: 12pt
bibliography: Bibliography_Reproducible_Science_2.bib
csl: AmJBot.csl
---
Follow these steps to set up your YAML metadata section (also see Figure 3.9):
Bibliography_Reproducible_Science_2.bib
AmJBot.csl
💡 Info
.bib
and .csl
files must be stored in the same working directory as your .Rmd
file.
R
functions in the YAML metadata section: This can be done using inline R code syntax. For example, use Sys.Date()
to automatically add the current date to the output document.
bibliography: [file1.bib, file2.bib]
.
#
at the beginning of the line (equivalent to commenting).
Since you have declared two output formats in the YAML metadata section, and both are specific to bookdown
functions, you need to select which format you want to use to compile your document. To do this, click the drop-down menu to the left of the Knit
button (see Figure 3.10).
To ensure that bookdown
functions are applied correctly, make sure to select one of the following options (see Figure 3.10):
Knit to html_document2
Knit to pdf_document2
Figure 3.10: Snapshot of RStudio console showing the drop-down list associated to Knit button.
It is best practice to add an R code chunk directly below the YAML metadata section to automatically install and load the required R packages.
This approach offers two key benefits that support the reproducibility of your research:
💡 Disclaimer
R
packages on your computer: “knitr”, “rmarkdown”, “bookdown”, “formattable”, “kableExtra”, “dplyr”, “magrittr”, “prettydoc”, “htmltools”, “knitcitations”, “bibtex”, “devtools”.
R
packages: These packages are required to produce your reproducible reports in RStudio. All of them are available on CRAN, the official repository for R
.
R
package repositories: Make sure you have set your R
package repositories in RStudio before proceeding with this tutorial. You can do this by following this procedure.
packages
and set the chunk options as follows (see Figure 3.11):
echo = FALSE
warning = FALSE
include = FALSE
Figure 3.11: This is how your packages R code chunk should look like at this stage of the procedure.
###~~~
# Load R packages
###~~~
#Create a vector w/ the required R packages
# --> If you have a new dependency, don't forget to add it in this vector
pkg <- c("knitr", "rmarkdown", "bookdown", "formattable", "kableExtra", "dplyr", "magrittr", "prettydoc", "htmltools", "knitcitations", "bibtex", "devtools")
##~~~
#2. Check if pkg are already installed on your computer
##~~~
print("Check if packages are installed")
#This line outputs a list of packages that are not installed
new.pkg <- pkg[!(pkg %in% installed.packages())]
##~~~
#3. Install missing packages
##~~~
# Use an if/else statement to check whether packages have to be installed
# WARNING: If your target R package is not deposited on CRAN then you need to adjust code/function
if(length(new.pkg) > 0){
print(paste("Install missing package(s):", new.pkg, sep=' '))
install.packages(new.pkg, dependencies = TRUE)
}else{
print("All packages are already installed!")
}
##~~~
#4. Load all required packages
##~~~
print("Load packages and return status")
#Here we use the sapply() function to require all the packages
# To know more about the function type ?sapply() in R console
sapply(pkg, require, character.only = TRUE)
I do not know about you, but I often struggle to properly cite R packages in my publications. Fortunately, R provides a built-in function to help with this. If you want to retrieve the citation for an R package, you can use the base R function citation()
.
For example, the citation for the knitr package can be obtained as follows:
# Generate citation for knitr Type this code directly in
# the Console
citation("knitr")
##
## To cite package 'knitr' in publications use:
##
## Xie Y (2023). _knitr: A General-Purpose Package for Dynamic Report
## Generation in R_. R package version 1.44, <https://yihui.org/knitr/>.
##
## Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition.
## Chapman and Hall/CRC. ISBN 978-1498716963
##
## Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible
## Research in R. In Victoria Stodden, Friedrich Leisch and Roger D.
## Peng, editors, Implementing Reproducible Computational Research.
## Chapman and Hall/CRC. ISBN 978-1466561595
##
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.
If you want to generate those citation entries in BibTeX format, you can pass the object returned by citation()
to toBibtex()
as follows:
# Generate citation for knitr in BibTex format Note that
# there is no citation identifiers. Those will be
# automatically generated in our next code.
utils::toBibtex(utils::citation("knitr"))
## @Manual{,
## title = {knitr: A General-Purpose Package for Dynamic Report Generation in R},
## author = {Yihui Xie},
## year = {2023},
## note = {R package version 1.44},
## url = {https://yihui.org/knitr/},
## }
##
## @Book{,
## title = {Dynamic Documents with {R} and knitr},
## author = {Yihui Xie},
## publisher = {Chapman and Hall/CRC},
## address = {Boca Raton, Florida},
## year = {2015},
## edition = {2nd},
## note = {ISBN 978-1498716963},
## url = {https://yihui.org/knitr/},
## }
##
## @InCollection{,
## booktitle = {Implementing Reproducible Computational Research},
## editor = {Victoria Stodden and Friedrich Leisch and Roger D. Peng},
## title = {knitr: A Comprehensive Tool for Reproducible Research in {R}},
## author = {Yihui Xie},
## publisher = {Chapman and Hall/CRC},
## year = {2014},
## note = {ISBN 978-1466561595},
## }
💡 Info
citation()
and toBibtex()
are functions in the
utils R package.
::
. In this example, the code is as follows:
utils::toBibtex(utils::citation(“knitr”))
.
Here, you will edit the packages
code chunk to output a .bib
file containing references for all R packages used to generate your document. This file will be included in the appendix 1 presented in the next section.
This can be done by adding the following code at the end of your packages
R code chunk:
# Generate BibTex citation file for all loaded R packages
# used to produce report Notice the syntax used here to
# call the function
knitr::write_bib(.packages(), file = "packages.bib")
The .packages()
function returns invisibly the names of all packages loaded in the current R session (to see the output, use .packages(all.available = TRUE)
). This ensures that all packages used in your code will have their citation entries written to the .bib
file.
Finally, to be able to cite these references (see Citation identifier) in your text, you need to edit the YAML metadata section. See Appendix 1 for a full list of references associated with the R packages used to generate this report.
💡 Your turn to practice this syntax in your document by:
packages.bib
in your R code chunk entitled packages.
packages.bib
.
Although a References section will be provided at the end of your document to cite in-text references (see References and Figure 3.9), it is useful to create a customized appendix listing citations for all R packages used to conduct the research, as shown in Appendix 1.
Here, we will learn the procedure to assemble such an appendix.
<div id="refs"></div>
as shown below. This ensures that the appendices (or any other material) appear after the References section.# References
<div id="refs"></div>
# (APPENDIX) Appendices {-}
# Appendix 1
Citations of all R packages used to generate this report.
# Appendix 1
to read and print the citations saved in packages.bib
. Use the following code:### Load R package
library("knitcitations")
### Process and print citations in packages.bib Clear all
### bibliography that could be in the cash
cleanbib()
# Set pandoc as the default output option for bib
options(citation_format = "pandoc")
# Read and print bib from file
read.bibtex(file = "packages.bib")
{r generateBibliography, results = "asis", echo = FALSE, warning = FALSE, message = FALSE}
In addition to citing R packages, you may also want to provide detailed information about the R package versions and operating system used (see Figure 3.9).
In R, the simplest — yet highly useful and important — way to document your environment is to include the output of sessionInfo()
(or devtools::session_info()
). Among other details, this output shows all loaded packages and their versions from the R session used to run your analysis.
Providing this information allows others to reproduce your work more reliably, as they will know exactly which packages, versions, and operating system were used to execute the code.
For example, here is the output of sessionInfo()
showing the R version and packages I used to create this document:
# Collect Information About the Current R Session
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rticles_0.27 DiagrammeR_1.0.11 DT_0.34.0
## [4] data.tree_1.2.0 kfigr_1.2.1 devtools_2.4.5
## [7] usethis_3.2.1 bibtex_0.5.1 knitcitations_1.0.12
## [10] htmltools_0.5.7 prettydoc_0.4.1 magrittr_2.0.3
## [13] dplyr_1.1.4 kableExtra_1.4.0 formattable_0.2.1
## [16] bookdown_0.36 rmarkdown_2.29 knitr_1.44
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.11 svglite_2.1.3 lubridate_1.9.3 visNetwork_2.1.4
## [5] digest_0.6.33 mime_0.12 R6_2.6.1 plyr_1.8.9
## [9] backports_1.4.1 evaluate_1.0.5 highr_0.11 httr_1.4.7
## [13] pillar_1.11.1 rlang_1.1.2 rstudioapi_0.17.1 miniUI_0.1.2
## [17] jquerylib_0.1.4 urlchecker_1.0.1 RefManageR_1.4.0 stringr_1.5.2
## [21] htmlwidgets_1.6.4 shiny_1.7.5.1 compiler_4.2.0 httpuv_1.6.13
## [25] xfun_0.41 pkgconfig_2.0.3 systemfonts_1.0.5 pkgbuild_1.4.8
## [29] tidyselect_1.2.0 tibble_3.2.1 viridisLite_0.4.2 later_1.3.2
## [33] jsonlite_1.8.8 xtable_1.8-4 lifecycle_1.0.4 formatR_1.14
## [37] scales_1.4.0 cli_3.6.2 stringi_1.8.3 cachem_1.0.8
## [41] farver_2.1.1 fs_1.6.3 promises_1.2.1 remotes_2.5.0
## [45] xml2_1.3.6 bslib_0.5.1 ellipsis_0.3.2 generics_0.1.4
## [49] vctrs_0.6.5 RColorBrewer_1.1-3 tools_4.2.0 glue_1.6.2
## [53] purrr_1.0.2 crosstalk_1.2.2 pkgload_1.4.1 fastmap_1.1.1
## [57] yaml_2.3.8 timechange_0.2.0 sessioninfo_1.2.3 memoise_2.0.1
## [61] profvis_0.3.8 sass_0.4.8
I have also used the approach described above to add this information in Appendix 2. This can be done as follows:
# Appendix 2
Version information about R, the operating system (OS) and attached or R loaded packages. This appendix was generated using `sessionInfo()`.
{r sessionInfo, eval = TRUE, echo = FALSE, warning = FALSE, message = FALSE}
# Load and provide all packages and versions
sessionInfo()
You have now set up your R Markdown environment and are ready to start populating it! This means you can begin inserting your text and additional code chunks directly below the packages
code chunk.
The References section marks the end of the main body of your document. If you wish to add appendices, do so under Appendix 2. Note that appendices will be labeled differently from the main sections of the document.
💡 Info
The aim of this tutorial is to provide students with the expertise to generate reproducible reports using bookdown (Xie, 2016, 2023a) and related R packages (see Appendix 1 for a full list). Unlike the functions implemented in the R rmarkdown package (Xie et al., 2018, which is better suited for generating PDF
reproducible reports), bookdown allows the use of one unified set of functions to generate both HTML
and PDF
documents.
In addition, the same approach and functions are used to process tables and figures, as well as to cross-reference them in the main body of the text. This tutorial will also cover how to cite references in the text, automatically generate a references section, and format citations according to journal styles.
This tutorial provides students with the opportunity to learn how to:
Follow the approach described in the box below to create your R Markdown file for Chapter 2 – Part C.
💡 Approach to Create Your R Markdown
Chapter2_partB.Rmd
in the same project folder and rename it Chapter2_partC.Rmd
Chapter 2 – Part C
This tutorial introduces key concepts related to table creation in R Markdown, specifically the following:
More details on this topic will be provided in Chapter 9.
In this section, you will learn the R Markdown syntax and R code needed to replicate the grading scale presented in the syllabus (see Table 3.1).
Percentage | Grade |
---|---|
100-98 | A+ |
97.9-93 | A |
92.9-90 | A- |
89.9-88 | B+ |
87.9-83 | B |
82.9-80 | B- |
79.9-78 | C+ |
77.9-73 | C |
72.9-70 | C- |
69.9-68 | D+ |
67.9-60 | D |
59.9-0 | F |
Follow these steps to reproduce Table 3.1 in your R Mardown document:
Chapter2_partC.Rmd
in RStudio.Tables
below the packages
code chunk.Insert
button ### Load package (for testing)
library(dplyr)
### Create a data.frame w/ grading scale
grades <- data.frame(Percentage = c("100-98", "97.9-93", "92.9-90",
"89.9-88", "87.9-83", "82.9-80", "79.9-78", "77.9-73", "72.9-70",
"69.9-68", "67.9-60", "59.9-0"), Grade = c("A+", "A", "A-",
"B+", "B", "B-", "C+", "C", "C-", "D+", "D", "F"))
### Plot table and add caption
knitr::kable(grades, caption = "Grading scale applied in this class.",
format = "html") %>%
kableExtra::kable_styling(c("striped", "scale_down"))
echo = FALSE
tabgrades
in the chunk options line (immediately after {r
) to enable cross-referencing.Play
button at the top left of the code chunk (see Fig. 3.12).
Figure 3.12: Demonstrates how to test your code within the R Markdown document. Click the play icon in the top-right corner of the code chunk to execute the code and return its output directly in the R Markdown document without knitting.
Knit
button 💡 Is something missing?
Several students have encountered an issue with duplicate table labels in the caption. The instructor did a web search and found a possible solution to help debug the issue:
kableExtra::kbl()
or knitr::kable()
in combination with kableExtra
functions, you might see a caption like “Table Table X: Your Caption”.format
argument in knitr::kable()
(e.g., format = "html"
), or use kableExtra::kbl()
directly, which automatically handles the format. This prevents kableExtra
from re-interpreting a markdown table as a new table and adding an extra “Table” prefix.In our case, we could try editing the code as follows (two options):
### Load package (for testing)
library(dplyr)
### Create a data.frame w/ grading scale
grades <- data.frame(Percentage = c("100-98", "97.9-93", "92.9-90",
"89.9-88", "87.9-83", "82.9-80", "79.9-78", "77.9-73", "72.9-70",
"69.9-68", "67.9-60", "59.9-0"), Grade = c("A+", "A", "A-",
"B+", "B", "B-", "C+", "C", "C-", "D+", "D", "F"))
### Plot table and add caption
# Option 1
knitr::kable(grades, caption = "Grading scale applied in this class.",
format = "html") %>%
kableExtra::kable_styling(c("striped", "scale_down"))
# Option 2
kableExtra::kbl(grades, caption = "Grading scale applied in this class.") %>%
kableExtra::kable_styling(c("striped", "scale_down"))
The instructor encourages students to test the options proposed above in their own documents and determine whether they resolve the issue.
This tutorial introduces key concepts related to figure creation in R Markdown, specifically the following:
More details on this topic will be provided in Chapter 10.
In this section, you will learn the R Markdown syntax and R code needed to replicate Figure 3.13.
Figure 3.13: Plot of cars’ speed in relation to distance.
Follow these steps to reproduce Figure 3.13 in your R Markdown document:
Chapter2_partC.Rmd
in RStudio.Figures
below the Tables
header.Insert
button ### Load and summarize the cars dataset
summary(cars)
### Plot data
plot(cars)
echo = FALSE
results = "hide"
fig.cap = "Plot of cars' speed in relation to distance."
out.width = "100%"
cars
in the chunk options line (immediately after {r
) to enable cross-referencing.Play
button at the top left of the code chunk (as done in Tables).Knit
button 💡 Challenge
Cross-referencing tables and figures in the main body of your R Markdown document can easily be done by using the \@ref()
function implemented in the bookdown package (see Figure 3.9).
The general syntax is as follows:
# Cross-referencing tables in main body of text
Table \@ref(tab:code_chunk_ID)
# Cross-referencing figures in main body of text
Figure \@ref(fig:code_chunk_ID)
💡 More Info on the Syntax
Table
or Figure
in front of @ref(tab:code_chunk_ID)
or @ref(fig:code_chunk_ID)
. This design allows the user to choose how tables and figures are referenced in the text. Journals often have distinct formatting styles, which must be followed during submission.
tabgrades
) in the text, follow these steps:
Tables
header:The grading scale presented in the syllabus is available in Table \@ref(tab:tabgrades).
Knit
button cars
) in the text, follow these steps:
Figures
header:The plot of the cars data is available in Figure \@ref(fig:cars).
Knit
button In this section, we will cover the following topics before delving into the practical implementation of citing references in your R Markdown document:
Pandoc can automatically generate citations in your text and a References section following a specific journal style (see Figure 3.9). To use this feature, you need to declare a bibliography file in the YAML metadata section under the bibliography:
field.
In this course, we are working with bibliography files formatted using the BibTeX
format. Other formats can also be used — please see this resource for more details.
Most journals allow citations to be downloaded in BibTeX
format, but if this feature is not available, you can convert citation formats using online services (e.g., EndNote to BibTeX: https://www.bruot.org/ris2bib/).
BibTeX is a reference management tool commonly used in conjunction with LaTeX (a Markup language) to format lists of references in scientific documents. It allows users to maintain a bibliographic database (.bib
file) and cite references in a consistent and automated way.
Each entry in a .bib
file follows this general structure:
Please find below an example of a reference formatted in BibTeX
format:
@entrytype{citationIndentifier,
field1 = {value1},
field2 = {value2},
...
}
💡 Info on BibTeX format
@entrytype
: Type of reference (e.g., @article, @book, @inproceedings
).
Here is an example associated with Baker (2016):
# Example of BibTex format for Baker (2016) published in Nature
@Article{Baker_2016,
doi = {10.1038/533452a},
url = {https://doi.org/10.1038/533452a},
year = {2016},
month = {may},
publisher = {Springer Nature},
volume = {533},
number = {7604},
pages = {452--454},
author = {Monya Baker},
title = {1,500 scientists lift the lid on reproducibility},
journal = {Nature},
}
The unique citation identifier of a reference (Baker_2016
in the example above) is set by the user in the BibTeX
citation file (see the first lines in the examples provided above). This identifier is used to refer to the publication in the R Markdown document and also enables citing references and generating the References section.
In this section, we will cover the following topics to implement the tools for references citing in your R Markdown document:
BibTeX
-formatted references in a bibliography text file, and make sure to add the .bib
or .bibtex
extension..Rmd
file).bibliography:
field.BibTeX
format are available in the following file:
Bibliography_Reproducible_Science_2.bib
💡 Download a Citation in BibTeX Format
Cite
icon and download a citation in .bibtex
format.
.bibtex
file in a text editor and inspect its contents.
We will examine the syntax for citing references, either as parenthetical citations or in-text citations.
In this case, citations are placed inside square brackets ([]
) (usually at the end of a sentence or clause) and separated by semicolons. Each citation must include a key composed of @
followed by the citation identifier (as stored in the BibTeX
file).
Please find below some examples on citation syntax:
#Syntax
Blah blah [see @Baker_2016, pp. 33-35; also @Smith2016, ch. 1].
Blah blah [@Baker_2016; @Smith2016].
Once knitted (using the button), the citation syntax is rendered as:
Blah blah (see Baker, 2016 pp. 33–35; also Smith et al., 2016, ch. 1).
Blah blah (Baker, 2016; Smith et al., 2016).
A minus sign (-
) before the @
symbol will suppress the author’s name in the citation. This is useful when the author is already mentioned in the text:
#Syntax
Baker says blah blah [-@Baker_2016].
Once knitted, the citation is rendered as:
Baker says blah blah (2016).
In this case, in-text citations can be rendered with the following syntax:
#Syntax
@Baker_2016 says blah.
@Baker_2016 [p. 1] says blah.
Once knitted, the citation is rendered as:
Baker (2016) says blah.
Baker (2016 p. 1) says blah.
💡 Your turn to practice citing references in your document by:
Upon knitting, a References section will automatically be generated and inserted at the end of your document (see Figure 3.9).
We recommend adding a level-1 “References” header immediately after the final paragraph of the document, as shown below:
last paragraph...
# References
The bibliography will be inserted after this header (please see the References section of this tutorial for more details).
💡 Your Turn!
In this section, we will explore how your bibliography can be automatically formatted to match a specific journal style. This is done by specifying a Citation Style Language (CSL) file in the YAML metadata section using the csl:
field. The CSL file contains the formatting rules required to style both in-text citations and the bibliography according to the selected journal or publication.
The Citation Style Language (CSL) is an open-source project designed to simplify scholarly publishing by automating the formatting of citations and bibliographies. The project maintains a crowdsourced repository of over 8,000 free CSL citation styles. For more information, visit: https://citationstyles.org
There are two main CSL repositories:
To learn this procedure and syntax, you will be tasked with downloading a .csl
file and using it in your document (in place of AmJBot.csl
).
Follow these steps to format your citations and bibliography according to a specific citation style (see Figure 3.9 for more details):
.csl
file from the Zotero Style Repository.
.csl
file. Save it in your project folder (see Figure 3.14).
Figure 3.14: Procedure to save csl file for PNAS on the Zotero Style Repository.
.csl
file is in your project folder (alongside your .Rmd
file)..csl
file.Knit
References
header at the end of your .Rmd
document.The aim of this tutorial is to provide an overview of procedures that can be applied to streamline your R Markdown document, supporting both your computing needs and formatting style.
This tutorial provides students with the opportunity to learn how to:
Please refer to section for more details on supporting files and their locations on the shared Google Drive.
Unlike R scripts, where you must set your working directory or provide the path to your files, the approach implemented in an R Markdown document (.Rmd
) automatically sets the working directory to the location of the .Rmd
file. This behavior is managed by functions from the knitr package.
💡 More Info
knitr
expects all referenced files to be located either in the same directory as the .Rmd
file or in a subfolder within that directory.
Before knitting your document, you will be testing your code (see Figure 3.12). To ensure smooth code testing, you need to set your working directory. This can be done in RStudio by clicking (see Figure 3.15):
Session --> Set Working Directory --> To Source File Location
Figure 3.15: Snapshot of RStudio showing procedure to set your working directory to allow testing your code prior to knitting.
To improve code reproducibility and efficiency, and to comply with publication requirements, it is customary to include a code chunk at the beginning of your .Rmd
file that sets global options for the entire document (see Figure 3.9). These settings relate to the following elements of your code:
These general settings will be configured using the opts_chunk$set()
function provided by the knitr
package (Xie, 2023b). The following website contains valuable information on code chunk options:
opts_chunk$set()
FunctionThe knitr
function opts_chunk$set()
is used to modify the default global options in an .Rmd
document.
Before starting, please note the following important points about the options:
.
) in chunk labels and directory names.We will discuss each part of the settings individually; however, these settings must be combined into a single code chunk in your document named setup
(please see below for more details).
This section covers settings related to the text output generated by code chunks.
Below is an example of options that can be applied across code chunks:
# Setup options for text results
opts_chunk$set(echo = TRUE, warning = TRUE, message = TRUE, include = TRUE)
echo = TRUE
: Include all R source code in the output file.warning = TRUE
: Preserve warnings (produced by warning()
) in the output, as if running R code in a terminal.message = TRUE
: Preserve messages emitted by message()
(similar to warnings).include = TRUE
: Include all chunk outputs in the final output document.If you want some text results to use different options, please adjust those in their specific code chunks. This comment applies to all other general settings as well.
This section covers settings related to code formatting (that is, how code is displayed in the final html
or pdf
document) generated by code chunks.
Below is an example of options that can be applied across code chunks:
# Setup options for code formatting
opts_chunk$set(tidy = TRUE, tidy.opts = list(blank = FALSE, width.cutoff = 60),
highlight = TRUE)
tidy = TRUE
: Use formatR::tidy_source()
to reformat the code. Please see the tidy.opts
option below.tidy.opts = list(blank = FALSE, width.cutoff = 60)
: A list of options passed to the function specified by the tidy
option. Here, the code is formatted to avoid blank lines and has a width cutoff of 60 characters.highlight = TRUE
: Highlight the source code.To compile your .Rmd
document faster—especially when you have computationally intensive tasks—you can cache the output of your code chunks. This process saves the results of these chunks, allowing you to reuse the output later without re-running the code.
The knitr package provides options to evaluate cached chunks only when necessary, but this must be set by the user. This procedure creates a unique MD5 digest (a data storage technique) for each chunk to track changes. When the option cache = TRUE
(there are other, more granular settings; see below) is set, the chunk will only be re-evaluated in the following scenarios:
The following code allows you to implement this caching procedure in your document:
# Setup options for code caching
opts_chunk$set(cache = 2, cache.path = "cache/")
TRUE
and FALSE
, the cache
option also accepts numeric values (cache = 0, 1, 2, 3
) for more granular control. 0
is equivalent to FALSE
, and 3
is equivalent to TRUE
.
cache = 1
, the results are loaded from the cache, so the code is not re-evaluated; however, everything else is still executed, including output hooks and saving recorded plots to files.cache = 2
(used here), the behavior is similar to 1
, but recorded plots will not be re-saved to files if the plot files already exist—saving time when dealing with large plots.cache.path = "cache/"
: Specifies the directory where cache files will be saved. You do not need to create the directory manually; knitr will create it automatically if it does not already exist.Plots are a major component of your research and form the basis of your figures. You can take advantage of options provided by the knitr
package to produce plots that meet publication requirements. This approach can save valuable time during the writing phase, as it eliminates the need to manually adjust figure size and resolution to comply with journal guidelines.
Below is an example of options that can be applied across code chunks:
# Setup options for plots The first dev is the master for
# the output document
opts_chunk$set(fig.path = "Figures_MS/", dev = c("png", "pdf"),
dpi = 300)
fig.path = "Figures_MS/"
: Specifies the directory where figures generated by the R Markdown document will be saved. As with caching, this folder does not need to exist beforehand; it will be created automatically. Files will be saved based on the code chunk label and assigned figure number.dev = c("pdf", "png")
: Saves each figure in both pdf
and png
formats.dpi = 300
: Sets the resolution (dots per inch) for bitmap devices. The resulting image dimensions follow the formula: DPI × inches = pixels. Check the submission guidelines of your target publication to ensure this value meets their requirements.💡 More Than One Type of Figure
.Rmd
document. To avoid confusion between figures generated by the .Rmd
file and those imported from outside sources, it is best practice to save them in two separate subfolders for more details).
Some journals have specific requirements for figure dimensions. You can easily set these using the following option:
fig.dim
: (NULL; numeric) — If a numeric vector of length 2, it specifies fig.width
and fig.height
. For example: fig.dim = c(5, 7)
.Positioning figures close to their corresponding code chunks is important for clarity and reproducibility. This can be controlled by adding another opts_chunk$set()
option in your setup
code chunk.
Use the fig.pos
argument and set it to "H"
to enforce figure placement near the relevant code.
## Locate figures as close as possible to requested
## position (=code)
opts_chunk$set(fig.pos = "H")
💡 Warning
.Rmd
file is knitted to a pdf
document. If this occurs, comment out the line of code using #
and try knitting again.
In this section, we will combine all the global settings discussed above into a code chunk named setup
, which should be placed below the YAML metadata section (see Figure 3.9 for more details on its location).
In addition to containing the global settings, it is advisable to include a code section for loading the required R packages (see Chapter 2 - Part B and Figure 3.9).
Below is the code for the setup
code chunk based on the options presented above:
### Load packages Add any packages specific to your code
library("knitr")
library("bookdown")
### Chunk options: see http://yihui.name/knitr/options/ ###
### Text output
opts_chunk$set(echo = TRUE, warning = TRUE, message = TRUE, include = TRUE)
## Code formatting
opts_chunk$set(tidy = TRUE, tidy.opts = list(blank = FALSE, width.cutoff = 60),
highlight = TRUE)
## Code caching
opts_chunk$set(cache = 2, cache.path = "cache/")
## Plot output The first dev is the master for the output
## document
opts_chunk$set(fig.path = "Figures_MS/", dev = c("png", "pdf"),
dpi = 300)
## Figure positioning
opts_chunk$set(fig.pos = "H")
setup
R Code ChunkWhen inserting the code above into an R code chunk (see Figure 3.9), please set the chunk options as follows:
setup
: Unique ID of the code chunk.include = FALSE
: The code will be evaluated, and any plot files will be generated, but nothing will be written to the output document.cache = FALSE
: The code chunk will not be cached (see above for more details).message = FALSE
: Messages emitted by message()
will not be included in the output.Please follow the procedure outlined above to implement the material presented in this section and become familiar with the tutorial content. This exercise is divided into seven steps as follows:
Chapter_2_PartC.Rmd
document.Insert
packages
code chunk and name it setup
. This code chunk will be used to define the global settings via the opts_chunk$set()
function.setup
code chunk.Knit
bookdown
\@ref()
function to cite your figure or plot in the text.The R code provided below corresponds to step 6 of the procedure and produces the plot shown in Figure 3.16.
Figure 3.16: Plot of y ~ x.
Insert
Figures
header of your document and set the following options and arguments:
plotcache
: Unique ID of the code chunk.fig.cap = "Plot of y ~ x."
: Figure caption.fig.show = "asis"
: Figure display.out.width = "100%"
: Figure width on the page.# Generate a set of observations (n=100) that have a normal
# distribution
x <- rnorm(100)
# Add a small amount of noise to x to generate a new vector
# (y)
y <- jitter(x, 1000)
# Plot y ~ x
plot(x, y)
Run
The aim of this tutorial is to provide an introduction to functions; more specifically, we will be studying user-defined functions (UDFs) as implemented in R.
💡 UDFs 101
.R
scripts and are made available to users using the source()
function, which reads the file content and loads the UDFs.
By the end of this tutorial, students will be able to:
💡 Disclaimer
.R
script.
.R
script into your R Markdown document.
In programming, you use functions to incorporate sets of instructions that you want to use repeatedly or that, because of their complexity, are better self-contained in a subprogram and called when needed.
A function is a piece of code written to carry out a specified task; it may or may not accept arguments or parameters, and it may or may not return one or more values.
There are many terms used to define and describe functions—subroutines, procedures, methods, etc.—but for the purposes of this tutorial, you will ignore these distinctions, which are often semantic and reminiscent of older programming languages (see here for more details on semantics). In our context, these definitions are less important because, in R, we only have functions.
In R, according to the base documentation, you define a function with the following construct:
function(arglist){
body
}
The code between the curly braces is the body
of the function.
When you use built-in functions, the only things you need to worry about are how to effectively communicate the correct input arguments (arglist
) and how to manage the return value(s) (or outputs), if there are any. To learn more about the arguments associated with a specific function, you can access its documentation using the following syntax (entered in the R console):
#General syntax
?function()
#Example with read.csv()
?read.csv()
R allows users to define their own functions (UDFs), which are based on the following syntax:
function.name <- function(arguments){
computations on the arguments
some more code
return value(s)
}
💡 In most cases, an R function has the following components:
function.name
.
arguments
): These are inputs to the function, declared within the parentheses ()
following the keyword function
.
{}
, where the computation is carried out.
function(arguments)
to the “variable” function.name
, followed by the function body.
This topic will be explored in more detail in Chapter 5, but here we introduce some best practices for writing clear and maintainable code.
A key part of this process involves starting with pseudocode before transitioning to actual R code. Pseudocode provides a high-level, language-agnostic description of the tasks your function needs to perform. It allows you to plan the structure and logic of your function before dealing with R syntax.
Once the pseudocode is defined, the next step is to translate each task into actual R code. This involves identifying existing R functions that can perform the required operations. If suitable functions do not exist, you may need to write your own—potentially supported by additional pseudocode to clarify the logic.
Below, you will find more detailed definitions of the two key concepts introduced here: pseudocode and code writing/implementation.
Pseudocode is an informal, high-level description of the operating logic of a computer program or algorithm. It uses the structural conventions of a programming language (in this case, R
) but is intended for human understanding rather than machine execution.
In this stage, you outline the major steps of your function and break them down into associated tasks. You then link these steps to specific R
functions—either existing ones or those you will need to develop. This provides the backbone of your code and supports the transition to implementation.
Writing clear, reproducible code has (at least) three main benefits:
When working on your project, it is highly likely that you will develop multiple UDFs tailored to your research. In this case, it is recommended to create a dedicated folder named R_functions
within your project directory (see Figure 3.17). Save each UDF as a separate file (e.g., check.install.pkg.R
) inside the R_functions
folder.
Figure 3.17: Example file structure of a simple analysis project. See Chapter 4 for more details
source()
FunctionOnce your project is properly structured (see Figure 3.17), it becomes easy to load specific UDFs stored in separate R scripts using the source()
function.
For example, to load the check.install.pkg()
function (saved as R_functions/check.install.pkg.R
) into the Global Environment, enter the following code in the R console (see Figure 3.18):
source("R_functions/check.install.pkg.R")
Figure 3.18: Snapshot of RStudio showing output of source() function and the UDF made available in the Global Environment. You can also view the code unperpinning the UDF.
Below is an example of R code that loads all your UDFs stored in the R_functions
folder. This approach is especially useful when you have multiple UDFs associated with your analysis pipeline.
### Load all UDFs stored in R_functions
# 1. Create vector with names of UDF files in R_functions
# (with full path)
files_source <- list.files("R_functions", full.names = T)
# 2. Iterative sourcing of all UDFs
sapply(files_source, source)
Once your UDFs are loaded into the Global Environment (see Figure 3.18 for an example), you can call them by typing their names directly into the R console. For example:
# Check if knitr package is installed, if not install it
# and then load it
check.install.pkg(pkg = c("knitr"))
## [1] "Check if packages are installed"
## [1] "Load packages"
## knitr
## TRUE
In this section, we aim at developing UDFs that return a single value. But how do we tell a function what to return?
In R, this is done using the return()
function.
To learn this skill, we will work on a challenge.
We are tasked with
Developing, implementing, and applying a UDF to calculate the square of a number.
Let’s begin by understanding this challenge from a mathematical perspective. Then, we will develop a pseudocode to solve it computationally. Finally, we will implement our solution in R.
In mathematics, the squared symbol (\(^2\)) is an arithmetic operator that signifies multiplying a number by itself. The “square” of a number is the product of the number and itself. Multiplying a number by itself is called “squaring” the number.
Although this task is straightforward, the pseudocode to implement the function requires the following steps:
base
, as part of the function’s argument(s).base
by multiplying the number by itself.sq
.sq
object to the user (a single numeric value). To do this, we will use the return()
function.💡 Let’s talk about the class of input/output data
numeric
.
numeric
.
class()
function.
class()
function is also useful when implementing defensive programming.
In this section, we will implement the pseudocode outlined above in a function called square_number()
.
This function takes one argument from the user (base
, a number) and returns the square of that number (i.e., base * base
).
## Create a UDF in R to calculate square number:
# - argument: base (= one number)
# - output: square of base (= one number)
square_number <- function(base){
#Infer square of base and save it into object
sq <- base*base
#Return sq object
return(sq)
}
Follow these steps to implement and load the UDF in R:
R_functions
inside your Chapter_2
folder..R
script and save it as square_number.R
in Chapter_2/R_functions
.square_number
function into your new R script.square_number
.square_number
has been loaded by checking the Environment panel in RStudio (see Figure 3.19).Figure 3.19: Close-up of the Environment panel in RStudio showing that the UDF is loaded in the Global Environment and can be used.
💡 Use the source()
function to load the UDF.
square_number()
function by typing the following command in the console:
source(“R_functions/square_number.R”)
.
The R language is quite flexible and allows functions to be applied to a single value (e.g., base = 2
) or a vector (e.g., base = c(2, 4, 16, 23, 45)
). Let’s further explore this concept by completing the following tasks in your R script:
.R
script titled Chapter_2_partE.R
saved in Chapter_2/
..R
script as follows to load your UDF using the source()
function.### Source UDFs Source the square_number function
source("R_functions/square_number.R")
### Apply your UDFs Square number of 2
square_number(base = 2)
## [1] 4
# Create vector with numbers
bases <- c(2, 4, 16, 23, 45)
# Apply function to vector
square_number(base = bases)
## [1] 4 16 256 529 2025
In this section, we introduce lists in R and learn how to create, name, and access list elements. We will also explore how lists can contain mixed data types and nested structures.
The procedures and syntax covered here will be applied in the next section, where we investigate how to implement user-defined functions (UDFs) that return multiple values.
More specifically, this section provides procedures and syntax for:
💡 Create Your R Script to Practice
R
script titled Chapter_2_partE.R
and save it in your project folder (Chapter_2
).
Lists are R objects that can contain elements of different types, such as numbers, strings, vectors, or even another list. A list can also include a matrix or a function as an element.
Lists are created using the list()
function.
Below is an example of how to create a list containing strings, numbers, vectors, and logical values using the list()
function:
# Create a list containing strings, numbers, vectors and a
# logical value
list_data <- list("Red", 51.3, 72, c(21, 32, 11), TRUE)
# Print object
print(list_data)
## [[1]]
## [1] "Red"
##
## [[2]]
## [1] 51.3
##
## [[3]]
## [1] 72
##
## [[4]]
## [1] 21 32 11
##
## [[5]]
## [1] TRUE
List elements can be assigned names using the names()
function, allowing them to be accessed by those names, as shown below:
# Create a list containing a vector, a matrix and a list
list_data <- list(c("Jan", "Feb", "Mar"), matrix(c(3, 9, 5, 1,
-2, 8), nrow = 2), list("green", 12.3))
# Give names to the elements in the list
names(list_data) <- c("1st_Quarter", "A_Matrix", "An_inner_list")
# Show the list
print(list_data)
## $`1st_Quarter`
## [1] "Jan" "Feb" "Mar"
##
## $A_Matrix
## [,1] [,2] [,3]
## [1,] 3 5 -2
## [2,] 9 1 8
##
## $An_inner_list
## $An_inner_list[[1]]
## [1] "green"
##
## $An_inner_list[[2]]
## [1] 12.3
Elements of a list can be accessed using their index or, in the case of named lists, using their names (combined with the $
symbol).
The following example illustrates how to access list elements using both methods:
# Create a list containing a vector, a matrix and a list
list_data <- list(c("Jan", "Feb", "Mar"), matrix(c(3, 9, 5, 1,
-2, 8), nrow = 2), list("green", 12.3))
# Give names to the elements in the list
names(list_data) <- c("1st_Quarter", "A_Matrix", "An_inner_list")
# Access the first element of the list
print(list_data[[1]])
## [1] "Jan" "Feb" "Mar"
# Access the thrid element. As it is also a list, all its
# elements will be printed
print(list_data[[3]])
## [[1]]
## [1] "green"
##
## [[2]]
## [1] 12.3
# Access the list element using the name of the element
# (combined with $ in front)
print(list_data$A_Matrix)
## [,1] [,2] [,3]
## [1,] 3 5 -2
## [2,] 9 1 8
We will now apply what you have learned to return multiple values from a UDF.
In this section, we will focus on developing UDFs that return multiple values.
To learn this skill, we will work on a challenge.
We are tasked with
Developing, implementing, and applying a UDF to calculate both the logarithm and the square of a number.
To accomplish this, we will collect the different outputs into a list, which will then be returned to the user using return()
.
Students will work in groups of 2–3 to complete the following tasks:
Please find below the solution proposed by the instructor:
##Create a user defined function in R to calculate log and square of a number:
# argument: base (= one number)
# output: log and square of base (= two numbers) returned in a list
my_log_square <- function(base){
## 1. Infer log and square of base
#log (base 10)
log_value <- log(base)
#Square of base
square_value <- base^2
## 2. Return both objects/values in a list
# Name objects in the list to help user determine what value refers to which output
return(list(log_val = log_value, square_val = square_value))
}
# Call the function
my_log_square(base = 2)
## $log_val
## [1] 0.6931472
##
## $square_val
## [1] 4
Defensive programming is a technique used to ensure that code fails with well-defined errors—that is, in situations where you know it should not work. The key idea is to ‘fail fast’ by ensuring the code throws an error as soon as something unexpected occurs. This may require a bit more effort from the programmer, but it makes debugging much easier later on.
To demonstrate how to apply defensive programming in your code, a power function y = x^n
is implemented as a UDF and used as an example.
💡 Create Your R Script to Practice
R
script titled exp_number.R
and save it in Chapter_2/R_functions
.
# Define a power function (exp_number): y = x^n
# - Arguments: base (= x) and power (= n)
# - Output: a number (y)
exp_number <- function(base, power){
#Infer exp (y) based on base (x) and power (n) (y=base^power)
exp <- base^power
#Return exp object
return(exp)
}
# Call function
exp_number(base = 2, power = 5)
## [1] 32
You can apply defensive programming to the exp_number
function defined above. The function requires that both arguments be of class numeric
. If you provide a string (e.g., a word) as input, the function will return an error:
# Example where we don't respect the class associated with
# the argument base
exp_number(base = "hello", power = 5)
## Error in base^power: non-numeric argument to binary operator
To help users identify and resolve potential issues, we can implement a routine that checks the class of the arguments (i.e., the input data provided by the user). If the inputs are not of class numeric
, the function will notify the user of the issue and provide guidance on how to fix it.
This routine can be implemented as follows:
# Define a power function (exp_number): y = x ^n
# - Arguments: base (= x) and power (= n)
# - Output: a number (y)
exp_number <- function(base, power){
# This if statement tests if classes of base and power are numeric.
# If one of them is not numeric it stops and return meaningful message
if(class(base) != "numeric" | class(power) != "numeric"){
stop("Both base and power inputs must be numeric")
}
#If classes are good then infer exp
exp <- base^power
# Return exp object
return(exp)
}
# Call function
exp_number(base = "hello", power = 5)
## Error in exp_number(base = "hello", power = 5): Both base and power inputs must be numeric
Although in this case debugging the error might not take long, in more complex functions you’re more likely to encounter vague error messages or code that runs for some time before failing. By applying defensive programming and adding checks to your code, you can detect unexpected behavior earlier and receive more meaningful error messages.
As you saw in the previous example, we used a logical operator—in that case, OR
(represented by the |
symbol)—to implement defensive programming in our user-defined function (UDF).
In a nutshell, the most commonly used logical operators in R
are:
&
): Returns TRUE
only if both values are TRUE
.|
): Returns TRUE
if at least one of the values is TRUE
.!
): Negates the logical value it is applied to.You can learn more about logical operators (including practice exercises) on this website.
The objective of this section is to provide students with some tools and ideas to design their bioinformatic tutorials (for PART 2). Here, students will have an overview of the tools implemented in the R package learnr, which was developed to produce interactive tutorials. Although developed in R, the interactive tutorials are designed to be conducted in web browsers (but it could also be entirely done within RStudio).
The interactive tutorial presented here is subdivided into five topics:
R Code
button. The instructor has also provided the solution to the exercise, which could be accessed by pressing the Solution
button available in the top banner of the code chunk window.Finally, the instructor wants to stress that students are not obliged to design their tutorials using learnr. You can also use the R Markdown language/syntax and output tutorials in HTML or PDF formats (more on this subject in Chapter 1).
This document highlights steps to execute the interactive tutorial designed by the instructor.
Open RStudio and install the learnr package from CRAN by typing the following command in the console:
install.packages("learnr")
Files associated to this tutorial are deposited on the Google Drive under this path:
Reproducible_Science -> Bioinformatic_tutorials -> Intro_interactive_tutorial
There are two main files:
README.html
: The documentation to install R package and run interactive tutorial.EEB603_Interactive_tutorial.Rmd
: The interactive tutorial written in R Markdown, but requiring functions from learnr.The instructor has made a video explaining the procedure to launch the interactive tutorial (based on option 1; see below) as well as some additional explanations related to the exercise.
Intro_interactive_tutorial
folder and save it on your local computer.EEB603_Interactive_tutorial.Rmd
.Session -> Set Working Directory -> Choose Directory...
.setwd()
function to set your working directory (e.g. setwd("~/Documents/Course_Reproducible_Science/Timetable/Intro_interactive_tutorial")
).EEB603_Interactive_tutorial.Rmd
in RStudio and press the Run Document
button on the upper side bar to launch the tutorial. It will appear in the Viewer
panel (on right bottom corner). You can open the interactive tutorial in your web browser by clicking on the third icon at the top of the viewer panel. This procedure is also explained in the Youtube video.Open in Browser
button.rmarkdown::run("EEB603_Interactive_tutorial.Rmd")
The procedure to develop interactive tutorials using learnr is presented here. To learn more about the syntax, the instructor encourages you to open EEB603_Interactive_tutorial.Rmd
in RStudio and inspect the document. This will allow learning syntax and associated procedures to:
The aim of this chapter is to engage in readings and group discussions that equip you with the knowledge and tools needed to implement reproducible science practices in support of your research.
We will begin by reviewing the steps of the scientific process and identifying potential threats to its integrity. Then, we will discuss strategies to mitigate these threats and explore initiatives that promote and reward reproducibility in science.
While studying the material presented in this chapter, keep in mind this quote from Richard Feynman:
The first principle is that you must not fool yourself — and you are the easiest person to fool.
By the end of this chapter, you will be able to:
This chapter is primarily based on the following resources:
The website of the Center for Open Science was also used to design the chapter content. More specifically, see this webpage on study pre-registration.
The presentation associated with this class is available here:
💡 Best browser to view presentation
The scientific process can be subdivided into six phases (Figure 4.1):
Figure 4.1: Overview of the scientific process.
To facilitate understanding of the material in this chapter, the six phases of the scientific process are grouped into three categories that reflect the progress of a study:
Distinguishing these categories is essential for ensuring the reproducibility and transparency of a study, and for avoiding common pitfalls described in the next section of this chapter.
For instance, treating the pre-study category as a distinct step in the scientific process promotes study pre-registration and helps prevent practices such as HARKing, p-hacking, and publication bias (see the Glossary section).
Recognizing the post-study category is equally important, as it encourages both pre- and post-publication review, thereby supporting improved dissemination and transparency of your research.
A hallmark of scientific creativity is the ability to recognize novel and unexpected patterns in data. However, a major challenge for scientists is remaining open to new and important insights while also avoiding the tendency to be misled by perceived structure in randomness.
This challenge is often driven by a combination of cognitive biases, including:
These factors can easily lead us to false conclusions and therefore pose significant threats to the scientific process. Some of these threats—such as HARKing and p-hacking—are illustrated in Figure 4.2, and definitions are provided in the Glossary section.
Figure 4.2: Overview of the scientific process and threats preventing reproducibility of the study (indicated in red). Abbreviations: HARKing: hypothesizing after the results are known; P-hacking: data dredging.
Below are definitions of common threats to the scientific process (see Figure 4.2):
HARKing: Hypothesizing After the Results are Known. This occurs when a post hoc hypothesis—that is, a hypothesis developed based on or informed by the observed results—is presented in a research report as if it were an a priori hypothesis (formulated before the study).
Outcome switching: The practice of changing the study’s outcomes of interest after seeing the results. For example, a researcher may measure ten possible outcomes but then selectively report only those that show statistically significant results, intentionally or unintentionally. This increases the likelihood of reporting spurious findings by capitalizing on chance, while ignoring negative or non-significant results.
P-hacking: Also known as “data dredging,” this involves conducting multiple statistical tests on data and focusing only on those with significant results, rather than pre-specifying a hypothesis and performing a single appropriate test. This misuse inflates the chance of false positives by finding patterns that appear statistically significant but have no real underlying effect.
Publication bias: Also called the file drawer problem, this refers to the tendency for studies with positive, novel, or significant results to be published more often than studies with negative results or replications. As a result, the published literature overrepresents positive findings and can give a misleading impression of the strength of evidence.
The aim of this section is to discuss solutions for addressing threats to the scientific process by presenting measures that ensure reproducibility and transparency.
Following the framework proposed by Munafo et al. (2017), the measures covered in this section are organized into five categories that promote research reproducibility and transparency. When possible, these categories include specific working themes designed to minimize the threats discussed earlier (see Figure 4.2).
These measures are not intended to be exhaustive, but they provide a broad, practical, and evidence-based set of actions that can be implemented by researchers, institutions, journals, and funders. They also serve as a roadmap for students when designing their thesis projects.
This section outlines key measures that can be implemented during the research process—such as study design, methodology, statistical analysis, and collaboration—to improve scientific rigor, reproducibility, and transparency.
We have organized the content into the following themes:
There is a substantial body of literature on the difficulty of avoiding cognitive biases. An effective strategy to mitigate self-deception and unwanted biases is blinding. In some research contexts, participants and data collectors can be blinded to the experimental conditions to which participants are assigned, as well as to the research hypotheses. Data analysts can also be blinded to key parts of the dataset. For example, during data preparation and cleaning (see Chapters 6 and 7), the identity of experimental conditions or variable labels can be masked so that outputs are not interpretable in terms of the research hypotheses.
Pre-registration of the study design, primary outcome(s), and analysis plan (see the Promoting Study Pre-registration section below) is another highly effective form of blinding, as the data do not yet exist and the outcomes are not yet known.
Research design and statistical analysis are deeply interconnected. Common misconceptions—such as misunderstanding p-values, the limitations of null hypothesis significance testing, the importance of statistical power, the accuracy of effect size estimates, and the likelihood that a statistically significant result will replicate—can all be addressed through improved statistical training. These topics are covered in BIOL603: Advanced Biometry.
Primary biodiversity occurrence data are central to research in Ecology and Evolution. However, these data are no longer collected as they once were. The mass production of observation-based (OB) occurrences is displacing the collection of specimen-based (SB) occurrences. Troudet et al. (2018) analyzed 536 million occurrences from the Global Biodiversity Information Facility (GBIF)2 and found that, between 1970 and 2016, the proportion of occurrences traceable to tangible material (i.e., SB occurrences) dropped from 68% to 18%. Moreover, many of the remaining specimen-based occurrences could not be readily traced back to a physical specimen due to missing information.
This alarming trend—characterized by low traceability and, therefore, low confidence in species identification—threatens the reproducibility of biodiversity research. For instance, low-confidence identifications limit the utility of large databases for deriving insights into species distributions, ecological traits, conservation status, phylogenetic relationships, and more.
Troudet et al. (2018) advocate that SB occurrences must continue to be collected to allow replication of ecological and evolutionary studies and to support rich, diverse research questions. When SB collection is not possible, they suggest OB occurrences should be accompanied by ancillary data (e.g., photos, audio recordings, tissue samples, DNA sequences). Unfortunately, such data are often not shared. A more rigorous approach to data collection—including proper documentation of ancillary evidence—can help ensure that recently collected biodiversity data remain useful and scientifically credible.
Specimens deposited in natural history museums and botanical gardens serve critical roles in biodiversity research. They enable:
These additional data are invaluable during the data collection phase and enhance the rigor of hypothesis testing. While SB occurrences may present logistical challenges (especially in field ecology), we strongly encourage students to collect and document ancillary data to support their observations and ensure the reproducibility of their analyses.
The need for independent methodological oversight is well-established in some fields. For example, many clinical trials employ multidisciplinary steering committees to oversee study design and execution. These committees arose in response to known financial conflicts of interest in clinical research.
Involving independent researchers—particularly methodologists who have no personal or financial stake in the research question—can reduce bias and improve study quality. This can be done either at the level of individual research projects or coordinated through funding agencies.
Studies of statistical power consistently find it to be low—often below 50%—across time and disciplines (see Munafo et al., 2017 and references therein). Low statistical power increases the likelihood of both false positives and false negatives, making it ineffective for building reliable scientific knowledge.
Despite this, low-powered research persists due to poor incentives, limited understanding of statistical power, and a lack of resources. Team science offers a solution: instead of relying on the limited capacity of individual investigators, distributed collaboration across multiple research sites supports high-powered study designs. It also improves generalizability across populations and contexts, and fosters integration of multiple theoretical frameworks and research cultures into a single project.
This section describes measures that can be implemented when communicating research, including (for example) reporting standards, study pre-registration, and the disclosure of conflicts of interest.
We have organized the content into the following themes:
The last theme is addressed in more detail in Chapter 2.
Progress in science relies, in part, on generating hypotheses from existing observations and testing hypotheses with new observations.
The distinction between postdiction3 and prediction is well understood conceptually, but often not respected in practice. Confusing postdictions with predictions reduces the credibility of research findings. Cognitive biases—particularly hindsight bias—make it difficult to avoid this mistake.
An effective solution is to define the research questions and analysis plan before observing the outcomes — a process known as pre-registration. Pre-registration clearly distinguishes between analyses and outcomes that result from a priori hypotheses and those that arise post hoc. Various practical strategies now exist to implement pre-registration effectively, even in cases involving pre-existing data. Services to support pre-registration are now available across disciplines, and their growing adoption is contributing to increased research transparency and credibility.
At its simplest, study pre-registration (see Nosek et al., 2018) may involve the basic registration of study design. More thorough pre-registration includes detailed specifications of procedures, outcomes, and the statistical analysis plan.
Study pre-registration addresses two major problems:
Definitions of these terms are provided in the Glossary section.
Pre-registration enhances the discoverability of research, but discoverability does not guarantee usability.
Poor usability occurs when it is difficult to evaluate what was done, replicate the methods to assess reproducibility, or incorporate findings into systematic reviews and meta-analyses.
Improving the quality and transparency of research reporting is therefore essential. This includes using standardized reporting guidelines, ensuring that methods and data are clearly described, and making materials available for reuse wherever possible.
This section describes measures that can be implemented to support the verification of research, including, for example, sharing data and methods.
Science is a social enterprise: independent and collaborative groups work to accumulate knowledge as a public good.
The credibility of scientific claims is rooted in the evidence supporting them, which includes the methodology applied, the data collected, the process of methodology implementation, data analysis, and outcome interpretation. Claims become credible through community review, criticism, extension, and reproduction of the supporting evidence. However, without transparency, credibility relies solely on trust in the confidence or authority of the originator. Transparency is superior to trust.
Open science refers to the process of making both the content and the process of producing evidence and claims transparent and accessible to others.
Transparency is a scientific ideal, and therefore adding ‘open’ should be redundant. In practice, however, science often lacks openness: many published articles are not available to those without personal or institutional subscriptions, and most data, materials, and code supporting research outcomes are not publicly accessible (though this is rapidly changing thanks to several initiatives, such as the Dryad digital repository).
Much of the research process—for example, study protocols, analysis workflows, and peer review—is historically inaccessible, partly because there were few opportunities to make it accessible, even if desired. This has motivated calls for open access, open data, and open workflows (including analysis pipelines). However, substantial barriers remain, including vested financial interests (particularly in scholarly publishing) and limited incentives for researchers to adopt open practices.
To promote open science, several open-access journals have recently been established (e.g., BMC, Frontiers, PLoS). These journals facilitate the sharing of scientific research—including associated methods, data, and code—but they are often expensive, with fees averaging over $1500 per publication.
Waivers may be available for researchers based in certain countries, and some institutions sponsor these initiatives, allowing a number of papers to be published for “free” each year. However, if you do not qualify for such waivers or institutional support, it can be challenging to cover these costs without grant funding. For example, the NSF is making efforts to promote open science through funding support.
The EEB program may be able to support some of these costs, but support will vary depending on the yearly budget and timing of the request. This topic is explored further in Chapter 4.
This section describes measures that can be implemented when evaluating research, including, for example, peer review.
For most of the history of scientific publishing, two functions have been confounded: evaluation and dissemination. Journals have provided dissemination by sorting and delivering content to the research community, and gatekeeping via peer review to determine what is worth disseminating. However, with the advent of the internet, individual researchers are no longer dependent on publishers to bind, print, and mail their research to subscribers. Dissemination is now easy and can be controlled by researchers themselves (see examples of preprint publishers below).
With the increasing ease of dissemination, the role of publishers as gatekeepers is declining. Nevertheless, the other role of publishing—evaluation—remains a vital part of the research enterprise. Conventionally, a journal editor selects a limited number of reviewers to assess the suitability of a submission for a particular journal. However, more diverse evaluation processes are now emerging, allowing the collective wisdom of the scientific community to be harnessed. For example, some preprint services support public comments on manuscripts, a form of pre-publication review that can be used to improve the manuscript (see below). Other platforms, such as PubMed Commons and PubPeer, offer public forums to comment on published works, facilitating post-publication peer review. At the same time, some journals are trialing ‘results-free’ review, where editorial decisions to accept are based solely on the rationale and study methods (i.e., results-blind; for instance, PLoS ONE applies this approach).
Both pre- and post-publication peer review mechanisms dramatically accelerate and broaden the evaluation process. By sharing preprints, researchers can obtain rapid feedback on their work from a diverse community, rather than waiting several months for a few reviews in the conventional, closed peer review process. Using post-publication services, reviewers can make positive and critical commentary on articles instantly, rather than relying on the laborious, uncertain, and lengthy process of authoring a commentary and submitting it to a publishing journal for possible publication.
bioRxiv (pronounced “bio-archive”) is a free online archive and distribution service for unpublished preprints in the life sciences. It is operated by Cold Spring Harbor Laboratory, a not-for-profit research and educational institution. By posting preprints on bioRxiv, authors can make their findings immediately available to the scientific community and receive feedback on draft manuscripts before submitting them to journals.
Articles on bioRxiv are not peer-reviewed, edited, or typeset before being posted online. However, all articles undergo a basic screening process to check for offensive and/or non-scientific content, potential health or biosecurity risks, and plagiarism. No endorsement of an article’s methods, assumptions, conclusions, or scientific quality by Cold Spring Harbor Laboratory is implied by its appearance on bioRxiv. An article may be posted prior to, or concurrently with, submission to a journal but should not be posted if it has already been accepted for publication.
PeerJ Preprints is a preprint server for the Biological Sciences, Environmental Sciences, Medical Sciences, Health Sciences, and Computer Sciences. A PeerJ Preprint is a draft of an article, abstract, or poster that has not yet been peer-reviewed for formal publication. Authors can submit a draft, incomplete, or final version of their work for free.
Submissions to PeerJ Preprints are not formally peer-reviewed. Instead, they are screened by PeerJ staff to ensure that they fit the subject area, do not contravene any policies, and can reasonably be considered part of the academic literature. Submissions deemed unsuitable in any of these respects will not be accepted for posting. Content considered non-scientific or pseudo-scientific will not pass screening.
Publication is the currency of academic science, influencing employment, funding, promotion, and tenure. However, not all research is equally publishable. Positive, novel, and clean results are more likely to be published than negative findings, replications, or results with loose ends. Consequently, researchers are incentivized to produce the former, sometimes at the expense of accuracy (see Nosek et al., 2012). These incentives ultimately increase the likelihood of false positives in the published literature. Changing these incentives offers an opportunity to enhance the credibility and reproducibility of published results.
Funders, publishers, societies, institutions, editors, reviewers, and authors all contribute to the cultural norms that create and sustain dysfunctional incentives. Therefore, shifting incentives requires a coordinated effort by all stakeholders to alter reward structures. While there will always be incentives for innovative outcomes—rewarding those who discover new things—there can also be incentives for efficiency and rigor. Researchers who conduct transparent, rigorous, and reproducible research should be rewarded more than those who do not.
Promising examples of effective interventions include journals adopting:
Collectively, and at scale, these efforts can shift incentives so that what benefits the scientist also benefits science—encouraging rigorous, transparent, and reproducible research that produces credible results.
The Transparency and Openness Promotion (TOP) Guidelines (published in Nosek et al., 2015) provide a framework for journals and funders to encourage greater transparency in research planning and reporting.
The guidelines consist of eight modular standards, each addressing a key aspect of research transparency.
Each standard includes three levels of increasing stringency, allowing for a gradual implementation of open science practices (see Figure 4.3).
Journals can choose which standards to adopt and at what level, providing flexibility to account for disciplinary norms.
This structure helps establish community-wide transparency standards while respecting the diversity of research practices across fields.
Students explore how journals implement the Transparency and Openness Promotion (TOP) Guidelines and reflect on how these practices support open science.
The eight modular standards are:
Each standard has three levels of transparency (Level 1, Level 2, and Level 3) reflecting increasing rigor (see Figure 4.4). Journals select levels based on their readiness and disciplinary norms, balancing implementation feasibility with the desire for stronger transparency requirements.
Figure 4.3: Table presenting the TOP standards and their associated levels of transparency.
Over 1,000 journals and organizations had adopted one or more TOP-compliant policies as of August 2018, including Ecology Letters, The Royal Society, and Science. The full list of journals implementing the TOP guidelines can be found here: https://osf.io/2sk9f/
The material in Chapter 1 on R Markdown (as used in RStudio) addresses the need for a unified environment that links publications, code, and data.
The Center for Open Science offers a platform called the Open Science Framework (OSF) to support this goal (see Figure 4.4). OSF is a free, open-source project management and repository platform designed to assist researchers throughout the entire project life cycle.
As a collaboration tool, OSF allows researchers to work privately with selected collaborators or to make parts or all of their projects publicly accessible, with citable and discoverable DOIs for broader dissemination.
As a workflow system, OSF integrates with many existing research tools, enabling researchers to manage their entire projects from one place. This connectivity helps eliminate data silos and information gaps by allowing tools to work together seamlessly, reflecting how researchers actually collaborate and share knowledge (see Figure 4.4).
Figure 4.4: Overview of the OSF workflow and connections with other widely used software.
Today, funding agencies and journals are increasingly promoting Open Science, a topic we began discussing in Chapter 3. In this context:
Balancing the promotion of Open Science to support reproducibility with the values of your stakeholders can be a challenging task.
This chapter aims to explore how Open Science is promoted while recognizing when other principles, like the CARE Principles, must be applied to respect diverse perspectives, values, and data sovereignty.
It also provides solutions to address threats to the scientific process (introduced in Chapters 1 and 3) and guidance on managing the data life cycle and responsibly sharing research.
By the end of this chapter, you will be able to:
This chapter is based on:
Web resources
Publications
The Organisation for Economic Co-operation and Development (OECD) defines Open Science as:
“to make the primary outputs of publicly funded research results – publications and the research data – publicly accessible in digital format with no or minimal restriction.”
While this definition focuses on access to research outputs, several other organizations advocate for a broader view of Open Science. According to the FOSTER Open Science initiative, Open Science involves:
“extending the principles of openness to the whole research cycle, fostering sharing and collaboration as early as possible, thus entailing a systemic change to the way science and research is done.”
In this section, we conduct three group activities to study the material presented in this chapter. We focus on ensuring open science while respecting data governance for our stakeholders. Use the resources provided in this chapter to complete the group assignments below.
Students collaboratively explore how the FAIR Principles apply to real-world research data and reflect on challenges and best practices for making data FAIR.
Students collaboratively explore how the CARE Principles apply to data involving Indigenous Peoples and reflect on ethical, cultural, and governance considerations in data stewardship.
Students critically compare the FAIR and CARE Principles and explore how they can be applied together to support ethical, inclusive, and technically sound data practices.
The 7 pillars of Open Science are:
The FAIR Principles (Wilkinson et al., 2016) guide researchers in ensuring that research outputs are (Figure 5.1):
💡 Why FAIR Matters:
Figure 5.1: The FAIR Principles (source: SangyaPundir, CC BY-SA 4.0).
Wilkinson et al. (2016) outline a set of guidelines that support the implementation of the FAIR Principles. These guidelines ensure that data and metadata are structured in a way that promotes discovery, access, integration, and reuse.
Research integrity refers to the commitment of researchers to act honestly, reliably, respectfully, and to be accountable for their actions throughout the research process.
At Boise State University, the Office of Research Compliance provides resources and guidance to support best practices in research integrity.
The Next Generation Metrics pillar of Open Science aims to shift cultural thinking around how research is evaluated. It encourages moving beyond traditional metrics—such as citation counts and journal impact factors—toward more diverse, meaningful indicators of research quality and influence.
These metrics, drawn from a variety of sources and describing different aspects of scholarly impact, can help us gain a broader and more nuanced understanding of the significance and reach of research.
In this chapter, we demonstrate this topic with two examples:
The San Francisco Declaration on Research Assessment (DORA) exemplifies the growing shift in how research quality is evaluated. Increasingly, institutions and funding bodies are supporting DORA and openly rejecting the use of journal impact factors and other quantitative metrics as primary indicators of research quality.
Among many others, Springer Nature has endorsed DORA. You can view their specific commitments here.
Another example is Altmetric, a tool that tracks and demonstrates the reach and influence of research outputs—particularly among key stakeholders—by focusing primarily on social media and other online platforms (Figure 5.2).
Figure 5.2: Example of new metrics associated to publications to assess impact of research beyond the impact factor.
The future of scholarly communication is one of the most prominent pillars of Open Scholarship, as it aims to shift the current academic publishing model toward fully Open Access. We will explore this topic further in the section on Open Access and associated licenses below.
Citizen Science refers to the increased involvement of the public in the research process, recognizing the invaluable insights that non-academic contributors can offer. These contributions often provide perspectives or data that researchers might not otherwise access. Examples of such initiatives include eBird and iNaturalist.
Leveraging the internet, openly available tools, and local knowledge, citizen science is transforming how research is conducted. No longer limited to academic researchers, it encourages collaboration across society.
Wagenknecht et al. (2021) discuss this concept through a European Citizen Science project, while Groom et al. (2017) reflect on its role in biodiversity research. Citizen science has the potential to significantly enhance your research—but it’s important to be aware of its challenges. For more on this topic, refer to this section in Chapter 3.
This pillar focuses on identifying researchers’ training needs and addressing gaps in knowledge and skills related to Open Science—such as making publications openly accessible, managing research data in alignment with the FAIR Principles (Figure 5.1), and upholding research integrity.
All researchers, regardless of career stage, should have access to training and professional development opportunities that support Open Science practices. Additionally, skills development should be extended to other stakeholders in the research ecosystem, including librarians, data managers, and the public, particularly in support of citizen science initiatives.
At Boise State University, the Albertsons Library offers resources, workshops, and seminars on these topics.
Promoting engagement with Open Science requires recognizing and rewarding the efforts of those who contribute to it. This pillar addresses systemic barriers while advocating for best practices.
A common barrier is the perceived lack of recognition for tasks such as managing research data or publishing open access. These efforts are often undervalued in traditional academic assessments, discouraging wider adoption of Open Science practices.
Work under this pillar seeks to overcome these challenges and promote greater participation in Open Science. For more on current initiatives and reward systems, refer to Chapter 3.
Open Access (OA) refers to a set of principles and practices that make research outputs freely available online, without access charges or other barriers (Figure 5.3). While OA publications are free for users to read, researchers often pay publication fees—typically between $1,500 and $2,500—to make their work openly accessible.
According to the 2001 definition of Open Access, it also involves removing or reducing barriers to copying and reuse by applying an open license. An open license allows others to reuse a creator’s work freely, in ways that are typically restricted by copyright, patent, or commercial licenses.
Most open licenses are:
These characteristics ensure that research can be reused, adapted, and redistributed in alignment with the goals of Open Science.
Since most Open Access (OA) journals rely on publication fees paid by authors to generate revenue, there is concern that some OA publishers may be motivated to accept lower-quality papers and forgo rigorous peer review in order to maximize profits.
At the same time, publication fees in prestigious OA journals can exceed $5,000, making this publishing model inaccessible for many researchers, particularly those without institutional funding or support.
This situation has been described as the “Open Access sequel to the Serials Crisis” (Khoo, 2019). The original Serials Crisis refers to the unsustainable pressure on library budgets caused by the rapid inflation of journal subscription costs. Open Access was initially proposed as a solution to this problem—by removing subscription barriers, research would remain accessible without ongoing costs. However, as Khoo (2019) notes, both traditional subscription models and Open Access publishing present their own challenges and limitations.
Most Open Access journals use Creative Commons (CC) licenses, so it is important that you and your co-authors understand their implications before submitting your manuscript—especially regarding potential commercial use of your work by third parties (Figure 5.3).
A CC license is a type of public copyright license that enables the free distribution of a copyrighted work. It allows authors to grant others the right to share, use, and build upon their work while retaining some control over how it is used.
Creative Commons licenses provide flexibility; for example, authors can choose to allow only non-commercial uses. They also protect users and redistributors from copyright infringement claims, provided they follow the license terms set by the author.
To help authors select the most appropriate license for their work, the Creative Commons initiative offers an online tool, which I recommend consulting. For more detailed information on CC licenses, see this Wikipedia page.
Figure 5.3: Front page of Ecology and Evolution, an open access journal publishing articles under the Creative Common licensing system.
The Creative Commons licenses provided by the CC initiative are summarized in Figure 5.4.
Figure 5.4: Overview of Creative Common licenses sorted from most open to most restrictive (provided by Andrew Child).
The description of these licenses is always presented in this format:
Here is a list of some of the most commonly used CC licenses with URLs leading to their descriptions:
In this section, we investigate both the CARE Principles and Data Sovereignty, emphasizing how these concepts are closely connected.
As noted by Carroll et al. (2021):
“while big data, open data, and Open Science increase access to complex and large datasets for innovation, discovery, and decision-making, Indigenous Peoples’ rights to control and access their data within these environments remain limited.”
Integrating the FAIR Principles for scientific data with the CARE Principles for Indigenous Data Governance enhances machine actionability while prioritizing people and purpose. This combined approach supports Indigenous Peoples’ rights and interests in their data throughout the data lifecycle (Figure 5.5).
Figure 5.5: Be FAIR and CARE (source: https://www.gida-global.org/care).
The CARE Principles emphasize the importance of people and purpose in data governance, particularly for Indigenous communities. CARE stands for:
📝 The CARE Principles complement the FAIR Principles by adding a crucial human- and rights-centered lens to data governance.
Finally, the United Nations Declaration on the Rights of Indigenous Peoples affirms Indigenous Peoples’ rights and interests in data and their authority to control it. Access to data for governance is essential to support self-determination, and Indigenous nations should be actively involved in the governance of data to ensure ethical data use and reuse.
The CARE Principles (Figure 5.5) are directly connected to the concept of data sovereignty. We provide a definition here because it is especially important for this course, as data sovereignty can impact your data sharing protocols. More on this topic will be discussed in Chapter 5.
Data sovereignty is the idea that data are subject to the laws and governance structures of the nation where they are collected. This concept is closely linked with data security, cloud computing, network sovereignty, and technological sovereignty. Unlike technological sovereignty, which is vaguely defined and often used as an umbrella term in policymaking, data sovereignty specifically concerns the governance of data itself.
Data sovereignty is usually discussed in two contexts: Indigenous groups and their autonomy from post-colonial states, and the control of transnational data flows. With the rise of cloud computing, many countries have enacted laws governing data control and storage, reflecting measures of data sovereignty. More than 100 countries now have data sovereignty laws in place.
In the context of self-sovereign identity (SSI), individuals can fully create and control their credentials, although nations may still issue digital identities within this framework.
This chapter is divided into two parts:
This part of Chapter 5 aims to help students navigate data management by first explaining what data and data management are, and why data sharing is important. It then provides advice and examples of best practices in data management.
Good data management is essential for research excellence. It ensures that research data are of high quality, accessible to others, and usable in the future (see the TOP Guidelines in Chapter 3).
The value of data is increasingly recognized through formal citation. Repositories such as GBIF and Dryad provide DOIs that allow datasets to be cited, giving researchers credit for sharing their data. By making data available for reuse, researchers can contribute to their field while also enhancing their own professional visibility.
To ensure data citations are properly acknowledged, they must be included in publications—typically in a dedicated “Data Availability Statement” section at the end of the article. (See Figure 6.1 for an example.)
Figure 6.1: Example of a Data Availability Statement.
This chapter is based on the following sources:
The presentation associated with this class is available here:
💡 Best browser to view presentation
Research data are the factual pieces of information used to test research hypotheses.
Data can be classified into five categories:
A key challenge facing researchers today is the need to work with diverse data sources. It is not uncommon for projects to integrate multiple data types into a single analysis—even drawing on data from disciplines outside of Ecology and Evolution. As research becomes increasingly collaborative and interdisciplinary, issues surrounding data management are becoming more prevalent.
Data have a longer lifespan than the project they were created for, as illustrated by the data life-cycle displayed in Figure 6.2.
Figure 6.2: The data lifecycle
Some projects may only focus on certain parts of the data life-cycle, such as primary data creation, or reusing others’ data. Other projects may go through several revolutions of the cycle. Either way, most researchers will work with data at all stages throughout their career.
Data management involves planning for all stages of the data life cycle and implementing this plan throughout your research project. When done effectively, it keeps the data life cycle moving smoothly, improves research efficiency, and ensures your data meet expectations set by yourself, funders, institutions, legislation, and publishers (e.g., copyright, data protection).
To put this into perspective, ask yourself:
Would a colleague be able to take over my project tomorrow if I disappeared, or make sense of the data without talking to me?
If your answer is YES, then you are managing your data well!
Effective data management brings several important benefits, including:
Whether or not your funder requires a data management or sharing plan as part of your grant application, having a plan in place before starting your research project will prepare you for potential data management challenges (see the Data Management Checklist section below).
Before designing your data management workflow, consider the following:
In the data life cycle (Figure 6.2), data creation begins when a researcher collects information in the field or lab and digitizes it to produce a raw dataset.
# Workflow associated with creating data
Collect data (in the field and/or lab) --> Digitize data --> Raw dataset
!! Perform quality checks @ each step to validate data !!
Quality control during data collection is essential, as there is often only one opportunity to gather data from a given situation. Researchers should critically evaluate their methods before data collection begins—high-quality methods lead to high-quality data. Once collection is underway, it’s equally important to document the process in detail as evidence of data quality.
Data may be collected directly in digital form using electronic devices, or recorded manually as handwritten notes. In either case, some processing is required to produce a structured, digital raw dataset.
Key things to consider during data digitization include:
Readme.txt
or protocol file (ideally attached as an appendix to your manuscript). Include definitions for all variables, units of measurement, and codes for missing values.Data should be processed into a format that supports analysis and ensures long-term usability. Data are at risk of being lost if the hardware or software used to create or process them becomes obsolete. To prevent this, data should be well organized, clearly structured, appropriately named, and version-controlled using standard, future-proof formats (see the Data Structure and Organisation of Files section below).
Here are some best practices for data processing (Figure 6.2):
.csv
for tabular data.txt
for plain text.gif
, .jpeg
, or .png
for imagesReadme.txt
file) can help others navigate your project or aid your own memory over time.v1_draft
, v2_reviewed
, v3_final
)A clearly structured directory tree offers a simple yet powerful way to summarize your project’s file organization. It improves clarity, supports reproducibility, and makes onboarding easier for collaborators.
In UNIX-based systems, you can generate a directory tree using the tree
command or package (see Figure 6.3).
Figure 6.3: Infer directory tree structure using the UNIX tree package.
tree
Package: Supporting Data Structure and OrganisationThe UNIX tree
package can be used to generate a visual representation of your project’s directory structure. In this example, it was used to inspect the contents of the Project_ID/
folder. This is a helpful way to check file names, file locations, and version control practices.
Some useful tree
command options include:
-s
— Displays the size (in bytes) of each file alongside its name.-D
— Shows the date and time of the last modification for each file.-c
option, it instead shows the time of the last status change.These options make it easy to identify inconsistencies in file naming, outdated versions, or misplaced files.
To install the tree
package on macOS, follow these steps:
#Install tree by typing the following code in a terminal
brew install tree
Creating clear documentation and metadata is essential for ensuring that your data can be understood, interpreted, and reused over the long term (Figure 6.2). Documentation—essentially the manual accompanying your dataset—should describe the data, explain any transformations or processing steps, and provide contextual information. The goal is to leave no room for misinterpretation.
Documentation requirements should be identified during the planning phase, so they can be implemented throughout all stages of the data life cycle, especially during data creation and processing. This proactive approach helps avoid “rescue missions” later—such as trying to remember undocumented steps or dealing with missing information when a collaborator leaves the project.
Data documentation includes information at both the project and data levels.
This includes broader information about the research project and how data were collected and managed:
This focuses on the details within individual datasets:
If you are using a language like R or Python for processing, much of this data-level documentation can be embedded directly in your scripts or notebooks.
Metadata support data discovery, reuse, and interoperability—especially in online repositories and databases. They also facilitate machine-readable access, which is increasingly important in large-scale, automated research workflows.
Metadata are typically created using:
- A repository’s data submission form
- A metadata editor
- A standalone metadata generation tool (many of which are freely available online)
Metadata follow standard schemas and usually fall into three main categories:
To protect against data loss and ensure long-term accessibility, every research project should include a clear strategy for data backup and storage (see where this step fits in Figure 6.2). A general recommendation is to maintain at least three copies of your data: - the original file, - an external/local backup (e.g. external hard drive), and - an external/remote backup (e.g. cloud storage or network drive).
Consult with your thesis supervisor or institutional IT team to determine the most suitable procedures for your project and research environment.
When designing a backup strategy, consider the various ways in which data could be lost or compromised, including:
While it’s not always possible to protect against every risk, you should tailor your backup strategy to address the most likely scenarios for your project and context.
Robust data storage is essential for both original and backup files. While this applies to paper records as well, electronic data present unique challenges related to format obsolescence, media degradation, and digital security.
Network Drives
Managed by institutional IT staff, these are regularly backed up, secure, and typically access-controlled.
Personal Devices (e.g. laptops, tablets)
Convenient for working, but not recommended for long-term or master copies due to higher risks of loss, damage, or theft.
External Devices (e.g. USBs, hard drives, CDs)
Affordable and portable, but also vulnerable. Use only high-quality devices from reputable brands, and avoid relying on them for long-term preservation.
Remote or Cloud Services (e.g. Google Drive, Dropbox)
Useful for syncing and off-site backup. Free plans are often limited, and advanced features may require a subscription. Check institutional policies before using third-party services for sensitive data.
Paper Copies
Surprisingly resilient, paper copies of small datasets or key documents can complement digital backups. While not ideal for all data types, ink on paper offers long-term accessibility—as long as it’s stored safely and can be found when needed!
All aspects of data management ultimately support the discovery and reuse of data by others. This is one of the main goals of reproducible science and open research.
To facilitate responsible reuse, it is essential to clearly document intellectual property rights, licenses, and permissions. These should be included in your data documentation and metadata so that others understand how your data can be reused.
At this stage of the data life-cycle, it is important to clearly communicate your expectations for reuse, which may include:
Likewise, those who wish to reuse data must take responsibility for ethical and transparent use. This includes:
When requesting access to someone else’s data, be transparent about your intentions. Clearly state: - The purpose of your request
- The research question or idea being explored
- Your expectations regarding collaboration or authorship
Co-authorship can be a sensitive and complex issue. It should be discussed openly with collaborators at the outset of a project to avoid misunderstandings later.
Openness to data reuse, combined with long-term data preservation, fosters collaboration, transparency, and innovation. Good data management enables high-quality datasets to remain useful beyond the original study and to contribute meaningfully to the broader research community—helping to answer the big questions in fields like ecology and evolution.
By making your data discoverable and reusable, you help ensure that science continues to grow on a strong, transparent foundation for future generations.
The following checklist, adapted from the UK Data Archive, is designed to guide you in planning effective data management and sharing. Use it to help design your own data management plan and ensure best practices throughout your research.
This section of Chapter 5 aims to introduce students to the principles and practices of reproducible coding. It provides guidance on organizing projects, maintaining clean and transparent code, and managing software dependencies to support reproducible research workflows.
To produce reproducible code, the following steps must be integrated:
In this chapter, we will cover Steps 1 to 4. Step 5 was introduced in Chapter 2, and Step 7 is discussed in Chapter 5 under Data Management. Step 6 will be addressed in the bioinformatics tutorial associated with Chapter 12.
This chapter draws on the following resources:
The presentation associated with this class is available here:
💡 Best browser to view presentation
To complete the exercise associated with the use of R to explore and summarize data structure and file organization, you will need to:
Detailed instructions for each step are provided below.
The dataset for this chapter is contained in the Project_ID
folder, which you can access via our shared Google Drive:
Path:
Reproducible_Science > Exercises > Chapter_5 > Project_ID
Please download the entire Project_ID
folder to your local computer before starting the exercises.
To ensure you have all the necessary R packages, we will reuse code introduced in Chapter 1. The exercises depend on the following packages:
data.tree
DiagrammeR
Please make sure these packages are installed and loaded in your R environment before proceeding. You can use the R code provided below to do this.
Before running the code, please read the disclaimer:
💡 Disclaimer
R
packages (and their dependencies) on your computer: data.tree
and DiagrammeR
.
R
packages: These packages help you organize your project for reproducibility. Both are available on CRAN, the official repository for R
packages.
R
package repositories: Before running the installation code, ensure your R
package repositories are properly set in RStudio. You can follow this guide for detailed instructions.
The code to install R packages is as follows:
## ~~~ 1. List all required packages ~~~ Object (args)
## provided by user with names of packages stored into a
## vector
pkg <- c("data.tree", "DiagrammeR")
## ~~~ 2. Check if pkg are installed ~~~
print("Check if packages are installed")
## [1] "Check if packages are installed"
# This line outputs a list of packages that are not
# installed
new.pkg <- pkg[!(pkg %in% installed.packages())]
## ~~~ 3. Install missing packages ~~~
if (length(new.pkg) > 0) {
print(paste("Install missing package(s):", new.pkg, sep = " "))
install.packages(new.pkg, dependencies = TRUE)
} else {
print("All packages are already installed!")
}
## [1] "All packages are already installed!"
## ~~~ 4. Load all packages ~~~
print("Load packages and return status")
## [1] "Load packages and return status"
# Here we use the sapply() function to require all the
# packages
sapply(pkg, require, character.only = TRUE)
## data.tree DiagrammeR
## TRUE TRUE
A key step to organizing your project for reproducibility is to create a clear workflow diagram that outlines how your project will be structured and executed. An example of such a reproducible project workflow is shown in Figure 6.4. We will use this workflow as a guiding template throughout this chapter to help you implement your own reproducible project.
Figure 6.4: A simple reproducible project workflow.
The fundamental idea behind a robust, reproducible analysis is a clean, repeatable, script-based workflow (i.e. the sequence of tasks from the start to the end of a project), linking raw data through to clean data and final analysis outputs.
Please find below some key concepts associated with this task:
The simplest and most effective way to document your workflow—its inputs and outputs—is through good file system organization and informative, consistent naming of materials associated with your analysis. The name and location of files should be as informative as possible regarding what a file contains, why it exists, and how it relates to other files in the project. These principles apply to all files in your project (not just scripts) and are also closely linked to good research data management (see Chapter 5: Data Management).
It is best to keep all files associated with a particular project in a single root directory. RStudio Projects offer a great way to keep everything together in a self-contained and portable manner (i.e. so files can be moved from computer to computer), allowing internal pathways to data and scripts to remain valid even when shared or relocated.
Figure 6.5: Example file structure of a simple analysis project. Make sure you left-pad single digit numbers with a zero for the R scripts to avoid having those misordered.
There is no single best way to organize a file system. The key is to ensure that the structure of directories and the location of files are consistent, informative, and tailored to your workflow.
Below is an example of a basic project directory structure (Figure 6.5):
data
folder contains all input data and associated metadata used in the analysis.MS
folder contains the manuscript files.Figures_&_Tables
folder stores all figures and tables generated by the analyses.Output
folder includes any intermediate or final output files (e.g., simulation outputs, models, processed datasets).cleaned-data
.R_functions
folder stores R scripts that define custom functions used throughout the project.Reports
folder includes R Markdown (.Rmd
) files that document the analysis process or summarize results.*.R
) are stored in the root directory, along with the README.md
file.Inferring the directory tree structure of your project provides a simple and efficient way to summarize the data structure and organization of files related to your project, as well as to track versioning. The R base functions list.files() and file.info() can be combined to obtain information about the files stored in your project. Please see the code below for an example associated with Figure 6.5.
# Produce a list of all files in working directory
# (Project-ID) together with info related to those files
file.info(list.files(path = "Project_ID", recursive = TRUE, full.name = TRUE))
## size isdir mode
## Project_ID/Data/species_data.csv 6 FALSE 700
## Project_ID/Figures_&_Tables/Fig_01_Data_lifecycle.pdf 10282 FALSE 700
## Project_ID/MS/MS_et_al.docx 11705 FALSE 700
## Project_ID/Output/DataStr_network.html 256368 FALSE 700
## Project_ID/Output/DataStr_tree.pdf 3611 FALSE 700
## Project_ID/R_functions/check.install.pkg.R 684 FALSE 700
## Project_ID/README.md 1000 FALSE 700
## Project_ID/Reports/Documentation.md 14 FALSE 700
## Project_ID/Scripts/01_download_data.R 14 FALSE 700
## Project_ID/Scripts/02_clean_data.R 14 FALSE 700
## Project_ID/Scripts/03_exploratory_analyses.R 14 FALSE 700
## Project_ID/Scripts/04_fit_models.R 14 FALSE 700
## Project_ID/Scripts/05_generate_figures.R 14 FALSE 700
## mtime
## Project_ID/Data/species_data.csv 2018-09-12 10:02:55
## Project_ID/Figures_&_Tables/Fig_01_Data_lifecycle.pdf 2018-09-09 10:38:23
## Project_ID/MS/MS_et_al.docx 2019-09-23 11:51:58
## Project_ID/Output/DataStr_network.html 2019-09-23 13:27:30
## Project_ID/Output/DataStr_tree.pdf 2019-09-23 13:04:03
## Project_ID/R_functions/check.install.pkg.R 2019-09-10 09:54:03
## Project_ID/README.md 2021-10-03 13:32:58
## Project_ID/Reports/Documentation.md 2018-09-12 08:56:42
## Project_ID/Scripts/01_download_data.R 2018-09-12 08:56:42
## Project_ID/Scripts/02_clean_data.R 2018-09-12 08:56:42
## Project_ID/Scripts/03_exploratory_analyses.R 2018-09-12 08:56:42
## Project_ID/Scripts/04_fit_models.R 2018-09-12 08:56:42
## Project_ID/Scripts/05_generate_figures.R 2018-09-12 08:56:42
## ctime
## Project_ID/Data/species_data.csv 2022-06-24 14:49:58
## Project_ID/Figures_&_Tables/Fig_01_Data_lifecycle.pdf 2022-06-24 14:49:58
## Project_ID/MS/MS_et_al.docx 2022-06-24 14:49:56
## Project_ID/Output/DataStr_network.html 2022-06-24 14:49:56
## Project_ID/Output/DataStr_tree.pdf 2022-06-24 14:49:57
## Project_ID/R_functions/check.install.pkg.R 2022-06-24 14:49:57
## Project_ID/README.md 2022-06-24 14:49:57
## Project_ID/Reports/Documentation.md 2022-06-24 14:49:58
## Project_ID/Scripts/01_download_data.R 2022-06-24 14:49:57
## Project_ID/Scripts/02_clean_data.R 2022-06-24 14:49:57
## Project_ID/Scripts/03_exploratory_analyses.R 2022-06-24 14:49:58
## Project_ID/Scripts/04_fit_models.R 2022-06-24 14:49:57
## Project_ID/Scripts/05_generate_figures.R 2022-06-24 14:49:57
## atime uid
## Project_ID/Data/species_data.csv 2022-06-24 15:34:11 502
## Project_ID/Figures_&_Tables/Fig_01_Data_lifecycle.pdf 2022-06-24 15:34:11 502
## Project_ID/MS/MS_et_al.docx 2022-06-24 15:34:11 502
## Project_ID/Output/DataStr_network.html 2022-06-24 15:34:11 502
## Project_ID/Output/DataStr_tree.pdf 2025-08-30 23:05:54 502
## Project_ID/R_functions/check.install.pkg.R 2022-06-24 15:34:11 502
## Project_ID/README.md 2022-06-24 15:34:11 502
## Project_ID/Reports/Documentation.md 2022-06-24 15:34:11 502
## Project_ID/Scripts/01_download_data.R 2022-06-24 15:34:11 502
## Project_ID/Scripts/02_clean_data.R 2022-06-24 15:34:11 502
## Project_ID/Scripts/03_exploratory_analyses.R 2022-06-24 15:34:11 502
## Project_ID/Scripts/04_fit_models.R 2022-06-24 15:34:11 502
## Project_ID/Scripts/05_generate_figures.R 2022-06-24 15:34:11 502
## gid uname grname
## Project_ID/Data/species_data.csv 20 sven staff
## Project_ID/Figures_&_Tables/Fig_01_Data_lifecycle.pdf 20 sven staff
## Project_ID/MS/MS_et_al.docx 20 sven staff
## Project_ID/Output/DataStr_network.html 20 sven staff
## Project_ID/Output/DataStr_tree.pdf 20 sven staff
## Project_ID/R_functions/check.install.pkg.R 20 sven staff
## Project_ID/README.md 20 sven staff
## Project_ID/Reports/Documentation.md 20 sven staff
## Project_ID/Scripts/01_download_data.R 20 sven staff
## Project_ID/Scripts/02_clean_data.R 20 sven staff
## Project_ID/Scripts/03_exploratory_analyses.R 20 sven staff
## Project_ID/Scripts/04_fit_models.R 20 sven staff
## Project_ID/Scripts/05_generate_figures.R 20 sven staff
The code presented above could be at the core of a user-defined function designed to manage files and ensure data reliability for your project. To help you define such a function, let’s investigate this further and produce a diagram summarizing your project structure. This objective is achieved in four steps (see Figure 6.6):
data.frame
.data.frame
into a data.tree
class object using the data.tree::as.Node() function from the data.tree
package.data.tree
, and the output is plotted using the DiagrammeR
package.💡 Disclaimer
Project_ID
folder located at: Reproducible_Science > Exercises > Chapter_5 > Project_ID
.
Procedure to follow:
Project_ID/
folder into a directory named Chapter_5_PartB
.Chapter_5_PartB
.Project_ID/
. Also, read the accompanying text, which further explains the approach.### Load R packages
library(data.tree)
library(DiagrammeR)
### Step 1: Produce a list of all files in Project_ID
filesInfo <- file.info(list.files(path = "Project_ID", recursive = TRUE,
full.name = TRUE))
### Step 2: Convert filesInfo into data.tree class
myproj <- data.tree::as.Node(data.frame(pathString = rownames(filesInfo)))
# Inspect output
print(myproj)
## levelName
## 1 Project_ID
## 2 ¦--Data
## 3 ¦ °--species_data.csv
## 4 ¦--Figures_&_Tables
## 5 ¦ °--Fig_01_Data_lifecycle.pdf
## 6 ¦--MS
## 7 ¦ °--MS_et_al.docx
## 8 ¦--Output
## 9 ¦ ¦--DataStr_network.html
## 10 ¦ °--DataStr_tree.pdf
## 11 ¦--R_functions
## 12 ¦ °--check.install.pkg.R
## 13 ¦--README.md
## 14 ¦--Reports
## 15 ¦ °--Documentation.md
## 16 °--Scripts
## 17 ¦--01_download_data.R
## 18 ¦--02_clean_data.R
## 19 ¦--03_exploratory_analyses.R
## 20 ¦--04_fit_models.R
## 21 °--05_generate_figures.R
### Step 3: Prepare and plot diagram of project structure
### (it requires DiagrammeR)
# Set general parameters related to graph
data.tree::SetGraphStyle(myproj$root, rankdir = "LR")
# Set parameters for edges
data.tree::SetEdgeStyle(myproj$root, arrowhead = "vee", color = "grey",
penwidth = "2px")
# Set parameters for nodes
data.tree::SetNodeStyle(myproj, style = "rounded", shape = "box")
# Apply specific criteria only to children nodes of Scripts
# and R_functions folders
data.tree::SetNodeStyle(myproj$Scripts, style = "box", penwidth = "2px")
data.tree::SetNodeStyle(myproj$R_functions, style = "box", penwidth = "2px")
# Plot diagram
plot(myproj)
Figure 6.6: Diagram of project structure for the Project_ID directory. Nodes representing folders and files associated to R code are symbolized by boxes, whereas the others are rounded.
Finally, the R DiagrammeR
package unfortunately doesn’t allow you to easily save the graph to a file (Step 4) using, for example, the pdf() and dev.off() functions. However, this task can be accomplished in RStudio as follows:
Viewer
window (bottom right panel). It can be exported by clicking the Save as Image...
, as shown in Figure 6.7.Figure 6.7: Snapshot showing how to export the diagram in RStudio.
Figure 6.8: Snapshot showing how to save the diagram in RStudio.
To learn more about the options for exporting or saving DiagrammeR
graphs, please visit the following website:
https://rich-iannone.github.io/DiagrammeR/io.html
Code and software are typically licensed under the GNU Affero General Public License. This is a free, copyleft license for software and other types of works, specifically designed to promote community cooperation, especially in the context of network server software.
For example, this is the license used by the instructor for this class.
Good naming practices should extend to all files, folders, and even objects in your analysis. This helps make the contents and relationships among the elements of your analysis understandable, searchable, and logically organized (see Figure 6.5 for examples, and Chapter 5: Data Management for more details).
Pseudocode is an informal, high-level description of how a computer program or algorithm operates. It uses the structural conventions of a standard programming language (in this case, R) but is intended for human understanding rather than execution by a machine.
Here, you will define the major steps in your process (and their associated tasks) and link them to R functions—either existing ones or those you need to create. This serves as the backbone of your code and supports the process of writing it.
Writing clear, reproducible code has (at least) three major benefits:
To support writing effective code, it is recommended to follow the workflow shown in Figure 6.4. The next section explains each part of this workflow and offers tips for writing better code. Although the workflow represents a ‘gold standard,’ even adopting a few of its elements can significantly improve the clarity and quality of your code.
The foundation of writing readable code is choosing a clear and consistent coding style—and sticking to it. Some key elements to consider when developing your coding style include:
### Naming files
# Good
fit-models.R
utility-functions.R
# Bad
foo.r
stuff.r
### Naming objects
# Good
day_one
day_1
# Bad
first_day_of_the_month
DayOne
dayone
djm1
=
, +
, -
, <-
, etc.) and after commas, just as you would in a sentence.### Spacing
# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)
# Bad
average<-mean(feet/12+inches,na.rm=TRUE)
### Indentation
long_function_name <- function(a = "a long argument",
b = "another argument",
c = "another long argument") {
# As usual code is indented by two spaces.
}
<-
for assignment, not =
.### Assignment
# Good
x <- 5
# Bad
x = 5
The most important role of a style guide, however, is to ensure consistency across scripts.
How often have you revisited an old script six months later and struggled to remember what you were doing? Or taken on a script from a collaborator and found it difficult to understand what their code does and why?
An easy way to make code more readable and reproducible is the liberal—and effective—use of comments. A comment is a line of text that is visible in the script but is not executed. In both R and Python, comments begin with a #
.
A good principle to follow is to comment the ‘why’ rather than the ‘what’. The code itself shows what is being done; it is far more important to explain the reasoning behind a particular section, or to clarify any nonstandard or complex parts of the code.
It is also good practice to use comments to provide an overview at the beginning of a script and to use commented lines of ---
to visually break up the script—for example, in R:
# Load data -------
When analyzing data, you often need to repeat the same task multiple times. For example, you might have several files that all require loading and cleaning in the same way, or you might need to perform the same analysis for multiple species or parameters. The best way to handle these tasks is to write functions and store them in a dedicated folder (e.g., R_functions
as shown in 6.5). These functions can then be loaded into the R environment—and made available for use—by calling the source()
function, which should be placed at the top of your R script. For more details, please see Chapter 1, Part D.
In the experimental sciences, rigorous testing ensures that results are accurate, reproducible, and reliable. Testing confirms that the experimental setup functions as intended and helps quantify any systematic biases. Since experimental results are not trusted without such validation, the same standard should apply to your code.
Testing scientific code helps ensure it works as intended and allows you to understand and quantify any limitations. Additionally, using tests can speed up code development by identifying errors early.
While establishing formal testing protocols is especially important when designing R packages, informal testing is also valuable. The instructor recommends that students load their written functions and run ad hoc tests in the command line to verify that the functions perform as expected.
When working collaboratively, ensuring code portability between machines is crucial—that is, will your code run correctly on someone else’s computer? Portability is also important when running code on different servers, such as a High Performance Cluster.
One effective way to improve code portability is to avoid using absolute paths and instead use relative paths (see https://en.wikipedia.org/wiki/Path_(computing)#Absolute_and_relative_paths).
For example, consider the file species_data.csv
stored in the Data
folder shown in Figure 6.5:
# Absolute path -----------------------------
C:/Project_ID/Data/species_data.csv
Project_ID = Project root folder = working directory
# Relative path ------
Data/species_data.csv
Relative paths are especially useful when transferring projects between computers. For example, while you might have stored your project folder in C:/Project_ID/
, someone else might have theirs in C:/Users/My Documents
. Using relative paths—and running your code from the project folder—helps avoid file-not-found errors.
RStudio is specifically designed to facilitate code portability. For instance, you can easily set the working directory in RStudio by clicking:
Session > Set Working Directory > To Source File Location
Alternatively, you can use the setwd() function. For more details, please see Chapter 2.
Reproducibility also means ensuring that someone else can reuse your code to obtain the same results as you (see Appendix 1 for more details).
To enable others to reproduce the results in your report, you need to provide more than just the code and data. You must also document the exact versions of all packages, libraries, and software used, as well as potentially your operating system and hardware.
R itself is quite stable, and the core development team takes backward compatibility seriously—meaning old code generally works with recent versions of R. However, default values for some functions can change, and new functions are regularly introduced. If you wrote your code on a recent R version and share it with someone using an older version, they might not be able to run it. Similarly, code written for one package version may produce different results with a newer version.
In R, the simplest—and a very useful—way to document your dependencies is to include the output of sessionInfo()
(or devtools::session_info()
). This output shows all the packages and their versions loaded in the session used to run your analysis. Anyone wishing to recreate your work will then know which packages and versions to install. See Appendix 2 for more details.
For example, here is the output of sessionInfo()
showing the R version and packages used to create this document:
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rticles_0.27 DiagrammeR_1.0.11 DT_0.34.0
## [4] data.tree_1.2.0 kfigr_1.2.1 devtools_2.4.5
## [7] usethis_3.2.1 bibtex_0.5.1 knitcitations_1.0.12
## [10] htmltools_0.5.7 prettydoc_0.4.1 magrittr_2.0.3
## [13] dplyr_1.1.4 kableExtra_1.4.0 formattable_0.2.1
## [16] bookdown_0.36 rmarkdown_2.29 knitr_1.44
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.11 svglite_2.1.3 lubridate_1.9.3 visNetwork_2.1.4
## [5] digest_0.6.33 mime_0.12 R6_2.6.1 plyr_1.8.9
## [9] backports_1.4.1 evaluate_1.0.5 highr_0.11 httr_1.4.7
## [13] pillar_1.11.1 rlang_1.1.2 rstudioapi_0.17.1 miniUI_0.1.2
## [17] jquerylib_0.1.4 urlchecker_1.0.1 RefManageR_1.4.0 stringr_1.5.2
## [21] htmlwidgets_1.6.4 shiny_1.7.5.1 compiler_4.2.0 httpuv_1.6.13
## [25] xfun_0.41 pkgconfig_2.0.3 systemfonts_1.0.5 pkgbuild_1.4.8
## [29] tidyselect_1.2.0 tibble_3.2.1 viridisLite_0.4.2 withr_3.0.2
## [33] later_1.3.2 jsonlite_1.8.8 xtable_1.8-4 lifecycle_1.0.4
## [37] formatR_1.14 scales_1.4.0 cli_3.6.2 stringi_1.8.3
## [41] cachem_1.0.8 farver_2.1.1 fs_1.6.3 promises_1.2.1
## [45] remotes_2.5.0 xml2_1.3.6 bslib_0.5.1 ellipsis_0.3.2
## [49] generics_0.1.4 vctrs_0.6.5 RColorBrewer_1.1-3 tools_4.2.0
## [53] glue_1.6.2 purrr_1.0.2 crosstalk_1.2.2 pkgload_1.4.1
## [57] fastmap_1.1.1 yaml_2.3.8 timechange_0.2.0 sessioninfo_1.2.3
## [61] memoise_2.0.1 profvis_0.3.8 sass_0.4.8
This topic will not be covered in this class, but it is worth noting that there are at least two R packages designed to better manage dependencies and help recreate your project setup. These packages are:
The instructor encourages students to discuss with their supervisors whether these packages might be useful for their projects.
This chapter is subdivided into two parts as follows:
To support the learning outcome, we will focus on discussing:
This chapter is mostly based on the following books, publications and web resources:
Books and Guides
Publications
Websites
The presentation slides for Chapter 6 - part A can be downloaded here.
Publishing research results is the one thing that unites scientists across disciplines, and it is a necessary part of the scientific process. You can have the best ideas in the world, but if you don’t communicate them clearly enough to be published, your work won’t be acknowledged by the scientific community.
By publishing you are achieving three key goals for yourself and the larger scientific endeavor:
In biology, publishing research is equal to journal articles.
Before beginning writing your journal article and thinking where to submit it, it is important to thoroughly understand your own research and know the key conclusion you want to communicate (see chapter 3). In other words, what is your take home message?
Consider your conclusion and ask yourself, is it:
If you can answer ‘yes’ to all three, you have a good foundation message for a paper.
Shape the whole narrative of your paper around this message.
Once you know your message, getting your research published will be a four steps process:
Each step will be discussed below. Please seek support from your supervisor to learn more about specifics from you field.
To target the best journal to publish your research, you need to ask yourself what audience do I want my paper to reach?
Your manuscript should be tailored to the journal you want to submit to in terms of content and in terms of style (as outlined in journals’ author guidelines). To confirm that a journal is the best outlet to publish your research ask yourself this question: can I relate my research to other papers published in this journal?
Here are some things to consider when choosing which journal to submit to:
Look closely at what the journal publishes; manuscripts are often rejected on the basis that they would be more suitable for another journal. There can be crossover between different journals’ aims and scope – differences may be subtle, but all important when it comes to getting accepted.
Do you want your article read by a more specialist audience working on closely related topics to yours, or researchers within your broader discipline?
Once you have decided which journal you are most interested in, make sure that you tailor the article according to its aims and scope.
It is a good sign if you recognize the names of the editors and editorial board members of a journal from the work you have already encountered (even better if they contributed to some of the references cited in your manuscript). Research who would likely deal with your paper if you submitted to a journal and find someone who would appreciate reading your paper. You can suggest handling editors in your cover letter or in the submission form, if it allows, but be aware that journals do not have to follow your suggestions and/or requests.
A summary of our previously discussed material is presented below to provide more context for this chapter, but please consult Chapter 4 for more details on this topic.
Impact factors are the one unambiguous measure widely used to compare journal quality based on citations the journal receives. However, other metrics are becoming more common, e.g. altmetric score measuring the impact of individual articles through online activity (shares on different social media platforms etc.), or article download figures listed next to the published paper.
None of the metrics described here are an exact measure of the quality of the journal or published research. You will have to decide which of these metrics (if any) matter most to your work or your funders and institutions.
Do you need to publish open access (OA)? Some funders mandate it and grant money often has an amount earmarked to cover the article processing charge (APC) required for Gold OA. Some universities have established agreements with publishers whereby their staff get discounts on APCs when publishing in certain journals (or even a quota of manuscripts that can be published for “free” on a yearly basis). If you do not have grant funding, check whether your university or department has got an OA fund that you could tap into.
However, if you are not mandated to publish OA by your funder and/or you do not have the funds to do so, your paper will still reach your target audience if you select the right journal for your paper. Remember, you can share your paper over email.
The length of time a paper takes to be peer reviewed does not correlate to the quality of peer review, but rather reflects the resources a journal has to manage the process (e.g. do they have paid editorial staff or is it managed by full-time academics?).
Journals usually give their average time to a decision on their website, so take note of this if time is a consideration for you.
Some journals also make it clear that they are reviewing for soundness of science rather than novelty and will therefore often have a faster review process (e.g. PLoS ONE).
Ethics can be divided into two groups:
As an author, it helps if you are familiar with what constitutes good practices and what is considered unacceptable. Please see section “Used literature & web resources” for more details on this topic.
Develop a narrative that leads to your main conclusion and develop a backbone around that narrative. The narrative should progress logically, which does not necessarily mean chronologically. Work out approximate word counts for each section to help manage the article structure and keep you on track for word limits.
It is important to set aside enough time to write your manuscript and – importantly – enough time to edit, which may actually take longer than the writing itself.
The article structure will be defined in the author guidelines, but if the journal’s guidelines permit it, there may be scope to use your own subheadings. By breaking down your manuscript into smaller sections, you will be communicating your message in a much more digestible form.
Use subheadings to shape your narrative and write each subheading in statement form (e.g. ecological variables do not predict genome size variation).
The title is the most visible part of your paper and it should thus clearly communicates your key message. Pre-publication, reviewers base their decision on whether to review a paper on the quality of the title and abstract. Post-publication, if you publish in a subscription journal and not OA, the title and abstract are the only freely available parts of your paper, which will turn up in search engines and thus reach the widest audience. A good title will help you get citations and may even be picked up by the press.
Draft a title before you write your manuscript to help focusing your paper. The title needs to be informative and interesting to make it stand out to reviewers and subsequently readers. Some key tips for a successful title include:
Write your abstract after you have written your paper, when you are fully aware of the narrative of your paper. After the title, the abstract is the most read part of your paper. Abstracts are freely available and affect how discoverable your article is via search engines.
Given its importance, your abstract should:
Writing with clarity, simplicity and accuracy takes practice and we can all get carried away with what we think is ‘academic writing’ (i.e. long words and jargon) but good science speaks for itself. Write short sentences (ca. 12 words on average).
Every extra word you write is another word for a reviewer to disagree with. Single out the narrative that leads to your main conclusion and write that – it is easy to get side tracked with lots of interesting avenues distracting from your work, but by including those in your paper, you are inviting more criticism from reviewers.
Write in an active, positive voice (e.g. ‘we found this…’ ‘we did this…’) and be direct so that your message is clear. Ambiguous writing is another invitation for reviewers to disagree with you.
In your introduction, state that your research is timely, important, and why. Begin each section with that section’s key message and end each section with that message again plus further implications. This will place your work in the broader context that high-quality journals like.
Draft and redraft your work to ensure it flows well and your message is clear and focused throughout. While doing this process, keep the reader in mind at all times (to have a critical look on your research and its presentation).
Keywords are used by readers to discover your paper. You will increase chances of your paper being discovered through search engines by using them strategically throughout your paper – this is search engine optimization (SEO).
Think of the words you would search for to bring up your paper in a Google search. Try it and see what comes up – are there papers that cover similar research to your own?
Build up a list of 15–20 terms relevant to your paper and divide them into two groups:
Place your core keywords in the title, abstract and subheadings, and the secondary keywords throughout the text and in figures and tables. Repeat keywords in the abstract and text naturally.
Reference all sources and do it as you go along (e.g. copy the BibTeX citation into a reference file; see chapter 1 part B), then tidy them once the paper is complete.
Make sure that most of your references are recent to demonstrate both that you have a good understanding of current literature, and that your research is relevant to current topics.
Figures and tables enhance your paper by communicating results or data concisely (more on this topic in chapters 9 and 10).
Use figures and tables to maintain the flow of your narrative – e.g. instead of trying to describe patterns in your results, create a figure and say ‘see Fig. 1’. Not only does this keep your word count down, but a well-designed figure can replace 1000 words!
Figures are useful for communicating overall trends and shapes, allowing simple comparisons between fewer elements. Tables should be used to display precise data values that require comparisons between many different elements.
Figure captions and table titles should explain what is presented and highlight the key message of this part of your narrative – the figure/table and its caption/title should be understandable in isolation from the rest of your manuscript.
Check the journal’s author guidelines for details on table and figure formatting, appropriate file types, number of tables and figures allowed and any other specifications that may apply. Material presented in chapter 1 part C can help you produce figures meeting journal expectations.
Once you have finished writing your manuscript, put it on ice for a week so you come back to it with fresh eyes. Take your time to read it through. Editing can take more time than you expect, but this is your opportunity to fine-tune and submit the best paper possible. Don’t hesitate to seek support from your thesis committee to fasten and streamline this process.
Key things to look out for when editing include:
You are now ready to submit your paper to your chosen journal. Each journal will have a different submission procedure that you will have to adhere to, and most manage their submissions through online submission systems (e.g. ScholarOne Manuscripts).
Only submit your paper for consideration to one journal at a time otherwise you will be breaching publishing ethics.
A great cover letter can set the stage towards convincing editors to send your paper for review. Write a concise and engaging letter addressed to the editor-in-chief, who may not be an expert in your field or sub-field.
The following points should be covered in your cover letter:
Very rarely is a paper immediately accepted – almost all papers go through few rounds of reviews before they get published.
If a decision comes back asking for revisions you should reply to all comments politely. Here are some tips on handling reviewer comments and revising your paper:
Reviewers are volunteers, but the service they provide is invaluable – by undergoing peer review, regardless of the outcome, you are receiving some of the best advice from leading experts for free. With this in mind, any feedback you get will be constructive in the end and will lead you on the way to a successful publishing portfolio.
Keep in mind that feedback is another person’s opinion on what you have done, not on who you are, and it is up to you to decide what to do with it.
If your paper is rejected look at the reviewer’s comments and use their feedback to improve your paper before resubmitting it.
If you are unhappy with a reject decision, 99.9% of the time, move on. However, don’t be afraid of appealing if you have well-founded concerns or think that the reviewers have done a bad job. There are instances where journals grant your appeal and allow you to revise your paper, but in the large majority of cases, the decision to reject will be upheld.
Congratulations! By now you should have an acceptance email from the editor-in-chief in your inbox. The process from here will vary according to each journal, but the post-acceptance workflow is usually as follows:
It might be then time to coordinate the publication of a press release or post the link of your article on social media to share your joy!
The presentation slides for Chapter 6 - part B can be downloaded here.
This chapter is mostly based on the following books and web resources:
Bookdown ebook on writing scientific papers: https://bookdown.org/yihui/rmarkdown/journals.html
A blog on “Writing papers in R Markdown” by Francisco Rodriguez-Sanchez: https://sites.google.com/site/rodriguezsanchezf/news/writingpapersinrmarkdown
rmdTemplates R package GitHub repository: https://github.com/Pakillo/rmdTemplates
rticles R package GitHub repository: https://github.com/rstudio/rticles
To apply the approach described below make sure that you have a Tex distribution installed on your computer. More information on this topic is available here. You will also need to install the R rticles package as demonstrated here.
Traditionally, journals are accepting manuscripts submitted in either Word (.doc
) or LaTex (.tex
) formats. In addition, most journals are requesting figures to be submitted as separate files (in e.g. .tiff
or .eps
formats). Online submission platforms are collating your different files to produce a .pdf
document, which is shared with reviewers for evaluation. In this context, although the .Rmd
format is growing in popularity (due to its ability to “mesh” data analyses with data communication), this format is technically currently not accepted by journals. In this document, we are discussing ways that have been developed to circumvent this issue and allow using the approach implemented in R Markdown for journal submissions.
As mentioned above, many journals support the LaTeX format (.tex
) for manuscript submissions. While you can convert R Markdown (.Rmd
) to LaTeX, different journals have different typesetting requirements and LaTeX styles. The solution is to develop scripts converting R Markdown files into LaTex files, which are meeting journal requirements.
Submitting scientific manuscripts written in R Markdown is still challenging; however the R rticles package was designed to simplify the creation of documents that conform to submission standards for academic journals (see Allaire et al., 2024). The package provides a suite of custom R Markdown LaTeX formats and templates for the following journals/publishers that are relevant to the EEB program:
An understanding of LaTeX is recommended, but not essential in order to use this package. R Markdown templates may sometimes inevitably contain LaTeX code, but usually we can use the simpler R Markdown and knitr function to produce elements like figures, tables, and math equations.
install.packages("rticles")
Tools -> Install Packages...
Then, type "rticles" in the prompted window to install package.
remotes::install_github("rstudio/rticles")
File -> New File -> R Markdown...
New R Markdown
window, click on From Template
in the left panel and select the journal style that you would like to follow for your article (here PNAS Journal Article; see Figure 7.1). Before pushing the OK
button, provide a name for your project and set a location where the project will be saved (see Figure 7.1).Figure 7.1: R Markdown window allowing to select templates following journal styles.
Figure 7.2: Snapshot showing the template R Markdown and the associated folder created to generate your submission.
Submission_PNAS.Rmd
: R Markdown file that will be used to write your article.pnas-sample.bib
: BibTeX file to store your bibliography.pnas.csl
and pnas-new.cls
: Files containing information about the formatting of citations and bibliography adapted to journal policies.frog.png
: A .png
file used to show you how to include figures in .Rmd
document.Figure 7.3: Snapshot showing the suite of files associated to your submission and saved in your project folder.
Submission_PNAS.Rmd
and update the YAML metadata section with information on authors, your abstract, summary and keywords (see Figure 7.4).Figure 7.4: Update YAML metadata section with information on authors, your abstract, summary and keywords.
.pdf
and .tex
files to submit your article (see Figure 7.5). The output files will be saved in your project folder.Figure 7.5: Snapshot of the procedure to knit the document.
To get familiar with this procedure, please practice by applying the above approach to different journal templates by favoring those where you might be submitting to.
Enjoy writing scientific publications in R Markdown!
In this chapter, we will study protocols to import and gather data with R. As stated by Gandrud (2015) in the chapter 6 of his book, how you gather your data directly impacts how reproducible your research will be. In this context, it is your duty to try your best to document every step of your data gathering process. Reproduction will be easier if all of your data gathering steps are tied together by your source code, then independent researchers (and you) can more easily regather the data. Regathering data will be easiest if running your code allows you to get all the way back to the raw data files (the rawer the better). Of course, this may not always be possible. You may need to conduct interviews or compile information from paper based archives, for example. The best you can sometimes do is describe your data gathering process in detail. Nonetheless, R’s automated data gathering capabilities for internet-based information is extensive. Learning how to take full advantage of these capabilities greatly increases reproducibility and can save you considerable time and effort over the long run.
Gathering data can be done by either importing locally stored data sets (= files stored on your computer) or by importing data sets from the Internet. Usually, these data sets are saved in plain-text format (usually in comma-separated values format or csv
) making importing them into R a fairly straightforward task (using the read.csv() function). However, if data sets are not saved in plain-text format, the users will have to start by converting them. In most cases, data sets will be saved in xls
or xlsx
formats and functions implemented in the readxl package (Wickham and Bryan, 2019) would be used (using the read_xlsx() function). If your data sets were created by other statistical programs such as SPSS, SAS or Stata, these could be imported into R using functions from the foreign package (R Core Team, 2020). Finally, data sets could be saved in compressed documents, which will have to be processed prior to importing the data into R.
Learning skills to import and gather data sets is especially important in the fields of Ecology, Evolution and Behavior since your research is highly likely to depend on large and complex data sets (see Figure 8.1). In addition, testing your hypotheses will rely on your ability to manage your data sets to test for complex interactions (e.g. does abiotic factors such as temperature drive selection processes in plants?; Figure 8.1).
Figure 8.1: Example of datasets involved in Ecology, Evolution and Behavior and their intercations.
Here, we are providing methods and R functions that are applied to manage your projects and gather, convert and clean your data. Ultimately these tools will be applied to document and produce the raw data at the basis of your research.
csv
files deposited on GitHub into R.csv
files associated to specific GitHub commit
events.zip
format) on your computer.zip
file.zip
file without decompressing it.RStudio projects (.Rproj
) allow users to manage their project, more specifically by dividing their work into multiple contexts, each with their own working directory, workspace, history, and source documents.
RStudio projects are associated with R working directories. You can create an RStudio project:
We will be covering the last option during Chapter 11.
To create a new project in RStudio, do File > New Project...
and then a window will pop-up allowing you to select among the 3 options (see Figure 8.2).
Figure 8.2: Window allowing you to create a New RStudio project. See text for more details.
When a new project is created RStudio:
.Rproj
extension) within the project directory. This file contains various project options and can also be used as a shortcut for opening the project directly from the filesystem.Rproj.user
) where project-specific temporary files (e.g. auto-saved source documents, window-state, etc.) are stored. This directory is also automatically added to .Rbuildignore
, .gitignore
, etc. if required.To open a project, go to your project directory and double-click on the project file (*.Rproj
). When your project opens within RStudio, the following actions will be taken:
.RData
file (see below), it will be loaded into your environment and allowing you to purse your work.When you are within a project and choose to either Quit, close the project, or open another project the following actions will be taken:
.RData
and/or .Rhistory
are written to the project directory (if current options indicate they should be).Additional information on RStudio projects can be found here:
https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects
The list of R packages and GitHub repositories used in this chapter are listed below. Please make sure you have all these set-up before starting reading and completing the material presented in this chapter.
Most of the functions that we will be using in this chapter are base R functions installed (by default) in the utils package (R Core Team, 2019). However, the following R package (and its dependencies) has to be installed on your computer prior to starting this tutorial: repmis (Gandrud, 2016). In the event that you wanted to import xsl
or xlsx
files into R, you would also have to install the readxl package (Wickham and Bryan, 2019).
This repository is dedicated to this course and is used to demonstrate the procedure to import csv
files into R from GitHub repositories. More specifically, we will be importing different versions of the file (Timetable_EEB603_topic_tasks.csv
) at the origin of the Timetable to study procedure associated to file versioning in GitHub. It is located at this URL (Uniform Resource Locator):
This repository is associated to Barron et al. (2020) and is used to demonstrate procedure to download whole GitHub repositories. We will be downloading the whole repository on your local computer and then extracting all csv
files in the 01_Raw_Data/
folder and saving them on your local computer using R (see Figure 8.3 for more details on file content).
Figure 8.3: Snapshot showing files in the 01_Raw_Data folder that we will be targeting in the GitHub repository.
This approach is aimed at demonstrating how you could access raw data deposited on GitHub. The repository is located at this URL:
In this chapter, we will be focusing on learning procedures to import data from the internet focusing on GitHub. More precisely, we will be learning procedures to:
csv
) stored in GitHub repositories.zip
format.The list of major R functions covered in this chapter are provided in the Table below.
Before starting coding, please do the following:
EEB603_Chapter_06
.01_Data_gathering.R
.Before delving into this subject, there are several topics that we need to cover.
With the growing popularity of GitHub, several authors are depositing their data sets on GitHub and you might would like to access those for your research. Since Git and GitHub are supporting version control, it is important to report the exact version of the file or data set that you have downloaded. To support such feature, each version of a file/data set is associated to a unique encrypted SHA-1 (Secure Hash Algorithm 1) hash accession number. This means that if the file changes (because the owner of the repository updated it), its SHA-1 hash accession number will change. Such feature allows users to referring to the exact file/data set used in their analyses therefore supporting reproducibility.
Before being able to import a csv
file deposited in GitHub into R you have to find it’s raw URL. In this section, we will demonstrate how to obtain this information by using the Timetable_EEB603_topic_tasks.csv
file located on the course’s GitHub repository.
To retrieve the raw URL associated to Timetable_EEB603_topic_tasks.csv
do the following:
Figure 8.4: Location of Timetable_EEB603_topic_tasks.csv on the EEB603_Reproducible_Science GitHub repository.
Raw
button on the right just above the file preview (Figure 8.4). This action should open a new window showing you the raw csv
file (see Figure 8.5).
Figure 8.5: Raw csv file associated to Timetable_EEB603_topic_tasks.csv. See text for more details.
csv
file can be retrieved by copying the URL address (see Figure 8.5). In this case, the URL is as follows:
https://raw.githubusercontent.com/svenbuerki/EEB603_Reproducible_Science/master/Data/Timetable_EEB603_topic_tasks.csvcsv
file is https://bit.ly/3BDxECl. We will be using this URL in the example below.Now that we have retrieved and shortened the raw GitHub URL pointing to our target csv
file, we can use the source_data() function implemented in the R package repmis (Gandrud, 2016) to download the file. The object return by source_data() is a data.frame
and can therefore be easily manipulated and saved on your local computer (using e.g. write.csv()). Retrieving a csv
data set from a GitHub repository can be done as follows:
### ~~~ Load package ~~~
library(repmis)
### ~~~ Store raw short URL into object ~~~
urlcsv <- "https://bit.ly/3BDxECl"
### ~~~ Download/Import csv into R ~~~
csvTimeTable <- repmis::source_data(url = urlcsv)
## Downloading data from: https://bit.ly/3BDxECl
## SHA-1 hash of the downloaded data file is:
## 12b62fd7b02ba87f99bd3e3c46803fa3b014fc9f
### ~~~ Check class ~~~ Class should be `data.frame`
class(csvTimeTable)
## [1] "data.frame"
### ~~~ Print csv ~~~
print(csvTimeTable)
## Topic
## 1 [Syllabus](https://svenbuerki.github.io/EEB603_Reproducible_Science/index.html)
## 2 [Chapt. 1 - part A](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#23_What_is_Reproducible_Science)
## 3 [Chapt. 1 - part A](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#23_What_is_Reproducible_Science)
## 4 [Chapt. 1 - part B](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#24_What_Factors_Break_Reproducibility)
## 5 [Chap. 2 - part A: Learning the basics](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#32_PART_A:_Learning_the_Basics)
## 6 Review Chap. 2 - part A and [Chap. 2 - part B: Setting Your R Markdown Document](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#33_PART_B:_Setting_Your_R_Markdown_Document)
## 7 Complete [Chap. 2 - Part B: Setting Your R Markdown Document](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#33_PART_B:_Setting_Your_R_Markdown_Document) (we will start at the ""Install R Dependencies and Load Packages"" section)
## 8 Start [Chap. 2 - Part C: Tables, Figures, and References](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#34_PART_C:_Tables,_Figures_and_References) and complete it until the end of the ""Cross-reference Tables and Figures in the Text"" section
## 9 Complete [Chap. 2 - part C: Tables, Figures and References](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#34_PART_C:_Tables,_Figures_and_References) (we will start at section 3.4.7 ""Cite References in the Text"")
## 10 [Chap. 2 - part D: Advanced R and R Markdown settings](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#35_PART_D:_Advanced_R_and_R_Markdown_settings)
## 11 [Chap. 2 - part E: User-Defined Functions in R](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#36_PART_E:_User-Defined_Functions_in_R)
## 12 Complete [Chap. 2 - part E: User-Defined Functions in R](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#36_PART_E:_User-Defined_Functions_in_R) (we will start at section 3.6.8 Create, Name, and Access Elements of a List in R)
## 13 [Chap. 3: A roadmap to implement reproducible science](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#4_Chapter_3)
## 14 [Chap. 4: Open science and CARE principles](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#5_Chapter_4)
## 15 Complete Chap. 4 (group activities on [CARE Principles and comparing FAIR & CARE Principles](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#551_%F0%9F%A4%9D_Group_Sharing_and_Discussions)) and [Chap. 5: Data Management](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#partDM)
## 16 [Chap. 5: Reproducible Code](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#repcode)
## 17 [Chap. 6: Getting published & Writing papers in R Markdown](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#7_Chapter_6)
## 18 Bioinfo. tutorial - Chap. 7
## 19 Bioinfo. tutorial - Chap. 8
## 20 Bioinfo. tutorial - Chap. 8
## 21 Bioinfo. tutorial - Chap. 9
## 22 Bioinfo. tutorial - Chap. 9
## 23 Bioinfo. tutorial - Chap. 10
## 24 Bioinfo. tutorial - Chap. 10
## 25 Bioinfo. tutorial - Chap. 11
## 26 Bioinfo. tutorial - Chap. 11
## 27 Bioinfo. tutorial - Chap. 12
## 28 Bioinfo. tutorial - Chap. 12
## 29 Individual projects - Q&A
## 30 Individual projects - Q&A
## 31 Oral presentations of individual projects
## 32 Oral presentations of individual projects
## Homework
## 1
## 2 Read the syllabus carefully and ask questions if you need any clarification
## 3 Read your assigned publication for Chapter 1 - part A and identify its reproducibility challenges
## 4 Read Baker (2016) and Summary (pages 5-19) of the Reproducibility and replicability in science report (from The National Academies Press)
## 5 [Install software](https://svenbuerki.github.io/EEB603_Reproducible_Science/index.html#82_Installing_Software) and Sign up for bioinformatics tutorial (see [Syllabus](https://svenbuerki.github.io/EEB603_Reproducible_Science/index.html#1013_Google_Sheet) )
## 6 Complete Chapter 2 - part A
## 7 Complete the sections of Chapter 2 – Part B up to the end of the “YAML Metadata Section.” Read the “Install R Dependencies and Load Packages” section to familiarize yourself with its content and purpose
## 8 Make sure that your ""Chapter2_partB.Rmd"" file knits successfully and includes all the material from Chapter 2 - Part B
## 9 Make sure that your `Chapter2_partC.Rmd` file knits successfully and includes all the material up to the end of section [3.4.6 ""Cross-reference Tables and Figures in the Text""](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#346_Cross-reference_Tables_and_Figures_in_the_Text). Refer to section 3.4.4.2 ""Double Table in Caption"" for guidance on how to debug this issue if needed
## 10 Complete Chap. 2 - part C: Tables, Figures and References
## 11 Read material presented in Chap. 2 - part E: User-Defined Functions in R
## 12 Complete Chap. 2 - part E up to section 3.6.8 Create, Name, and Access Elements of a List in R
## 13 Read material presented in Chapter 3
## 14 Read material presented in Chapter 4
## 15 Read material presented in Chapter 5 - Data Management
## 16 Read material presented in Chapter 5 - Reproducible Code
## 17 Read material presented in Chapter 6
## 18 Students work on bioinformatics tutorials and their individual projects
## 19 Students work on bioinformatics tutorials and their individual projects
## 20 Students work on bioinformatics tutorials and their individual projects
## 21 Students work on bioinformatics tutorials and their individual projects
## 22 Students work on bioinformatics tutorials and their individual projects
## 23 Students work on bioinformatics tutorials and their individual projects
## 24 Students work on bioinformatics tutorials and their individual projects
## 25 Students work on bioinformatics tutorials and their individual projects
## 26 Students work on bioinformatics tutorials and their individual projects
## 27 Students work on bioinformatics tutorials and their individual projects
## 28 Students work on bioinformatics tutorials and their individual projects
## 29 Students work on bioinformatics tutorials and their individual projects
## 30 Upload slides of individual projects (on [Google Drive](https://drive.google.com/drive/folders/1MZt5kNKusCv6OeZpjuuPxUiqBaoQVQAc?usp=sharing))
## 31
## 32
## Deadline
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15 Turn in tutorial for Chap. 7
## 16 Upload tutorial of Chap. 7 on Google Drive
## 17 Turn in tutorial for Chap. 8
## 18 Upload tutorial of Chap. 8 on Google Drive
## 19 Turn in tutorial for Chap. 9
## 20 Upload tutorial of Chap. 9 on Google Drive
## 21 Turn in tutorial for Chap. 10
## 22 Upload tutorial of Chap. 10 on Google Drive
## 23 Turn in tutorial for Chap. 11
## 24 Upload tutorial of Chap. 11 on Google Drive
## 25 Turn in tutorial for Chap. 12
## 26 Upload tutorial of Chap. 12 on Google Drive
## 27
## 28
## 29
## 30 Turn in ind. projects (on [Google Drive](https://drive.google.com/drive/folders/1MZt5kNKusCv6OeZpjuuPxUiqBaoQVQAc?usp=sharing))
## 31
## 32
## URL
## 1 [Syllabus](https://svenbuerki.github.io/EEB603_Reproducible_Science/index.html)
## 2 [Chapter 1](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#2_Chapter_1) & [Chapter 2](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#3_Chapter_2)
## 3 [Part A](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#partA)
## 4 [Part A](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#partA) & [Part B: Set your R Markdown environment](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#233_Set_your_R_Markdown_environment)
## 5 [Part B](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#13_PART_B:_Tables,_Figures_and_References)
## 6 [Part B](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#13_PART_B:_Tables,_Figures_and_References)
## 7 [Part B](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#13_PART_B:_Tables,_Figures_and_References) & [Part C](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#14_PART_C:_Advanced_R_and_R_Markdown_settings)
## 8 [Part D](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#15_PART_D:_User_Defined_Functions_in_R)
## 9 [Part D](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#15_PART_D:_User_Defined_Functions_in_R)
## 10 [Publications and Resources for bioinformatic tutorials](https://svenbuerki.github.io/EEB603_Reproducible_Science/index.html#7_Publications__Textbooks)
## 11 [Chapter 3](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#4_Chapter_3)
## 12 [Chapter 4](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#5_Chapter_4)
## 13 [Chapter 5 - Part A](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#partDM)
## 14 [Chapter 5 - Part B](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#repcode)
## 15 [Chapter 6 - Part A](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#getpub)
## 16 [Chapter 6 - Part B](https://svenbuerki.github.io/EEB603_Reproducible_Science/Chapters.html#writpub)
## 17
## 18
## 19
## 20
## 21
## 22
## 23
## 24
## 25
## 26
## 27
## 28
## 29
## 30
## 31
## 32
source_data() will always download the most recent version of the file from the master branch and return its unique SHA-1 hash. However, you could also download a prior version of the file by providing its specific SHA-1 hash associated to a previous commit
.
Retrieving the csv
file associated to a specific commit
can be done by applying the following approach (here using the same example than above):
commit
raw URL navigate to its location on GitHub by clicking here.Figure 8.6: Page showing commit history associated to Timetable_EEB603_topic_tasks.csv.
Commits on Aug 23, 2021
; Figure 8.6). This will take you to the version of the file at this point in history.
Figure 8.7: Page showing prior commit associated to Timetable_EEB603_topic_tasks.csv.
Raw
button will load the csv
file and allow you to retrieve the URL (as done above).### ~~~ Store raw URL into object ~~~
urlcsvold <- "https://raw.githubusercontent.com/svenbuerki/EEB603_Reproducible_Science/1312eae07e0515d8d2423ff83834b184cf6eeb8d/Data/Timetable_EEB603_topic_tasks.csv"
### ~~~ Download/Import csv into R ~~~
csvTimeTableOld <- repmis::source_data(url = urlcsvold)
## Downloading data from: https://raw.githubusercontent.com/svenbuerki/EEB603_Reproducible_Science/1312eae07e0515d8d2423ff83834b184cf6eeb8d/Data/Timetable_EEB603_topic_tasks.csv
## SHA-1 hash of the downloaded data file is:
## e7071b1e85ada38d6b2cf2a93bbb43c2b96a331f
### ~~~ Check class ~~~ Class should be `data.frame`
class(csvTimeTableOld)
## [1] "data.frame"
### ~~~ Print csv ~~~
print(csvTimeTableOld)
## Topic
## 1 Syllabus
## 2 Example of a bioinformatic tutorial
## 3 Chap. 1 - R Markdown part A
## 4 Chap. 1 - R Markdown part B
## 5 Chap. 1 - R Markdown part C
## 6 Chap. 1 - User-defined functions
## 7 Chap. 1 - Wrap-up
## 8 Chap. 2
## 9 Chap. 3
## 10 Chap. 4: Data management
## 11 Chap. 4: Reproducible code
## 12 Chap. 5: Getting published
## 13 Chap. 5: Writing papers in R Markdown (rticles)
## 14 TBD
## 15 Bioinfo. tutorial - Chap. 6
## 16 Bioinfo. tutorial - Chap. 6
## 17 Bioinfo. tutorial - Chap. 7
## 18 Bioinfo. tutorial - Chap. 7
## 19 Bioinfo. tutorial - Chap. 8
## 20 Bioinfo. tutorial - Chap. 8
## 21 Bioinfo. tutorial - Chap. 9
## 22 Bioinfo. tutorial - Chap. 9
## 23 Bioinfo. tutorial - Chap. 10
## 24 Bioinfo. tutorial - Chap. 10
## 25 Bioinfo. tutorial - Chap. 11
## 26 Bioinfo. tutorial - Chap. 11
## 27 Individual projects - Q&A
## 28 Oral presentations
## 29 Oral presentations
## 30 Oral presentations
## Task Deadline
## 1
## 2
## 3 Work on bioinfo. tutorials
## 4 Work on bioinfo. tutorials
## 5 Work on bioinfo. tutorials
## 6 Work on bioinfo. tutorials
## 7 Work on bioinfo. tutorials
## 8 Read Baker (2016) and prepare for discussing outcome of study
## 9 Work on bioinfo. tutorials
## 10 Work on bioinfo. tutorials
## 11 Work on bioinfo. tutorials
## 12 Work on bioinfo. tutorials
## 13 Turn in tutorial for Chap. 6 & start individual reports
## 14 Upload tutorial of Chap. 6 on Google Drive
## 15 Turn in tutorial for Chap. 7
## 16 Upload tutorial of Chap. 7 on Google Drive
## 17 Turn in tutorial for Chap. 8
## 18 Upload tutorial of Chap. 8 on Google Drive
## 19 Turn in tutorial for Chap. 9
## 20 Upload tutorial of Chap. 9 on Google Drive
## 21 Turn in tutorial for Chap. 10
## 22 Upload tutorial of Chap. 10 on Google Drive
## 23 Turn in tutorial for Chap. 11
## 24 Upload tutorial of Chap. 11 on Google Drive
## 25 Students work on ind. projects: Review literature (see Syllabus)
## 26 Students work on ind. projects: Data management workflow
## 27 Students work on ind. project
## 28 Work on reports/presentations
## 29 Turn in ind. reports
## 30
Before being able to download a GitHub repository and work with its files into R, you have to find the URL pointing to the compressed zip
file containing all the files for the target repository. In this section, we will demonstrate how to obtain this information by using the Sagebrush_rooting_in_vitro_prop GitHub repository. As mentioned above, we will be downloading the whole repository and then extract all the csv
files in the 01_Raw_Data/
folder (see Figure 8.3).
To retrieve the URL associated to the compressed zip
file containing all files for the repository do the following:
Figure 8.8: GitHub repository page for Sagebrush_rooting_in_vitro_prop.
zip
file do the following actions:Now that we have secured the URL pointing to the compressed zip
file for the target repository (by copying it), we will use this URL and the base R download.file() function as input to download the file on our local computer. Since compressed files could be large, we are also providing some code to check if the file already exists on your computer before downloading.
### ~~~ Store URL in object ~~~ Paste the URL that you
### copied in the previous section here
URLrepo <- "https://github.com/svenbuerki/Sagebrush_rooting_in_vitro_prop/archive/refs/heads/master.zip"
### ~~~ Download the repository from GitHub ~~~ Arguments:
### - url: URLrepo - destfile: Path and name of destination
### file on your computer YOU HAVE TO ADJUST PATH TO YOUR
### COMPUTER
# First check if the file exists, if yes then return file
# already downloaded else proceed with download
if (file.exists("Data/GitHubRepoSagebrush.zip") == TRUE) {
# File already exists!
print("file already exists and doesn't need to be downloaded!")
} else {
# Download the file
print("Downloading GitHub repository!")
download.file(url = URLrepo, destfile = "Data/GitHubRepoSagebrush.zip")
}
## [1] "file already exists and doesn't need to be downloaded!"
Compressed files can be quite large and you might would like to avoid decompressing them, but rather accessing target files and only decompressing those. Here, we are practicing such approach by using GitHubRepoSagebrush.zip
and targeting csv
files in the 01_Raw_Data/
folder.
To estimate the size (in bytes) of a file you can use base R function file.size() as follows:
### ~~~ Infer file size of GitHubRepoSagebrush.zip ~~~
# Transform and round file size from bytes to Mb
ZipSize <- round(file.size("Data/GitHubRepoSagebrush.zip")/1e+06,
2)
print(paste("The zip file size is", ZipSize, "Mb", sep = " "))
## [1] "The zip file size is 22.26 Mb"
Finally, we can now i) list all files in the zip
file, ii) identify csv
files in 01_Raw_Data/
and iii) save these files on our local computer in a folder entitled 01_Raw_Data/
. These files will then constitute the raw data for your subsequent analyses.
### ~~~ List all files in zip file without decompressing it
### ~~~
filesZip <- as.character(unzip("Data/GitHubRepoSagebrush.zip",
list = TRUE)$Name)
### ~~~ Identify files in 01_Raw_Data/ that are csv ~~~ Use
### grepl() to match criteria
targetF <- which(grepl("01_Raw_Data/", filesZip) & grepl(".csv",
filesZip) == TRUE)
# Subset files from filesZip to only get our target files
rawcsvfiles <- filesZip[targetF]
# print list of target files
print(rawcsvfiles)
## [1] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/1_block_8_12_2020 - 1_block.csv"
## [2] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/2_block_8_15_2020 - 2_block.csv"
## [3] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/3_block_8_15_2020 - 3_block.csv"
## [4] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/4_block_8_16_2020 - 4_block.csv"
## [5] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/5_block_8_19_2020 - 5_block.csv"
## [6] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/Phenotypes_sagebrush_in_vitro.csv"
## [7] "Sagebrush_rooting_in_vitro_prop-master/01_Raw_Data/Survival_height_clones.csv"
### ~~~ Create local directory to save csv files ~~~ Check
### if the folder already exists if not then creates it
output_dir <- file.path(paste0("Data/01_Raw_Data/"))
if (dir.exists(output_dir)) {
print(paste0("Dir", output_dir, " already exists!"))
} else {
print(paste0("Created ", output_dir))
dir.create(output_dir)
}
## [1] "DirData/01_Raw_Data/ already exists!"
### ~~~ Save csv in output_dir ~~~ Use loop to read csv
### file in and then save it in output_dir
for (i in 1:length(rawcsvfiles)) {
### ~~~ Decompress and read in csv file ~~~
tempcsv <- read.csv(unz("Data/GitHubRepoSagebrush.zip", rawcsvfiles[i]))
### ~~~ Write file in ~~~ Extract file name
csvName <- strsplit(rawcsvfiles[i], split = "01_Raw_Data/")[[1]][2]
# Write csv file in output_dir
write.csv(tempcsv, file = paste0(output_dir, csvName))
}
We can verify that all the files are in the newly created directory on your computer by listing them as follows (compare your results with files shown in Figure 8.3):
# List all the files in output_dir (on your local computer)
list.files(paste0(output_dir))
## [1] "1_block_8_12_2020 - 1_block.csv" "2_block_8_15_2020 - 2_block.csv"
## [3] "3_block_8_15_2020 - 3_block.csv" "4_block_8_16_2020 - 4_block.csv"
## [5] "5_block_8_19_2020 - 5_block.csv" "Phenotypes_sagebrush_in_vitro.csv"
## [7] "Survival_height_clones.csv"
Citations of all R packages used to generate this report.
[1] J. Allaire, Y. Xie, C. Dervieux, et al. rmarkdown: Dynamic Documents for R. R package version 2.29. 2024. https://github.com/rstudio/rmarkdown.
[2] J. Allaire, Y. Xie, C. Dervieux, et al. rticles: Article Formats for R Markdown. R package version 0.27. 2024. https://github.com/rstudio/rticles.
[3] S. M. Bache and H. Wickham. magrittr: A Forward-Pipe Operator for R. R package version 2.0.3. 2022. https://magrittr.tidyverse.org.
[4] C. Boettiger. knitcitations: Citations for Knitr Markdown Files. R package version 1.0.12. 2021. https://github.com/cboettig/knitcitations.
[5] J. Cheng, C. Sievert, B. Schloerke, et al. htmltools: Tools for HTML. R package version 0.5.7. 2023. https://github.com/rstudio/htmltools.
[6] R. Francois and D. Hernangómez. bibtex: Bibtex Parser. R package version 0.5.1. 2023. https://github.com/ropensci/bibtex.
[7] C. Glur. data.tree: General Purpose Hierarchical Data Structure. R package version 1.2.0. 2025. https://github.com/gluc/data.tree.
[8] R. Iannone and O. Roy. DiagrammeR: Graph/Network Visualization. R package version 1.0.11. 2024. https://rich-iannone.github.io/DiagrammeR/.
[9] M. C. Koohafkan. kfigr: Integrated Code Chunk Anchoring and Referencing for R Markdown Documents. R package version 1.2.1. 2021. https://github.com/mkoohafkan/kfigr.
[10] Y. Qiu. prettydoc: Creating Pretty Documents from R Markdown. R package version 0.4.1. 2021. https://github.com/yixuan/prettydoc.
[11] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2022. https://www.R-project.org/.
[12] K. Ren and K. Russell. formattable: Create Formattable Data Structures. R package version 0.2.1. 2021. https://renkun-ken.github.io/formattable/.
[13] H. Wickham, J. Bryan, M. Barrett, et al. usethis: Automate Package and Project Setup. R package version 3.2.1. 2025. https://usethis.r-lib.org.
[14] H. Wickham, R. François, L. Henry, et al. dplyr: A Grammar of Data Manipulation. R package version 1.1.4. 2023. https://dplyr.tidyverse.org.
[15] H. Wickham, J. Hester, W. Chang, et al. devtools: Tools to Make Developing R Packages Easier. R package version 2.4.5. 2022. https://devtools.r-lib.org/.
[16] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. Boca Raton, Florida: Chapman and Hall/CRC, 2016. ISBN: 978-1138700109. https://bookdown.org/yihui/bookdown.
[17] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.36. 2023. https://github.com/rstudio/bookdown.
[18] Y. Xie. Dynamic Documents with R and knitr. 2nd. ISBN 978-1498716963. Boca Raton, Florida: Chapman and Hall/CRC, 2015. https://yihui.org/knitr/.
[19] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014.
[20] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.44. 2023. https://yihui.org/knitr/.
[21] Y. Xie, J. Allaire, and G. Grolemund. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman and Hall/CRC, 2018. ISBN: 9781138359338. https://bookdown.org/yihui/rmarkdown.
[22] Y. Xie, J. Cheng, X. Tan, et al. DT: A Wrapper of the JavaScript Library DataTables. R package version 0.34.0. 2025. https://github.com/rstudio/DT.
[23] Y. Xie, C. Dervieux, and E. Riederer. R Markdown Cookbook. Boca Raton, Florida: Chapman and Hall/CRC, 2020. ISBN: 9780367563837. https://bookdown.org/yihui/rmarkdown-cookbook.
[24] H. Zhu. kableExtra: Construct Complex Table with kable and Pipe Syntax. R package version 1.4.0. 2024. http://haozhu233.github.io/kableExtra/.
Version information about R, the operating system (OS) and attached or R loaded packages. This appendix was generated using sessionInfo()
.
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] repmis_0.5.1 rticles_0.27 DiagrammeR_1.0.11
## [4] DT_0.34.0 data.tree_1.2.0 kfigr_1.2.1
## [7] devtools_2.4.5 usethis_3.2.1 bibtex_0.5.1
## [10] knitcitations_1.0.12 htmltools_0.5.7 prettydoc_0.4.1
## [13] magrittr_2.0.3 dplyr_1.1.4 kableExtra_1.4.0
## [16] formattable_0.2.1 bookdown_0.36 rmarkdown_2.29
## [19] knitr_1.44
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.7 sass_0.4.8 pkgload_1.4.1 jsonlite_1.8.8
## [5] viridisLite_0.4.2 R.utils_2.13.0 bslib_0.5.1 shiny_1.7.5.1
## [9] highr_0.11 yaml_2.3.8 remotes_2.5.0 sessioninfo_1.2.3
## [13] pillar_1.11.1 backports_1.4.1 glue_1.6.2 digest_0.6.33
## [17] RColorBrewer_1.1-3 promises_1.2.1 RefManageR_1.4.0 R.oo_1.27.1
## [21] httpuv_1.6.13 plyr_1.8.9 pkgconfig_2.0.3 purrr_1.0.2
## [25] xtable_1.8-4 scales_1.4.0 svglite_2.1.3 later_1.3.2
## [29] timechange_0.2.0 tibble_3.2.1 generics_0.1.4 farver_2.1.1
## [33] ellipsis_0.3.2 cachem_1.0.8 withr_3.0.2 cli_3.6.2
## [37] mime_0.12 memoise_2.0.1 evaluate_1.0.5 R.methodsS3_1.8.2
## [41] fs_1.6.3 R.cache_0.17.0 xml2_1.3.6 pkgbuild_1.4.8
## [45] data.table_1.14.10 profvis_0.3.8 tools_4.2.0 formatR_1.14
## [49] lifecycle_1.0.4 stringr_1.5.2 compiler_4.2.0 jquerylib_0.1.4
## [53] systemfonts_1.0.5 rlang_1.1.2 rstudioapi_0.17.1 htmlwidgets_1.6.4
## [57] visNetwork_2.1.4 crosstalk_1.2.2 miniUI_0.1.2 curl_5.2.0
## [61] R6_2.6.1 lubridate_1.9.3 fastmap_1.1.1 stringi_1.8.3
## [65] Rcpp_1.0.11 vctrs_0.6.5 tidyselect_1.2.0 xfun_0.41
## [69] urlchecker_1.0.1
What is a Hypothesis? A hypothesis is a tentative, testable answer to a scientific question. Once a scientist has a scientific question they perform a literature review to find out what is already known on the topic. Then this information is used to form a tentative answer to the scientific question. Keep in mind, that the hypothesis also has to be testable since the next step is to do an experiment to determine whether or not the hypothesis is right! A hypothesis leads to one or more predictions that can be tested by experimenting. Predictions often take the shape of “If ____then ____” statements, but do not have to. Predictions should include both an independent variable (the factor you change in an experiment) and a dependent variable (the factor you observe or measure in an experiment). A single hypothesis can lead to multiple predictions.↩︎
GBIF — the Global Biodiversity Information Facility — is an international network and research infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth.↩︎
Postdiction involves explanation after the fact.↩︎