To further gain expertise in the field of genomics, students are producing three mini-reports on the following topics:
Time will be allocated in class to work on these mini-reports, but the instructor expects students to complete those on their own time.
To have a full overview of available sequencing platforms, students are tasked to produce individual mini-reports on the following sequencing platforms and their associated technologies:
Each student was assigned a sequencing platform to work on (please see Google spreadsheet for more information).
The individual assignment is mandatory and will be graded. Please see this task as an opportunity to consolidate your knowledge on next-generation sequencing technologies and writing scientific reports.
Time will be allocated in class on the 19th of January 2024 to allow students to work on their assignments. Reports are due on the 2nd of February 2024 and should be uploaded on the shared Google Drive in the Mini_Report_1
folder.
Please structure your reports to ensure covering the following aspects:
For this assignment, students are not allowed to use AI to complete their work.
Maximum two pages, single spaced, using either Arial 11 pt or Times New Roman 12 pt, and with all margins set to 1 inch. The two pages do not include a title page and a references section.
The title page should include the following information:
The references section should contain full citations of all references cited in the text. Each reference cited in the References section should be clearly cited in the text to support transparency in your work and reproducible science.
A citation is a way of giving credit to individuals for their creative and intellectual works that you utilized to support your research. It can also be used to locate particular sources and combat plagiarism. Typically, a citation can include the author’s name, date, location of the publishing company, journal title, or DOI (Digital Object Identifier). Citation styles dictate the information necessary for a citation and how the information is ordered, as well as punctuation and other formatting. There are different citation styles (see Figure 2.1) and the instructor let you pick the one you like. However, please make sure to only use one citation style in your mini-report.
In the case of web resources, please provide the following information in your citation: author, title (of the web page), URL and accessed date. Here is an example:
Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcostsdata. Accessed 2023-01-16.
Please keep in mind that a figure can convey a lot more information than a long text! Your reports should follow the structure presented in section 2.2.
Name your report document following this pattern: Sequencing platform_Surname
.
The instructor has provided pdf
documents devoted to each sequencing platform below to support students in starting their assignments. Please see Satam et al. (2023) for a review on this subject and Marx (2023) for more information on long-read sequencing.
In addition, please use information available from manufacturers’ websites (see below), publications, online NGS training sites (e.g. EMBL-EBI Training), Google, YouTube and Wikipedia to prepare your reports and presentations. Do not forget to provide citations supporting your claims in your reports. I let you decide which reference style you want to use (look at your favorite journals for examples).
Here are a few additional online references to support your assignment:
To have a full overview of available molecular biology databases and assess their contributions to genome annotation, students are tasked to produce individual mini-reports on the following molecular databases:
Each student was assigned a molecular database to work on (please see Google spreadsheet for more information).
This individual assignment is mandatory and will be graded. Please see this task as an opportunity to consolidate your knowledge on molecular databases and writing scientific reports.
Time will be allocated in class on the 2nd of February 2024 to allow students to work on their assignments. Reports are due on the 21st of February 2024 and should be uploaded on the shared Google Drive in the Mini_Report_2
folder.
Please structure your reports to cover the following aspects:
Follow the same reporting format than with Mini-Report 1.
Name your report document following this pattern: Molecular database_Surname
.
The instructor provided below short descriptions of each of the major types of molecular biology databases. Students are encouraged to use the information provided below to prime their research and start populating their reports. Please do not forget to provide citations and references in your reports following information provided in Mini-Report 1.
There are three major protein sequences databases:
In 2002, these three databases coordinated their efforts to form the UniProt consortium (Consortium, 2014). The partners in this enterprise share the database, but continue to offer separate information-retrieval tools for access. The UniProtKB database is available at this URL: http://www.uniprot.org. This latter database will be especially important to retrieve information on gene annotations (Chapters 6, 7) using the online tool available at this URL: http://www.uniprot.org/uploadlists/. This will be done by specifically looking at Gene ontology (see GeneOntology).
Gene Ontology (GO, Consortium, 2020) (http://www.geneontology.org): the framework for the model of biology. The GO defines concepts/classes used to describe gene function, and relationships between these concepts. It classifies functions along three aspects:
GO is also available for all the protein sequences deposited on the UniProt database. Other useful tool to perform gene annotations:
The Kyoto Encyclopedia of Genes and Genomes (KEGG, Kanehisa et al., 2022) (http://www.genome.jp/kegg/) collects individual genomes, gene products, and their functions, but its special strength lies in its integration of biochemical and genetic information. KEGG focuses on interactions: molecular assemblies, and metabolic and regulatory networks. KEGG organizes five data types into a comprehensive system:
The catalogs of chemical compounds and genes contain information about particular molecules or sequences. Genome maps integrate the genes themselves according to their chromosomal position. Pathway maps describe potential networks of molecular activities, both metabolic and regulatory. One enzyme in one organism would be referred to in KEGG in its orthologue tables, which link the enzyme to related ones in other organisms. This permits analysis of relationships between the metabolic pathways of different organisms.
The following learning outcomes will be taught as part of this Mini-Report:
R
(R Core Team, 2016) packages.MUSCLE
(Edgar, 2004) implemented in MEGA
(Kumar et al., 2018).R
(R Core Team, 2016) packages.This mini-report is mostly based on the following publication:
In each section, references are provided to help you master the material taught in this assignment.
This report is subdivided into four parts:
Please see the Timetable for more details on the schedule.
Projections of global biodiversity have ranged from 2 to 100 million species (Larsen et al., 2017).
But these estimations are without accounting for cryptic species! See for instance Dentinger and Suz (2014) for an example on porcini. In this study, the authors used DNA-sequencing to identify three species of mushroom contained within a commercial packet of dried Chinese porcini purchased in London. Surprisingly, all three have never been formally described by science and required new scientific names.
Larsen et al. (2017) recently published a keystone paper where they predicted 1 to 6 billion species on Earth! This prediction was based on an average of 6 cryptic species per described species.
Overall, data presented here demonstrate that most species on this planet are either poorly known or pending description.
In this context, the fields of genetics/genomics (more specifically DNA barcoding) and phylogenetics have the potential to contribute to:
Approaches combining these two objectives were applied across lineages. See for instance, an article describing a new species belonging to the soapberry family (Sapindaceae) endemic to Fiji (Buerki et al., 2017). In this study, the phylogeny is used to confirm the new taxon and place it within a larger evolutionary and biogeographical framework. Finally, this evidence is coupled with occurrence data to infer a species extinction risk following IUCN guidelines.
The Consortium for the Barcode of Life (CBOL) is an international initiative devoted to developing DNA barcoding as a global standard for the identification of biological species. CBOL has more than 130 Member Organizations from more than 40 countries (CBOL, 2021).
DNA barcoding is a method of species identification using a short section of DNA from a specific gene or genes. The premise of DNA barcoding is that, by comparison with a reference library of such DNA sections, an individual DNA sequence can be used to uniquely identify an organism to species. This is analogous to a supermarket scanner using the familiar black stripes of the UPC barcode to identify an item in its stock against its reference database (CBOL, 2021). DNA barcodes are sometimes used in an effort to identify unknown species, parts of an organism, or simply to catalog as many taxa as possible, or to compare with traditional taxonomy in an effort to determine species boundaries. For more details on DNA barcodes applied to plants, please see section below dedicated to vanilla.
DNA barcoding has a wide range of applications from biodiversity surveys (e.g. Telfer et al., 2015), illegal trade of wildlife (e.g. Gonçalves et al., 2015) to food security (e.g. Quinto et al., 2016). The DNA barcoding approach has been reviewed by DeSalle and Goldstein (2019).
A DNA barcoding approach is subdivided into four steps:
As described in Masters and Pozzi (2017), phylogenetic inference is the practice of reconstructing the evolutionary history of related species by grouping them in successively more inclusive sets based on shared ancestry. Homologous characters in independent lineages are similar because they have been inherited from a common ancestor, and they alone should be used in phylogenetic reconstructions. Homoplasies are characters that appear similar, but have evolved from different ancestral states. They may mislead interpretations of evolutionary history. Both molecular and morphological datasets are subject to obfuscation by homoplasy. Methods of phylogenetic inference aim to distinguish between homologous (signal) and homoplastic (noise) resemblance. Molecular datasets tend to be very large and are analyzed using statistical techniques that fit the data to models of molecular evolution. These methods are not well suited to morphological data, and combined analyses including both kinds of data tend to obscure the morphological signal. Rates of molecular change may be used to estimate divergence ages.
In this mini-report, we will be focusing on inferring phylogenetic trees based on DNA sequences obtained through the DNA barcoding approach described above.
A phylogenetic analysis is subdivided into four steps:
Like porcini (Dentinger and Suz, 2014), vanilla is one of our most beloved spice, but we still know very little about its biology, taxonomy and evolution (see Ellestad et al., 2021 for a review). As part of a study aiming at inferring domestication processes in the genus Vanilla and species adaptation to drought, fieldwork was conducted in Mexico by Paige Ellestad (an EEB Ph.D. student at BSU) and her in-country partners.
For her Ph.D. dissertation, Paige is applying a DNA barcoding approach coupled with phylogenetic inference (and ecological niche modeling) to answer the following questions:
Q1: What species of vanilla are cultivated in Mexico?
Q2: How are those species related and is there evidence of gene flow/hybridization?
Q3: What level of genetic diversity is there among each species?
To answer these questions, she sampled >30 locations (mostly plantations, but also wild populations) representing >60 samples and sequenced two DNA barcode regions widely used in plants: one chloroplastic (the coding rbcL region) and one nuclear ribosomal (the ITS region, which stands for for “Internal Transcribed Spacer”). These two DNA regions are traditionally used by researchers as DNA barcodes supporting species delimitation and identifications. Since you are already familiar with rbcL (see Hollingsworth et al., 2009 for more details), we will be providing more details on the ITS region. The nuclear ribosomal ITS region includes part of 18S, ITS1, 5.8S, ITS2 and part of 26S (see Figure 4.1, Cheng et al., 2016). This region is repeated in tandem 1000 of times and duplicated across chromosomes in order to produce ribosomes, which are key to produce proteins. This latter DNA region is shared across kingdoms and plant specific primers have to be used otherwise we may be at risk of amplifying this region for symbiotic organisms (e.g. fungi, see Cheng et al., 2016).
DNA extractions of leaf material were conducted at BSU using the Qiagen Plant Mini kit followed by PCR amplifications using plant specific primers [for instance, the IT4p and ITS5p primers for the ITS region; see Cheng et al. (2016)]. Sequencing was outsourced to GENEWIZ and performed using Sanger sequencing technology. PCR amplicons/fragments were only sequenced using forward primers. As per provider, we are expecting high quality DNA sequencing read lengths of up to ca. 1000 bases. The company also mentions that a typical read would provide 800 bases with Phred score of 20.
For this assignment, students will investigate the following question:
Q1: To what species of vanilla do the individuals presented in Figure 4.2 belong to?
To answer these questions, we will be focusing on four samples/individuals of vanilla (see Figure 4.2) and their associated ITS sequence data as well as ITS sequences deposited on GenBank. See the scientific process section below for more details.
Throughout this project, we will be referring to these vanilla samples/individuals as follows (see Figure 4.2):
Our overachieving question is as follows:
To what species of vanilla do individuals presented in Figure 4.2 belong to?
Based on evidence detailed above, our hypothesis is the following:
Individuals belong to the same species (V. planifolia) and thr observed phenotypic variation is due to phenotypic plasticity (= exhibiting responses to contrasting climates).
To test our hypothesis, we are applying the phylogenetic species concept (Wheeler, 1999) (mainly criterion of monophyly) using DNA barcoding data analyzed using a phylogenetic approach. If our hypothesis is correct, then we predict that all the individuals will form a monophyletic clade and fall with other samples belonging to V. planifolia.
To test this hypothesis, we have generated ITS barcode data for individuals in Figure 4.2 and aim at comparing them with data from GenBank (= reference data).
Our analyses will be done using the following methodology:
R
.R
to test working hypothesis.Finally, we will intersect evidence from all our analyses (especially phylogenetic clustering combined with node supports and taxonomy of ITS sequences from GenBank) to test our hypothesis and infer species identity of our target individuals. Please see section 4.9 for more details our the formatting of the report.
The instructor provides below information to complete the Mini-Report 3. All the material, data, code and references required to complete this report are provided here and are covered in class. In this context, students will have to focus on formatting their reports following guidelines presented here as well as making sure that their R
code is working and ready to be shared.
Your reports will be structured and organized following the IMMRAD format: Introduction, Material & Methods, Results, and Discussion. This format is widely used to report experimental research in many scientific disciplines. In addition, you will be complementing your reports with an Abstract (see below). Finally, since this research relies heavily on bioinformatics, students will complete their reports by supplying their commented R
scripts.
The instructor provides guidance on content for each section of your individual reports:
R
code for further details. The subsections are as follows:
R
code has to be provided as an Appendix either by directly coping your code in the main document under this section or by providing the name and location of the R
script in the Appendix section (e.g. by providing a text similar to this: “All the R code associated to this research can be downloaded at this path.”).Individual reports should be submitted in either Google Docs, Word or Rmarkdown formats (and if you decide to submit an additional file containing your R script, submit this file as an .R
format). These files should be uploaded on the shared Google Drive in the Mini_Report_3
folder in a subfolder entitled as follows: Mini_Report_3_Surname
.
Upload your reports on the shared Google drive by the 29th of March 2024 before 5 PM MT.
To support mastering the learning outcomes, we will be using a subset of a DNA dataset focusing on the vanilla spice (from the Orchidaceae family) published by Ellestad et al. (2022).
The data for this assignment are deposited on the shared Google Drive in a folder entitled DNA_barcoding
(located in Mini_Report_3
). All students enrolled in this class have been granted access to this folder. The folder is subdivided into four sub-folders (see Figure 4.3) and contains a master spreadsheet Vanilla_samples_records.xlsx
at its root:
01_Field_images
: jpeg
images of samples.02_Raw_ITS_data_ab1
: .ab1
sequence electropherograms of ITS sequences.03_Processed_ITS_data_FASTA
: Folder where cleaned ITS sequences will be saved in FASTA
format.04_Data_analyses
: Folder where we will be saving outputs of analyses conducted in this project.Vanilla_samples_records.xlsx
contains meta-data information about the samples. We will also update this file with information on species hypotheses.PART2_Vanilla.R
is an R script containing code for the part 2 of this mini-report.The data availability statement associated with Ellestad et al. (2022) is as follows:
All sequence data for this project are available at the National Center for Biotechnology Information (NCBI) under GenBank ITS sequences ON525161–ON525228, GenBank rbcL sequences ON531917–ON531986, BioProject accession no. PRJNA841950, and BioSample accession nos. SAMN28632719–SAMN28632734. All raw sequence files are available from the NCBI SRA database, nos. SRR19374405–SRR193744012, SRR19374414, SRR19374417, and SRR19374418. DNA alignments are available at Zenodo: https://doi.org/10.5281/zenodo.6577744
Process and clean ITS DNA sequence electropherograms and infer species working hypothesis.
The bioinformatic tools used in part 1 are as follows:
FinchTV
: A popular desktop software developed by Geospiza, Inc. for viewing trace data from Sanger DNA Sequencing. FinchTV
is freely available and operates on Windows and Mac platforms. Start by downloading and installing the software on your computers at this URL: https://digitalworldbiology.com/FinchTVBLAST
: The Basic Local Alignment Search Tool (BLAST, Altschul et al., 1990) will be applied onto cleaned DNA sequences to:
Although BLAST
can be run locally, we will be using the web portal available here: https://blast.ncbi.nlm.nih.gov/Blast.cgi
When evaluating .ab1
files (= raw DNA data from an Applied Biosystems’ Sequencing instrument containing an electropherogram showing the Phred scores and the DNA base sequence), you should first see the electropherogram and come to a conclusion whether your data can be considered of good quality or not.
Good quality sequencing data/positions are characterized by:
Bad quality sequencing data/positions are characterized by:
We will be discussing this topic further during class.
Nuclear DNA regions such as ITS could show evidence of recombination. This means that there could be polymorphism at a specific base [also know as single-nucleotide polymorphism or SNP; see Poplin et al. (2018) for bioinformatics technics to identify SNPs based on NGS data]. The signature of recombination in an electropherogram would be recognized by the occurrence of “peak under peak” (Figure 4.5). The International Union of Pure and Applied Chemistry (IUPAC) has defined a standard representation of DNA bases by single characters that specify either a single base (e.g. G for guanine, A for adenine) or a set of bases (e.g. R for either G or A). UCSC uses these single character codes to represent multiple observed alleles of single-base polymorphisms (Table 4.1).
IUPAC nucleotide code | Base |
---|---|
A | Adenine |
C | Cytosine |
G | Guanine |
T (or U) | Thymine (or Uracil) |
R | A or G |
Y | C or T |
S | G or C |
W | A or T |
K | G or T |
M | A or C |
B | C or G or T |
D | A or G or T |
H | A or C or T |
V | A or C or G |
N | any base |
. or - | gap |
Here, we will be using its25-ITSp4.ab1
as an example for the analysis. This file is located in DNA_barcoding/02_Raw_ITS_data_ab1
.
DNA_barcoding/
folder onto your computers.FinchTV
..ab1
file (one at a time) using the File --> Open...
tab or by dragging your .ab1
file in the main window.Base Position Numbers
, Base Calls
and Quality Values
settings are ticked using the View
tab (Figure 4.4).Shift
command and then pressing Delete
(Figure 4.6).FASTA
format in 03_Processed_ITS_data_FASTA
. Exporting the cleaned sequence is done by pressing File -> Export -> DNA Sequence: FASTA
. Please do not rename file, leave it as proposed by FinchTV
. The file is saved in interleaved FASTA format as shown in Figure 4.9.>
; see Figure 4.10).BLAST
button to start your query.Click on the Distance tree of results
link to perform phylogenetic distance analysis showing position of your sequence compared to sequences available on GenBank (see Figure 4.11). This will open a new window.
Expand tree to locate your DNA sequence by following procedure in Figure 4.12.
Open Vanilla_samples_records.xlsx
and update Species_BLAST
column with a species name (your first working hypothesis) and add the GenBank accession number of the most closely related DNA sequence deposited on GenBank. Don’t forget to save the file.
Repeat this procedure until you analyzed all the ab1
files.
DNA sequence retrieval and prepare data for analyses.
To execute Part 2, you need to install the following software and R packages on your computer:
R
: https://www.r-project.orgRStudio
: https://rstudio.com
R
package:
If you don’t know how to install an R package, don’t worry, this topic is covered here.
Please find below two documents providing a comprehensive introduction to R:
R for beginners (a tutorial by Emmanuel Paradis): https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf An introduction to R: https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
RStudio is an integrated development environment (IDE) that allows you to interact with R more readily. RStudio is similar to the standard RGUI, but it is considerably more user friendly. It has more drop-down menus, windows with multiple tabs, and many customization options (see Figure 4.14). Detailed information on using RStudio can be found at at RStudio’s website.
Please consult this RStudio article to learn more about procedures to edit and execute code in the RStudio environment.
Tip: To execute a line of code and send it to the Console you can press Ctrl+Enter
on Windows or Command+Enter
on Mac or use the Run toolbar button (see Figure 4.14).
In this course, we will be using built-in R functions that are implemented in packages.
Functions are useful when you want to perform a certain task multiple times. A function accepts input arguments and produces the output by executing valid R commands that are inside the function.
Arguments have associated data types that need to be entered by the user to execute the function and retrieve its output(s).
The basic R data types are as follows:
numeric
: Numbers, written as either integers or decimals.integer
: Whole numbers without any decimal point.character
(a.k.a string): A sequence of characters (declared between “” or ’’)logical
(a.k.a. boolean): Binary values, TRUE or FALSE.vector
: Elements of a vector using subscripts (using this syntax c(1,2,3)
).matrix
: A matrix from the given set of values (using this function matrix(ncol = 2, nrow = 3)
).data.frame
: Data frames are data displayed in a format as a table. Data frames can have different types of data inside it. While the first column can be character
, the second and third can be numeric
or logical
. However, each column should have the same type of data.We store/save outputs of built-in functions in variables. R
does not have a command for declaring a variable. A variable is created the moment you first assign a value to it. To assign a value to a variable, use the <-
sign. For instance:
# Assign output of simple math to variable
x <- 2+2
# You can call variable as follow
x
## [1] 4
To know the class of data stored in a variable, you can use the class() function as follows:
# What is the class of x
class(x)
## [1] "numeric"
Finally, you can retrieve the documentation associated with a built-in function by using the following syntax:
# Pull up documentation for class function
?class()
To infer the ML phylogenetic analysis, the following workflow will be executed:
R
package (Winter, 2017) to download DNA sequences and metadata.FASTA
and csv
formats..seq
files from Part 1 into a FASTA
object/file using the following approach:
.seq
files, open them individually and merge them into one object (ultimately saved as a file) using a for
loop.FASTA
objects/files from steps 1 and 2 to perform a multiple sequence alignment (msa).Protocols associated to each step within the workflow are detailed below.
Here, we are demonstrating an approach to search GenBank for ITS sequences of Vanilla species using rentrez (Winter, 2017). Students are tasked to study/examine and execute the code and finally adapt it to search for ITS sequences. The code is subdivided into the following steps:
FASTA
and csv
formats.The following method (or pseudo-code) is applied to download (here referred to as “fetch”) data from Entrez using rentrez (Winter, 2017):
Before delving into the code, do the following:
RStudio
..R
script (File > R Script
)..R
script at the root of your project (DNA_barcoding/
) with the following name 01_PART2.R
.All the R code provided below will be copied into your newly saved R
script and executed in class together.
This R code uses rentrez functions to interact with GenBank and the nucleotide database to remotely retrieve data based on your query.
###~~~
#Check if package is installed if not then install it
###~~~
if("rentrez" %in% rownames(installed.packages()) == FALSE){
print("Install rentrez")
install.packages("rentrez")
}else{
print("rentrez is installed!")
}
###~~~
#Load package
###~~~
library(rentrez)
###~~~
#Set working directory
###~~~
#Set working directory to path leading to DNA_barcoding folder
# WARNING: This path as to be adapted to match your computer
setwd("~/Documents/Class_Genomics&Bioinfo_Spring/DNA_barcoding/")
#Check that working directory is set correctly
getwd()
###~~~
#Build a query
###~~~
#Taxon
sp <- "Vanilla"
#DNA region: here ITS
DNA <- "internal transcribed spacer"
#Organism
org <- "Plants"
#Build query: sp AND DNA region
query <- paste0(sp," [All Fields] AND ", DNA," [All Fields] ", org, " [filter]")
Boolean operators provide a way of generating precise queries that produce well-defined sets of results. The Boolean operators used in Entrez and how they work are as follows.
Entrez requires the Boolean operator AND to be entered in uppercase. This is not required in all databases for the other two operators, but it is simplest to enter all of them in uppercase:
promoters OR response elements NOT human AND mammals
Entrez processes all Boolean operators in a left-to-right sequence. Enclosing individual concepts in parentheses changes this priority. The terms inside the parentheses are processed first as a unit and then incorporated into the overall strategy. For example, in the following search statement, the union of response element and promoter results is generated first and then is intersected with the result of the g1p3 search.
g1p3 AND (response element OR promoter)
###~~~
#Retrieve DNA accessions in GenBank
###~~~
GBresults <- rentrez::entrez_search(db = "nuccore", term = query, retmax = 50000)
#How may hits did we get
print(GBresults$count)
Even when you execute a query directly on the GenBank portal, there will always be sequences that do neither match your target taxon (here Vanilla) nor your target DNA region. In this context, it is paramount to retrieve meta-data associated to the DNA accessions (stored in GBresults
) in order to clean-up your dataset prior to analyses (see next step).
For each DNA sequence the following meta-data are gathered using functions implemented in rentrez:
Please see the R
code below for more details on the procedure to retrieve the meta-data.
Disclaimer: The R code below might stop because you do not have an API key registered to NCBI and the system might time you out. If it is the case, you will have to edit the for
loop to pursue downloading the data.
###~~~
#Fetch meta-data associated to sequences
###~~~
#Use loop to automatically retrieve species,
# seq definition line, seq length and DNA sequence associated
# to each DNA accession
#Create empty matrix to be populated
OUT <- matrix(ncol = 5, nrow = length(GBresults$ids))
colnames(OUT) <- c("GenBankID", "Species", "Definition", "Seq_length", "Sequence")
#Add GenBank ID
OUT[,1] <- GBresults$ids
print("Processing sequences: fetching meta-data")
#Set a progress bar
pb <- txtProgressBar(min = 0, max = length(GBresults$ids), style = 3)
for(i in 1:length(GBresults$ids)){
#Wait time to avoid being timed out by NCBI
# but it still might happen because you don't have an API key
Sys.sleep(5)
#Print iteration number to assess progress
# This info will help reset the loop if your are timed out
print(paste("Iteration", i, "of", length(GBresults$ids), sep= ' '))
#Download sequence
seq <- entrez_fetch(db = 'nuccore', id = GBresults$ids[i], rettype = 'fasta', retmode = "text")
#Extract definition line
OUT[i,3] <- strsplit(seq, split = "\n")[[1]][1]
#Infer seq length
nbp <- strsplit(seq, split = "\n")
OUT[i,4] <- as.numeric(length(strsplit(paste(nbp[[1]][2:length(nbp[[1]])],
collapse=''),"")[[1]]))
#Extract sequence
OUT[i,5] <- as.vector(paste(nbp[[1]][2:length(nbp[[1]])], collapse = '')[1])
#Fetch taxon ID associated to GenBank accessions
taxID <- entrez_link(dbfrom = 'nuccore', id = GBresults$ids[i], db = 'taxonomy')
#Extract taxonomy: genus and species epithet
tmp <- strsplit(
strsplit(entrez_fetch(db = 'taxonomy', id = taxID$links, rettype = "native"),
split = '\n')[[1]][1]
, split = ' ')
OUT[i,2] <- paste(tmp[[1]][2:length(tmp[[1]])], collapse = ' ')
# update progress bar
setTxtProgressBar(pb, i)
}
close(pb)
SEQ <- as.data.frame(OUT)
###~~~
#Write raw data in 04_Data_analyses
###~~~
FileIDRawcsv <- paste(sp, DNA, gsub("-", "_", Sys.Date()), "Raw_GenBank.csv", sep = '_')
#Write FASTA file
write.table(SEQ,
paste("04_Data_analyses/CSV/",
FileIDRawcsv, sep = ''), row.names = F, col.names = T, quote = T)
Let’s have a look at the data downloaded from GenBank:
## [1] "GenBank query retrieved 185 DNA sequences."
## GenBankID Species
## 1 1708599397 Vanilla planifolia
## Definition
## 1 >MN221421.1 Vanilla planifolia voucher PDBKTMA2013-1860 external transcribed spacer, partial sequence; small subunit ribosomal RNA gene, internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence; and large subunit ribosomal RNA gene, partial sequence
## Seq_length
## 1 9463
## Sequence
## 1 CGCGGCTGAGGGCAACGCCACGCCGCGCGGGGCGTGCGGTCGTGGCCAATGATTAGGCGCGCGCGGCGCGCCGTGCAGGGCACAACGTTGCAACCCCGGGCGCGCGGTCGGGGGCACCGCGCCCCTGCGCACCGCCGCGGGCACCGCGCACCTGCGTGCGGTCCTGTGCACCGCGCAGGTTGGTTATGTTGGCTGATGAGGGGACAAAAAAGTGCAACTTTTTTCGGGATTTTTAGCGCTGCGGGCTGCTTCTGCACGGATGCGGCGACAACCAATTGGTAGTCTTGCCTCGGCGCATGATGGCATCGTCGCAAAAAAAAAAATCCGAGTTGCGGGCGTGTGCATCGCGCGCCTGACCCGGGCGTGGGCACCGCGCTCCGGCGTGGGGGCATGTTCATCGAGCTCCTGCTTGCGTGCTTTCGAACCGCGCTCCTGTTGGCTTGGCTATCCGCGCGTTCTCGGCGACTAGGTGGGCCGGCATCGTTGCGCTGCAGAACAGAGCAACTTTCGTCGGCATCTTTAGCCCTGTGTTCTCGCGAGGGGACCCAAAAGTGTAACTTTTTTTGGGATTTTTAGCGCCGCGGGCTGCTTCTGCTTGGATATGGTGACAACCCATTGGTAGTCTGCCTCGTCCCAAGATGCGATTGTCGCAATACACGATCGGAAATGTGGKCATGTTCATAACGCACCTGCGGGCGGGCATGTGCACGGCGCGCCGGCSCGTGGGCATTGGCACCGCGCTCCCGCGGGCCGTGGAGTGCACCGCCCCCCGCGCGCGGCGCGCGTCCGTGGGCACGGCTCGGGTGCGTGCGGCGCGACGTGAATGGCACCGCACCGCCGCGCGGATCGTCGATGGCACCGCGCCGCCGCGAGGGTCGCGTGGTCGTGTGCACCTCGCAGGCGGGCGCGGCGCGCCGTGGATTGAACCGCACCGTCGCCCGGGTCGCGCGGCGGTGTGCACCTCGGAGGCTCGCGCCGCGCTTGGAGCGCGATCGAGGGCGCCGCGCCGCGGCGCGCGGTCGTGCGACCCGCGTAAGCGCGCGAGGCGCGTCGGCGCAGGCACCGCGCCCCTACGCGCCCGGCGCGGCCGCGGGCACCGCGCGTCTGGGTGCTGGCCTTTGCGCCTCGCTCCTGCGCACGGGCATGTGCACCGCGCCCCAGCGCGCGGGCATCTCCCCGCGCCCCAGTGCGTGTGCATGTGGACTGCTCTCCCGCGCGCGGCAGCGTGCAACGCGCTCCTGCTGTTTTTTTTGTGGAGTGCGATGTGGTGCACTGTCTGAACTGCTGCCCTCTTCCATCTGCTTTGGGCTTTGTTTCTTAGTGGTGGTTCGTTTTGTCAATTTGGTTGGAGAAATGTGCAAACTCTTTTCGATGTTATGTCATATCTCCGGAGGGCCAATGTAGGGATGAGGTGTTGCTCAAATCGTGTCGTTTAAGCGGTGCAATTTTTGCTTGGCTCTTTGTCGGTTGGGCGTGATTTTCGCTTTGCATGTGGTGCGAGGTATCGCTTGTCGTGCTATGCGGGTGTTAGCCAATGCCTTCATCCCTAATTAGACCCTTTGCGCTTCCTGTTTTAAACATCGTTGAGGCAATTTGGGCTAATGGAGTTGGATTGGGGAGATGCCGTTTTTGCACAAGAGATGTTTGTTGTGGCGCTACTCTGTATGGCTTGGTCACCTCGTGGTGTGGTCTGTCAATACATGAGTAGGGCTCTGTCACTTCGTTGAACCTTCATACGGAGGAAGTCCGCATGTTGTTTTGCCTTTCTATCGGATGGGTAGGTTCCGTTCTATTATTTTTTATGGCGCCGCACGACTACGGTGTTGGAATGGTGGCGTTACGCGATGCCGTCGAACGCATGATCGTACATGCTCCCCTGTTGGCAAGGGCTCGGAAATCACACGTTTCGGTCATATCGTTCTCCAACAAGGAGGGTTGTCCCTGAGGAATGTTATTTGTCGAGAGGCAATGTTTCGGAACTTGGGCGATGGTATCCCAACCTGGGAGCATGCGCTCTTTCCTTTGTGGAGTATCGCCAGACACGACGAACATGTCAAATTACGACATGGGGTTTTGCTTGGCGTTTCATCATTGTATTGCTACCTGGTTGATCCTGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAGTATGAACAATTTCAGACTGTGAAACTGCGAATGGCTCATTAAATCAGTTATAGTTTGTTTGATGGTACTTGCTACTCGGATAACCGTAGTAATTCTAGAGCTAATACGTGCACCAAACCCCGACTTTTGGAAGGGATGCATTTATTAGGTAAAAGGTCGACGCGGGCTTTTGCCCGGCTCCTTGACGATTCATGATAACTTGTCGGATCGCACGGCCTTCGTGCCGGCGATGCATCATTCGAATTTCTGCCCTATCAACTTTCGATGGTAGGATAGGGGCCTACCATGGTGGTGACGGGTGACGGAGAATTAGGGTTCGATTCCGGAGAGGGAGCCTGAGAGACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCAATCCTGACACGGGGAGGTAGTGACAATAAATAACAATACCGGGCTCCACGAGTCTGGTAATTGGAATGAGTACAATCTAAATCCCTTAACGAGGATCCATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTTAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGACCTTGGTTTGGGTCGGTCGGTCCGCCTTTTGGTGTGCACCGCCCGCCCTGATCCTTTTGTCGACGATGCGGTCTGGCCTTAGCTGGCCGGGTCGTGCCCTCGGCGTTGTTACTTTGAAGAAATTAGAGTGCTCAAAGCAAGCCCACGCTCTGGATACATTAGCATGGGATAACATCACAGGATTTCGATCCTATTGTGTTGGCCTTCGGGATCGGAGTAATGATTAAGAGGGACAGTCGTGGGCATTCGTATTTCATAGTCAGAGGTGAAATTCTTGGATTTATGAAAGACGAACCACTGCGAAAGCATTTGCCAAGGATGTTTTCATTAATCAAGAACGAAAGTTGGGGGCTCGAAGACGATCAGATACCGTCCTAGTCTCAACCATAAACGATGCCGACCAGGGATTGGCGGATGTTGCTCTTTGGACTCCGTCAGCACCTTATGAGAAATCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGAAAGCTTACCAGGTCCAGACATAGCAAGGATTGACAGATTGAGAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGCGATTTGTCTGGTTAATTCCGTTAACGAACGAGACCTCAGCTTGCTAACTAGCTATGCGGGGTGCAAGCCCTGTGGCCAGCTTCTTAGAGGGACTATGGCCGCTTAGGCCATGGAAGTTTGAGGCAATAACAGGTCTGTGATGCCCTTAGATGTTCTGGGCCGCACGCGCGCTACACTGATGTATTCAACGAGTCCATTGCCTTGGTCGAAAGGCCTGGGTAATCTTATGAAAATTTCATCGTGATGGGGATAGATCATTGCAATTGTTGGTCTTCAACGAGGAATTCCTAGTAAGCGCGAGTCATCAGCTCGCGTTGACTACGTCCCTGCCCTTTGTACACACCGCCCGTCGCTCCTACCGATTGAATGGTCCGGTGAAGTGTTCGGATCGCTGCGATGCGGGCGGTTTGCCGCGTGCGACTCGGCGAGAAGTCCACTGAACCTTATCATTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGACGAGAGGTTCGAACGATGGAACGATCTGTCCAACGCGTGGGAGTGCGACGGTTGTTCGATGTCGCGTTCTTTCGTAGCGCGTGCTCTCGCGATCACGCGGAGCTCGACGTTATGGGGGATAAACAAAAGCTTATGGGCGTGGTCAGGCGCCAAGGGAGAGCAAATGTTAGGCTGGCAAACGAGTATGCCGTGCGTCGTCAGGTCCATTGAGTCCTGGCAATCGAACGACTCTCGACAACGGATATCTTGGCTCTCGCATCGATGAAGAACGCAGCGAAATGCGATACGTGTTGTGAATTGTAGAATCCCGTGAACCATCCATTTTTTGAACGCAAGTTGCGCCCGAGGATGCAAGCCGAGGGCACGCCTGCATGGGCGTAATGCGACCCGTCGCTCCCTGCGGAAGCCTGGAATCTTTCGGTTTGGTTCCTCGGCCCCGTCGCGGAGGCGATTGATGGCACCCCGTGCGATGGCACGGCGTGTTGAAGCTTGGGGCGACGGTCGACTTCGACACGATACGAGGTGGACGCCACTGGCTGTCGTGGTGTTGGCCAGCAAGAATCGATGTTGTTGTGCGACGAGCAGATGCCCCGCATAGATCCTGCTCCGTCCTCCACGGTGTGGAATCGTGACCCCATGTTAGGTGAGGCTACCCGCCGAGTTTAAGCATATAAATAAGCGGAGGAGAAGGAACTTACAAGGATTCCCTTAGTAACGGCGAGCGAAACGGGACCAGCCCAGTTTGGAAATCGGGCAGCCTGAAGCCTGAATTGTAGTCTGGAGAGGCGTCCTCAGCGACGGATCGGGATCAAGTCCCCTGGAAAGGGGCGCCGGGGAGGGTGAGAGCCCCGTTCGGCCCGTACCCTGCTGCTACACGAGGCGCCGTCAACGAGTCGGGTTGTTTGGGAATGCAGCCCAAATTGGGTGGTAAATTCCGTCCAAGGCTAAATAATTGCGAGAGACCGATAGCGAACAAGTACCGCGAGGGAAAGATGAAAAGGACTTTGAAAAGAGAGTCAAAGAGTGCTTGAAATTGTCGGGAGGGAAGCAGATGGGGGCCGGCGGTGCGCCTCGGCTGGATGCAGAACGTCGAATGACGGTTTGCTGCACGGCTCGAGGAGCGGACCGTCTCGGGCCATCGCGGCGACCGGAGCCCGGGCGCACGTCGCTCGTGGAGAAATCGTCGGCGTGGCCGATCGCAATGCCCGCGCCATCGAGGCGTGCCACGCGGCACCGCGTGCATTGGTGATGGCCAGTGGGCTCCCCATCTGACCCGTCTTGAAACACGGACCAAGGAGTCTGACATGCTTGCGAGTCGACGGGTGGGCAAGCCCGGAAGGCGCAAGGAAGCTGATTGGTGGGATCCCCGTTTGAGGGGTGAGTCATCGACCGACCCAGATCTTTTGTGAAGGGTTCGAGTGAGAGCATGCCTGTCGGGACCCGAAAGATGGTGAACTATGCCTGAGCGGGGCGAAGCCAGAGGAAACTCTGGTGGAGGCCCGCAGCGATACTGACGTGCAAATCGTTCGTCTGACTTGGGTATAGGGGCGAAAGACTAATCGAACCATCTAGTAGCTGGTTCCCTTCGAAGTTTCCCTCAGGATAGCTGGAGCCACAGTCGGAGTTCTATCGGGTAAAGCCAATGATTAGAGGCATCGGGGGCACAATGCCCTCGACCTATTCTCAAACTTTAAATAGGTAGGAGGGCTCGGCTGCTTTGATGAGTTGAGCCAAGGAATCCAGGCTCCAAGTGGGCCATTTTTGGTAAGCAGAACTGGCGATGCGGGATGAACCGAAAGCCGGGTTACGGTGTCCAACTGCGCGCTAACCTAGAGCCCACAAAGGGTGTTGGTCGATTAAGACAGCAGGACGGTGGTCATGGAAGTTGAAATCCGCTAAGGAGTGTGTAACAACTCACCTGCCGAATCAACTAGCCCCGAAAATGGATGGCGCTTAAGCGTGCGACCCACACCTGGCCGTCGGTGCAATGCAAGGCCCCGACGAGTAGGAGGGTGCAACGGTCGCTGCAAAACGTGGGGCGTGAGCCCGCGTGGAGCGTCTGTTGGTGCAGATCTTGGTGGTAGTAGCAAATATTCAAATGAGAACTTTGAAGGCCGAAGGGGGGAAAGGTTCCATGTGAACGGCACTTGCACATGGGTTAGCCGATCCTAAGAGACGGGGGAAACCCGTCGGATAGTGCCATTCGGCACGAGCTTCGAAAGGGAGTCGAGTTAAAATTCTCGAGCCGGGATGTGGCGGTCGATGGCAACGTCAAGTGGTCCGGAGACGTCGGTGGGGGCCTCGGGAAGAGTTATCTTTTCTGCTTAACAGCCCATCGGCCCTGGAAACGGCTCAGCCGGAGGTAGGGCCAAGCGGTTGGAAGAGCACCGCATGTTGAGCGGTGTCCGATGCGCCCCTGGCGACCCTTGAAAATCCGGATGGCCGAGTGCCTTCCACGCCCGGTCGTACTCATAACCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGATGGAGCAATGTAGGCAAGGGAAGTCGGCAAAATGGATCCGTAACTTCGGGAAAAGGATTGGCTCTGAGGGTTGGGCACGGGGGTCCCAATCCCGAACCCATCGGCTGTCGGCGGACTGCTCGAGCTGCTTGCGCGGCGAGAGCGGGTCGACGCGTGCCGGCTGGGGGACGAATTGGGAGCGGCCCCTTCGGGGGCCTTCCCCGGGCGTCGAACAATCGACTCAGAACTGGTACGGACATGGGGAATCCGACTGTTTAATTAAAACAAAGCATTGCGATGGTCCTCGAGGATGCTCACGCAATGTGATTTCTGCCCAGTGCTCTGAATGTCAAAGTGAAGAAATTCAACCAAGCGCGGGTAAACGGCGGGAGTAACTATGACTCTCTTAAGGTAGCCAAATGCCTCGTCATCTAATTAGTGACGCGCATGAATGGATTAACGAGATTCCCACTGTCCCTGTCTACTATCCAGCGAAACCACAGCCAAGGGAACGGGCTTGGCAGAATCAGCGGGGAAAGAAGACCCTGTTGAGCTTGACTCTAGTCCGACTTTGTGAAATGACTTGAGAGGTGTAGCATAAGTGGGAGTCGGCGTGCCGACGGAATTGAAATACCACTACTTTTAACGTTATTTTACTTATTCCGTGAGACGGAGGCGGGGCCCAGCCCCTCCTTTTGGCTCCAAGTCTCGCCTCGGCGAGTCGATCCGGGCGGAAGACATTGTCAGGTGGGGAGTTTGGCTGGGGCGGCACATCTGTTAAAAGATAACGCAGGTGTCCTAAGATGAGCTCAACGAGAACAGAAATCTCGTGTGGAACAAAAGGGTAAAAGCTCGTTTGATTCTGATTTCCAGTACGAATACGAACCGTGAAAGCGTGGCCTATCGATCCTTTAGGCTTTCAGAATTTGAAGCTAGAGGTGTCAGAAAAGTTACCACAGGGATAACTGGCTTGTGGCAGCCAAGCGTTCATAGCGACGTTGCTTTTTGATCCTTCGATGTCGGCTCTTCCTATCATTGTGAAGCAGAATTCACCAAGTGTTGGATTGTTCACCCACCAATAGGGAACGTGAGCTGGGTTTAGACCGTCGTGAGACAGGTTAGTTTTACCCTACTGATGAATGTGTCGTGATAGTAATTCAACCTAGTACGAGAGGAACCGTTGATTCACACAATTGGTCATCGCGCTTGGTTGAAAAGCCAGTGGCGCGAAGCTATCGTGTGTAGGATTATGACTGAACGCCTCTAAGTCAGAATCCACGCTAAGATGCGACGCTTAGGCCTATCGTTCGCCTGTTGGCCTAAAGTAGGGGCTTGGCCCCCAAGGGCACGCGACCACGGGCTAGTCTGGTGTCATAGAAGGTTGGATGGCATGGGCCCCGTAAGAAATGATAGTTAAGAACGAACGATGGGTAGAATCCTTTGCAGACGACTTAAATTTGCGACGGGGCATTGTAAGTGGCAGAGTGGCCTTGCTGCCACGATCCACTGAGATCCAACCCTATGTTGCAGTAGATTCGTCCCCTTCTGCCGCAGAACACGAAGAGCTTGGTTTTTGGTTTTCTAAAGGGGGACCAAGCTCGCTTGCTCTACAGCTCTACTTATTGCGGTACTTTGTGCGGGTCCTGTGCAACTGGCCTTGCGATGCGTCCCTCGGTGGCCCTACCAATGTGTAAGGCTTAAGCCTGACATTGCACGATTCAGTTATCACAAACGTGAAATAGCTATGAAATGCTGAAAAAGGGTTTGAAAAATATCTGCTGGCTCTGTCTACGTGCACATAGAGGCTTGTCTGCACGCACTTTGGGGCTTGTTTGCGTGCACAGCAGCAACCTACGCGCGCTCCATGTTGGCACCAGAAAAGTGTAACTTTTTTCGGGATTTTCAGCGCTGCGGGCTGCTTCTGCATGGATGCGGTGACAACCTATTGGTAGTCTTACCTTGTGCTGCGTGCTGTCTGTGCACCGGGCCTGTTTGCGTGCACGGCAGCAACCTACGCGCGCTCCGTGTTGGCACCAGAAAAGTGTAACTTTTTTCGGGATTTTTAGCGCTGCGGGCTGCTTCTGCATGGATGCGGTGACAACCTATCGGTAGTCTTGCCTCAGCCCAAGATGGCATTGTCGCAAAACACGATCCGGATGGCGTGCATGGGCATCACGCTCCTGCACGCGGGCATGCGCATCCCGCCCGCCTGCGCCGCGCTCCTGCGTGGGGGAAAGTGTGCTTCTCGCTCCTGCGTGTTGGCCTATCCCCGCTTCTTCGGTCGCAAGGTGCGTCCCCCGCATCATGGTCTTGGAGAATCCTGTAACTTTCACCGAGATCATTAGCTATGCGGTCTTCATGAGTGGACCAAAAAGTGTAACTTTTTTCGGGATTTTTAGCGCTGCGGGCTGCGTCTGCATGGATGCGGTGGCAACCTATTGGTAGTCTTGCCTCTTCCCGAGGTGCGACCATCGCAATACACGATCCGGATGGCGTGCATGTGCATCACGCTCCAGCGATTGGGCCGGTGCGGCGATCCGTCGCGGGCACCGTTCCCTTGCTCGGAACGCGCGGCCGTGGGCACCGCGCTCTCGCGCTCGGTCCTGTGCGCCGCGCAAGTTAGCTCTGCGGGCTCGTGAGGGGACCGAAATTTGTAACTTTTTTCCGGATTTTTAGGGCCGCGGGCTGCTTCAGCATGGTTACGGTGACAACCTATTGGTAGTATTTGCACGTCCCGAGATGCGATCGTCTCAAAACACGTTCCGGATGGCGGTCGTGGTCATCACGCTCCTGCGTGTGGGCAAGCGCGCCGCGCGGCCCCCCGACGCACGAGCGCGCGGCGCGATCATTGGCTCTCCGGTCTCCACGAGGGGACAAAAAAGTGCAACTTTTTTCGGGATTTTCAGCGCTGCGGGCTGCTTCTGCATGGATGCGGTGACAGCCTTTTGGTAGTCTTACCTTGTGCACCGGGCCTGTTTGCGTGCACCGCTGCAACCAACGCGCGCTCCATGTTGGCACCAGAAAAGTGTAACTTTTTTCGGGACCTTTCAGCGCGGCGGG
If you need to restart coding from the previous point, please execute the code below, which will load the Raw GB data from the CSV
file saved in 04_Data_analyses/CSV
.
To do so, do the following:
csv
file produced by the instructor on Google Drive at this path: DNA_barcoding/04_Data_analyses/CSV/Instructor_files/Vanilla_internal transcribed spacer_2024_02_08_Raw_GenBank.csv
DNA_barcoding
.###~~~
#Load Raw GB query csv file
###~~~
#If you have saved the SEQ file and need to restart (from SEQ), execute this code
# --> Reload csv file from 04_Data_analyses/CSV/
#Adjust the file name based on your data
SEQfileName <- "Vanilla_internal transcribed spacer_2023_02_14_Raw_GenBank.csv"
#Load the csv in the environment
RawGBDat2 <- read.csv(paste0("04_Data_analyses/CSV/", SEQfileName), quote = "\"", sep = ' ')
#Change name of object to make it compatible with code
SEQ <- RawGBDat2
Write code to determine how many DNA sequences where downloaded from GenBank?
To answer this question, use SEQ
as input.
Here we apply filters to discard DNA sequences that are:
Finally, we are preparing/formatting data for the production of the FASTA
file.
###~~~
#Tidy dataset
###~~~
##
#1. The search retrieved sequences that are NEITHER ITS, NOR Vanilla
##
# Use grep to search for internal (more used than ITS) in definition
# Use grep to search for Vanilla in species
gene <- SEQ[grep("internal", SEQ$Definition),]
gene <- gene[grep("Vanilla", gene$Species),]
##
#2. Previous analysis showed that the V. mexicana sequence is contaminated/corrupted.
##
# We are therefore discarding it from our dataset
gene <- gene[-which(gene$Species == "Vanilla mexicana"),]
##
#3. Discard DNA sequences identified at genus level
##
#Create vector with names of taxa in dataset
taxaVan <- unique(as.vector(gene$Species))
#Find DNA accessions identified at genus level
# and discard them
# 1. All species matching "sp."
spVan <- taxaVan[grep("sp.", taxaVan)]
# 2. Exclude those that have "subsp." since they are accurately identified
spVan <- spVan[-grep("subsp.", spVan)]
# 3. Subset gene to only keep DNA sequences identified at species level
gene <- subset(gene, !(gene$Species %in% spVan))
##
#4. Discard DNA sequences < 500 OR >= 1000 bp
##
gene <- gene[-which(as.numeric(as.vector(gene$Seq_length)) < 500 | as.numeric(as.vector(gene$Seq_length)) >= 1000),]
###~~~
#Prepare dataset for FASTA format
###~~~
# FASTA first line contains GenBank ID and species.
# Want these fields to be separated by "_"
# and need to include those in species field
gene$Species <- gsub(" ", "_", gene$Species)
Q1: Write code to tell us how many DNA sequences were discarded through your filtering analyses?
Q2: Write code to determine how many taxa of Vanilla are in your sampling?
Q3: Write code to determine if your sampling is skewed towards specific taxa? If yes, which ones?
FASTA
and csv
formatsWe are using R
to generate a FASTA
file with the DNA sequences from our GenBank query. To do that we are concatenating information from 3 columns in the gene
object:
- gene$GenBankID
: Contains unique GenBank ID for DNA sequence.
- gene$Species
: Contains species taxonomy associated to sequence.
- gene$Sequence
: Contains the DNA sequence.
###~~~
#Create FASTA
###~~~
FASTAGB <- paste(paste(">", as.vector(gene$GenBankID), "_",
as.vector(gene$Species), sep = ""),
as.vector(gene$Sequence), sep = '\n')
###~~~
#Write FASTA & CSV (incl. meta-data) files
###~~~
##File name FASTA
FileIDFASTA <- paste(sp, DNA, gsub("-", "_", Sys.Date()), "GenBank.fasta", sep='_')
#Write FASTA file in DNA_barcoding/04_Data_analyses/
write.table(FASTAGB,
paste("04_Data_analyses/FASTA/",
FileIDFASTA, sep = ''), row.names = F, col.names = F, quote = F)
#File name CSV
FileIDcsv <- paste(sp, DNA, gsub("-", "_", Sys.Date()), "GenBank.csv", sep='_')
#Write CSV file
write.table(gene,
paste("04_Data_analyses/CSV/",
FileIDcsv, sep=''), row.names = F, col.names = T, quote = T)
Let’s have a look at the tidy data:
## [1] "After cleaning 148 DNA sequences remain."
## GenBankID Species
## 1 1789804163 Vanilla_trigonocarpa
## Definition
## 1 >MN902067.1 Vanilla trigonocarpa voucher W. Stern s.n. internal transcribed spacer 1, partial sequence; 5.8S ribosomal RNA gene, complete sequence; and internal transcribed spacer 2, partial sequence
## Seq_length
## 1 617
## Sequence
## 1 AGAGGCGTGAATGATGGAACGATCTGTCCAACATGTGGGAGTGCGACAGTTCGATGTCGCCTTCTTCCGTAGCGCGTGCTCTTGCTTCGACGTGGAGCTCGACGCTACGGGGGATAAACAAAAGCTTATGGGCGTTGTCTGGCGCCAAGGGAGAGCAAATGTTCAAGCTGGCAAACGAGTGTGTTGTCGTCAGGTCCATTGAGTCCTGGCAATCGAACGACTCTCGACAACGGATATCTTGGCTCTCGCATCGATGAAGAACGCAGCTTGAAATGCGATACGTGTTGTGAATTGTAGAATCCCGTGAACCATCCATTTTTTGAACGCAAGTTGCGCCCGAGGATGCAAGCCGAGGGCACGCCTGCATGGGTGTAATGCGACCCGTCGCTCCTTGCGGAAGGCTGGAATCTTTGGTTTGGTTCCTCGTCCCCGTTGTGGAGGCGATTGATGGCACCCCGTGCAATAGCATGGCGTGTCGAAGTGTGGGGCGACGGTCGACTGTCGACATGATAAGAGGTGGGCAGCCACCGGCTGTTGTGGTGTTGGCCAGCAATAATCGATGTTGTCGTGCGACAAGCAGGTGCCCCGCATAGATCCAACTCCGTCCTCGATGGTGT
.seq
files from Part 1 into a FASTA
fileIn this section, we are focusing on developing an R
code to format and merge our clean sequences from Part 1 by using the following approach:
.seq
files in folder 03_Processed_ITS_data_FASTA/
using list.files()
.readLines()
,FASTA
file from interleave to sequential.FASTA
objects using rbind()
.FASTA
object into file using write.table()
.Please notice that steps 2 to 5 will take place within a for
loop.
###~~~
#List of .seq files
###~~~
#Please adapt path to your working directory/project
Files <- list.files("03_Processed_ITS_data_FASTA/", pattern = '.seq', full.names = T)
###~~~
#Format and merge all files into one file
###~~~
#This is done by using a loop
seqOUT <- NULL
for(i in 1:length(Files)){
#Read FASTA
tmp <- readLines(Files[i])
#Extract and concatenate DNA sequence
DNAseq <- paste(tmp[2:length(tmp)], collapse = '')
#Infer complementary sequence
DNAseqcomp <- unname(sapply(strsplit(DNAseq,"")[[1]], switch, "A"="T", "T"="A","G"="C","C"="G"))
#Infer reverse and complement sequence (and collapse into one element)
DNAseqcomprev <- paste(rev(DNAseqcomp), collapse='')
#Edit FASTA to have only one object/line (=sequential format)
tmp <- paste(tmp[1], DNAseqcomprev, sep='\n')
#Merge objects
seqOUT <- rbind(seqOUT, tmp)
}
###~~~
#Write data
###~~~
#File name
FileIDfastaSeq <- paste(sp, DNA, gsub("-", "_", Sys.Date()), "msa_input.fasta", sep='_')
#Write file (= input for msa analysis)
write.table(seqOUT, file =
paste("04_Data_analyses/FASTA/",
FileIDfastaSeq, sep = ''), col.names = F, row.names = F, quote = F)
Please check that the format of your output file is as expected using a text editor (see Figure 4.15).
FASTA
objects/files to perform multiple sequence alignmentHere, we are aiming at merging FASTA
outputs to produce the input of the multiple sequence alignment. This step is pretty straightforward and involves using the c() function and writing the output.
The objects containg the GenBank DNA sequences and your DNA sequences are FASTAGB
and seqOUT
, respectively.
###~~~
#Start by merging datasets (GenBank and your newly produced seq.)
###~~~
#If your FASTA objects are still in R
# merge FASTA into one object
FASTAall <- c(FASTAGB, seqOUT)
#Else, you will have to load fasta files using readLines() before merging them
###~~~
#Write data
###~~~
#File name
FileIDfastaALL <- paste(sp, DNA, gsub("-", "_", Sys.Date()), "GenBank_seq_msa_input.fasta", sep='_')
#Write file (= input for msa analysis)
write.table(FASTAall, file =
paste("04_Data_analyses/FASTA/",
FileIDfastaALL, sep=''), col.names = F, row.names = F, quote = F)
Why would you need to use readLines() instead of read.csv() to open a FASTA file?
DNA multiple sequence alignment and phylogenetic inference.
To execute Part 3, you need to install the following software and R packages on your computer:
MEGA
(Kumar et al., 2018): Please download the GUI version of the software associated to your operating system at this URL:
FigTree
(a software to visualize and manipulate trees): https://github.com/rambaut/figtree/releasesR
package: ape (Paradis et al., 2004).If you don’t remember how to install an R package, don’t worry, this topic was covered here.
To infer the ML phylogenetic analysis, the following workflow will be executed: implemented:
MUSCLE
algorithm (Edgar, 2004). This algorithm is implemented in MEGA
. The input data for this analysis is Vanilla_internal transcribed spacer_2021_02_09_GenBank_seq_msa_input.fasta
, which is stored in 04_Data_analyses/FASTA/
.MEGA
.RAxML
algorithm (available on a web platform).The analysis described below relies on MEGA
; however you might be experiencing issues with the .fasta
file outputted by this program (especially on Windows operating system). These issues might compromise downstream analyses, more specifically the RAxML inference and R code used to generate figures. Those issues are associated to changes in file encryption and changing formatting of samples names (i.e. replacing “_” by ” “). In the event that it happens to you, please switch over and use the online MUSCLE
platform available at this URL: https://www.ebi.ac.uk/Tools/msa/muscle/
WARNING: When you submit your job on the portal, do not forget to select Pearson/FASTA
as output file format in step 2. In addition, the output will only be made available in a window and you will have to copy and past the whole content in a new file using your favorite text editor. The file should be saved and named as detailed in the step 8 of the section below.
To conduct a multiple sequence alignment in MEGA
do the following:
MEGA
.Vanilla_internal transcribed spacer_2021_02_09_GenBank_seq_msa_input.fasta
) by clicking File -> Open A File/Session...
and selecting the right file.How would you like...
and you will then click on the Align
button. This will open a new window with your data (corresponding to an unaligned DNA matrix; see Figure 4.16).4.To start a MUSCLE
analysis do has shown in Figure 4.17 (Alignment -> Align by MUSCLE
. A window will open saying Nothing selected for alignment. Select all?
, click the OK
button. Further details on the MUSCLE
algorithm can be found in Edgar (2004).
OK
to start the analysis. The analysis should take ca. 5-10 minutes to complete.-
) that had to be incorporated to accommodate for differences in DNA sequences between samples. Your DNA sequences of Vanilla are at the bottom of the file.FASTA
format by clicking Data -> Export Alignment -> FASTA Format
. This procedure is also shown in Figure 4.20.04_Data_analyses/FASTA/
and adjust file name as follows: _GenBank_seq_msa_output.fasta
. This procedure is shown in Figure 4.21. Make sure to have your file extension set as .fasta
otherwise you won’t be able to load your file during the RAxML
analysis.The RAxML
algorithm will be used to conduct the ML phylogenetic analysis and infer node statistical supports using the bootstrap procedure (Kozlov et al., 2019). This algorithm has been implemented on a web service platform accessible at this URL:
- https://raxml-ng.vital-it.ch/#/
WARNING: If the RAxML web service platform is down, the instructor will run the analysis locally (on a Linux computer). Please click here for more details on the procedure.
To perform the ML phylogenetic analysis:
MEGA
analysis (*_GenBank_seq_msa_output.fasta
) stored in 04_Data_analyses/FASTA/
as input data.Bootstrapping settings
(see below). Settings have to be set as shown in Figure 4.22.Finally, don’t forget to provide your email before submitting the analysis. The analysis should take 1-5 hours to run. You will receive an email to download results or you can access results by using your unique URL.
When your analysis is completed, download results (see Figure 4.23) and save it in 04_Data_analyses/Phylogenetic_analyses/
.
raxmlArg.txt
: Contains RAxML command line used to conduct analysis.result.raxml.bestModel
: Contains estimated parameters for model used for ML analysis.result.raxml.bestTree
: Best ML tree in newick format.result.raxml.bootstraps
: Bootstrap trees in newick format.result.raxml.support
: Best ML tree (same tree than in result.raxml.bestTree
) with node supports (inferred from data in result.raxml.bootstraps
) in newick format.sequenceAlignment.fasta
: Your input aligned FASTA file.result.raxml.support
in FigTree
to visualize it and locate the four samples of vanilla from Mexico. We will learn how to process phylogenetic trees in R in the next section.Connect to Linux computer:
ssh svenbuerki@XX.XX.XXX.XXX
Navigate to the right folder:
cd /media/SeaGate/Mini_Report_3/RAxML_analysis/
Run the RAxML analysis locally:
This analysis is done on the output of the MUSCLE analysis
Vanilla_internal\ transcribed\ spacer_2024_02_20_GenBank_seq_msa_output.fas
raxmlHPC-PTHREADS -T 20 -f a -m GTRGAMMA -p 12345 -o 1789804116_Vanilla_inodora -x 12345 -# 100 -s Vanilla_internal\ transcribed\ spacer_2024_02_20_GenBank_seq_msa_output.fas -n Vanilla_ML.tre
Arguments - Short definitions
-T
: number of threads-m
: molecular model to build the phylogenetic tree-o
: outgroup taxon to root analysis-p
and -x
: seed numbers for analysis-#
: number of bootstrap analyses-s
: input file in FASTA format-n
: output fileOutput files
RAxML_bootstrap.Vanilla_ML.tre
RAxML_bestTree.Vanilla_ML.tre
RAxML_bipartitions.Vanilla_ML.tre
(FILE FOR NEXT ANALYSES)RAxML_bipartitionsBranchLabels.Vanilla_ML.tre
Files on Google Drive
The output files are available on Google Drive at this path:
DNA_barcoding > 04_Data_analyses > Phylogenetic_analyses > RAxML_analysis
Download the folder on your computer and save it under the same path.
R
functions implemented in the ape packageThe Newick Standard for representing trees in computer-readable form makes use of the correspondence between trees and nested parentheses. We owe this format to the British mathematician Arthur Cayley. Here we will be providing an overview of the Newick format using the R package ape (Paradis et al., 2004).
To produce the “dummy” phylogenetic tree displayed in Figure 4.24, the following syntax should be applied:
Before starting, create a new R
script entitled 02_Part3_exercises.R
saved at the root of DNA_barcoding
.
###~~~
#Check if package is installed if not then install it
###~~~
if("ape" %in% rownames(installed.packages()) == FALSE){
print("Install ape")
install.packages("ape")
}else{
print("ape is installed!")
}
###~~~
#Load package
###~~~
library(ape)
###~~~
#Set working directory
###~~~
#Set working directory to path leading to DNA_barcoding folder
# WARNING: This path as to be adapted to match your computer
setwd("~/Documents/Class_Genomics&Bioinfo_Spring/DNA_barcoding/")
#Check that working directory is set correctly
getwd()
###~~~
#Your first tree in Newick format
###~~~
tr1 <- "(A._speciosa,(B._dimorpha,(C._elegans,D._viridis)));"
###~~~
#Plot your tree with ape package
###~~~
plot(ape::read.tree(text=tr1))
Some considerations associated to the syntax:
;
).()
). Between them are representations of the nodes that are immediately descended from that node, separated by commas (,
)._
) stands for a blank; any of these in a name will be converted to a blank when it is read in (see code above and Figure 4.24 for more details).Branch lengths (representing e.g. molecular substitutions, which can be turned into time in a dated phylogenetic tree) can be incorporated into a tree by putting a real number, with or without decimal point, after a node and preceded by a colon (:
). This represents the length of the branch immediately below that node. Thus, the above tree might have lengths represented as in Figure 4.25.
###~~~
#Tree in Newick format with branch lengths
###~~~
# Here we added branch length values using ":" and inserting value
tr2 <- "(A._speciosa:2.0,(B._dimorpha:4.2,(C._elegans:2.0,D._viridis:3.0):1):4.0);"
###~~~
#Plot your tree with ape package
###~~~
plot(ape::read.tree(text=tr2))
Now, we can finally add node supports (inferred using e.g. the bootstrap approach in the case of the vanilla Maximum Likelihood analysis conducted here) by simply adding values after each closing parenthesis. This is done as follows and displayed in Figure 4.26:
###~~~
#Tree in Newick format with branch lengths and node supports
###~~~
# Here we added branch lengths and bootstrap values
tr3 <- "(A._speciosa:2.0,(B._dimorpha:4.2,(C._elegans:2.0,D._viridis:3.0)60:1)90:4.0)100;"
###~~~
#Plot your tree with ape package
###~~~
# Note that to display node labels, we have to set show.node.label = TRUE
plot(ape::read.tree(text=tr3), show.node.label = TRUE)
phylo
class and phylogenetic tree in RHere, we will be introducing the phylo
class implemented in R packages dealing with phylogenetic analyses. To study this topic, we will be using the phylogenetic tree with branch lengths and node supports (see Figure 4.26).
###~~~
#Tree in Newick format with branch lengths and node supports
###~~~
tr3 <- "(A._speciosa:2.0,(B._dimorpha:4.2,(C._elegans:2.0,D._viridis:3.0)60:1)90:4.0)100;"
###~~~
#Create and load tree
###~~~
tr <- ape::read.tree(text=tr3)
tr
##
## Phylogenetic tree with 4 tips and 3 internal nodes.
##
## Tip labels:
## A._speciosa, B._dimorpha, C._elegans, D._viridis
## Node labels:
## 100, 90, 60
##
## Rooted; includes branch lengths.
###~~~
#Check class
###~~~
class(tr)
## [1] "phylo"
phylo
class objects (here tr
) are lists allowing to access multiple attributes associated to the phylogenetic tree. Three of these lists are especially useful to us:
tr$tip.label
: Vector with tip labels.tr$node.label
: Vector with node labels.tr$edge.length
: Vector with branch lengths.Here, are the details with our example:
###~~~
#To access tip labels
###~~~
tr$tip.label
## [1] "A._speciosa" "B._dimorpha" "C._elegans" "D._viridis"
###~~~
#To access node labels
###~~~
tr$node.label
## [1] "100" "90" "60"
###~~~
#To access branch lengths
###~~~
tr$edge.length
## [1] 2.0 4.0 4.2 1.0 2.0 3.0
Using this knowledge, draw and code a phylogenetic tree with 8 samples (tips) containing branch lengths and node supports
This challenge is done in small groups and each group will take turn to (i) draw their phylogenetic tree on the white board, and (ii) code it using the newick syntax.
Tip: To make sure that your phylogenetic tree is correct, code it and execute it in R
using the ape functions shown above.
In this section, we will be analyzing the output of the RAxML analysis focusing on result.raxml.support
, which contains the phylogenetic tree with node supports (here obtained through the bootstrap method). The file is stored in 04_Data_analyses/Phylogenetic_analysis/RAxML_analysis
.
Our analyses are divided into seven steps:
ITS$node.label
) and replacing their values by nothing (""
).1789804116_Vanilla_inodora
) is a suitable outgroup taxon.pdf
format.Before starting, create a new R
script entitled 03_Part3_PhyloFig.R
saved at the root of DNA_barcoding
.
###~~~
#1. Load Vanilla RAxML ITS tree
###~~~
#Please adjust path based on your computer (= delete "Data/")
ITS <- ape::read.tree(file="Data/04_Data_analyses/Phylogenetic_analyses/RAxML_analysis/RAxML_bipartitions.Vanilla_ML.tre")
###~~~
#2. Discard bootstrap values < 50%
###~~~
#Find nodes with low statistical supports
ITS$node.label[which(as.numeric(ITS$node.label) < 50)]
## [1] "32" "5" "0" "18" "9" "1" "6" "0" "0" "12" "29" "2" "26" "33" "26"
## [16] "32" "14" "14" "19" "11" "0" "3" "1" "1" "7" "4" "15" "16" "15" "5"
## [31] "19" "10" "13" "7" "3" "34" "16" "3" "2" "2" "9" "31" "19" "28" "6"
## [46] "3" "9" "24" "4" "3" "7" "30" "24" "34" "30" "16" "17" "28" "23" "39"
## [61] "44" "38" "19" "45" "28" "28" "7" "5" "22" "45" "31" "35" "34" "38" "6"
## [76] "6" "27" "1" "0" "6" "1" "5" "0" "15" "3" "42" "16" "7" "11" "10"
## [91] "21" "35" "1" "0" "0" "0" "1" "5" "5" "26" "23" "25" "38" "9" "16"
## [106] "22" "21" "2" "1" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## [121] "1" "20" "28" "37"
#Replace/overwrite values by ""
ITS$node.label[which(as.numeric(ITS$node.label) < 50)] <- ""
###~~~
#3. Rename tips: Vanilla by V.
###~~~
#Replace/overwrite values
ITS$tip.label <- gsub("Vanilla", "V.", ITS$tip.label)
###~~~
#4. Root phylogenetic tree with outgroup taxon
###~~~
#Re-root tree with suitable sample
ITS <- ape::root(ITS, outgroup = "1789804116_V._inodora", resolve.root=T)
#This procedure add a node label "Root", which needs to be discarded.
ITS$node.label[which(ITS$node.label == "Root")] <- ""
###~~~
#5. Ladderize and plot tree by sorting nodes
###~~~
#Ladderize
ITS <- ape::ladderize(ITS)
#Color tips: your samples in red
tipcol <- rep("black", length(ITS$tip.label))
#Find which tips are your samples and replace color by red
tipcol[grep("-ITSp4", ITS$tip.label)] <- "red"
#Plot
ape::plot.phylo(ITS, cex=.3, tip.color = tipcol, show.node.label = TRUE, align.tip.label = T, no.margin = TRUE)
###~~~
#6. Extract subtree and plot tree
###~~~
#Find node corresponding to MRCA of our vanilla samples
MRCAsamples <- ape::getMRCA(ITS, tip=ITS$tip.label[grep("-ITSp4", ITS$tip.label)])
#Extract subtree
VanITS <- ape::extract.clade(ITS, node=MRCAsamples)
#Rename vanilla samples using apply and switch as learned before
VanITS$tip.label[grep("-ITSp4", VanITS$tip.label)] <- unname(sapply(VanITS$tip.label[grep("-ITSp4", VanITS$tip.label)], switch, "its25-ITSp4"="PE25", "its50-ITSp4"="PE50","its44-ITSp4"="PE44","its49-ITSp4"="PE49"))
#Color tips: your samples in red
tipcolVan <- rep("black", length(VanITS$tip.label))
#Find which tips are your samples and replace color by red
# ^ means that the search has to begin with PE
tipcolVan[grep("^PE", VanITS$tip.label)] <- "red"
#Plot tree in radial mode
ape::plot.phylo(VanITS, cex = .4, use.edge.length = F, tip.color = tipcolVan, type = "radial", label.offset = 0.04, no.margin = TRUE)
#Add bootstrap supports
ape::nodelabels(text = VanITS$node.label, adj = c(0.5,0.5), frame = "none", cex = 0.5)
#Add circles next to your vanilla samples
ape::tiplabels(tip = grep("^PE", VanITS$tip.label), pch = 16, col = "red", offset = 0.02)
###~~~
#7. Export tree from step 6 in pdf format
###~~~
#Use the pdf function to create pdf
# Adjust path for your computer (= delete "Data/")
pdf("Data/04_Data_analyses/Phylogenetic_analyses/RAxML_analysis/VanSubTree.pdf")
#Plot tree in radial mode
ape::plot.phylo(VanITS, cex = .4, use.edge.length = F, tip.color = tipcolVan, type = "radial", label.offset = 0.04, no.margin = TRUE)
#Add bootstrap supports
ape::nodelabels(text = VanITS$node.label, adj = c(0.5,0.5), frame = "none", cex = 0.5)
#Add circles next to your vanilla samples
ape::tiplabels(tip = grep("^PE", VanITS$tip.label), pch = 16, col = "red", offset = 0.02)
#This function closes the pdf
dev.off()
## quartz_off_screen
## 2
Click on the button below to inspect the final Vanilla phylogenetic tree (in pdf
format) and answer this scientific question:
What species do the samples of vanilla analyzed here belong to (see Figure 4.2) and why?
Note: It is not sufficient to provide taxonomic names for the analyzed samples, you have to motivate why you are advocating for these names.