1 Introduction

To further gain expertise in the field of genomics, students are producing three mini-reports on the following topics:

  • Mini-Report 1: Sequencing technologies (25 points; this task is related to Chapter 2).
  • Mini-Report 2: Molecular biology databases (25 points; this task is related to Chapter 3).
  • Mini-Report 3: Species identifications based on DNA barcoding and phylogenetic inference (50 points). This report provides students with an introduction to methods applied in the laboratory sessions and prepare them to analyze NGS data.

Time will be allocated in class to work on these mini-reports, but the instructor expects students to complete those on their own time.

2 Mini-report 1 – Sequencing platforms & technologies

To have a full overview of available sequencing platforms, students are tasked to produce individual mini-reports on the following sequencing platforms and their associated technologies:

  • Sequencing platform 1: Illumina.
  • Sequencing platform 2: PacBio.
  • Sequencing platform 3: Oxford Nanopore.

2.1 Background information

Each student was assigned a sequencing platform to work on (please see Google spreadsheet for more information).

The individual assignment is mandatory and will be graded. Please see this task as an opportunity to consolidate your knowledge on next-generation sequencing technologies and writing scientific reports.

Time will be allocated in class on the 19th of January 2024 to allow students to work on their assignments. Reports are due on the 2nd of February 2024 and should be uploaded on the shared Google Drive in the Mini_Report_1 folder.

2.2 Structure of mini-reports

Please structure your reports to ensure covering the following aspects:

  • Introduce the sequencing technology (incl. library preparation and DNA requirements).
  • Provide an overview of the available sequencing systems or platforms.
  • Outline the research potential/applicability of the sequencing systems or platform.
  • Provide any additional information such as e.g. price, accessibility, technical support, that you feel are relevant to your sequencing platform.

2.3 Warning

For this assignment, students are not allowed to use AI to complete their work.

2.4 Reporting format

Maximum two pages, single spaced, using either Arial 11 pt or Times New Roman 12 pt, and with all margins set to 1 inch. The two pages do not include a title page and a references section.

The title page should include the following information:

  1. Name of the sequencing platform as title.
  2. Name of student contributing the report.

The references section should contain full citations of all references cited in the text. Each reference cited in the References section should be clearly cited in the text to support transparency in your work and reproducible science.

A citation is a way of giving credit to individuals for their creative and intellectual works that you utilized to support your research. It can also be used to locate particular sources and combat plagiarism. Typically, a citation can include the author’s name, date, location of the publishing company, journal title, or DOI (Digital Object Identifier). Citation styles dictate the information necessary for a citation and how the information is ordered, as well as punctuation and other formatting. There are different citation styles (see Figure 2.1) and the instructor let you pick the one you like. However, please make sure to only use one citation style in your mini-report.

Example of citation styles for an article published in Plants.

Figure 2.1: Example of citation styles for an article published in Plants.

In the case of web resources, please provide the following information in your citation: author, title (of the web page), URL and accessed date. Here is an example:

Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcostsdata. Accessed 2023-01-16.

Please keep in mind that a figure can convey a lot more information than a long text! Your reports should follow the structure presented in section 2.2.

2.5 Format to name document

Name your report document following this pattern: Sequencing platform_Surname.

2.6 Resources

The instructor has provided pdf documents devoted to each sequencing platform below to support students in starting their assignments. Please see Satam et al. (2023) for a review on this subject and Marx (2023) for more information on long-read sequencing.

In addition, please use information available from manufacturers’ websites (see below), publications, online NGS training sites (e.g. EMBL-EBI Training), Google, YouTube and Wikipedia to prepare your reports and presentations. Do not forget to provide citations supporting your claims in your reports. I let you decide which reference style you want to use (look at your favorite journals for examples).

Here are a few additional online references to support your assignment:

3 Mini-report 2 – Molecular biology databases

To have a full overview of available molecular biology databases and assess their contributions to genome annotation, students are tasked to produce individual mini-reports on the following molecular databases:

3.1 Background information

Each student was assigned a molecular database to work on (please see Google spreadsheet for more information).

This individual assignment is mandatory and will be graded. Please see this task as an opportunity to consolidate your knowledge on molecular databases and writing scientific reports.

Time will be allocated in class on the 2nd of February 2024 to allow students to work on their assignments. Reports are due on the 21st of February 2024 and should be uploaded on the shared Google Drive in the Mini_Report_2 folder.

3.2 Structure of mini-reports

Please structure your reports to cover the following aspects:

  • Provide a brief summary of the purpose/mission of your assigned molecular database to introduce its overarching goals and objectives.
  • Provide an overview of the available databases falling under your major type of molecular biology data.
  • Highlight potential contributions (and limitations) of these databases to support genome annotation.
  • Provide an overview or survey of available bioinformatic tools and resources to query/access these databases and conduct analyses.
  • Complement reports with information relevant/specific to your data repositories.

3.3 Reporting format

Follow the same reporting format than with Mini-Report 1.

3.4 Format to name document

Name your report document following this pattern: Molecular database_Surname.

3.5 Resources

The instructor provided below short descriptions of each of the major types of molecular biology databases. Students are encouraged to use the information provided below to prime their research and start populating their reports. Please do not forget to provide citations and references in your reports following information provided in Mini-Report 1.

3.5.1 Proteins sequences databases

There are three major protein sequences databases:

In 2002, these three databases coordinated their efforts to form the UniProt consortium (Consortium, 2014). The partners in this enterprise share the database, but continue to offer separate information-retrieval tools for access. The UniProtKB database is available at this URL: http://www.uniprot.org. This latter database will be especially important to retrieve information on gene annotations (Chapters 6, 7) using the online tool available at this URL: http://www.uniprot.org/uploadlists/. This will be done by specifically looking at Gene ontology (see GeneOntology).

3.5.2 Gene ontology databases

Gene Ontology (GO, Consortium, 2020) (http://www.geneontology.org): the framework for the model of biology. The GO defines concepts/classes used to describe gene function, and relationships between these concepts. It classifies functions along three aspects:

  • Cellular component: This term describes a component of a cell that is part of a larger object, such as an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer).
  • Biological process: A biological process term describes a series of events accomplished by one or more organized assemblies of molecular functions. Examples of broad biological process terms are “cellular physiological process” or “signal transduction”. Examples of more specific terms are “pyrimidine metabolic process” or “alpha-glucoside transport”. The general rule to assist in distinguishing between a biological process and a molecular function is that a process must have more than one distinct steps. A biological process is not equivalent to a pathway. At present, the GO does not try to represent the dynamics or dependencies that would be required to fully describe a pathway.
  • Molecular function: Molecular function terms are describing activities occurring at the molecular level, such as “catalytic activity” or “binding activity”. GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions, and do not specify where, when, or in what context the action takes place. Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products. Examples of broad functional terms are “catalytic activity” and “transporter activity”; examples of narrower functional terms are “adenylate cyclase activity” or “toll receptor binding”. It is easy to confuse a gene product name with its molecular function; for that reason GO molecular functions are often appended with the word “activity”.

GO is also available for all the protein sequences deposited on the UniProt database. Other useful tool to perform gene annotations:

3.5.3 Metabolic pathways databases

The Kyoto Encyclopedia of Genes and Genomes (KEGG, Kanehisa et al., 2022) (http://www.genome.jp/kegg/) collects individual genomes, gene products, and their functions, but its special strength lies in its integration of biochemical and genetic information. KEGG focuses on interactions: molecular assemblies, and metabolic and regulatory networks. KEGG organizes five data types into a comprehensive system:

  • Catalogs of chemical compounds in living cells.
  • Gene catalogs.
  • Genome maps.
  • Pathway maps.
  • Orthologue tables.

The catalogs of chemical compounds and genes contain information about particular molecules or sequences. Genome maps integrate the genes themselves according to their chromosomal position. Pathway maps describe potential networks of molecular activities, both metabolic and regulatory. One enzyme in one organism would be referred to in KEGG in its orthologue tables, which link the enzyme to related ones in other organisms. This permits analysis of relationships between the metabolic pathways of different organisms.

4 Mini-report 3 – Species identifications based on DNA barcoding and phylogenetic inference

4.1 Learning outcomes

The following learning outcomes will be taught as part of this Mini-Report:

  1. Process and clean DNA sequence electropherograms generated using the Sanger sequencing approach (based on PCR amplifications).
  2. Validate DNA sequence and provide species working hypothesis by conducting BLAST similarity DNA sequences analyses. This will be based on DNA sequences deposited on GenBank (Benson et al., 2005).
  3. Query GenBank and download DNA sequences for phylogenetic analyses. This will be done using R (R Core Team, 2016) packages.
  4. Produce DNA multiple sequence alignments (containing both your target DNA sequences and those from GenBank). This will be done using MUSCLE (Edgar, 2004) implemented in MEGA (Kumar et al., 2018).
  5. Infer phylogenetic analysis based on Maximum Likelihood criterion as implemented in RAxML (Stamatakis, 2014). This phylogenetic framework will be used to test and refine species working hypotheses using the phylogenetic species concept (Wheeler, 1999).
  6. Visualize and interpret phylogenetic analysis using R (R Core Team, 2016) packages.
  7. Learn the terminology associated with DNA barcoding and phylogenetic analyses. Key definitions associated to the most used terms are available in the Lexicon.

4.2 Publications

This mini-report is mostly based on the following publication:

  • Ellestad et al. (2022) - DNA barcoding and phylogenetics of Vanilla

In each section, references are provided to help you master the material taught in this assignment.

4.3 Schedule

This report is subdivided into four parts:

  • Introduction: Provide introductions on key theoretical concepts, scientific questions, hypothesis (incl. prediction), methodology, data availability, and report structure and formatting.
  • Part 1: Process and clean ITS DNA sequence electropherograms and infer species working hypothesis.
  • Part 2: DNA sequence retrieval and prepare data for analyses.
  • Part 3: DNA multiple sequence alignment and phylogenetic inference.

Please see the Timetable for more details on the schedule.

4.4 Introduction

4.4.1 How many species are there on Earth?

Projections of global biodiversity have ranged from 2 to 100 million species (Larsen et al., 2017).

But these estimations are without accounting for cryptic species! See for instance Dentinger and Suz (2014) for an example on porcini. In this study, the authors used DNA-sequencing to identify three species of mushroom contained within a commercial packet of dried Chinese porcini purchased in London. Surprisingly, all three have never been formally described by science and required new scientific names.

Larsen et al. (2017) recently published a keystone paper where they predicted 1 to 6 billion species on Earth! This prediction was based on an average of 6 cryptic species per described species.

Overall, data presented here demonstrate that most species on this planet are either poorly known or pending description.

In this context, the fields of genetics/genomics (more specifically DNA barcoding) and phylogenetics have the potential to contribute to:

  • species identifications and naming (taxonomy).
  • infer evolutionary frameworks allowing developing working hypotheses on species boundaries, relationships and their spatio-temporal histories (e.g. knowing how did species cope with past climatic conditions will provide insights into their adaptive capacity to forecasted conditions).

Approaches combining these two objectives were applied across lineages. See for instance, an article describing a new species belonging to the soapberry family (Sapindaceae) endemic to Fiji (Buerki et al., 2017). In this study, the phylogeny is used to confirm the new taxon and place it within a larger evolutionary and biogeographical framework. Finally, this evidence is coupled with occurrence data to infer a species extinction risk following IUCN guidelines.

4.4.2 DNA barcoding: Species identification

The Consortium for the Barcode of Life (CBOL) is an international initiative devoted to developing DNA barcoding as a global standard for the identification of biological species. CBOL has more than 130 Member Organizations from more than 40 countries (CBOL, 2021).

DNA barcoding is a method of species identification using a short section of DNA from a specific gene or genes. The premise of DNA barcoding is that, by comparison with a reference library of such DNA sections, an individual DNA sequence can be used to uniquely identify an organism to species. This is analogous to a supermarket scanner using the familiar black stripes of the UPC barcode to identify an item in its stock against its reference database (CBOL, 2021). DNA barcodes are sometimes used in an effort to identify unknown species, parts of an organism, or simply to catalog as many taxa as possible, or to compare with traditional taxonomy in an effort to determine species boundaries. For more details on DNA barcodes applied to plants, please see section below dedicated to vanilla.

DNA barcoding has a wide range of applications from biodiversity surveys (e.g. Telfer et al., 2015), illegal trade of wildlife (e.g. Gonçalves et al., 2015) to food security (e.g. Quinto et al., 2016). The DNA barcoding approach has been reviewed by DeSalle and Goldstein (2019).

4.4.2.1 Procedure

A DNA barcoding approach is subdivided into four steps:

  1. Isolate DNA from the target sample.
  2. Amplify the target DNA barcode region using PCR.
  3. Sequence the PCR products using either Sanger sequencing or NGS (mostly on Illumina platform).
  4. Compare the resulting sequences against reference databases to find the matching species. Here, we will be using DNA sequences from GenBank as our reference (Benson et al., 2005). See Chapter 3 for more details on GenBank.

4.4.3 Phylogenetic inference

As described in Masters and Pozzi (2017), phylogenetic inference is the practice of reconstructing the evolutionary history of related species by grouping them in successively more inclusive sets based on shared ancestry. Homologous characters in independent lineages are similar because they have been inherited from a common ancestor, and they alone should be used in phylogenetic reconstructions. Homoplasies are characters that appear similar, but have evolved from different ancestral states. They may mislead interpretations of evolutionary history. Both molecular and morphological datasets are subject to obfuscation by homoplasy. Methods of phylogenetic inference aim to distinguish between homologous (signal) and homoplastic (noise) resemblance. Molecular datasets tend to be very large and are analyzed using statistical techniques that fit the data to models of molecular evolution. These methods are not well suited to morphological data, and combined analyses including both kinds of data tend to obscure the morphological signal. Rates of molecular change may be used to estimate divergence ages.

In this mini-report, we will be focusing on inferring phylogenetic trees based on DNA sequences obtained through the DNA barcoding approach described above.

4.4.3.1 Procedure

A phylogenetic analysis is subdivided into four steps:

  1. Produce DNA sequences following steps 1 to 3 as described in DNA barcoding section.
  2. Produce DNA multiple sequence alignments.
  3. Test for best-fit model reflecting evolution of DNA region.
  4. Infer phylogenetic tree, which is usually based either on Maximum Likelihood and/or Bayesian criteria.

4.5 Introducing our model species: Vanilla

Like porcini (Dentinger and Suz, 2014), vanilla is one of our most beloved spice, but we still know very little about its biology, taxonomy and evolution (see Ellestad et al., 2021 for a review). As part of a study aiming at inferring domestication processes in the genus Vanilla and species adaptation to drought, fieldwork was conducted in Mexico by Paige Ellestad (an EEB Ph.D. student at BSU) and her in-country partners.

For her Ph.D. dissertation, Paige is applying a DNA barcoding approach coupled with phylogenetic inference (and ecological niche modeling) to answer the following questions:

Q1: What species of vanilla are cultivated in Mexico?

Q2: How are those species related and is there evidence of gene flow/hybridization?

Q3: What level of genetic diversity is there among each species?

4.6 Applied DNA barcodes for the vanilla project

To answer these questions, she sampled >30 locations (mostly plantations, but also wild populations) representing >60 samples and sequenced two DNA barcode regions widely used in plants: one chloroplastic (the coding rbcL region) and one nuclear ribosomal (the ITS region, which stands for for “Internal Transcribed Spacer”). These two DNA regions are traditionally used by researchers as DNA barcodes supporting species delimitation and identifications. Since you are already familiar with rbcL (see Hollingsworth et al., 2009 for more details), we will be providing more details on the ITS region. The nuclear ribosomal ITS region includes part of 18S, ITS1, 5.8S, ITS2 and part of 26S (see Figure 4.1, Cheng et al., 2016). This region is repeated in tandem 1000 of times and duplicated across chromosomes in order to produce ribosomes, which are key to produce proteins. This latter DNA region is shared across kingdoms and plant specific primers have to be used otherwise we may be at risk of amplifying this region for symbiotic organisms (e.g. fungi, see Cheng et al., 2016).

Map of the ITS nuclear ribsomal region with primers (from Cheng et al, 2016).

Figure 4.1: Map of the ITS nuclear ribsomal region with primers (from Cheng et al, 2016).

DNA extractions of leaf material were conducted at BSU using the Qiagen Plant Mini kit followed by PCR amplifications using plant specific primers [for instance, the IT4p and ITS5p primers for the ITS region; see Cheng et al. (2016)]. Sequencing was outsourced to GENEWIZ and performed using Sanger sequencing technology. PCR amplicons/fragments were only sequenced using forward primers. As per provider, we are expecting high quality DNA sequencing read lengths of up to ca. 1000 bases. The company also mentions that a typical read would provide 800 bases with Phred score of 20.

4.7 Mini-Report 3 scientific questions

For this assignment, students will investigate the following question:

Q1: To what species of vanilla do the individuals presented in Figure 4.2 belong to?

To answer these questions, we will be focusing on four samples/individuals of vanilla (see Figure 4.2) and their associated ITS sequence data as well as ITS sequences deposited on GenBank. See the scientific process section below for more details.

Images of the four samples of vanilla collected in Mexico studied in this project.

Figure 4.2: Images of the four samples of vanilla collected in Mexico studied in this project.

Throughout this project, we will be referring to these vanilla samples/individuals as follows (see Figure 4.2):

  • PE25, PE44, PE49 and PE50.

4.8 Question, hypothesis, and methodology

Our overachieving question is as follows:

To what species of vanilla do individuals presented in Figure 4.2 belong to?

Based on evidence detailed above, our hypothesis is the following:

Individuals belong to the same species (V. planifolia) and thr observed phenotypic variation is due to phenotypic plasticity (= exhibiting responses to contrasting climates).

To test our hypothesis, we are applying the phylogenetic species concept (Wheeler, 1999) (mainly criterion of monophyly) using DNA barcoding data analyzed using a phylogenetic approach. If our hypothesis is correct, then we predict that all the individuals will form a monophyletic clade and fall with other samples belonging to V. planifolia.

To test this hypothesis, we have generated ITS barcode data for individuals in Figure 4.2 and aim at comparing them with data from GenBank (= reference data).

Our analyses will be done using the following methodology:

  1. Produce ITS barcodes for target individuals using PCR and Sanger sequencing (= raw data are provided on Google Drive).
  2. Assembly of reference ITS data available on GenBank using R.
  3. BLAST and RAxML Maximum Likelihood phylogenetic analyses (including bootstrap analysis to assess node supports) to identify species identity of target samples.
  4. Produce figures using R to test working hypothesis.

Finally, we will intersect evidence from all our analyses (especially phylogenetic clustering combined with node supports and taxonomy of ITS sequences from GenBank) to test our hypothesis and infer species identity of our target individuals. Please see section 4.9 for more details our the formatting of the report.

4.9 Writing your report

The instructor provides below information to complete the Mini-Report 3. All the material, data, code and references required to complete this report are provided here and are covered in class. In this context, students will have to focus on formatting their reports following guidelines presented here as well as making sure that their R code is working and ready to be shared.

4.9.1 Apply the IMRAD format and supply R code

Your reports will be structured and organized following the IMMRAD format: Introduction, Material & Methods, Results, and Discussion. This format is widely used to report experimental research in many scientific disciplines. In addition, you will be complementing your reports with an Abstract (see below). Finally, since this research relies heavily on bioinformatics, students will complete their reports by supplying their commented R scripts.

4.9.2 Structure of Mini-Report 3

The instructor provides guidance on content for each section of your individual reports:

  1. Title & author: Please supply a first page containing the title and author information.
  2. Abstract: Provide a short text summarizing the i) objective(s), ii) methods (incl. sampling), iii) most important results and iv) major conclusions and significance of your research (maximum: 300 words).
  3. Introduction: Summarize challenges associated to biodiversity research and benefit of DNA barcoding approach in tackling those issues (e.g. by rapidly identifying species, this approach supports species descriptions, monitoring and development of conservation strategies). Then, move on to talk about vanilla, your model organism, and provide the scientific question and hypothesis investigated in this report (see section 4.7). Finally, finish this section with an overview of the applied methodology used to answer your questions.
  4. Material & Methods: This section should be subdivided into four subsections. Subsections a. and b. are associated to wet-lab analyses, whereas subsections c. and d. correspond to bioinformatic analyses. Please, make sure to cite references supporting your methods and report used bioinformatic tools (incl. references) together with their applied settings. Finally, you can also refer to your R code for further details. The subsections are as follows:
    1. Sampling (see section 4.7).
    2. DNA extraction and sequencing (see section 4.6).
    3. BLAST analysis.
    4. Phylogenetic inference.
  5. Results: Results should be presented in the same order as presented in the Material & Methods section. Tables, Figures and code are reported in this section (by directly embedding them in the text and providing captions). To speed-up the writing of this report, you don’t have to report on results associated to Sampling (a.) and DNA extraction and sequencing (b.) subsections, but rather focus on the BLAST analysis (c.) and phylogenetic inference (d.) subsections.
  6. Discussion: In your discussion, we are expecting you to collate all the evidence presented in the study to answer your scientific questions. In this context, please subdivide this section into three subsections named as follows:
    1. To what species of vanilla do the individuals studied here belong to?
    2. How are those individuals related to each other?
    3. Conclusions and Perspectives.
  7. References: Add a references section containing full citations of references cited in the text. Students can pick their preferred citation format, but references need to be consistently edited in the same fashion.
  8. Appendix: Your R code has to be provided as an Appendix either by directly coping your code in the main document under this section or by providing the name and location of the R script in the Appendix section (e.g. by providing a text similar to this: “All the R code associated to this research can be downloaded at this path.”).

4.9.3 Submission and deadline

Individual reports should be submitted in either Google Docs, Word or Rmarkdown formats (and if you decide to submit an additional file containing your R script, submit this file as an .R format). These files should be uploaded on the shared Google Drive in the Mini_Report_3 folder in a subfolder entitled as follows: Mini_Report_3_Surname.

Upload your reports on the shared Google drive by the 29th of March 2024 before 5 PM MT.

4.10 Project structure and data

To support mastering the learning outcomes, we will be using a subset of a DNA dataset focusing on the vanilla spice (from the Orchidaceae family) published by Ellestad et al. (2022).

The data for this assignment are deposited on the shared Google Drive in a folder entitled DNA_barcoding (located in Mini_Report_3). All students enrolled in this class have been granted access to this folder. The folder is subdivided into four sub-folders (see Figure 4.3) and contains a master spreadsheet Vanilla_samples_records.xlsx at its root:

  • 01_Field_images: jpeg images of samples.
  • 02_Raw_ITS_data_ab1: .ab1 sequence electropherograms of ITS sequences.
  • 03_Processed_ITS_data_FASTA: Folder where cleaned ITS sequences will be saved in FASTA format.
  • 04_Data_analyses: Folder where we will be saving outputs of analyses conducted in this project.
  • Vanilla_samples_records.xlsx contains meta-data information about the samples. We will also update this file with information on species hypotheses.
  • PART2_Vanilla.R is an R script containing code for the part 2 of this mini-report.
Screenshot of the DNA_barcoding folder showing data structure for this project.

Figure 4.3: Screenshot of the DNA_barcoding folder showing data structure for this project.

4.10.1 Data availability

The data availability statement associated with Ellestad et al. (2022) is as follows:

All sequence data for this project are available at the National Center for Biotechnology Information (NCBI) under GenBank ITS sequences ON525161–ON525228, GenBank rbcL sequences ON531917–ON531986, BioProject accession no. PRJNA841950, and BioSample accession nos. SAMN28632719–SAMN28632734. All raw sequence files are available from the NCBI SRA database, nos. SRR19374405–SRR193744012, SRR19374414, SRR19374417, and SRR19374418. DNA alignments are available at Zenodo: https://doi.org/10.5281/zenodo.6577744

4.11 Part 1

4.11.1 Aim

Process and clean ITS DNA sequence electropherograms and infer species working hypothesis.

4.11.2 Bioinformatic tools

The bioinformatic tools used in part 1 are as follows:

  1. FinchTV: A popular desktop software developed by Geospiza, Inc. for viewing trace data from Sanger DNA Sequencing. FinchTV is freely available and operates on Windows and Mac platforms. Start by downloading and installing the software on your computers at this URL: https://digitalworldbiology.com/FinchTV
  2. BLAST: The Basic Local Alignment Search Tool (BLAST, Altschul et al., 1990) will be applied onto cleaned DNA sequences to:
    • Confirm that the DNA sequences correspond to the correct DNA region (here ITS region).
    • Validate that DNA sequences belong to the right taxon (here belonging to the genus Vanilla) and are therefore not contaminated.
    • Provide species working hypotheses using the distance-tree approach implemented on the online version of BLAST.

Although BLAST can be run locally, we will be using the web portal available here: https://blast.ncbi.nlm.nih.gov/Blast.cgi

4.11.3 Overview of the DNA sequence data cleaning procedure

When evaluating .ab1 files (= raw DNA data from an Applied Biosystems’ Sequencing instrument containing an electropherogram showing the Phred scores and the DNA base sequence), you should first see the electropherogram and come to a conclusion whether your data can be considered of good quality or not.

Good quality sequencing data/positions are characterized by:

  • Well-defined peak resolution (bad resolution of the first 10-25 bases is acceptable; see Figure 4.4). The Phred score is displayed at top of the window (see bars on Figure 4.4).
  • Uniform peak spacing.
  • High signal-to-noise ratios.

Bad quality sequencing data/positions are characterized by:

  • Presence of “N”s in the sequence. Indeed, when the base-calling software is unable to accurately identify a nucleotide, it will score it as “N” (meaning that it can be any base; see Figure 4.4). In this tutorial, we will open files individually and search for peaks scored as “N”s. If we can confidently correct those peaks/positions (to either “A/T/C/G or any other IUPAC code; see below) then we will edit the sequence accordingly. To keep track of changes, it is standard procedure to identify edited peaks/positions by using lower cases letters (instead of capital letters as set by default). See below for more details on cases of base polymorphism.

We will be discussing this topic further during class.

Screenshot of FinchTV app.

Figure 4.4: Screenshot of FinchTV app.

4.11.4 IUPAC codes

Nuclear DNA regions such as ITS could show evidence of recombination. This means that there could be polymorphism at a specific base [also know as single-nucleotide polymorphism or SNP; see Poplin et al. (2018) for bioinformatics technics to identify SNPs based on NGS data]. The signature of recombination in an electropherogram would be recognized by the occurrence of “peak under peak” (Figure 4.5). The International Union of Pure and Applied Chemistry (IUPAC) has defined a standard representation of DNA bases by single characters that specify either a single base (e.g. G for guanine, A for adenine) or a set of bases (e.g. R for either G or A). UCSC uses these single character codes to represent multiple observed alleles of single-base polymorphisms (Table 4.1).

Screenshot of DNA sequence electropherogram showing signature of peak under peak suggesting recombination.

Figure 4.5: Screenshot of DNA sequence electropherogram showing signature of peak under peak suggesting recombination.

Table 4.1: IUPAC codes used to score base polymorphism. This is especially useful when analyzing nuclear DNA regions.
IUPAC nucleotide code Base
A Adenine
C Cytosine
G Guanine
T (or U) Thymine (or Uracil)
R A or G
Y C or T
S G or C
W A or T
K G or T
M A or C
B C or G or T
D A or G or T
H A or C or T
V A or C or G
N any base
. or - gap

4.11.5 Step-by-step protocol

Here, we will be using its25-ITSp4.ab1 as an example for the analysis. This file is located in DNA_barcoding/02_Raw_ITS_data_ab1.

  1. Copy the DNA_barcoding/ folder onto your computers.
  2. Launch FinchTV.
  3. Open .ab1 file (one at a time) using the File --> Open... tab or by dragging your .ab1 file in the main window.
  4. Make sure that the Base Position Numbers, Base Calls and Quality Values settings are ticked using the View tab (Figure 4.4).
  5. Trim the first 10-20 bp of the sequence. This is done by selecting bases using the Shift command and then pressing Delete (Figure 4.6).
Screenshot of FinchTV app showing trimming procedure.

Figure 4.6: Screenshot of FinchTV app showing trimming procedure.

  1. Scroll through the sequence and edit “N” peaks/positions by replacing those with lower cases “a/t/c/g”. If you are unable to call the nucleotide, do not edit the position. In the example, position 10 was edited to “g” (Figure 4.7).
Screenshot of FinchTV app showing editing procedure.

Figure 4.7: Screenshot of FinchTV app showing editing procedure.

  1. Peaks at the end of the sequences will become more rounded and harder to call. It is therefore standard procedure to trim the last 10-30 bp/positions. Please trim these bp following the procedure explained above. In the example, the quality drops around position 670 (see Figure 4.8).
Screenshot of FinchTV app showing trimming procedure at the end of the sequence.

Figure 4.8: Screenshot of FinchTV app showing trimming procedure at the end of the sequence.

  1. Save cleaned sequence in FASTA format in 03_Processed_ITS_data_FASTA. Exporting the cleaned sequence is done by pressing File -> Export -> DNA Sequence: FASTA. Please do not rename file, leave it as proposed by FinchTV. The file is saved in interleaved FASTA format as shown in Figure 4.9.
Screenshot of cleaned FASTA sequence as outputted by FinchTV.

Figure 4.9: Screenshot of cleaned FASTA sequence as outputted by FinchTV.

  1. Open fasta file in a text editor and copy DNA sequence (including header starting with >; see Figure 4.10).
  2. Go on the BLAST website by clicking here and copy your DNA sequence as shown in Figure 4.10. Click on the BLAST button to start your query.
Screenshot of BLAST form where you copy content of its25-ITSp4.seq

Figure 4.10: Screenshot of BLAST form where you copy content of its25-ITSp4.seq

  1. Inspect the BLAST output to make sure that the top hits are associated to Vanilla and refer to the right DNA region (Figure 4.11).
Screenshot of BLAST search based on its25-ITSp4. Note that top hits are ITS sequences of Vanilla species.

Figure 4.11: Screenshot of BLAST search based on its25-ITSp4. Note that top hits are ITS sequences of Vanilla species.

  1. Click on the Distance tree of results link to perform phylogenetic distance analysis showing position of your sequence compared to sequences available on GenBank (see Figure 4.11). This will open a new window.

  2. Expand tree to locate your DNA sequence by following procedure in Figure 4.12.

Procedure to expand tree to show position of your DNA sequence in phylogeny.

Figure 4.12: Procedure to expand tree to show position of your DNA sequence in phylogeny.

  1. Use the Zoom toggle to locate your DNA sequence on the tree (see Figure 4.13).
Position of your DNA sequence on tree.

Figure 4.13: Position of your DNA sequence on tree.

  1. Open Vanilla_samples_records.xlsx and update Species_BLAST column with a species name (your first working hypothesis) and add the GenBank accession number of the most closely related DNA sequence deposited on GenBank. Don’t forget to save the file.

  2. Repeat this procedure until you analyzed all the ab1 files.

4.12 Part 2

4.12.1 Aim

DNA sequence retrieval and prepare data for analyses.

4.12.2 Bioinformatic tools

To execute Part 2, you need to install the following software and R packages on your computer:

If you don’t know how to install an R package, don’t worry, this topic is covered here.

4.12.3 R tutorials

Please find below two documents providing a comprehensive introduction to R:

R for beginners (a tutorial by Emmanuel Paradis): https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf An introduction to R: https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf

4.12.4 RStudio

RStudio is an integrated development environment (IDE) that allows you to interact with R more readily. RStudio is similar to the standard RGUI, but it is considerably more user friendly. It has more drop-down menus, windows with multiple tabs, and many customization options (see Figure 4.14). Detailed information on using RStudio can be found at at RStudio’s website.

Snapshot of the RStudio environment showing the four windows and their content.

Figure 4.14: Snapshot of the RStudio environment showing the four windows and their content.

4.12.4.1 Editing and Executing Code in RStudio

Please consult this RStudio article to learn more about procedures to edit and execute code in the RStudio environment.

Tip: To execute a line of code and send it to the Console you can press Ctrl+Enter on Windows or Command+Enter on Mac or use the Run toolbar button (see Figure 4.14).

4.12.5 Introduction to R built-in functions

In this course, we will be using built-in R functions that are implemented in packages.

Functions are useful when you want to perform a certain task multiple times. A function accepts input arguments and produces the output by executing valid R commands that are inside the function.

Arguments have associated data types that need to be entered by the user to execute the function and retrieve its output(s).

The basic R data types are as follows:

  • numeric: Numbers, written as either integers or decimals.
  • integer: Whole numbers without any decimal point.
  • character (a.k.a string): A sequence of characters (declared between “” or ’’)
  • logical (a.k.a. boolean): Binary values, TRUE or FALSE.
  • vector: Elements of a vector using subscripts (using this syntax c(1,2,3)).
  • matrix: A matrix from the given set of values (using this function matrix(ncol = 2, nrow = 3)).
  • data.frame: Data frames are data displayed in a format as a table. Data frames can have different types of data inside it. While the first column can be character, the second and third can be numeric or logical. However, each column should have the same type of data.

We store/save outputs of built-in functions in variables. R does not have a command for declaring a variable. A variable is created the moment you first assign a value to it. To assign a value to a variable, use the <- sign. For instance:

# Assign output of simple math to variable
x <- 2+2

# You can call variable as follow
x
## [1] 4

To know the class of data stored in a variable, you can use the class() function as follows:

# What is the class of x
class(x)
## [1] "numeric"

Finally, you can retrieve the documentation associated with a built-in function by using the following syntax:

# Pull up documentation for class function
?class()

4.12.6 Analytical workflow

To infer the ML phylogenetic analysis, the following workflow will be executed:

  1. Expand sampling for phylogenetic analyses by downloading ITS DNA sequences of Vanilla species deposited on GenBank using the following approach:
    • Use functions from the rentrez R package (Winter, 2017) to download DNA sequences and metadata.
    • Build a table with species taxonomy associated to GenBank accessions.
    • Clean the data and save outputs in FASTA and csv formats.
  2. Format and merge individual .seq files from Part 1 into a FASTA object/file using the following approach:
    • Create a list of all .seq files, open them individually and merge them into one object (ultimately saved as a file) using a for loop.
  3. Merge FASTA objects/files from steps 1 and 2 to perform a multiple sequence alignment (msa).

Protocols associated to each step within the workflow are detailed below.

4.12.7 Download and write DNA sequences and retrieve taxonomy

Here, we are demonstrating an approach to search GenBank for ITS sequences of Vanilla species using rentrez (Winter, 2017). Students are tasked to study/examine and execute the code and finally adapt it to search for ITS sequences. The code is subdivided into the following steps:

  1. Build a GenBank query. But prior to that we will load R package and set our working directory.
  2. Retrieve DNA sequences matching query deposited on GenBank.
  3. Fetch meta-data associated to sequences.
  4. Tidy dataset based on meta-data.
  5. Write results in FASTA and csv formats.

4.12.8 Method to download data from Entrez

The following method (or pseudo-code) is applied to download (here referred to as “fetch”) data from Entrez using rentrez (Winter, 2017):

  1. Build GenBank query
    • Function: paste0()
  2. Search GenBank using query to retrieve unique IDs associated to target DNA sequences
    • Function: rentrez::entrez_search()
  3. Fetch GenBank DNA sequences using unique IDs
    • Function: rentrez::entrez_fetch()
  4. Link GenBank database with other databases (e.g., Taxonomy) to retrieve unique IDs of additional data associated with DNA sequences
    • Function: rentrez::entrez_link()
  5. Fetch additional data
    • Function: rentrez::entrez_fetch()

4.12.9 Create an R script

Before delving into the code, do the following:

  1. Launch RStudio.
  2. Create a new .R script (File > R Script).
  3. Save the .R script at the root of your project (DNA_barcoding/) with the following name 01_PART2.R.

All the R code provided below will be copied into your newly saved R script and executed in class together.

4.12.10 Build a GenBank query

This R code uses rentrez functions to interact with GenBank and the nucleotide database to remotely retrieve data based on your query.

###~~~
#Check if package is installed if not then install it
###~~~
if("rentrez" %in% rownames(installed.packages()) == FALSE){
  print("Install rentrez")
  install.packages("rentrez")
  }else{
    print("rentrez is installed!")
}

###~~~
#Load package
###~~~
library(rentrez)

###~~~
#Set working directory
###~~~
#Set working directory to path leading to DNA_barcoding folder
# WARNING: This path as to be adapted to match your computer
setwd("~/Documents/Class_Genomics&Bioinfo_Spring/DNA_barcoding/")

#Check that working directory is set correctly
getwd()

###~~~
#Build a query
###~~~

#Taxon
sp <- "Vanilla"

#DNA region: here ITS
DNA <- "internal transcribed spacer"

#Organism
org <- "Plants"

#Build query: sp AND DNA region
query <- paste0(sp," [All Fields] AND ", DNA," [All Fields] ", org, " [filter]")

4.12.11 Using Entrez bollean operators

Boolean operators provide a way of generating precise queries that produce well-defined sets of results. The Boolean operators used in Entrez and how they work are as follows.

  • AND: Finds documents that contain terms on both sides of the operator terms, the intersection of both searches.
  • OR: Finds documents that contain either term, the union of both searches.
  • NOT: Finds documents that contain the term on the left but not the term on the right of the operator, the subtraction of the right hand search from the one on the left.

Entrez requires the Boolean operator AND to be entered in uppercase. This is not required in all databases for the other two operators, but it is simplest to enter all of them in uppercase:

promoters OR response elements NOT human AND mammals

Entrez processes all Boolean operators in a left-to-right sequence. Enclosing individual concepts in parentheses changes this priority. The terms inside the parentheses are processed first as a unit and then incorporated into the overall strategy. For example, in the following search statement, the union of response element and promoter results is generated first and then is intersected with the result of the g1p3 search.

g1p3 AND (response element OR promoter)

4.12.12 Retrieve GenBank DNA sequences matching query

###~~~
#Retrieve DNA accessions in GenBank
###~~~
GBresults <- rentrez::entrez_search(db = "nuccore", term = query, retmax = 50000)

#How may hits did we get
print(GBresults$count)

4.12.13 Fetch meta-data associated to sequences

Even when you execute a query directly on the GenBank portal, there will always be sequences that do neither match your target taxon (here Vanilla) nor your target DNA region. In this context, it is paramount to retrieve meta-data associated to the DNA accessions (stored in GBresults) in order to clean-up your dataset prior to analyses (see next step).

For each DNA sequence the following meta-data are gathered using functions implemented in rentrez:

  • GenBank DNA accession number.
  • Taxonomy.
  • Sequence definition line as displayed on GenBank.
  • Sequence length (in bp).
  • DNA sequence.

Please see the R code below for more details on the procedure to retrieve the meta-data.

Disclaimer: The R code below might stop because you do not have an API key registered to NCBI and the system might time you out. If it is the case, you will have to edit the for loop to pursue downloading the data.

###~~~
#Fetch meta-data associated to sequences
###~~~

#Use loop to automatically retrieve species, 
# seq definition line, seq length and DNA sequence associated 
# to each DNA accession

#Create empty matrix to be populated
OUT <- matrix(ncol = 5, nrow = length(GBresults$ids))
colnames(OUT) <- c("GenBankID", "Species", "Definition", "Seq_length", "Sequence")

#Add GenBank ID
OUT[,1] <- GBresults$ids
print("Processing sequences: fetching meta-data")
#Set a progress bar
pb <- txtProgressBar(min = 0, max = length(GBresults$ids), style = 3)
for(i in 1:length(GBresults$ids)){
  #Wait time to avoid being timed out by NCBI
  # but it still might happen because you don't have an API key
  Sys.sleep(5)
  
  #Print iteration number to assess progress
  # This info will help reset the loop if your are timed out
  print(paste("Iteration", i, "of", length(GBresults$ids), sep= ' '))
  
  #Download sequence
  seq <- entrez_fetch(db = 'nuccore', id = GBresults$ids[i], rettype = 'fasta', retmode = "text")
  
  #Extract definition line
  OUT[i,3] <- strsplit(seq, split = "\n")[[1]][1]
  
  #Infer seq length
  nbp <- strsplit(seq, split = "\n")
  OUT[i,4] <- as.numeric(length(strsplit(paste(nbp[[1]][2:length(nbp[[1]])], 
                                               collapse=''),"")[[1]]))
  
  #Extract sequence
  OUT[i,5] <- as.vector(paste(nbp[[1]][2:length(nbp[[1]])], collapse = '')[1])
  
  #Fetch taxon ID associated to GenBank accessions
  taxID <- entrez_link(dbfrom = 'nuccore', id = GBresults$ids[i], db = 'taxonomy')

  #Extract taxonomy: genus and species epithet
  tmp <- strsplit(
    strsplit(entrez_fetch(db = 'taxonomy', id = taxID$links, rettype = "native"), 
             split = '\n')[[1]][1]
    , split = ' ')
  
  OUT[i,2] <- paste(tmp[[1]][2:length(tmp[[1]])], collapse = ' ')
  
  # update progress bar
  setTxtProgressBar(pb, i)
}
close(pb)

SEQ <- as.data.frame(OUT)

###~~~
#Write raw data in 04_Data_analyses
###~~~
FileIDRawcsv <- paste(sp, DNA, gsub("-", "_", Sys.Date()), "Raw_GenBank.csv", sep = '_')
#Write FASTA file
write.table(SEQ, 
            paste("04_Data_analyses/CSV/", 
                  FileIDRawcsv, sep = ''), row.names = F, col.names = T, quote = T)

Let’s have a look at the data downloaded from GenBank:

## [1] "GenBank query retrieved 185 DNA sequences."
##    GenBankID            Species
## 1 1708599397 Vanilla planifolia
##                                                                                                                                                                                                                                                                                                      Definition
## 1 >MN221421.1 Vanilla planifolia voucher PDBKTMA2013-1860 external transcribed spacer, partial sequence; small subunit ribosomal RNA gene, internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence; and large subunit ribosomal RNA gene, partial sequence
##   Seq_length
## 1       9463
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Sequence
## 1 CGCGGCTGAGGGCAACGCCACGCCGCGCGGGGCGTGCGGTCGTGGCCAATGATTAGGCGCGCGCGGCGCGCCGTGCAGGGCACAACGTTGCAACCCCGGGCGCGCGGTCGGGGGCACCGCGCCCCTGCGCACCGCCGCGGGCACCGCGCACCTGCGTGCGGTCCTGTGCACCGCGCAGGTTGGTTATGTTGGCTGATGAGGGGACAAAAAAGTGCAACTTTTTTCGGGATTTTTAGCGCTGCGGGCTGCTTCTGCACGGATGCGGCGACAACCAATTGGTAGTCTTGCCTCGGCGCATGATGGCATCGTCGCAAAAAAAAAAATCCGAGTTGCGGGCGTGTGCATCGCGCGCCTGACCCGGGCGTGGGCACCGCGCTCCGGCGTGGGGGCATGTTCATCGAGCTCCTGCTTGCGTGCTTTCGAACCGCGCTCCTGTTGGCTTGGCTATCCGCGCGTTCTCGGCGACTAGGTGGGCCGGCATCGTTGCGCTGCAGAACAGAGCAACTTTCGTCGGCATCTTTAGCCCTGTGTTCTCGCGAGGGGACCCAAAAGTGTAACTTTTTTTGGGATTTTTAGCGCCGCGGGCTGCTTCTGCTTGGATATGGTGACAACCCATTGGTAGTCTGCCTCGTCCCAAGATGCGATTGTCGCAATACACGATCGGAAATGTGGKCATGTTCATAACGCACCTGCGGGCGGGCATGTGCACGGCGCGCCGGCSCGTGGGCATTGGCACCGCGCTCCCGCGGGCCGTGGAGTGCACCGCCCCCCGCGCGCGGCGCGCGTCCGTGGGCACGGCTCGGGTGCGTGCGGCGCGACGTGAATGGCACCGCACCGCCGCGCGGATCGTCGATGGCACCGCGCCGCCGCGAGGGTCGCGTGGTCGTGTGCACCTCGCAGGCGGGCGCGGCGCGCCGTGGATTGAACCGCACCGTCGCCCGGGTCGCGCGGCGGTGTGCACCTCGGAGGCTCGCGCCGCGCTTGGAGCGCGATCGAGGGCGCCGCGCCGCGGCGCGCGGTCGTGCGACCCGCGTAAGCGCGCGAGGCGCGTCGGCGCAGGCACCGCGCCCCTACGCGCCCGGCGCGGCCGCGGGCACCGCGCGTCTGGGTGCTGGCCTTTGCGCCTCGCTCCTGCGCACGGGCATGTGCACCGCGCCCCAGCGCGCGGGCATCTCCCCGCGCCCCAGTGCGTGTGCATGTGGACTGCTCTCCCGCGCGCGGCAGCGTGCAACGCGCTCCTGCTGTTTTTTTTGTGGAGTGCGATGTGGTGCACTGTCTGAACTGCTGCCCTCTTCCATCTGCTTTGGGCTTTGTTTCTTAGTGGTGGTTCGTTTTGTCAATTTGGTTGGAGAAATGTGCAAACTCTTTTCGATGTTATGTCATATCTCCGGAGGGCCAATGTAGGGATGAGGTGTTGCTCAAATCGTGTCGTTTAAGCGGTGCAATTTTTGCTTGGCTCTTTGTCGGTTGGGCGTGATTTTCGCTTTGCATGTGGTGCGAGGTATCGCTTGTCGTGCTATGCGGGTGTTAGCCAATGCCTTCATCCCTAATTAGACCCTTTGCGCTTCCTGTTTTAAACATCGTTGAGGCAATTTGGGCTAATGGAGTTGGATTGGGGAGATGCCGTTTTTGCACAAGAGATGTTTGTTGTGGCGCTACTCTGTATGGCTTGGTCACCTCGTGGTGTGGTCTGTCAATACATGAGTAGGGCTCTGTCACTTCGTTGAACCTTCATACGGAGGAAGTCCGCATGTTGTTTTGCCTTTCTATCGGATGGGTAGGTTCCGTTCTATTATTTTTTATGGCGCCGCACGACTACGGTGTTGGAATGGTGGCGTTACGCGATGCCGTCGAACGCATGATCGTACATGCTCCCCTGTTGGCAAGGGCTCGGAAATCACACGTTTCGGTCATATCGTTCTCCAACAAGGAGGGTTGTCCCTGAGGAATGTTATTTGTCGAGAGGCAATGTTTCGGAACTTGGGCGATGGTATCCCAACCTGGGAGCATGCGCTCTTTCCTTTGTGGAGTATCGCCAGACACGACGAACATGTCAAATTACGACATGGGGTTTTGCTTGGCGTTTCATCATTGTATTGCTACCTGGTTGATCCTGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAGTATGAACAATTTCAGACTGTGAAACTGCGAATGGCTCATTAAATCAGTTATAGTTTGTTTGATGGTACTTGCTACTCGGATAACCGTAGTAATTCTAGAGCTAATACGTGCACCAAACCCCGACTTTTGGAAGGGATGCATTTATTAGGTAAAAGGTCGACGCGGGCTTTTGCCCGGCTCCTTGACGATTCATGATAACTTGTCGGATCGCACGGCCTTCGTGCCGGCGATGCATCATTCGAATTTCTGCCCTATCAACTTTCGATGGTAGGATAGGGGCCTACCATGGTGGTGACGGGTGACGGAGAATTAGGGTTCGATTCCGGAGAGGGAGCCTGAGAGACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCAATCCTGACACGGGGAGGTAGTGACAATAAATAACAATACCGGGCTCCACGAGTCTGGTAATTGGAATGAGTACAATCTAAATCCCTTAACGAGGATCCATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTTAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGACCTTGGTTTGGGTCGGTCGGTCCGCCTTTTGGTGTGCACCGCCCGCCCTGATCCTTTTGTCGACGATGCGGTCTGGCCTTAGCTGGCCGGGTCGTGCCCTCGGCGTTGTTACTTTGAAGAAATTAGAGTGCTCAAAGCAAGCCCACGCTCTGGATACATTAGCATGGGATAACATCACAGGATTTCGATCCTATTGTGTTGGCCTTCGGGATCGGAGTAATGATTAAGAGGGACAGTCGTGGGCATTCGTATTTCATAGTCAGAGGTGAAATTCTTGGATTTATGAAAGACGAACCACTGCGAAAGCATTTGCCAAGGATGTTTTCATTAATCAAGAACGAAAGTTGGGGGCTCGAAGACGATCAGATACCGTCCTAGTCTCAACCATAAACGATGCCGACCAGGGATTGGCGGATGTTGCTCTTTGGACTCCGTCAGCACCTTATGAGAAATCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGAAAGCTTACCAGGTCCAGACATAGCAAGGATTGACAGATTGAGAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGCGATTTGTCTGGTTAATTCCGTTAACGAACGAGACCTCAGCTTGCTAACTAGCTATGCGGGGTGCAAGCCCTGTGGCCAGCTTCTTAGAGGGACTATGGCCGCTTAGGCCATGGAAGTTTGAGGCAATAACAGGTCTGTGATGCCCTTAGATGTTCTGGGCCGCACGCGCGCTACACTGATGTATTCAACGAGTCCATTGCCTTGGTCGAAAGGCCTGGGTAATCTTATGAAAATTTCATCGTGATGGGGATAGATCATTGCAATTGTTGGTCTTCAACGAGGAATTCCTAGTAAGCGCGAGTCATCAGCTCGCGTTGACTACGTCCCTGCCCTTTGTACACACCGCCCGTCGCTCCTACCGATTGAATGGTCCGGTGAAGTGTTCGGATCGCTGCGATGCGGGCGGTTTGCCGCGTGCGACTCGGCGAGAAGTCCACTGAACCTTATCATTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGACGAGAGGTTCGAACGATGGAACGATCTGTCCAACGCGTGGGAGTGCGACGGTTGTTCGATGTCGCGTTCTTTCGTAGCGCGTGCTCTCGCGATCACGCGGAGCTCGACGTTATGGGGGATAAACAAAAGCTTATGGGCGTGGTCAGGCGCCAAGGGAGAGCAAATGTTAGGCTGGCAAACGAGTATGCCGTGCGTCGTCAGGTCCATTGAGTCCTGGCAATCGAACGACTCTCGACAACGGATATCTTGGCTCTCGCATCGATGAAGAACGCAGCGAAATGCGATACGTGTTGTGAATTGTAGAATCCCGTGAACCATCCATTTTTTGAACGCAAGTTGCGCCCGAGGATGCAAGCCGAGGGCACGCCTGCATGGGCGTAATGCGACCCGTCGCTCCCTGCGGAAGCCTGGAATCTTTCGGTTTGGTTCCTCGGCCCCGTCGCGGAGGCGATTGATGGCACCCCGTGCGATGGCACGGCGTGTTGAAGCTTGGGGCGACGGTCGACTTCGACACGATACGAGGTGGACGCCACTGGCTGTCGTGGTGTTGGCCAGCAAGAATCGATGTTGTTGTGCGACGAGCAGATGCCCCGCATAGATCCTGCTCCGTCCTCCACGGTGTGGAATCGTGACCCCATGTTAGGTGAGGCTACCCGCCGAGTTTAAGCATATAAATAAGCGGAGGAGAAGGAACTTACAAGGATTCCCTTAGTAACGGCGAGCGAAACGGGACCAGCCCAGTTTGGAAATCGGGCAGCCTGAAGCCTGAATTGTAGTCTGGAGAGGCGTCCTCAGCGACGGATCGGGATCAAGTCCCCTGGAAAGGGGCGCCGGGGAGGGTGAGAGCCCCGTTCGGCCCGTACCCTGCTGCTACACGAGGCGCCGTCAACGAGTCGGGTTGTTTGGGAATGCAGCCCAAATTGGGTGGTAAATTCCGTCCAAGGCTAAATAATTGCGAGAGACCGATAGCGAACAAGTACCGCGAGGGAAAGATGAAAAGGACTTTGAAAAGAGAGTCAAAGAGTGCTTGAAATTGTCGGGAGGGAAGCAGATGGGGGCCGGCGGTGCGCCTCGGCTGGATGCAGAACGTCGAATGACGGTTTGCTGCACGGCTCGAGGAGCGGACCGTCTCGGGCCATCGCGGCGACCGGAGCCCGGGCGCACGTCGCTCGTGGAGAAATCGTCGGCGTGGCCGATCGCAATGCCCGCGCCATCGAGGCGTGCCACGCGGCACCGCGTGCATTGGTGATGGCCAGTGGGCTCCCCATCTGACCCGTCTTGAAACACGGACCAAGGAGTCTGACATGCTTGCGAGTCGACGGGTGGGCAAGCCCGGAAGGCGCAAGGAAGCTGATTGGTGGGATCCCCGTTTGAGGGGTGAGTCATCGACCGACCCAGATCTTTTGTGAAGGGTTCGAGTGAGAGCATGCCTGTCGGGACCCGAAAGATGGTGAACTATGCCTGAGCGGGGCGAAGCCAGAGGAAACTCTGGTGGAGGCCCGCAGCGATACTGACGTGCAAATCGTTCGTCTGACTTGGGTATAGGGGCGAAAGACTAATCGAACCATCTAGTAGCTGGTTCCCTTCGAAGTTTCCCTCAGGATAGCTGGAGCCACAGTCGGAGTTCTATCGGGTAAAGCCAATGATTAGAGGCATCGGGGGCACAATGCCCTCGACCTATTCTCAAACTTTAAATAGGTAGGAGGGCTCGGCTGCTTTGATGAGTTGAGCCAAGGAATCCAGGCTCCAAGTGGGCCATTTTTGGTAAGCAGAACTGGCGATGCGGGATGAACCGAAAGCCGGGTTACGGTGTCCAACTGCGCGCTAACCTAGAGCCCACAAAGGGTGTTGGTCGATTAAGACAGCAGGACGGTGGTCATGGAAGTTGAAATCCGCTAAGGAGTGTGTAACAACTCACCTGCCGAATCAACTAGCCCCGAAAATGGATGGCGCTTAAGCGTGCGACCCACACCTGGCCGTCGGTGCAATGCAAGGCCCCGACGAGTAGGAGGGTGCAACGGTCGCTGCAAAACGTGGGGCGTGAGCCCGCGTGGAGCGTCTGTTGGTGCAGATCTTGGTGGTAGTAGCAAATATTCAAATGAGAACTTTGAAGGCCGAAGGGGGGAAAGGTTCCATGTGAACGGCACTTGCACATGGGTTAGCCGATCCTAAGAGACGGGGGAAACCCGTCGGATAGTGCCATTCGGCACGAGCTTCGAAAGGGAGTCGAGTTAAAATTCTCGAGCCGGGATGTGGCGGTCGATGGCAACGTCAAGTGGTCCGGAGACGTCGGTGGGGGCCTCGGGAAGAGTTATCTTTTCTGCTTAACAGCCCATCGGCCCTGGAAACGGCTCAGCCGGAGGTAGGGCCAAGCGGTTGGAAGAGCACCGCATGTTGAGCGGTGTCCGATGCGCCCCTGGCGACCCTTGAAAATCCGGATGGCCGAGTGCCTTCCACGCCCGGTCGTACTCATAACCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGATGGAGCAATGTAGGCAAGGGAAGTCGGCAAAATGGATCCGTAACTTCGGGAAAAGGATTGGCTCTGAGGGTTGGGCACGGGGGTCCCAATCCCGAACCCATCGGCTGTCGGCGGACTGCTCGAGCTGCTTGCGCGGCGAGAGCGGGTCGACGCGTGCCGGCTGGGGGACGAATTGGGAGCGGCCCCTTCGGGGGCCTTCCCCGGGCGTCGAACAATCGACTCAGAACTGGTACGGACATGGGGAATCCGACTGTTTAATTAAAACAAAGCATTGCGATGGTCCTCGAGGATGCTCACGCAATGTGATTTCTGCCCAGTGCTCTGAATGTCAAAGTGAAGAAATTCAACCAAGCGCGGGTAAACGGCGGGAGTAACTATGACTCTCTTAAGGTAGCCAAATGCCTCGTCATCTAATTAGTGACGCGCATGAATGGATTAACGAGATTCCCACTGTCCCTGTCTACTATCCAGCGAAACCACAGCCAAGGGAACGGGCTTGGCAGAATCAGCGGGGAAAGAAGACCCTGTTGAGCTTGACTCTAGTCCGACTTTGTGAAATGACTTGAGAGGTGTAGCATAAGTGGGAGTCGGCGTGCCGACGGAATTGAAATACCACTACTTTTAACGTTATTTTACTTATTCCGTGAGACGGAGGCGGGGCCCAGCCCCTCCTTTTGGCTCCAAGTCTCGCCTCGGCGAGTCGATCCGGGCGGAAGACATTGTCAGGTGGGGAGTTTGGCTGGGGCGGCACATCTGTTAAAAGATAACGCAGGTGTCCTAAGATGAGCTCAACGAGAACAGAAATCTCGTGTGGAACAAAAGGGTAAAAGCTCGTTTGATTCTGATTTCCAGTACGAATACGAACCGTGAAAGCGTGGCCTATCGATCCTTTAGGCTTTCAGAATTTGAAGCTAGAGGTGTCAGAAAAGTTACCACAGGGATAACTGGCTTGTGGCAGCCAAGCGTTCATAGCGACGTTGCTTTTTGATCCTTCGATGTCGGCTCTTCCTATCATTGTGAAGCAGAATTCACCAAGTGTTGGATTGTTCACCCACCAATAGGGAACGTGAGCTGGGTTTAGACCGTCGTGAGACAGGTTAGTTTTACCCTACTGATGAATGTGTCGTGATAGTAATTCAACCTAGTACGAGAGGAACCGTTGATTCACACAATTGGTCATCGCGCTTGGTTGAAAAGCCAGTGGCGCGAAGCTATCGTGTGTAGGATTATGACTGAACGCCTCTAAGTCAGAATCCACGCTAAGATGCGACGCTTAGGCCTATCGTTCGCCTGTTGGCCTAAAGTAGGGGCTTGGCCCCCAAGGGCACGCGACCACGGGCTAGTCTGGTGTCATAGAAGGTTGGATGGCATGGGCCCCGTAAGAAATGATAGTTAAGAACGAACGATGGGTAGAATCCTTTGCAGACGACTTAAATTTGCGACGGGGCATTGTAAGTGGCAGAGTGGCCTTGCTGCCACGATCCACTGAGATCCAACCCTATGTTGCAGTAGATTCGTCCCCTTCTGCCGCAGAACACGAAGAGCTTGGTTTTTGGTTTTCTAAAGGGGGACCAAGCTCGCTTGCTCTACAGCTCTACTTATTGCGGTACTTTGTGCGGGTCCTGTGCAACTGGCCTTGCGATGCGTCCCTCGGTGGCCCTACCAATGTGTAAGGCTTAAGCCTGACATTGCACGATTCAGTTATCACAAACGTGAAATAGCTATGAAATGCTGAAAAAGGGTTTGAAAAATATCTGCTGGCTCTGTCTACGTGCACATAGAGGCTTGTCTGCACGCACTTTGGGGCTTGTTTGCGTGCACAGCAGCAACCTACGCGCGCTCCATGTTGGCACCAGAAAAGTGTAACTTTTTTCGGGATTTTCAGCGCTGCGGGCTGCTTCTGCATGGATGCGGTGACAACCTATTGGTAGTCTTACCTTGTGCTGCGTGCTGTCTGTGCACCGGGCCTGTTTGCGTGCACGGCAGCAACCTACGCGCGCTCCGTGTTGGCACCAGAAAAGTGTAACTTTTTTCGGGATTTTTAGCGCTGCGGGCTGCTTCTGCATGGATGCGGTGACAACCTATCGGTAGTCTTGCCTCAGCCCAAGATGGCATTGTCGCAAAACACGATCCGGATGGCGTGCATGGGCATCACGCTCCTGCACGCGGGCATGCGCATCCCGCCCGCCTGCGCCGCGCTCCTGCGTGGGGGAAAGTGTGCTTCTCGCTCCTGCGTGTTGGCCTATCCCCGCTTCTTCGGTCGCAAGGTGCGTCCCCCGCATCATGGTCTTGGAGAATCCTGTAACTTTCACCGAGATCATTAGCTATGCGGTCTTCATGAGTGGACCAAAAAGTGTAACTTTTTTCGGGATTTTTAGCGCTGCGGGCTGCGTCTGCATGGATGCGGTGGCAACCTATTGGTAGTCTTGCCTCTTCCCGAGGTGCGACCATCGCAATACACGATCCGGATGGCGTGCATGTGCATCACGCTCCAGCGATTGGGCCGGTGCGGCGATCCGTCGCGGGCACCGTTCCCTTGCTCGGAACGCGCGGCCGTGGGCACCGCGCTCTCGCGCTCGGTCCTGTGCGCCGCGCAAGTTAGCTCTGCGGGCTCGTGAGGGGACCGAAATTTGTAACTTTTTTCCGGATTTTTAGGGCCGCGGGCTGCTTCAGCATGGTTACGGTGACAACCTATTGGTAGTATTTGCACGTCCCGAGATGCGATCGTCTCAAAACACGTTCCGGATGGCGGTCGTGGTCATCACGCTCCTGCGTGTGGGCAAGCGCGCCGCGCGGCCCCCCGACGCACGAGCGCGCGGCGCGATCATTGGCTCTCCGGTCTCCACGAGGGGACAAAAAAGTGCAACTTTTTTCGGGATTTTCAGCGCTGCGGGCTGCTTCTGCATGGATGCGGTGACAGCCTTTTGGTAGTCTTACCTTGTGCACCGGGCCTGTTTGCGTGCACCGCTGCAACCAACGCGCGCTCCATGTTGGCACCAGAAAAGTGTAACTTTTTTCGGGACCTTTCAGCGCGGCGGG

4.12.14 Reload raw GB query in csv

If you need to restart coding from the previous point, please execute the code below, which will load the Raw GB data from the CSV file saved in 04_Data_analyses/CSV.

To do so, do the following:

  • Download the csv file produced by the instructor on Google Drive at this path: DNA_barcoding/04_Data_analyses/CSV/Instructor_files/Vanilla_internal transcribed spacer_2024_02_08_Raw_GenBank.csv
  • Save the file under the right path in DNA_barcoding.
###~~~
#Load Raw GB query csv file
###~~~
#If you have saved the SEQ file and need to restart (from SEQ), execute this code
# --> Reload csv file from 04_Data_analyses/CSV/

#Adjust the file name based on your data
SEQfileName <- "Vanilla_internal transcribed spacer_2023_02_14_Raw_GenBank.csv"

#Load the csv in the environment
RawGBDat2 <- read.csv(paste0("04_Data_analyses/CSV/", SEQfileName), quote = "\"", sep = ' ')

#Change name of object to make it compatible with code
SEQ <- RawGBDat2

4.12.14.1 Question

Write code to determine how many DNA sequences where downloaded from GenBank?

To answer this question, use SEQ as input.

4.12.15 Tidy dataset based on meta-data

Here we apply filters to discard DNA sequences that are:

  • Not belonging to the genus Vanilla OR to the target DNA barcode (here the ITS region).
  • Contaminated. Previous analysis showed that the V. mexicana ITS sequence is corrupted. We are therefore discarding it from our dataset.
  • Identified at genus level. Since these sequences cannot be used as reference to identify species and infer relationships with confidence.
  • Either too short (< 500 bp) or too long (>= 1000 bp).

Finally, we are preparing/formatting data for the production of the FASTA file.

###~~~
#Tidy dataset
###~~~

##
#1. The search retrieved sequences that are NEITHER ITS, NOR Vanilla
##
# Use grep to search for internal (more used than ITS) in definition
# Use grep to search for Vanilla in species
gene <- SEQ[grep("internal", SEQ$Definition),]
gene <- gene[grep("Vanilla", gene$Species),]

##
#2. Previous analysis showed that the V. mexicana sequence is contaminated/corrupted.
##
# We are therefore discarding it from our dataset
gene <- gene[-which(gene$Species == "Vanilla mexicana"),]

##
#3. Discard DNA sequences identified at genus level
##
#Create vector with names of taxa in dataset
taxaVan <- unique(as.vector(gene$Species))

#Find DNA accessions identified at genus level
# and discard them
# 1. All species matching "sp."
spVan <- taxaVan[grep("sp.", taxaVan)]
# 2. Exclude those that have "subsp." since they are accurately identified
spVan <- spVan[-grep("subsp.", spVan)]
# 3. Subset gene to only keep DNA sequences identified at species level
gene <- subset(gene, !(gene$Species %in% spVan))

##
#4. Discard DNA sequences < 500 OR >= 1000 bp
##
gene <- gene[-which(as.numeric(as.vector(gene$Seq_length)) < 500 | as.numeric(as.vector(gene$Seq_length)) >= 1000),]

###~~~
#Prepare dataset for FASTA format
###~~~
# FASTA first line contains GenBank ID and species. 
# Want these fields to be separated by "_" 
# and need to include those in species field
gene$Species <- gsub(" ", "_", gene$Species)

4.12.15.1 Questions

Q1: Write code to tell us how many DNA sequences were discarded through your filtering analyses?

Q2: Write code to determine how many taxa of Vanilla are in your sampling?

Q3: Write code to determine if your sampling is skewed towards specific taxa? If yes, which ones?

4.12.16 Write results of query in FASTA and csv formats

We are using R to generate a FASTA file with the DNA sequences from our GenBank query. To do that we are concatenating information from 3 columns in the gene object: - gene$GenBankID: Contains unique GenBank ID for DNA sequence. - gene$Species: Contains species taxonomy associated to sequence. - gene$Sequence: Contains the DNA sequence.

###~~~
#Create FASTA
###~~~
FASTAGB <- paste(paste(">", as.vector(gene$GenBankID), "_", 
                         as.vector(gene$Species), sep = ""), 
                   as.vector(gene$Sequence), sep = '\n')

###~~~
#Write FASTA & CSV (incl. meta-data) files
###~~~

##File name FASTA
FileIDFASTA <- paste(sp, DNA, gsub("-", "_", Sys.Date()), "GenBank.fasta", sep='_')
#Write FASTA file in DNA_barcoding/04_Data_analyses/
write.table(FASTAGB, 
            paste("04_Data_analyses/FASTA/", 
                  FileIDFASTA, sep = ''), row.names = F, col.names = F, quote = F)

#File name CSV
FileIDcsv <- paste(sp, DNA, gsub("-", "_", Sys.Date()), "GenBank.csv", sep='_')
#Write CSV file
write.table(gene, 
            paste("04_Data_analyses/CSV/", 
                  FileIDcsv, sep=''), row.names = F, col.names = T, quote = T)

Let’s have a look at the tidy data:

## [1] "After cleaning 148 DNA sequences remain."
##    GenBankID              Species
## 1 1789804163 Vanilla_trigonocarpa
##                                                                                                                                                                                                Definition
## 1 >MN902067.1 Vanilla trigonocarpa voucher W. Stern s.n. internal transcribed spacer 1, partial sequence; 5.8S ribosomal RNA gene, complete sequence; and internal transcribed spacer 2, partial sequence
##   Seq_length
## 1        617
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Sequence
## 1 AGAGGCGTGAATGATGGAACGATCTGTCCAACATGTGGGAGTGCGACAGTTCGATGTCGCCTTCTTCCGTAGCGCGTGCTCTTGCTTCGACGTGGAGCTCGACGCTACGGGGGATAAACAAAAGCTTATGGGCGTTGTCTGGCGCCAAGGGAGAGCAAATGTTCAAGCTGGCAAACGAGTGTGTTGTCGTCAGGTCCATTGAGTCCTGGCAATCGAACGACTCTCGACAACGGATATCTTGGCTCTCGCATCGATGAAGAACGCAGCTTGAAATGCGATACGTGTTGTGAATTGTAGAATCCCGTGAACCATCCATTTTTTGAACGCAAGTTGCGCCCGAGGATGCAAGCCGAGGGCACGCCTGCATGGGTGTAATGCGACCCGTCGCTCCTTGCGGAAGGCTGGAATCTTTGGTTTGGTTCCTCGTCCCCGTTGTGGAGGCGATTGATGGCACCCCGTGCAATAGCATGGCGTGTCGAAGTGTGGGGCGACGGTCGACTGTCGACATGATAAGAGGTGGGCAGCCACCGGCTGTTGTGGTGTTGGCCAGCAATAATCGATGTTGTCGTGCGACAAGCAGGTGCCCCGCATAGATCCAACTCCGTCCTCGATGGTGT

4.12.17 Format and merge individual .seq files from Part 1 into a FASTA file

In this section, we are focusing on developing an R code to format and merge our clean sequences from Part 1 by using the following approach:

  1. Establish a list of all .seq files in folder 03_Processed_ITS_data_FASTA/ using list.files().
  2. Open/read files individually using readLines(),
  3. Convert format of FASTA file from interleave to sequential.
  4. Infer reverse complement DNA sequences to be formatted in the same manner as GenBank sequences. For more details on this protocol click here.
  5. Merge individual FASTA objects using rbind().
  6. Write output FASTA object into file using write.table().

Please notice that steps 2 to 5 will take place within a for loop.

###~~~
#List of .seq files
###~~~

#Please adapt path to your working directory/project
Files <- list.files("03_Processed_ITS_data_FASTA/", pattern = '.seq', full.names = T)

###~~~
#Format and merge all files into one file
###~~~

#This is done by using a loop
seqOUT <- NULL
for(i in 1:length(Files)){
  #Read FASTA
  tmp <- readLines(Files[i])
  #Extract and concatenate DNA sequence
  DNAseq <- paste(tmp[2:length(tmp)], collapse = '')
  #Infer complementary sequence
  DNAseqcomp <- unname(sapply(strsplit(DNAseq,"")[[1]], switch,  "A"="T", "T"="A","G"="C","C"="G"))
  #Infer reverse and complement sequence (and collapse into one element)
  DNAseqcomprev <- paste(rev(DNAseqcomp), collapse='')
  #Edit FASTA to have only one object/line (=sequential format)
  tmp <- paste(tmp[1], DNAseqcomprev, sep='\n')
  
  #Merge objects
  seqOUT <- rbind(seqOUT, tmp)
}

###~~~
#Write data
###~~~

#File name
FileIDfastaSeq <- paste(sp, DNA, gsub("-", "_", Sys.Date()), "msa_input.fasta", sep='_')
#Write file (= input for msa analysis)
write.table(seqOUT, file = 
              paste("04_Data_analyses/FASTA/", 
                    FileIDfastaSeq, sep = ''), col.names = F, row.names = F, quote = F)

Please check that the format of your output file is as expected using a text editor (see Figure 4.15).

Screenshot of merged FASTA file used as input for msa analysis.

Figure 4.15: Screenshot of merged FASTA file used as input for msa analysis.

4.12.18 Merge FASTA objects/files to perform multiple sequence alignment

Here, we are aiming at merging FASTA outputs to produce the input of the multiple sequence alignment. This step is pretty straightforward and involves using the c() function and writing the output.

The objects containg the GenBank DNA sequences and your DNA sequences are FASTAGB and seqOUT, respectively.

###~~~
#Start by merging datasets (GenBank and your newly produced seq.)
###~~~

#If your FASTA objects are still in R
# merge FASTA into one object
FASTAall <- c(FASTAGB, seqOUT)

#Else, you will have to load fasta files using readLines() before merging them

###~~~
#Write data
###~~~

#File name
FileIDfastaALL <- paste(sp, DNA, gsub("-", "_", Sys.Date()), "GenBank_seq_msa_input.fasta", sep='_')

#Write file (= input for msa analysis)
write.table(FASTAall, file = 
              paste("04_Data_analyses/FASTA/", 
                    FileIDfastaALL, sep=''), col.names = F, row.names = F, quote = F)

4.12.18.1 Question

Why would you need to use readLines() instead of read.csv() to open a FASTA file?

4.13 Part 3

4.13.1 Aim

DNA multiple sequence alignment and phylogenetic inference.

4.13.2 Bioinformatic tools

To execute Part 3, you need to install the following software and R packages on your computer:

If you don’t remember how to install an R package, don’t worry, this topic was covered here.

4.13.3 Analytical workflow

To infer the ML phylogenetic analysis, the following workflow will be executed: implemented:

  1. Conduct the msa analysis using the MUSCLE algorithm (Edgar, 2004). This algorithm is implemented in MEGA. The input data for this analysis is Vanilla_internal transcribed spacer_2021_02_09_GenBank_seq_msa_input.fasta, which is stored in 04_Data_analyses/FASTA/.
  2. Check and manually edit msa. This will be done in MEGA.
  3. Infer ML phylogenetic tree based on msa file. This will be done using the RAxML algorithm (available on a web platform).
  4. Visualize phylogenetic tree and interpret results.

4.13.4 Conduct msa analysis

4.13.4.1 Disclaimer

The analysis described below relies on MEGA; however you might be experiencing issues with the .fasta file outputted by this program (especially on Windows operating system). These issues might compromise downstream analyses, more specifically the RAxML inference and R code used to generate figures. Those issues are associated to changes in file encryption and changing formatting of samples names (i.e. replacing “_” by ” “). In the event that it happens to you, please switch over and use the online MUSCLE platform available at this URL: https://www.ebi.ac.uk/Tools/msa/muscle/

WARNING: When you submit your job on the portal, do not forget to select Pearson/FASTA as output file format in step 2. In addition, the output will only be made available in a window and you will have to copy and past the whole content in a new file using your favorite text editor. The file should be saved and named as detailed in the step 8 of the section below.

4.13.4.2 Step-by-step protocol

To conduct a multiple sequence alignment in MEGA do the following:

  1. Launch MEGA.
  2. Load your input file (Vanilla_internal transcribed spacer_2021_02_09_GenBank_seq_msa_input.fasta) by clicking File -> Open A File/Session... and selecting the right file.
  3. The program will ask you How would you like... and you will then click on the Align button. This will open a new window with your data (corresponding to an unaligned DNA matrix; see Figure 4.16).
Screenshot of DNA matrix in MEGA.

Figure 4.16: Screenshot of DNA matrix in MEGA.

4.To start a MUSCLE analysis do has shown in Figure 4.17 (Alignment -> Align by MUSCLE. A window will open saying Nothing selected for alignment. Select all?, click the OK button. Further details on the MUSCLE algorithm can be found in Edgar (2004).

Screenshot showing how to start a MUSCLE analysis in MEGA.

Figure 4.17: Screenshot showing how to start a MUSCLE analysis in MEGA.

  1. A window showing settings associated with the analysis will appear as shown on Figure 4.18. Please use the parameters set by default and click OK to start the analysis. The analysis should take ca. 5-10 minutes to complete.
Screenshot showing MUSCLE settings in MEGA.

Figure 4.18: Screenshot showing MUSCLE settings in MEGA.

  1. Once the analysis is completed, you will be able to inspect and edit the alignment. Please see Figure 4.19 for an example. Notice the gaps (-) that had to be incorporated to accommodate for differences in DNA sequences between samples. Your DNA sequences of Vanilla are at the bottom of the file.
Screenshot showing MUSCLE alignment in MEGA.

Figure 4.19: Screenshot showing MUSCLE alignment in MEGA.

  1. Save the analysis in FASTA format by clicking Data -> Export Alignment -> FASTA Format. This procedure is also shown in Figure 4.20.
Screenshot showing how to export MUSCLE alignment into a FASTA format in MEGA.

Figure 4.20: Screenshot showing how to export MUSCLE alignment into a FASTA format in MEGA.

  1. Save the file in 04_Data_analyses/FASTA/ and adjust file name as follows: _GenBank_seq_msa_output.fasta. This procedure is shown in Figure 4.21. Make sure to have your file extension set as .fasta otherwise you won’t be able to load your file during the RAxML analysis.
Screenshot showing how to save file in MEGA.

Figure 4.21: Screenshot showing how to save file in MEGA.

4.13.5 Infer ML phylogenetic tree based on msa file

The RAxML algorithm will be used to conduct the ML phylogenetic analysis and infer node statistical supports using the bootstrap procedure (Kozlov et al., 2019). This algorithm has been implemented on a web service platform accessible at this URL: - https://raxml-ng.vital-it.ch/#/

WARNING: If the RAxML web service platform is down, the instructor will run the analysis locally (on a Linux computer). Please click here for more details on the procedure.

To perform the ML phylogenetic analysis:

  1. Go on the RAxML portal.
  2. Select the output of the MEGA analysis (*_GenBank_seq_msa_output.fasta) stored in 04_Data_analyses/FASTA/ as input data.
  3. For the rest of the settings use default parameters with the exception of the Bootstrapping settings (see below). Settings have to be set as shown in Figure 4.22.
Screenshot showing Bootstrapping settings for the RAxML analysis.

Figure 4.22: Screenshot showing Bootstrapping settings for the RAxML analysis.

  1. Finally, don’t forget to provide your email before submitting the analysis. The analysis should take 1-5 hours to run. You will receive an email to download results or you can access results by using your unique URL.

  2. When your analysis is completed, download results (see Figure 4.23) and save it in 04_Data_analyses/Phylogenetic_analyses/.

Screenshot of RAxML portal showing results of your analysis. The best ML phylogenetic tree is also displayed.

Figure 4.23: Screenshot of RAxML portal showing results of your analysis. The best ML phylogenetic tree is also displayed.

  1. Unzip the file and inspect output files (use a text editor to open files). Descriptions of key files is provided here:
    • raxmlArg.txt: Contains RAxML command line used to conduct analysis.
    • result.raxml.bestModel: Contains estimated parameters for model used for ML analysis.
    • result.raxml.bestTree: Best ML tree in newick format.
    • result.raxml.bootstraps: Bootstrap trees in newick format.
    • result.raxml.support: Best ML tree (same tree than in result.raxml.bestTree) with node supports (inferred from data in result.raxml.bootstraps) in newick format.
    • sequenceAlignment.fasta: Your input aligned FASTA file.
  2. You can open result.raxml.support in FigTree to visualize it and locate the four samples of vanilla from Mexico. We will learn how to process phylogenetic trees in R in the next section.

4.13.6 Run RAxML locally

Connect to Linux computer:

ssh svenbuerki@XX.XX.XXX.XXX

Navigate to the right folder:

cd /media/SeaGate/Mini_Report_3/RAxML_analysis/

Run the RAxML analysis locally:

This analysis is done on the output of the MUSCLE analysis Vanilla_internal\ transcribed\ spacer_2024_02_20_GenBank_seq_msa_output.fas

raxmlHPC-PTHREADS -T 20 -f a -m GTRGAMMA -p 12345 -o 1789804116_Vanilla_inodora -x 12345 -# 100 -s Vanilla_internal\ transcribed\ spacer_2024_02_20_GenBank_seq_msa_output.fas -n Vanilla_ML.tre

Arguments - Short definitions

  • -T : number of threads
  • -m : molecular model to build the phylogenetic tree
  • -o : outgroup taxon to root analysis
  • -p and -x : seed numbers for analysis
  • -# : number of bootstrap analyses
  • -s : input file in FASTA format
  • -n : output file

Output files

  • All 1000 bootstrapped trees : RAxML_bootstrap.Vanilla_ML.tre
  • Best-scoring ML tree : RAxML_bestTree.Vanilla_ML.tre
  • Best-scoring ML tree with support values : RAxML_bipartitions.Vanilla_ML.tre (FILE FOR NEXT ANALYSES)
  • Best-scoring ML tree with support values as branch labels : RAxML_bipartitionsBranchLabels.Vanilla_ML.tre

Files on Google Drive

The output files are available on Google Drive at this path:

DNA_barcoding > 04_Data_analyses > Phylogenetic_analyses > RAxML_analysis

Download the folder on your computer and save it under the same path.

4.13.7 Visualize phylogenetic tree and interpret results

4.13.7.1 Learning outcomes

  • Learn the newick syntax to represent phylogenetic trees
  • Produce figures of phylogenetic trees using R functions implemented in the ape package

4.13.7.2 Introduction to Newick tree format

The Newick Standard for representing trees in computer-readable form makes use of the correspondence between trees and nested parentheses. We owe this format to the British mathematician Arthur Cayley. Here we will be providing an overview of the Newick format using the R package ape (Paradis et al., 2004).

To produce the “dummy” phylogenetic tree displayed in Figure 4.24, the following syntax should be applied:

Rooted tree showing relationsip between 4 species.

Figure 4.24: Rooted tree showing relationsip between 4 species.

4.13.7.2.1 Code to produce your first phylogenetic tree

Before starting, create a new R script entitled 02_Part3_exercises.R saved at the root of DNA_barcoding.

###~~~
#Check if package is installed if not then install it
###~~~

if("ape" %in% rownames(installed.packages()) == FALSE){
  print("Install ape")
  install.packages("ape")
  }else{
    print("ape is installed!")
}

###~~~
#Load package
###~~~
library(ape)

###~~~
#Set working directory
###~~~
#Set working directory to path leading to DNA_barcoding folder
# WARNING: This path as to be adapted to match your computer
setwd("~/Documents/Class_Genomics&Bioinfo_Spring/DNA_barcoding/")

#Check that working directory is set correctly
getwd()

###~~~
#Your first tree in Newick format
###~~~
tr1 <- "(A._speciosa,(B._dimorpha,(C._elegans,D._viridis)));"

###~~~
#Plot your tree with ape package
###~~~
plot(ape::read.tree(text=tr1))

Some considerations associated to the syntax:

  • The tree ends with a semicolon (;).
  • The bottommost node in the tree is an interior node, not a tip.
  • Interior nodes are represented by a pair of matched parentheses (()). Between them are representations of the nodes that are immediately descended from that node, separated by commas (,).
  • Tips (or samples) are represented by their names. A name can be any string of printable characters except blanks, colons, semicolons, parentheses, and square brackets.
  • Because you may want to include a blank in a tip name, it is assumed that an underscore character (_) stands for a blank; any of these in a name will be converted to a blank when it is read in (see code above and Figure 4.24 for more details).

Branch lengths (representing e.g. molecular substitutions, which can be turned into time in a dated phylogenetic tree) can be incorporated into a tree by putting a real number, with or without decimal point, after a node and preceded by a colon (:). This represents the length of the branch immediately below that node. Thus, the above tree might have lengths represented as in Figure 4.25.

###~~~
#Tree in Newick format with branch lengths
###~~~
# Here we added branch length values using ":" and inserting value
tr2 <- "(A._speciosa:2.0,(B._dimorpha:4.2,(C._elegans:2.0,D._viridis:3.0):1):4.0);"

###~~~
#Plot your tree with ape package
###~~~
plot(ape::read.tree(text=tr2))
Rooted tree showing relationsip between 4 species with branch lengths.

Figure 4.25: Rooted tree showing relationsip between 4 species with branch lengths.

Now, we can finally add node supports (inferred using e.g. the bootstrap approach in the case of the vanilla Maximum Likelihood analysis conducted here) by simply adding values after each closing parenthesis. This is done as follows and displayed in Figure 4.26:

###~~~
#Tree in Newick format with branch lengths and node supports
###~~~
# Here we added branch lengths and bootstrap values
tr3 <- "(A._speciosa:2.0,(B._dimorpha:4.2,(C._elegans:2.0,D._viridis:3.0)60:1)90:4.0)100;"

###~~~
#Plot your tree with ape package
###~~~
# Note that to display node labels, we have to set show.node.label = TRUE 
plot(ape::read.tree(text=tr3), show.node.label = TRUE)
Rooted tree showing relationsip between 4 species with branch lengths and node supports.

Figure 4.26: Rooted tree showing relationsip between 4 species with branch lengths and node supports.

4.13.7.3 phylo class and phylogenetic tree in R

Here, we will be introducing the phylo class implemented in R packages dealing with phylogenetic analyses. To study this topic, we will be using the phylogenetic tree with branch lengths and node supports (see Figure 4.26).

###~~~
#Tree in Newick format with branch lengths and node supports
###~~~
tr3 <- "(A._speciosa:2.0,(B._dimorpha:4.2,(C._elegans:2.0,D._viridis:3.0)60:1)90:4.0)100;"

###~~~
#Create and load tree
###~~~
tr <- ape::read.tree(text=tr3)
tr
## 
## Phylogenetic tree with 4 tips and 3 internal nodes.
## 
## Tip labels:
##   A._speciosa, B._dimorpha, C._elegans, D._viridis
## Node labels:
##   100, 90, 60
## 
## Rooted; includes branch lengths.
###~~~
#Check class
###~~~
class(tr)
## [1] "phylo"

phylo class objects (here tr) are lists allowing to access multiple attributes associated to the phylogenetic tree. Three of these lists are especially useful to us:

  • tr$tip.label: Vector with tip labels.
  • tr$node.label: Vector with node labels.
  • tr$edge.length: Vector with branch lengths.

Here, are the details with our example:

###~~~
#To access tip labels
###~~~
tr$tip.label
## [1] "A._speciosa" "B._dimorpha" "C._elegans"  "D._viridis"
###~~~
#To access node labels
###~~~
tr$node.label
## [1] "100" "90"  "60"
###~~~
#To access branch lengths
###~~~
tr$edge.length
## [1] 2.0 4.0 4.2 1.0 2.0 3.0

4.13.7.4 Challenge

Using this knowledge, draw and code a phylogenetic tree with 8 samples (tips) containing branch lengths and node supports

This challenge is done in small groups and each group will take turn to (i) draw their phylogenetic tree on the white board, and (ii) code it using the newick syntax.

Tip: To make sure that your phylogenetic tree is correct, code it and execute it in R using the ape functions shown above.

4.13.7.5 Vanilla phylogenetic tree

In this section, we will be analyzing the output of the RAxML analysis focusing on result.raxml.support, which contains the phylogenetic tree with node supports (here obtained through the bootstrap method). The file is stored in 04_Data_analyses/Phylogenetic_analysis/RAxML_analysis.

Our analyses are divided into seven steps:

  1. Load ITS tree using the ape::read.tree() function.
  2. Discard bootstrap values < 50% since these nodes are not statistically supported. This is done by finding nodes matching this criterion (in ITS$node.label) and replacing their values by nothing ("").
  3. Rename tips to facilitate reading. Since all accessions belong to same genus, we will replace “Vanilla” by “V.”.
  4. Root phylogenetic tree with outgroup taxon (using the ape::root() function). Analyses conducted by Paige (Ellestad et al., 2022) showed that Vanilla inodora (represented by the sample 1789804116_Vanilla_inodora) is a suitable outgroup taxon.
  5. Ladderize and plot phylogenetic tree (using ape::ladderize() and ape::plot.phylo() functions; see Figure 4.27). We will also color in red the tips associated with your vanilla samples to improve readability.
  6. Extract subtree containing our vanilla samples (using ape::getMRCA() and ape::extract.clade() functions) and plot tree (this time using the radial plotting mode; see Figure 4.28). To better visualize your vanilla samples, we will plot a red circle next to their tips (using the ape::tiplabels() function) and rename them based on Figure 4.2.
  7. Export phylogenetic tree from step 6 in pdf format.
4.13.7.5.1 Code to generate figures of the phylogenetic tree

Before starting, create a new R script entitled 03_Part3_PhyloFig.R saved at the root of DNA_barcoding.

###~~~
#1. Load Vanilla RAxML ITS tree
###~~~
#Please adjust path based on your computer (= delete "Data/")
ITS <- ape::read.tree(file="Data/04_Data_analyses/Phylogenetic_analyses/RAxML_analysis/RAxML_bipartitions.Vanilla_ML.tre")

###~~~
#2. Discard bootstrap values < 50%
###~~~
#Find nodes with low statistical supports
ITS$node.label[which(as.numeric(ITS$node.label) < 50)]
##   [1] "32" "5"  "0"  "18" "9"  "1"  "6"  "0"  "0"  "12" "29" "2"  "26" "33" "26"
##  [16] "32" "14" "14" "19" "11" "0"  "3"  "1"  "1"  "7"  "4"  "15" "16" "15" "5" 
##  [31] "19" "10" "13" "7"  "3"  "34" "16" "3"  "2"  "2"  "9"  "31" "19" "28" "6" 
##  [46] "3"  "9"  "24" "4"  "3"  "7"  "30" "24" "34" "30" "16" "17" "28" "23" "39"
##  [61] "44" "38" "19" "45" "28" "28" "7"  "5"  "22" "45" "31" "35" "34" "38" "6" 
##  [76] "6"  "27" "1"  "0"  "6"  "1"  "5"  "0"  "15" "3"  "42" "16" "7"  "11" "10"
##  [91] "21" "35" "1"  "0"  "0"  "0"  "1"  "5"  "5"  "26" "23" "25" "38" "9"  "16"
## [106] "22" "21" "2"  "1"  "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0" 
## [121] "1"  "20" "28" "37"
#Replace/overwrite values by ""
ITS$node.label[which(as.numeric(ITS$node.label) < 50)] <- ""

###~~~
#3. Rename tips: Vanilla by V.
###~~~
#Replace/overwrite values
ITS$tip.label <- gsub("Vanilla", "V.", ITS$tip.label)

###~~~
#4. Root phylogenetic tree with outgroup taxon
###~~~
#Re-root tree with suitable sample
ITS <- ape::root(ITS, outgroup = "1789804116_V._inodora", resolve.root=T)
#This procedure add a node label "Root", which needs to be discarded.
ITS$node.label[which(ITS$node.label == "Root")] <- ""

###~~~
#5. Ladderize and plot tree by sorting nodes
###~~~
#Ladderize
ITS <- ape::ladderize(ITS)

#Color tips: your samples in red
tipcol <- rep("black", length(ITS$tip.label))
#Find which tips are your samples and replace color by red 
tipcol[grep("-ITSp4", ITS$tip.label)] <- "red"

#Plot
ape::plot.phylo(ITS, cex=.3, tip.color = tipcol, show.node.label = TRUE, align.tip.label = T, no.margin = TRUE)
RaxML ITS rooted phylogenetic tree of species of Vanilla.

Figure 4.27: RaxML ITS rooted phylogenetic tree of species of Vanilla.

###~~~
#6. Extract subtree and plot tree
###~~~
#Find node corresponding to MRCA of our vanilla samples
MRCAsamples <- ape::getMRCA(ITS, tip=ITS$tip.label[grep("-ITSp4", ITS$tip.label)])

#Extract subtree
VanITS <- ape::extract.clade(ITS, node=MRCAsamples)

#Rename vanilla samples using apply and switch as learned before
VanITS$tip.label[grep("-ITSp4", VanITS$tip.label)] <- unname(sapply(VanITS$tip.label[grep("-ITSp4", VanITS$tip.label)], switch,  "its25-ITSp4"="PE25", "its50-ITSp4"="PE50","its44-ITSp4"="PE44","its49-ITSp4"="PE49"))

#Color tips: your samples in red
tipcolVan <- rep("black", length(VanITS$tip.label))
#Find which tips are your samples and replace color by red 
# ^ means that the search has to begin with PE
tipcolVan[grep("^PE", VanITS$tip.label)] <- "red"

#Plot tree in radial mode
ape::plot.phylo(VanITS, cex = .4, use.edge.length = F, tip.color = tipcolVan, type = "radial", label.offset = 0.04, no.margin = TRUE)
#Add bootstrap supports
ape::nodelabels(text = VanITS$node.label, adj = c(0.5,0.5), frame = "none", cex = 0.5)
#Add circles next to your vanilla samples
ape::tiplabels(tip = grep("^PE", VanITS$tip.label), pch = 16, col = "red", offset = 0.02)
RAxML ITS phylogenetic tree with focus on clade containing all accessions of vanilla studied here.

Figure 4.28: RAxML ITS phylogenetic tree with focus on clade containing all accessions of vanilla studied here.

###~~~
#7. Export tree from step 6 in pdf format 
###~~~
#Use the pdf function to create pdf
# Adjust path for your computer (= delete "Data/")
pdf("Data/04_Data_analyses/Phylogenetic_analyses/RAxML_analysis/VanSubTree.pdf")
#Plot tree in radial mode
ape::plot.phylo(VanITS, cex = .4, use.edge.length = F, tip.color = tipcolVan, type = "radial", label.offset = 0.04, no.margin = TRUE)
#Add bootstrap supports
ape::nodelabels(text = VanITS$node.label, adj = c(0.5,0.5), frame = "none", cex = 0.5)
#Add circles next to your vanilla samples
ape::tiplabels(tip = grep("^PE", VanITS$tip.label), pch = 16, col = "red", offset = 0.02)
#This function closes the pdf
dev.off()
## quartz_off_screen 
##                 2

4.13.8 Discussion

Click on the button below to inspect the final Vanilla phylogenetic tree (in pdf format) and answer this scientific question:

What species do the samples of vanilla analyzed here belong to (see Figure 4.2) and why?

Note: It is not sufficient to provide taxonomic names for the analyzed samples, you have to motivate why you are advocating for these names.

References

Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology 215: 403–410. Available at: http://www.sciencedirect.com/science/article/pii/S0022283605803602.
Benson, D.A., I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and D.L. Wheeler. 2005. GenBank. Nucleic Acids Research 33: D34–D38.
Buerki, S., P.P. Lowry II, J. Munzinger, M. Tuiwawa, A. Naikatini, and M.W. Callmander. 2017. Alectryon vitiensis: A new species of sapindaceae endemic to fiji. Novon: A Journal for Botanical Nomenclature 25: 421–429.
CBOL. 2021. International barcode of life: DNA barcoding. Available at: https://ibol.org/about/dna-barcoding/.
Cheng, T., C. Xu, L. Lei, C. Li, Y. Zhang, and S. Zhou. 2016. Barcoding the kingdom plantae: New PCR primers for ITS regions of plants with improved universality and specificity. Molecular Ecology Resources 16: 138–149. Available at: https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12438.
Consortium, T.G.O. 2020. The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Research 49: D325–D334. Available at: https://doi.org/10.1093/nar/gkaa1113.
Consortium, T.U. 2014. UniProt: a hub for protein information. Nucleic Acids Research 43: D204–D212. Available at: https://doi.org/10.1093/nar/gku989.
Dentinger, B.T., and L.M. Suz. 2014. What’s for dinner? Undescribed species of porcini in a commercial packet. PeerJ 2: e570.
DeSalle, R., and P. Goldstein. 2019. Review and interpretation of trends in DNA barcoding. Frontiers in Ecology and Evolution 7: 302. Available at: https://www.frontiersin.org/article/10.3389/fevo.2019.00302.
Edgar, R.C. 2004. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic acids research 32: 1792–1797.
Ellestad, P., M.A.P. Farrera, F. Forest, and S. Buerki. 2022. Uncovering haplotype diversity in cultivated mexican vanilla species. American Journal of Botany 109: 1120–1138. Available at: https://bsapubs.onlinelibrary.wiley.com/doi/abs/10.1002/ajb2.16024.
Ellestad, P., F. Forest, M. Serpe, S.J. Novak, and S. Buerki. 2021. Harnessing large-scale biodiversity data to infer the current distribution of Vanilla planifolia (Orchidaceae). Botanical Journal of the Linnean Society. Available at: https://doi.org/10.1093/botlinnean/boab005.
Gonçalves, P.F., A.R. Oliveira-Marques, T.E. Matsumoto, and C.Y. Miyaki. 2015. DNA barcoding identifies illegal parrot trade. Journal of Heredity 106: 560–564.
Hollingsworth, P.M., L.L. Forrest, J.L. Spouge, M. Hajibabaei, S. Ratnasingham, M. van der Bank, M.W. Chase, et al. 2009. A DNA barcode for land plants. Proceedings of the National Academy of Sciences 106: 12794–12797. Available at: https://www.pnas.org/content/106/31/12794.
Kanehisa, M., M. Furumichi, Y. Sato, M. Kawashima, and M. Ishiguro-Watanabe. 2022. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Research 51: D587–D592. Available at: https://doi.org/10.1093/nar/gkac963.
Kozlov, A.M., D. Darriba, T. Flouri, B. Morel, and A. Stamatakis. 2019. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35: 4453–4455. Available at: https://doi.org/10.1093/bioinformatics/btz305.
Kumar, S., G. Stecher, M. Li, C. Knyaz, and K. Tamura. 2018. MEGA x: Molecular evolutionary genetics analysis across computing platforms. Molecular biology and evolution 35: 1547–1549.
Larsen, B.B., E.C. Miller, M.K. Rhodes, and J.J. Wiens. 2017. Inordinate fondness multiplied and redistributed: The number of species on earth and the new pie of life. The Quarterly Review of Biology 92: 229–265.
Marx, V. 2023. Method of the year: Long-read sequencing. Nature Methods 20: 6–11. Available at: https://doi.org/10.1038/s41592-022-01730-w.
Masters, J.C., and L. Pozzi. 2017. Phylogenetic inference. In The international encyclopedia of primatology, 1–6. American Cancer Society. Available at: https://onlinelibrary.wiley.com/doi/abs/10.1002/9781119179313.wbprim0419.
Paradis, E., J. Claude, and K. Strimmer. 2004. APE: Analyses of phylogenetics and evolution in R language. Bioinformatics 20: 289–290.
Poplin, R., P.-C. Chang, D. Alexander, S. Schwartz, T. Colthurst, A. Ku, D. Newburger, et al. 2018. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology 36: 983–987. Available at: https://doi.org/10.1038/nbt.4235.
Quinto, C.A., R. Tinoco, and R.S. Hellberg. 2016. DNA barcoding reveals mislabeling of game meat species on the US commercial market. Food Control 59: 386–392.
R Core Team. 2016. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Available at: https://www.R-project.org/.
Satam, H., K. Joshi, U. Mangrolia, S. Waghoo, G. Zaidi, S. Rawool, R.P. Thakare, et al. 2023. Next-generation sequencing technology: Current trends and advancements. Biology 12: Available at: https://www.mdpi.com/2079-7737/12/7/997.
Stamatakis, A. 2014. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics (Oxford, England) 30: 1312–1313. Available at: https://doi.org/10.1093/bioinformatics/btu033.
Telfer, A.C., M.R. Young, J. Quinn, K. Perez, C.N. Sobel, J.E. Sones, V. Levesque-Beaudin, et al. 2015. Biodiversity inventories in high gear: DNA barcoding facilitates a rapid biotic survey of a temperate nature reserve. Biodiversity data journal.
Wheeler, Q.D. 1999. Why the phylogenetic species concept?—elementary. Journal of nematology 31: 134.
Winter, D.J. 2017. rentrez: An r package for the NCBI eUtils API. The R Journal 9: 520–526.