This webpage contains information related to the chapters covered in the Genomics & Bioinformatics course at BSU. The instructor will provide links to download the presentations as well as tutorials to complete in-class exercises (see e.g. Chapter 4).
The syllabus can be accessed here. In addition, you can download the presentation slides (in pdf
format) here.
This chapter provides the theoretical background information (incl. definitions; see Lexicon) necessary to support material taught in this course. We will review information on DNA and RNA that are important for genomic projects, especially to support genome sequencing, assembly and annotation.
The key learning outcomes of this chapter are:
This chapter is mostly based on the following resources:
The presentation slides (in pdf
format) of Chapter 1 can be downloaded here.
This chapter focuses on learning about next-generation sequencing and students will have to produce individual mini-reports on a sequencing technology (see below).
This chapter is mostly based on the following resources:
Note: These publications will be very helpful to complete Mini-report 1.
The following resources are helpful to complete Mini-report 1 and support reading scientific publications (Journal Club):
The presentation slides (in pdf
format) of Chapter 2 can be downloaded here.
As part of this chapter, students are tasked to to produce individual mini-reports on the following sequencing platforms and their associated technologies:
More details on this assignment is available here.
Warning: For this assignment, students are not allowed to use AI to complete their work.
To become more familiar with NGS technology (incl. jargon) and its application to assemble and annotate reference genomes, we are reading a publication on the Giant panda (Li, Fan, et al., 2009) for our first journal club (available on Google Drive).
The presentation used by the instructor to summarize the publication and initiate a group discussion is available here.
This chapter is mostly based on the following resources:
The presentation slides (in pdf
format) of Chapter 3 can be downloaded here.
To support reproducibility, research institutions, funders and publishers are requiring scientists to support their research by
To provide more insights into these practices here is an example of data policy from the G3 journal and a data availability statement from the sagebrush genome (Melton et al., 2022).
Here the goal is to get familiar with NCBI by doing the following exercises:
To have a full overview of available molecular biology databases and assess their contributions to genome annotation, students are tasked to produce individual mini-reports on the following molecular databases:
More details on this assignment is available here.
Before getting into conducting bioinformatics analyses to de novo assemble a draft nuclear genome, we will devote time to learn computing procedures. Please click here to access this material.
In this chapter, we aim at training students in producing a draft nuclear genome for a non-model organism using Illumina data.
The whole-genome shotgun (WGS) sequencing dataset published by Zhang et al. (2017) on the orchid species Apostasia shenzhenica is used as case-study. This dataset provides an opportunity to gain firsthand experience in analyzing NGS data. Zhang et al. (2017) have also produced RNA-Seq data, which were used in their study to support genome annotation. Zhang et al. (2017) is available in our shared Google Drive.
The chapter is subdivided into three parts:
Before you start working on Chapter 4, please make sure that you have completed the tutorials available here.
The two presentations (in pdf
format) associated to material presented in Chapter 4 can be downloaded here:
Details on the WGS PE Illumina library studied here are provided in Figure 6.1. The SRA accession number is SRR5759389 and the most important information to know about this data are that the fragments insert-size of the library is 180 bp and that the length of each PE read is 90 bp. Note that the authors have performed a round of PCR amplification during their library preparation. This step is know to potentially introduce errors in reads, which could be identified by performing k-mer analyses.
The number of bases shown in Figure 6.1 was inferred as follows: \(\text{N bases} = \text{N. spots * reads length (bp)}\). In this example, \(\text{N bases} = 84.1e6 * 180 (90 + 90) = 15.1e9bp = 15.1Gbp\).
You can also obtain a rough estimate of haploid genome coverage (x) by using the following equation: \(\text{Raw haploid genome coverage (x)} = \text{N bases / Genome size (haploid)}\). In this example, \(\text{Raw haploid genome coverage (x)} = 15.1e9 / 471.0e6 = 32x\). The estimation of haploid genome size was taken from Zhang et al. (2017).
As you go along this document and perform analyses, please COPY ALL COMMAND LINES into an .Rmd
document saved in a folder entitled Report/
(create this folder in your working directory; see below). Remember to comment your script (using #s). This will greatly help you in repeating your analyses or using parts of your code to create new scripts. Enjoy scripting!
Files for this chapter are deposited on each group Linux computer under the following path, which is shared between all the users on the computer: /home/Genomics_shared/Chapter_04/
.
There are three folders in this project:
01_SRA/
: This folder contains the SRA
file (SRR5759389_pe12.fastq
) downloaded from the SRA
database. This file is very large, >35GB!02_Kmers_analyses/
: This folder contains the structure that will be used for this tutorial. Each student will have to copy this folder onto the account prior to starting the analyses.03_Output_files/
: This folder contains all the outputs files. Students can look at these files to help solve potential coding issues.To support remotely connecting to the Linux computers and retrieving files (from the Linux computer to your personal computer), please make sure that you have installed the following software on your personal computer before pursing with this tutorial:
If you are working from outside of the University network (= off campus) on your personal computer (e.g., you are not connect to the eduroam WiFi), then you will need to use the BSU VPN (Virtual Private Network) to connect to the BSU secured network prior to remotely connecting to your Linux computer account (using ssh
protocol as taught in class; see Tutorials).
The BSU VPN software and instructions are available at this URL: https://www.boisestate.edu/oit-network/vpn-services/
The overarching objective of this tutorial is to gain theoretical and bioinformatics knowledge on the steps required to prepare PE Illumina reads for de novo nuclear genome assembly as well as infer genome size and complexity (see Figure 6.2).
The five main steps associated to the objectives of PART 1 are as follows (see Figure 6.3):
fasqt
format file. This latter format (where both reads are paired and combined into one file) is the input for most bioinformatic programs.FastQC
(Andrews, 2012).Before starting coding, each student has to copy the 02_Kmers_analyses/
folder project located in /home/Genomics_shared/Chapter_04
in their ~/Documents/
folder.
To do that, please execute the following commands in a Terminal
window:
Remotely connect to your computer account using ssh
Start a new tmux
session and rename it Part1
Execute the code below to copy the data on your account:
#Navigate to /home/Genomics_shared/Chapter_04
cd /home/Genomics_shared/Chapter_04
#Copy all files in 02_Kmers_analyses/ in ~/Documents/ (This takes ca. 10 minutes to complete)
cp -r 02_Kmers_analyses/ ~/Documents/
#Navigate to ~/Documents/02_Kmers_analyses/ to start the analyses
cd ~/Documents/02_Kmers_analyses/
Use this resource to:
Which command can be applied to list the content of
02_Kmers_analyses/
?
What is in
02_Kmers_analyses/
?
How much space this folder takes on your computer?
Although all the bioinformatic tools (software) necessary to complete this tutorial are already installed on your Linux computers, URLs to their repositories (incl. documentations) are provided below together with details on their publications (when available). Software are sorted by steps in our analytical workflow (see Figure 6.3):
fastq-dump
: https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/.FastQC
(Andrews, 2012): https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.seqtk
: https://github.com/lh3/seqtk.khmer
(Crusoe et al., 2015): https://github.com/dib-lab/khmer.jellyfish
(Marçais and Kingsford, 2011): http://www.genome.umd.edu/jellyfish.html.GenomeScope
(Vurture et al., 2017): http://qb.cshl.edu/genomescope/.If you are remotely connecting to the Linux computers please install FileZilla to support file transfer. Please see section 6.8.6.2 for more details on how to set up remote connection for transferring files. Finally, the R
software is also used in steps 4 and 5. You can find more details on this software in the Syllabus and Mini-Report webpages.
The SRA format (containing raw NGS data) can be quite difficult to play with. Here, we aim at downloading the WGS data, split PE reads, but store both reads (R1 and R2) in the same file using the interleaved format (Figure 6.3). This format is very convenient and is the entry point for most reads cleaning programs. This task can be done by using fastq-dump
from the SRA Toolkit.
!!! DON’T EXECUTE THIS CODE !!!
The command to download the SRA file is available below, but please don’t execute the command since the file weighs >35GB and it takes a LONG time to download.
#Download PE reads (properly edited) in interleave format using SRA Toolkit
# !! DON'T EXECUTE THIS COMMAND !!
fastq-dump --split-files --defline-seq '@$ac.$si.$sg/$ri' --defline-qual '+' -Z SRR5759389 > SRR5759389_pe12.fastq
Open a Terminal, connect to your Part1
tmux
session and navigate to the /home/Genomics_shared/Chapter_04/01_SRA
folder using the cd
command. This folder contains the SRA
file (SRR5759389_pe12.fastq
) that we will be using in this tutorial.
Use this resource to:
Identify a command to show the first 10 lines of
SRR5759389_pe12.fastq
What is the format of
SRR5759389_pe12.fastq
and why do you reach this conclusion?
To determine the quality of the raw Illumina data, we are conducting analyses to evaluate the following criteria:
By intersecting results provided by these analyses, we can determine best-practices to conduct reads cleaning (incl. removal of contaminants) and determine whether our Illumina data are suitable for genomic analyses.
The program FastQC
is used to obtain preliminary information about the raw Illumina data (see Figure 6.1). To perform this analysis do the following:
tmux
session~/Documents/02_Kmers_analyses/FastQC
using cd
01_SRA/SRR5759389_pe12.fastq
SRR5759389_pe12_fastqc.html
#Reads QCs with FastQC
./FastQC_program/fastqc -o ~/Documents/02_Kmers_analyses/FastQC/ /home/Genomics_shared/Chapter_04/01_SRA/SRR5759389_pe12.fastq
This step will have to be repeated after completion of the step3 (reads cleaning) to confirm that our reads trimming and cleaning procedure was successful. To inspect SRR5759389_pe12_fastqc.html
click here.
In small groups, inspect the output file and answer the following question:
What conclusions can you draw on the quality of the raw data based on the FastQC analysis?
Note: Go over each category outputted by the FastQC
analyses and determine why these metrics were inferred and how they help you drawing conclusions on your data.
At this stage, you will need to transfer SRR5759389_pe12_fastqc.html
on your local computer to be able to inspect the results. This can be done using i) the scp
protocol (but it only works on Unix-based operating systems) or ii) the sftp
protocol as implemented in FileZilla (this works on all operating systems).
scp
protocolWARNING: This protocol works only on UNIX OS (Mac, Linux).
You can use the scp
(Secure Copy Protocol) command to copy SRR5759389_pe12_fastqc.html
or any other file to your local computer. This can be done as follows:
Open a Terminal
window on your computer and navigate (using cd
) where you want to store the target file.
Type this command line:
#General syntax
scp USER@IP:PATH_ON_REMOTE_COMPUTER PATH_ON_YOUR_COMPUTER
#Example:
# To copy SRR5759389_pe12_fastqc.html on my computer from bio_11 user do:
scp bio_11@132.178.142.214:~/Documents/02_Kmers_analyses/FastQC/SRR5759389_pe12_fastqc.html .
sftp
protocolsftp
is a SSH File Transfer Protocol that can be used to support file exchange between remote computers and your personal computer. This protocol is implemented in FileZilla and other free software.
To copy the output of the FastQC
analysis on your computer do the following:
Open FileZilla on your personal computer and establish remote connection with Linux computer as shown in Figure 6.4. Once the connection is established:
SRR5759389_pe12_fastqc.html
) is located on the remote computer and select it.This protocol will have to be replicated at different stages of this tutorial.
This step is at the core of our analytical workflow. It is paramount to conduct thorough reads cleaning when assembling a nuclear genome due to the very nature of this genome (i.e. repetitive elements, heterozygosity, recombination, etc…). In this case, aim at minimizing the effect of PCR errors, which could jeopardize our assembly by creating false polymorphism. In addition, de novo genome assembly is computing intensive and by properly cleaning reads we will dramatically decrease RAM requirements.
Reads will be cleaned/trimmed based on (see Figure 6.3):
Finally, we will format the clean data for de novo genome assembly, which will be conducted using SOAPdenovo2
(Luo et al., 2012).
A k-mer
is a substring of length \(k\) in a string of DNA bases. For a given sequence of length \(L\), and a k-mer size of \(k\), the total k-mer’s possible (\(n\)) will be given by \((L-k) + 1\). For instance, in a sequence of length of 9 (\(L\)), and a k-mer length of 2 (\(k\)), the number of k-mer’s generated will be: \(n = (9-2) + 1 = 8\).
All eight 2-mers of the sequence “AATTGGCCG” are: AA, AT, TT, TG, GG, GC, CC, CG
In most studies, the authors provide an estimate of sequencing coverage prior to assembly (e.g. 73 fold in the case of the giant panda genome, Li, Fan, et al., 2009), but the real coverage distribution will be influenced by factors including DNA quality, library preparation and local GC content. On average, you might expect most of the genome (especially the single/low copy genes) to be covered between 20 and 70x. One way of rapidly examining the coverage distribution before assembling a reference genome is to chop your cleaned sequence reads into short “k-mers” of 21 nucleotides, and count how often you get each possible k-mer. By doing so, you will find out that:
Please find below the plot of k-mer frequencies inferred from the trimmed library studied here (SRR5759389
), which suggests that the haploid genome of Apostasia shenzhenica is sequenced at ca. 25x (Figure 6.5). The methodology used to infer this graph is explained in Step 4.
The peak around 25 in Figure 6.5 is the coverage with the highest number of different 21-mers. This means that there are \(1.4e^7\) unique 21-mers (frequency) that have been observed 25 times (coverage). The normal-like distribution is due to the fact that the sequencing did not provide a perfect coverage of the whole genome. Some regions have less coverage, whereas others have a little more coverage, but the average coverage depth is around 25.
The large number of unique k-mers (\(1.7e^7\)) that have a frequency of 1 (right left side of the graph; Figure 6.5) is most likely due to PCR errors. Please find below an example explaining how PCR errors impact on the k-mer procedure.
All 3-mers of the sequence “AATTGGCCG” are:
Now, let’s consider that the 4th letter (T
) in the sequence above is replaced with a C
to simulate a PCR error. In this context, all 3-mers of this sequence “AATCGGCCG” are:
The k-mers in bold are the incorrect 3-mers that are now unique and end up at the beginning of the graph in Figure 6.5.
Overall, the general rule is that for a given sequence, a single PCR error will result in \(k\) unique and incorrect k-mers.
Please follow the approach below to:
Note: Perform the analyses in your tmux
session entitled Part1
.
Let’s start the cleaning of our interleaved PE reads by trimming reads using Phred scores (33) as implemented in seqtk
.
Before executing the commands below do the following:
tmux
session~/Documents/02_Kmers_analyses/khmer
using cd
SRR5759389_pe12.fastq
SRR5759389_pe12.trimmed.fastq
#Trim reads based on Phred scores (default of 33)
# --> This code uses the raw Illumina data deposited on our shared folder
seqtk trimfq /home/Genomics_shared/Chapter_04/01_SRA/SRR5759389_pe12.fastq > SRR5759389_pe12.trimmed.fastq
Using Phred scores to clean our data is not going to deal with PCR errors and potential effects of high repeated DNA sequences on the de novo assembly. Below, we will use information on k-mer distributions to address these issues and clean our dataset accordingly.
In the next commands, khmer
(Crusoe et al., 2015) will be used to:
To conduct these analyses, run the following commands in your Part1 tmux
session (you should be in ~/Documents/02_Kmers_analyses/khmer
):
SRR5759389_pe12.trimmed.fastq
SRR5759389_pe12.max100.trimmed.fastq
# Normalize high coverage reads (>100x)
khmer normalize-by-median.py -p --ksize 21 -C 100 -M 1e9 -s kmer.counts -o SRR5759389_pe12.max100.trimmed.fastq SRR5759389_pe12.trimmed.fastq
SRR5759389_pe12.max100.trimmed.fastq
SRR5759389_pe12.max100.trimmed.norare.fastq
# Filter low abundance reads based on k-mer frequencies to minimize PCR errors
khmer filter-abund.py -V kmer.counts -o SRR5759389_pe12.max100.trimmed.norare.fastq SRR5759389_pe12.max100.trimmed.fastq
Final clean-up of reads to remove low quality bases, short sequences, and non-paired reads
SRR5759389_pe12.max100.trimmed.norare.fastq
SRR5759389_pe12.max100.trimmed.norare.noshort.fastq
# Final clean-up of reads (remove low quality bases, short sequences, and non-paired reads)
seqtk seq -q 10 -N -L 80 SRR5759389_pe12.max100.trimmed.norare.fastq | seqtk dropse > SRR5759389_pe12.max100.trimmed.norare.noshort.fastq
To save disk space and avoid cluttering the computer (with redundant information), please remove the following intermediate files as follows uisng the rm
command:
#Removing intermediate, temporary files to save disk space
rm SRR5759389_pe12.max100.trimmed.fastq
rm SRR5759389_pe12.max100.trimmed.norare.fastq
–> Each of the file that you discarded is >30G of data!
In this last section, our cleaned reads are prepared for the de novo genome assembly analysis conducted in SOAPdenovo2
(Luo et al., 2012) (done in PART 2). This is achieved by using khmer
to de-interleave reads and rename files using mv
as follows:
SRR5759389_pe12.max100.trimmed.norare.noshort.fastq
SRR5759389_pe12.max100.trimmed.norare.noshort.fastq.1
SRR5759389_pe12.max100.trimmed.norare.noshort.fastq.2
# De-interleave filtered reads
khmer split-paired-reads.py SRR5759389_pe12.max100.trimmed.norare.noshort.fastq
SRR5759389_pe12.max100.trimmed.norare.noshort.fastq.1
SRR5759389_pe12.max100.trimmed.norare.noshort.fastq.2
SRR5759389.pe1.clean.fastq
SRR5759389.pe2.clean.fastq
#Rename output reads
mv SRR5759389_pe12.max100.trimmed.norare.noshort.fastq.1 SRR5759389.pe1.clean.fastq
mv SRR5759389_pe12.max100.trimmed.norare.noshort.fastq.2 SRR5759389.pe2.clean.fastq
–> These final files are the input files for the de novo analysis.
Use FastQC
to validate the reads cleaning conducted above. Please see Step 2 for more details on the methodology.
After inspecting the output of the reads QC analyses, we discovered that our Illumina library contained two distinct sequence content patterns. This evidence together with the nature of the organism that we are studying allowed us to hypothesize that our data were enriched in plastid (or chloroplast) DNA over nuclear DNA. In this context, we predict that the 32x sequencing nuclear genome coverage inferred based on the raw data could be significantly lower. To test our hypothesis and determine the actual sequencing coverage of the nuclear genome, it is paramount to:
To investigate these topics, the instructor proposes that we conduct the following analyses:
All the analyses from this step rely on SRR5759389_pe12.max100.trimmed.norare.noshort.fastq
. If you have not been able to generate the latter file, please copy it from our shared hard drive space into your khmer/
folder. The path is as follows:
## SRR5759389_pe12.max100.trimmed.norare.noshort.fastq
# can be found at this path
/home/Genomics_shared/Chapter_04/02_Kmers_analyses/khmer/instructor_files
Students are divided into 3-4 small groups. Each group work as a team to conduct the analyses and answer questions. Prior to assigning tasks, the group has to work on a workflow describing the analyses (input/output), used software (and arguments) and carefully reading the material prior to running jobs in secured tmux
sessions.
The GC content of a sequence library can provide evidence of contamination or the presence of sequence from multiple organisms (since most organisms have specific GC profiles). A GC plot inferred from a non-contaminated library would display a smooth, uni-modal distribution. The existence of shoulders, or in more extreme cases a bi-modal distribution, could be indicative of the presence of sequence reads from an organism with a different GC content, which is most likely a contaminant (see Figure 6.3).
Here we infer GC contents for all clean reads using the perl script GC_content.pl
. This script requires reads to be in fasta
format. We will then start by converting our fastq
file into fasta
by using seqtk
. Data are available in the ~/Documents/02_Kmers_analyses/GC_content/
folder. Please see below for the commands to be executed:
fastq
to fasta
SRR5759389_pe12.max100.trimmed.norare.noshort.fastq
SRR5759389_pe12.max100.trimmed.norare.noshort.fasta
#Convert fastq to fasta
seqtk seq -a ~/Documents/02_Kmers_analyses/khmer/SRR5759389_pe12.max100.trimmed.norare.noshort.fastq > SRR5759389_pe12.max100.trimmed.norare.noshort.fasta
fasta
formatSRR5759389_pe12.max100.trimmed.norare.noshort.fasta
gc_out.txt
–> Don’t execute command and just use this file for next analysesPLEASE DON’T EXECUTE THE COMMAND BELOW! It takes few hours to calculate reads GC contents. Use the output of this command gc_out.txt
(weighing ca. 5GB), which is available in the GC_content/
folder to carry on our analyses.
#Infer GC content using fasta file as input (this can take a long time)
./GC_content.pl SRR5759389_pe12.max100.trimmed.norare.noshort.fasta > gc_out.txt
gc_out.txt
gc_simple.txt
The next command extracts the column containing the GC content per read from gc_out.txt
(located in GC_content/
). Since the file is very big, we use the BASH command awk
to extract this information instead of R
.
#Extract reads GC contents (the file is big and can't be processed easily)
awk '{print$2}' gc_out.txt > gc_simple.txt
gc_simple.txt
GC_content_hist.pdf
The last part of the analysis will be executed in R
where we will load the data and create a histogram to look at the distribution of reads GC contents.
### ~~~ Infer GC content per reads ~~~ Load the data in R
gc_frac <- read.table("gc_simple.txt", header = T)
# Create pdf to save output of hist
pdf("GC_content_hist.pdf")
# Do the histogram of GC content
hist(as.numeric(gc_frac[, 1]), main = "Histogram of GC content per reads",
xlab = "GC fraction")
# Close pdf file
dev.off()
The histogram displaying the distribution of GC values based on cleaned reads is provided in Figure 6.6.
To your knowledge, based on data presented in Figure 6.6, is the SRR5759389 library contaminated with alien DNA?
Here we aim at assessing the proportions of reads in the clean library belonging to either nuclear or chloroplast genomes by mapping our clean reads against two reference genomes using bwa
(Li and Durbin, 2009) and samtools
(Li, Handsaker, et al., 2009). Data are available in Map_reads/
.
The nuclear genome of Apostasia shenzhenica published by Zhang et al. (2017) is not listed in the NCBI database as “Genome”, but rather as “Assembly” (see BioProject PRJNA310678). Please see Figure 6.7 for more details on the genome. This file (a compressed FASTA) is already available in your folder, but you could download it as follows:
#Get Fasta files for genomes of reference
#PEFY01 nuclear genome of Apostasia
wget ftp://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/PE/FY/PEFY01/PEFY01.1.fsa_nt.gz
Please go through the commands below to map your clean reads against the reference nuclear genome assembly:
fastq
cleaned reads to nuclear reference genome assembly (in fasta
) using bowtie
SRR5759389_pe12.max100.trimmed.norare.noshort.fastq
and PEFY01.1.fsa_nt.gz
PEFY01_map_pe.sorted.bam
#Mapping clean reads against reference genome using bwa and converting into human readable format using samtools
#Before starting, you have to create an index for your reference genome to allow efficient mapping
bwa index -a bwtsw PEFY01.1.fsa_nt.gz
bwa mem -t 20 PEFY01.1.fsa_nt.gz ~/Documents/02_Kmers_analyses/khmer/SRR5759389_pe12.max100.trimmed.norare.noshort.fastq | samtools view -buS - | samtools sort - -o PEFY01_map_pe.sorted.bam
samtools index PEFY01_map_pe.sorted.bam
We will be devoting time to decipher these commands during our Lab sessions. For now, here is a short description of bwa
from the manual:
BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to 1Mbp. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.
The chloroplast genome of Apostasia shenzhenica was not published. To assess the proportion of chloroplastic reads in our Illumina cleaned reads, we will use the chloroplast genome of a sister species, A. wallichii (Niu et al., 2017), which is available on the NCBI website under the accession number LC199394
. This file (in FASTA) is already available in your folder and you do not have to download it.
Please execute the commands below to map the clean reads against the reference chloroplast genome of the sister species:
fastq
cleaned reads to chloroplast genome assembly (in fasta
) using bowtie
SRR5759389_pe12.max100.trimmed.norare.noshort.fastq
and LC199394.fasta
LC199394_map_pe.sorted.bam
#Mapping reads against the chloropolast genome of a sister species
# following the same approach as with the nuclear genome
bwa index -a bwtsw LC199394.fasta
bwa mem -t 20 LC199394.fasta ~/Documents/02_Kmers_analyses/khmer/SRR5759389_pe12.max100.trimmed.norare.noshort.fastq | samtools view -buS - | samtools sort - -o LC199394_map_pe.sorted.bam
samtools index LC199394_map_pe.sorted.bam
We have now all the information at hand to count how many reads are matching with either nuclear or chloroplast genomes (and infer proportions). Remember that it is just a very preliminary assessment, which aims at assessing the quality of the library and its suitability for de novo genome assembly.
count_fastq.sh
SRR5759389_pe12.max100.trimmed.norare.noshort.fastq
#Count the total number of clean reads
./count_fastq.sh ~/Documents/02_Kmers_analyses/khmer/SRR5759389_pe12.max100.trimmed.norare.noshort.fastq
PEFY01_map_pe.sorted.bam
and LC199394_map_pe.sorted.bam
Now, we will use samtools
to count how many unique reads mapped the reference genomes:
#How many unique reads mapped the nuclear genome
samtools view -F 0x904 -c PEFY01_map_pe.sorted.bam
#How many unique reads mapped the chloroplast genome
samtools view -F 0x904 -c LC199394_map_pe.sorted.bam
The instructor let you work on code to determine these values.
What are the proportions of nuclear and chloroplast reads in your library as well as potential contaminants?
Based on your previous answer, is this library suitable to assemble the nuclear genome of Apostasia shenzhenica?
In this section, we aim at utilizing data on k-mer frequencies to estimate the nuclear genome size and tease apart the proportion of the genome associated with low copy genes (also referred to as unique sequences) and repetitive DNA.
The JELLYFISH
(Marçais and Kingsford, 2011) program is a tool for fast, memory-efficient counting of k-mers in DNA sequences. JELLYFISH
is a command-line program that reads FASTA and multi-FASTA files containing DNA sequences and it outputs k-mer counts in an binary format (which can easily be translated into a human-readable text). Finally, the output of JELLYFISH
will be used to estimate the genome size and complexity of the Apostasia shenzhenica nuclear genome by using R
and GenomeScope
(Vurture et al., 2017).
In this section, the JELLYFISH
analysis has to be done in groups, but the rest of the analyses (in R
and with GenomeScope
) can be done individually.
Here, we are using JELLYFISH
to generate countings of 21-mers based on the cleaned Illumina reads (SRR5759389_pe12.max100.trimmed.norare.noshort.fastq
). We will subsequently input the output of this program into R
to estimate the genome size based on k-mer frequencies.
The data are available in Jellyfish/
and all analyses must be done in your tmux
Part1 session.
SRR5759389_pe12.max100.trimmed.norare.noshort.fastq
Orchid_21mer_SRR5759389_pe12_max100.trimmed.norare.histo
Note: This job uses 30 CPUs (or Central Processing Units) to infer k-mers and it would be best to work in groups to execute these tasks.
# 1. Navigate to the right folder (Jellyfish/) using cd
# 2. Run jellyfish to obtain k-mer counts and frequencies (based on trimmed, normalized and filtered reads)
jellyfish count -t 30 -C -m 21 -s 5G -o Orchid_21mer_SRR5759389_pe12_max100.trimmed.norare ~/Documents/02_Kmers_analyses/khmer/SRR5759389_pe12.max100.trimmed.norare.noshort.fastq
jellyfish histo -o Orchid_21mer_SRR5759389_pe12_max100.trimmed.norare.histo Orchid_21mer_SRR5759389_pe12_max100.trimmed.norare
Note: To ensure that everybody can run the rest of the analyses on their own, copy Orchid_21mer_SRR5759389_pe12_max100.trimmed.norare.histo
into the shared folder at this path:
/home/Genomics_shared/Chapter_04/02_Kmers_analyses/
Each member of the group will then copy this file over to their account (at the right path) to pursue the analyses.
R
Once the JELLYFISH
analysis is completed, please open an R
session and execute the following command to estimate the haploid (1x) genome size. Here, genome size is estimate as follows:
#Genome size (N) is equal to:
N = Total numbers of k-mers / Peak of coverage (here ca. 22x)
N = Area under the curve / Peak of coverage
The R
script to infer the genome size of Apostasia shenzhenica based on the data generated by JELLYFISH
is presented below:
# 1. Load the output of Jellyfish in R
Clean <- read.table("Orchid_21mer_SRR5759389_pe12_max100.trimmed.norare.histo")
# 2. Execute this user defined function to infer genome
# size and plot k-mer distribution A function to infer
# genome size (1C or haploid) on jellyfish output input -
# hist = table containing output of Jellyfish (hist file) -
# plotKmers if 'TRUE' outputs a graph
GenomeSize <- function(hist, plotKmers) {
# Infer Genome size
x <- grep(min(hist[1:20, 2]), hist[, 2])
tmp <- hist[x:1000, 1] * hist[x:1000, 2]
# Infer the total number of kmers in the distribution
Totkmers <- sum(as.numeric(as.vector(tmp)))
# Infer Peak of coverage (when enough data is equal to
# mean) of the genome (13x)
PeakCov <- grep(max(hist[x:1000, 2]), hist[, 2])
# Genome size (1C or haploid) is the area under the
# curve, which is the Total numner of K-mers/Mean
# coverage
GenomeSizeOut <- Totkmers/PeakCov
GenomeSizeOut <- round(GenomeSizeOut/1e+06, 2)
print(GenomeSizeOut)
if (plotKmers == "TRUE") {
# Plot
plot(hist[2:200, ], type = "n", ylim = c(0, 2e+07), xlab = "Coverage",
ylab = "Frequency", main = "K-mer (k=21) distribution")
# Plot trimmed data
lines(hist[2:200, ])
# These are low coverage K-mers (K=1:8), which have
# very high frequencies
rect(xleft = 0, ybottom = 0, xright = x, ytop = 2e+07,
col = "red")
text(x = 25, y = hist[PeakCov, 2] + 3e+06, paste("Error k-mers",
sep = "\n"), cex = 0.7, col = "red")
# Add the distribution line
lines(hist[2:200, ], type = "l")
# Add a line showing the peak of coverage (here at
# 26x)
segments(x0 = PeakCov, y0 = 0, x1 = PeakCov, y1 = hist[PeakCov,
2], lty = 2)
text(x = 26, y = hist[PeakCov, 2] + 9e+05, paste("Peak cov.: ",
PeakCov, "x", sep = ""), cex = 0.6)
text(x = 50, y = 8e+06, paste("True k-mers", sep = "\n"),
cex = 0.8)
text(x = 170, y = 18500000, paste("Estimated genome size: ",
GenomeSizeOut, "Mb", sep = ""), cex = 0.6)
}
}
# 3. Estimate genome size on cleaned reads using the
# function GenomeSize
pdf("GenomeSize.pdf")
GenomeSize(Clean, plotKmers = "TRUE")
dev.off()
Please use FileZilla
or the sftp
protocol (if you have a UNIX-based OS) to transfer GenomeSize.pdf
and Orchid_21mer_SRR5759389_pe12_max100.trimmed.norare.histo
onto your personal computer. The latter file will be required to conduct the GenomeScope
analysis (see below).
What is the haploid genome size of Apostasia shenzhenica?
How does this genome size estimation compare with the one conducted based on reads that were only trimmed provided in Figure 6.5?
Finally, are the latter two estimations close to the 349 Mb provided in Zhang et al. (2017)?
Finally, the last step of our preliminary analyses prior to de novo genome assembly aims at estimating heterozygosity, unique genes, repetitive DNA and PCR errors. These analyses are conducted by using the approach implemented in the online tool GenomeScope
(Vurture et al., 2017). The authors of this program developed an equation to model the shape and size of the k-mer graph by using four negative binomial peaks which shape and size are determined by % heterozygosity, % PCR duplication, and % PCR Error.
Open a web browser and navigate to the GenomeScope website (http://qb.cshl.edu/genomescope/). Once on the website, upload Orchid_21mer_SRR5759389_pe12_max100.trimmed.norare.histo
by dragging your file onto the small window and set the analysis as provided in Figure 6.8.
The output of GenomeScope
is displayed in Figure 6.9. The program outputs two plots and the only difference between those is that the the second plot is using a log transformation for the axes. This transformation allows better teasing apart the different elements studied here. The big peak at 25 in the graph above is in fact the diploid homozygous (i.e., AA) portions of the genome that account for the identical 21-mers from both strands of the DNA (haploid genome mostly composed of single-copy genes). The dotted line corresponds to the predicted center of that peak. The absence of a shoulder to the left of the peak suggests that this genome has a very low level of heterozygosity (i.e., AB). The red line on the left corresponds to PCR errors. Finally, the proportion of unique sequences, which would correspond to low-copy genes is depicted by the yellow line and the amount of repetitive DNA is estimated by comparing the “space” between the black and yellow lines (on the right side of the graph).
Please see this tutorial for more information on the model implemented in GenomeScope
and interpretation of the results.
How do estimations of genome size differ between R and GenomeScope?
What is the level of genomic heterozygosity of Apostasia shenzhenica?
Is the genome of Apostasia shenzhenica enriched in repeats?
The overarching objective of PART 2 is to gain theoretical and bioinformatics knowledge on the steps required to perform a de novo genome assembly based on cleaned Illumina reads. The procedure applied to assess and validate the quality of the genome assembly will also be covered in PART 3.
PART 2 will be using the output files of the PART 1 reads cleaning step, which are deposited in ~/Documents/02_Kmers_analyses/khmer/
and named as follows:
SRR5759389.pe1.clean.fastq
SRR5759389.pe2.clean.fastq
WARNING: You have to complete section 6.8.7.3.5 before starting this section.
To achieve our overarching objective, our next classes will be divided into two steps:
SOAPdenovo2
(Luo et al., 2012).Here, students will be learning the procedure to set-up and perform a de novo genome assembly using cleaned PE Illumina reads as implemented in SOAPdenovo2
(Luo et al., 2012). The analysis will be conducted on the output files obtained after cleaning reads based on Phred scores and k-mer distributions (see section 6.9.2 for more details on the input files).
SOAPdenovo2
is the updated version of the program that was used by Li, Fan, et al. (2009) to assemble the nuclear genome of the giant panda. The draft nuclear genome of Apostasia shenzhenica was not inferred with this latter program, but with ALLPATHS-LG
(Gnerre et al., 2011). The draft assembly was obtained based on multiple Illumina short-reads libraries exhibiting different insert-sizes (see Zhang et al., 2017). The authors have conducted a final round of gap closures to polish their assembly based on PacBio long-reads. It will therefore be interesting to compare our assembly based on one Illumina library with the one obtained by Zhang et al. (2017) inferred uisng a combination of short and long read data. Further information justifying using SOAPdenovo2
instead of ALLPATHS-LG
are provided below.
SOAPdenovo2
SOAPdenovo2
(Luo et al., 2012) is a short-read genome assembler, which was especially developed to handle Illumina reads and to build de novo
draft assemblies for human-sized genomes. This program creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost effective way. As stated above, SOAPdenovo2
aims for large eukaryote genomes, but it also works well on bacteria and fungi genomes. It runs on 64-bit Linux system with a minimum of 5GB of physical memory. For big genomes like human (3.2Gbp), about 150GB of memory are required.
In our case, we have estimated the haploid genome size of Apostasia shenzhenica to be around 340Mbp in our previous tutorial. This latter evidence therefore suggests that is adapted to assemble a draft genome for this species of orchid.
SOAPdenovo2
This program is made up of six modules handling:
Please read Luo et al. (2012) for more details on the approach implemented in SOAPdenovo2
.
Once reads are fully cleaned and ready for de novo assembly, one of the main questions that you will ask yourself is: What de novo program should I use to assemble my genome? Or in other words, which program best fit my data?
Most researchers working on Illumina short-reads are using either SOAPdenovo2
or ALLPATHS-LG
to assemble nuclear genomes of eukaryote species. Please find below a short comparison of both programs:
SOAPdenovo2
, ALLPATHS‐-LG
requires high sequence coverage of the genome in order to compensate for the shortness of the Illumina reads. The precise coverage required depends on the length and quality of the paired reads, but typically is of the order 100x or above. This is raw read coverage, before any error correction or filtering.SOAPdenovo2
, ALLPATHS‐-LG
requires a minimum of 2 paired-end libraries – one short and one long. The short library average separation size must be slightly less than twice the read size, such that the reads from a pair will likely overlap – for example, for 100 base reads the insert size should be 180 bases. The distribution of sizes should be as small as possible, with a standard deviation of less than 20%. The long library insert size should be approximately 3000 bases long and can have a larger size distribution. Additional optional longer insert libraries can be used to help disambiguate larger repeat structures and may be generated at lower coverage.Overall, because we estimated in the previous tutorial that this library provided a coverage depth of ca. 25x for the haploid genome and that we only have one paired-end library, SOAPdenovo2
therefore best fits our data.
SOAPdenovo2
analysisRunning a SOAPdenovo2
analysis is a three steps process:
SOAPdenovo2
configuration file providing settings for the analysis.Run the code in a tmux
session!
Use the mkdir
command to create a new folder entitled SOAPdenovo/
in ~/Documents/02_Kmers_analyses/
as follows:
mkdir SOAPdenovo2
Move the paired-end cleaned reads files (i.e. input files for the de novo analysis) in SOAPdenovo2/
using the mv
command as follows:
#Move the split pe files in SOAPdenovo2
# WARNING: You have to be in the SOAPdenovo2 folder
mv ../khmer/SRR5759389.pe* .
SOAPdenovo2
requires a configuration file to run the analysis.
Please create a file entitled soap_config.txt
deposited in SOAPdenovo2/
using vim
.
vim
session:vim soap_config.txt
Type Shift
and I
on your keyboard to enter into INSERT
mode.
Copy and paste the text below in the vim
session:
max_rd_len=90 # maximal read length
[LIB] # One [LIB] section per library
avg_ins=180 # average insert size
reverse_seq=0 # if sequence needs to be reversed
asm_flags=3 # use for contig building and subsequent scaffolding
rank=1 # in which order the reads are used while scaffolding
q1=SRR5759389.pe1.clean.fastq
q2=SRR5759389.pe2.clean.fastq
To better understand the settings contained in this file, comments have been provided to explain each parameter. The last 2 lines contain the names of the input files. This is a very simple analysis based only on one paired-end Illumina library ([LIB]
).
Esc
on your keyboard to quit the INSERT
mode and type :wq!
to save the document and exit from vim
.Once the input and configuration files are sorted, just type the following command in your terminal to start the analysis:
#Run the de novo analysis
soapdenovo2-63mer all -s soap_config.txt -K 63 -R -p 5 -o SRR5759389_assembly
Here, SOAPdenovo2
works with a k-mer size of 63 (-K
). The analysis will be performed on 5 threads (or CPUs) (-p
) and the program will save the output files under SRR5759239_assembly
as shown by the -o
argument.
The analysis will run for SEVERAL HOURS. Please log out from your sessions, but DON’T SWITCH OFF COMPUTERS.
The output file of the SOAPdenovo2
analysis are presented in section 6.10.2.
In this part, students are learning how to use QUAST
to assess the quality of the de novo genome assembly inferred in PART 2 using SOAPdenovo2
.
QUAST
is used to validate the assembly (based on one PE Illumina library; SRR5759389
) by comparing it with the assembly published by Zhang et al. (2017) (available under the GenBank accession number: PEFY01
). Such approach is especially useful to assess the level of misassembly (see Figure 6.10). Please keep in mind that this assembly has only been inferred from one PE Illumina library.
SOAPdenovo2
assembly filesThe input of the QUAST
analysis is the fasta
assembly file produced by SOAPdenovo2
containing scaffolds (see Figure 6.10). This file is entitled SRR5759389_assembly.scafSeq
and it is located in the SOAPdenovo2/
folder (see PART 2 for more details).
Navigate to the file location and inspect SRR5759389_assembly.scafSeq
by using the head
command. It should be similar to the example provided below:
>scaffold2 6.9
TATATATATTAGCATTAAATTAACTTTATTTTCATTCTTTCACAAAGCTCGAAACAAAGAT
ATCACCAAACAACTATGGCTTAGCAAACACATCATTTATACGAGAAATATTAACGATTTG
AGTGGTGTATGTAGGTTTATGTTCTGAAGTTAACTCGCCCTTTTTTTTTTTTTTCTTAAGT
ACTGATTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTCTTCTTCT
TCTAATCATAATATATGTGGTGCTTGCATTTGTGTGCCATGGTGTCACCCTCCCTTCTTG
CATTGGCCATTGACTCTACTCTTCCCCTCATCAGTGTCTTCCTGTTTAAATTGTAAACCA
You can also gather general statistics on your assembly (both at scaffold and contig levels) by consulting SRR5759389_assembly.scafStatistics
. Please find below a print out of the content of this latter file:
<-- Information for assembly Scaffold 'SRR5759389_assembly.scafSeq'.(cut_off_length < 100bp) -->
Size_includeN 319316154
Size_withoutN 316496824
Scaffold_Num 206180
Mean_Size 1548
Median_Size 339
Longest_Seq 58464
Shortest_Seq 100
Singleton_Num 148213
Average_length_of_break(N)_in_scaffold 13
Known_genome_size NaN
Total_scaffold_length_as_percentage_of_known_genome_size NaN
scaffolds>100 205022 99.44%
scaffolds>500 84690 41.08%
scaffolds>1K 60675 29.43%
scaffolds>10K 6062 2.94%
scaffolds>100K 0 0.00%
scaffolds>1M 0 0.00%
Nucleotide_A 105171893 32.94%
Nucleotide_C 54211428 16.98%
Nucleotide_G 53194169 16.66%
Nucleotide_T 103919334 32.54%
GapContent_N 2819330 0.88%
Non_ACGTN 0 0.00%
GC_Content 33.94% (G+C)/(A+C+G+T)
N10 17086 1447
N20 12498 3661
N30 9573 6593
N40 7459 10380
N50 5728 15270
N60 4208 21769
N70 2902 30880
N80 1712 45089
N90 649 74738
NG50 NaN NaN
N50_scaffold-NG50_scaffold_length_difference NaN
<-- Information for assembly Contig 'SRR5759389_assembly.contig'.(cut_off_length < 100bp) -->
Size_includeN 323703442
Size_withoutN 323703442
Contig_Num 326278
Mean_Size 992
Median_Size 328
Longest_Seq 32360
Shortest_Seq 100
Contig>100 324883 99.57%
Contig>500 125875 38.58%
Contig>1K 81379 24.94%
Contig>10K 1818 0.56%
Contig>100K 0 0.00%
Contig>1M 0 0.00%
Nucleotide_A 108192955 33.42%
Nucleotide_C 55873838 17.26%
Nucleotide_G 54056600 16.70%
Nucleotide_T 105580049 32.62%
GapContent_N 0 0.00%
Non_ACGTN 0 0.00%
GC_Content 33.96% (G+C)/(A+C+G+T)
N10 8835 2797
N20 6338 7177
N30 4798 13093
N40 3648 20857
N50 2743 31105
N60 1995 44959
N70 1350 64640
N80 786 95910
N90 342 158749
NG50 NaN NaN
N50_contig-NG50_contig_length_difference NaN
Number_of_contigs_in_scaffolds 178065
Number_of_contigs_not_in_scaffolds(Singleton) 148213
Average_number_of_contigs_per_scaffold 3.1
Why do scaffold sequences contain Ns?
Use the workflow presented in Figure 6.10 to answer this question.
What are the assembly length, N50 statistics and overall quality (i.e., how fragmented is the genome assembly) of the de novo assembly at contig and scaffold levels?
What proportion of the nuclear genome of Apostasia shenzhenica was assembled?
Does the assembly contain only nuclear genomic data?
Based on all the data at hand, what part of the genome could be annotated?
QUAST
stands for QUality ASsessment Tool for genome assemblies (Gurevich et al., 2013). The manual of this program is available here: http://quast.sourceforge.net/docs/manual.html
QUAST
uses MUMmer v3.23
(Kurtz et al., 2004) to
Unlike other similar software, QUAST
can also compute metrics that are useful for assessing assemblies of previously un-sequenced species (i.e. non model organisms).
QUAST
metricsThe metrics evaluated by QUAST
can be subdivided into several groups as follows:
QUAST
can evaluate them only with respect to a known reference genome. If the reference genome exactly matches the dataset being assembled, differences may be attributed to misassemblies by the software or to sequencing errors, such as chimeric reads. Sometimes one uses a reference genome that is related to but different than the dataset being sequenced. In this case, the differences may still be misassemblies, but they may also be true structural variations, such as rearrangements, large indels, different repeat copy numbers and so forth.
QUAST
also generates reports with detailed information about each contig, including whether the contig is unaligned, ambiguously mapped, misassembled or correct.QUAST
also generates a more detailed report with the coordinates of mismatches. This metric does not distinguish between single-nucleotide polymorphisms, which are true differences in the assembled genome versus the reference genome, and single-nucleotide errors, which are due to errors in reads or errors in the assembly algorithm.QUAST
presents a number of statistics in graphical form and supports svg
, png
and pdf
formats. These plots are divided into several groups as follows:
QUAST
is written in python
and can be run on Linux or MacOS. The assembly and reference genome files have to be in fasta
formats (they can be compressed). The output files are in html
and pdf
formats.
In this section, we are reviewing our data and computational resources and learning the procedure to best-fit those to the options provided in the QUAST
program:
SOAPdenovo2
assembly is made of scaffolds and not contigs. This will have to be stated by using the --scaffolds
option.PEFY01.1.fsa_nt.gz
) is compressed. QUAST
can handle compressed files, but not all software can.--fragmented
option.QUAST
can be run in parallel (i.e. simultaneously using multiple threads/CPUs to speed up the analysis). We will take advantage of the 32 cores of our computers and set the --threads
option to 5 (to allow all students to run the analyses in parallel on the same computer).QUAST
runs from a command line as follows:
#General syntax
quast.py [options] <contig_file(s)>
Options relevant to our data are as follows:
-R <path>
: Reference genome file. Optional. Many metrics can’t be evaluated without a reference. If this is omitted, QUAST
will only report the metrics that can be evaluated without a reference.--threads (or -t) <int>
: Maximum number of threads. The default value is 25% of all available CPUs but not less than 1. If QUAST
fails to determine the number of CPUs, maximum threads number is set to 4.--scaffolds (or -s)
: The assemblies are scaffolds (rather than contigs). QUAST
will add split versions of assemblies to the comparison (named <assembly_name>_broken
). Assemblies are split by continuous fragments of N’s of length \(\geq 10\). This is especially important to assess the level of misassemblies constructed by the algorithm.--fragmented
: Reference genome is fragmented (e.g. a scaffold reference). QUAST
will try to detect misassemblies caused by the fragmentation and mark them fake (will be excluded from # misassemblies
).If you want to get more information about options available in QUAST
, please consult the manual.
The command provided below will run a QUAST
analysis on the SOAPdenovo2
de novo assembly file using two input files:
SRR5759389_assembly.scafSeq
as output of the de novo analysisPEFY01.1.fsa_nt.gz
as reference genome sequenceBefore starting the QUAST
analysis execute the following tasks:
tmux
session, and navigate to 02_Kmers_analyses/
.PEFY01.1.fsa_nt.gz
: ~/Documents/02_Kmers_analyses/Map_reads/
SRR5759389_assembly.scafSeq
: ~/Documents/02_Kmers_analyses/SOAPdenovo2/
QUAST/
using the mkdir
command at the root of 02_Kmers_analyses/
.cd
command to navigate to QUAST/
.QUAST
analysis:#Run QUAST on the SOAPdenovo2 assembly using the reference nuclear genome
quast.py -R ../Map_reads/PEFY01.1.fsa_nt.gz --threads 5 --scaffolds --fragmented ../SOAPdenovo2/SRR5759389_assembly.scafSeq
QUAST
output filesBy default, the output files of the analysis will be saved in a folder named as follows:
quast_results/results_<date_time>
In each folder, there will be a suite of output files, please find below the ones that are more relevant to our goals:
report.txt
: Assessment summary in plain text format.report.html
: HTML version of the report with interactive plots inside.report.pdf
: All plots combined with all tables (see Figure 6.11).icarus.html
: Icarus main menu with links to interactive viewers. Icarus generates contig size viewer and one or more contig alignment viewers (if reference genome is provided). All Icarus viewers contain a legend with color scheme description (see Figure 6.12).Do the following tasks:
Icarus
tab; see Figure 6.11).Questions
What is your overall assessment of the quality of the de novo assembly?
To reach your conclusion, ask yourself the following questions:
Did
SOAPdenovo2
do a good job at assembling contigs into scaffolds?
What could be the implication(s) of using scaffolds to conduct gene annotation?