1 Introduction

This document supports analyses aiming at mining the sagebrush (Artemisia tridentata Nutt; Asteraceae) draft genome for Aquaporin (AQP) genes and it is based on data and analyses presented in Melton et al. (2021). AQPs are multi-exon genes (see Figure 1.1) encoding for a large family of proteins known to function in the transport of water and other molecules across cell membranes (reviewed in Li et al., 2014).

Flowchart showing the structure of a multi-exon gene and different resulting proteins. As a reminder, exons are pieces of coding DNA that encode proteins. Different exons code for different domains of a protein. The domains may be encoded by a single exon or multiple exons spliced together. In the case of multi-exon genes, the presence of exons and introns allows for greater molecular evolution through the process of exon shuffling and alternative splicing. Exon shuffling occurs when exons on sister chromosomes are exchanged during recombination. This allows for the formation of new genes. On the other hand, exons also allow for multiple proteins to be translated from the same gene through alternative splicing. This process allows the exons to be arranged in different combinations when the introns are removed/spliced. The different configurations can include the complete removal of an exon, the inclusion of part of an exon, or the inclusion of part of an intron. Alternative splicing can occur in the same location to produce different variants of a gene with a similar role, such as the human slo gene, or it can occur in different cell or tissue types, such as the mouse alpha-amylase gene. Alternative splicing, and defects in alternative splicing, can result in a number of diseases including cancer.

Figure 1.1: Flowchart showing the structure of a multi-exon gene and different resulting proteins. As a reminder, exons are pieces of coding DNA that encode proteins. Different exons code for different domains of a protein. The domains may be encoded by a single exon or multiple exons spliced together. In the case of multi-exon genes, the presence of exons and introns allows for greater molecular evolution through the process of exon shuffling and alternative splicing. Exon shuffling occurs when exons on sister chromosomes are exchanged during recombination. This allows for the formation of new genes. On the other hand, exons also allow for multiple proteins to be translated from the same gene through alternative splicing. This process allows the exons to be arranged in different combinations when the introns are removed/spliced. The different configurations can include the complete removal of an exon, the inclusion of part of an exon, or the inclusion of part of an intron. Alternative splicing can occur in the same location to produce different variants of a gene with a similar role, such as the human slo gene, or it can occur in different cell or tissue types, such as the mouse alpha-amylase gene. Alternative splicing, and defects in alternative splicing, can result in a number of diseases including cancer.

1.1 AQP defining characteristics

The defining characteristics of AQP proteins include having:

Six membrane-spanning alpha-helices
Two hydrophobic loops
Each loop contains a conserved asparagine-proline-alanine (NPA) motif forming a barrel surrounding a central pore-like region that contains additional protein density (see Figure 1.2)

While the NPA motifs are generally highly conserved, there are some AQP genes that have undergone mutations of the alanine residue in the NPA motif (Ishibashi, 2006).

1.2 AQP subfamilies

AQPs in flowering plants comprise five subfamilies (Danielson and Johanson, 2008):

NOD26-like intrinsic proteins (NIPs);
plasma membrane intrinsic proteins (PIPs);
small basic intrinsic proteins (SIPs);
tonoplast intrinsic proteins (TIPs);
X intrinsic proteins (XIPs).

Genes from each subfamily tend to move water or other substrate depending on their NPA motifs. Some AQPs, such as NIPs, have acquired a mutation in their NPA motif, such as alanine to leucine, which confer the ability to move substrates such as urea or ammonium (reviewed in Chaumont et al., 2005).

Figure 1.2: Example of a 3D model of an Aquaporin protein recovered in the sagebrush genome. NPA motifs are shown by red/orange colors

2 Learning outcomes

As part of this lab, students will gain the following learning outcomes:

Learn protocol to remotely access to a unix-based computer (see Tutorials).
Learn unix-based commands to organize projects and execute code (see Tutorials).
Provide introduction to running R code and developing your own R code (this topic was also partially covered in Mini-Report 3).
Study protocols to mine Illumina de novo draft genome (at scaffold level) for target genes.
Predict and annotate genes recovered from genome mining process.
Reconstruct protein sequences of annotated genes.
Validate protein sequence and predict function.

3 Objectives and scientific question

Instructors teach students to conduct analyses by going over each module in class demonstrating the approach to mine genome for target gene and go over steps to assemble and validate protein product. Then, each group will be assigned a scaffold (from the de novo genome assembly) and they will have to conduct analyses presented in modules 2 to 4 on their own to answer the following question:

What Aquaporin protein coding sequence is “hidden” in your assigned scaffold?

Very much the same as with the “Where’s waldo?” children books, students will be able to make predictions to “find” their Aquaporin protein coding sequence based on material presented in the Introduction and further developed in modules 2, 3 and 4. These material will allow students to design their analytical workflow covering the following major steps:

Predict, annotate and identify AQP genes along scaffold.
Reconstruct and validate AQP protein sequences.
Predict AQP protein function (by intersecting evidence recovered from the previous steps).

3.1 Scaffolds assigned to groups

The FASTA file with scaffolds sequences can be downloaded here. The scaffolds assigned to each group are as followed:

Group 1: Scaffold106379.
Group 2: Scaffold254734.
Group 3: Scaffold348135.

Let’s complete the training first going by through each module and then students will be working in groups to replicate these analyses to their assigned scaffold.

4 Group lab report (TOTAL: 150 points)

The instructor provides below information to complete the group report. All the material, data, code and references required to complete this report are provided here and are covered in class. In this context, students will have to focus on formatting their reports following guidelines presented here as well as making sure that their code is working and ready to be shared. Please DON’T FORGET TO PROVIDE REFERENCES FOR SOFTWARE/PACKAGES cited in the Material & Methods section.

4.1 Apply the IMRAD format and supply data and code

Your group reports will be structured and organized following the IMMRAD format: Introduction, Material & Methods, Results, and Discussion. This format is widely used to report experimental research in many scientific disciplines. In addition, you will be complementing your reports with an Abstract. Finally, since this research relies heavily on bioinformatics, students will complete their reports by supplying their data and commented code/scripts. Please see Mini-Report 3 for more details.

4.2 Report deadline

Group reports should be uploaded on the shared Google Drive by April 26th 2024 before 5 pm MT.

5 Group lab oral presentation (TOTAL: 50 points)

Students will use their lab reports to prepare 10 minutes presentations (+ 5 minutes for questions) taking place on April 26th 2024 (during class time).

The instructor expects each student to take part to the presentation and to be involved in answering questions.

Presentations will have to be deposited in advance onto a shared Google Drive folder or be sent to the instructor prior to the presentation.

6 Analytical workflow

Overall, our analytical workflow is subdivided into 4 modules as follows:

Module 1: Set-up environment and identify scaffolds containing AQP genes
Module 2: Extract scaffolds identified in module 1 and find Open Reading Frames
Module 3: Annotate ORFs and assemble AQP protein(s)
Module 4: Validate AQP protein sequences and infer their function(s)

7 Project structure

7.1 Where are the data located?

All the data associated to this project are located on your individual account in the DraftGenomeMineR/ folder. This folder is located in ~/DraftGenomeMineR. Your account detail are available here.

Key files:

Draft_Genome_Assembly.fasta: Sagebrush draft genome assembly (at scaffold level).
FASTAs/PIP1_3.fa: Reference containing PIP1 gene (AA) for BLAST analysis. This data will be used to find Aquaporin genes in the draft genome.
Scripts related to each module are located in Lesson_Modules/.

7.2 How do I access the data and run analyses?

To access the data and run the analyses, please remotely connect to your dedicated computer using ssh protocol. Your credentials are available here.

8 Module 1

8.1 Objectives

This module is dedicated to presenting approach to set-up your working environment and then to identify scaffolds in the de novo genome assembly containing AQP genes (using BLAST).

8.2 Remotely connect to computer

Start by using ssh protocol to connect to your Linux computer. This is done as follows (enter password when prompted to):

#SSH with bio_11 as example
ssh bio_11@132.178.142.214

8.3 Install `R` v.4.1.3

!!WARNING: YOU DON’T HAVE TO INSTALL R!!

The code used in this lab depends on a specific version of R (version 4.1.3 (2022-03-10) – “One Push-Up”) and it needs to be installed on your computer prior to starting the tutorial. This can be done by following the procedure described on these two websites:

Disclaimer: The right version of R was already installed on our lab computers, but if you wanted to execute this code on your personal computer please make sure you have the right version of R!

8.4 Start R session and associated R script

8.4.1 To do before starting

Before starting this tutorial, do the following:

Remotely connect to your computer
Start a new tmux session entitled “Lab” on your remote computer
Navigate to DraftGenomeMineR/ and start a new R session:

#Navigate to DraftGenomeMineR/
cd DraftGenomeMineR/

#Start R session
R

Create and save a new R file in RStudio entitled Lab_mining_genome.R on your personal computer

8.5 R dependencies

!!! DON’T EXECUTE THIS CODE !! The instructor already installed the R packages on your computers, but the code below shows you how these packages were installed.

Some of the required packages cannot be installed with install.packages() and require the BiocManager R package to be installed. We are also installing some packing directly from GitHub repositories.

#Check if BiocManager is installed and if not install it
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install()

#Install the other packages using BiocManager
install.packages("devtools")

library(devtools)
BiocManager::install("Biostrings")
BiocManager::install("ORFik")

#Install required packages for this tutorial
install_github("mhahsler/rBLAST")
install.packages("remotes")
remotes::install_github("GuillemSalazar/FastaUtils")
install.packages("ape")
install.packages("seqinr")

8.6 Load R packages and user-defined functions

To conduct the analyses, you need to load R packages and source R user-defined functions. Please copy the following code in your R script and then execute it.

8.6.1 Bioinformatics

Create an object with all the required R packages and load them:

# This will make a list of packages that another function will use to make sure they are all installed and loaded.
list.of.packages <- c("ape",
                      "Biostrings",
                      "dplyr",
                      "FastaUtils",
                      "ORFik",
                      "readr",
                      "tidyr",
                      "rBLAST",
                      "seqinr",
                      "stringr")

# Use lapply to load all of the packages in the list.
lapply(list.of.packages, require, character.only = TRUE)

Load all the user-defined function located in Functions/:

# Load all of the functions written for DraftGenomeMineR     
files.sources <- list.files("Functions", full.names=T)
sapply(files.sources, source)

8.7 Set environment for BLAST analysis

To be able to navigate between folders within our project, we will set an object entitled project.folder, which contains the path to the root of the project folder.

8.7.1 Bioinformatics

To create this object, do as follows:

Create an object with working directory

#Copy the output of getwd() in the object as follows:
project.folder <- getwd()

Call project.folder in the Console to check that it is correct. It should return: "~/DraftGenomeMineR".

Set working directory to project.folder as follows:

#Set working directory
setwd(project.folder)

8.8 Run the BLAST analysis

8.8.1 Overview

In this section, you will perform a BLAST analysis to identify scaffolds containing AQP genes. For this analysis, you will need the following files:

Input:
- Draft_Genome_Assembly.fasta: The file containing all the scaffolds (= draft genome)
- FASTAs/PIP1_3.fa: The file containing an AQP protein sequence used to mine the draft genome
Output:
- Unique_Filtered_Blast_Hit_Info.csv: A CSV containing the results of the BLAST analysis (= which scaffold matches the AQP protein sequence)

8.8.2 Bioinformatics

Set general parameters for the BLAST analysis:

# There are a few parameters that need to be set. 
# Let's create some R objects to store file paths and parameters for the BLAST search.

# We need a query (a fasta file of a gene you want to find in the draft genome), a draft genome assembly, BLAST databases, paths to other required folders (described in the README), and other parameters that can be used to filter results.
query.file.path <- "FASTAs/PIP1_3.fa"
genome.file.name <- "Draft_Genome_Assembly.fasta"
genome.path <- "FASTAs/Draft_Genome_Assembly.fasta"
blast.db.path <- "BlastDBs/Draft_Genome_Assembly.fasta"
AA.BlastDB.folder <- "AA_BlastDB/"
AA.ORF.folder <- "AA_ORFs/"
max.e <- 5E-50
perc.ident <- 90.000
query.type <- "AA"
blast.type <- "tblastn"
make.BlastDB <- T
BlastDB.type <- "nucl"

Read in the sagebrush draft genome assembly in FASTA format:

# Read in the genome assembly file
genome <- readLines(con = genome.path)
head(genome) # Print the top 6 lines of the fasta. There should be no spaces in what is printed. Each header should be on one line, followed by its entire scaffold on the next.

Make a BLAST database for query:

# Do you need to make a BLAST database or use an existing one?
if(make.BlastDB == TRUE){
  setwd("BlastDBs/")
  makeblastdb(file = genome.file.name, dbtype = BlastDB.type)
}

Read in the query file in FASTA format:

# Read in the query. What type of molecule is the query? DNA (or RNA) or amino acid sequences?
setwd(project.folder) # Previous lines changed the wd, so we need to go back to the project folder.
if (query.type == "DNA") {
  query <- readDNAStringSet(filepath = query.file.path,
                            format = "fasta")
} else {
  query <- readAAStringSet(filepath = query.file.path,
                           format = "fasta")
}

Perform BLAST analysis (BLAST needs to be installed on the computer):

# Now we have everything set up and can perform a BLAST search.
bl <- blast(db = blast.db.path, type = blast.type) # Create an object with the BLAST database and the type of BLAST search to perform.
cl <- predict(bl, query) # Perform the BLAST search
head(cl) # Look at the top BLAST hits

Filter output of BLAST analysis:

# We can now filter out bad BLAST hits using a few thresholds and parameters. This leaves us with only the best matches to our query.
# What parameters do you think would give you just the best candidates to be your gene of interest?
cl.filt <- subset(x = cl, Perc.Ident >= perc.ident & E <= max.e)
cl.filt.unique <- cl.filt[!duplicated(cl.filt[,c('SubjectID')]),] # SubjectID is the column that contains the scaffold names
cl.filt.unique
nrow(cl)
nrow(cl.filt)
nrow(cl.filt.unique)
blast.hits.to.extract <- subset(x = cl.filt.unique, SubjectID == cl.filt.unique[1,2])

Write output of BLAST analysis in csv file:

# We have evaluated the top BLAST hits and can see a clear best hit. Let's save this data so we can extract just that scaffold later.
write.csv(x = blast.hits.to.extract, file = "Unique_Filtered_Blast_Hit_Info.csv", row.names = F)

Quit R (using quit()) and inspect the content of Unique_Filtered_Blast_Hit_Info.csv using e.g. vim.

9 Module 2

9.1 Objectives

In this module, we are conducting the following tasks:

Setting up your working environment (as shown in module 1)
Extracting scaffold(s) identified in module 1 from the de novo draft genome assembly (in FASTAs/Draft_Genome_Assembly.fasta) and saving the output as a FASTA object/file.
Finding ORFs along scaffold(s) using the user-defined function findORFsTranslateDNA2AA() and saving the output into a FASTA object/file. In molecular genetics, an Open Reading Frame (ORF) is the part of a reading frame that has the ability to be translated. An ORF is a continuous stretch of codons that begins with a start codon (usually AUG) and ends at a stop codon (usually UAA, UAG or UGA). We will study the code implemented in findORFsTranslateDNA2AA() to fully understand the applied approach.
Produce maps of recovered ORFs along scaffold(s). These pdf maps will be saved in ORFs_map/.

9.2 Setting up your working environment

Before starting analyses in R, students have to complete these following tasks:

Remotely connect to their computers (using their individual accounts) with ssh protocol.
Navigate to project directory (DraftGenomeMineR/) using cd.
Create a new folder (Output_FASTAs/) using mkdir to store results of BLAST scaffold analysis.
Create a new folder (ORFs_report/) using mkdir to store results of ORFs analysis.
Start a new R session using R-4 command.
Load R packages and user-defined functions.
Set working directory to DraftGenomeMineR/.

For an example of code, see below:

#1. ssh
ssh svenbuerki@132.178.142.214

#2. Navigate to DraftGenomeMineR/
cd DraftGenomeMineR/

#3. & 4. Create new folders to save files
# --> If these folders already exist then don't execute the following code 
mkdir Output_FASTAs/
mkdir ORFs_report/
  
#5. Start R session
R

#6. This will make a list of packages that another function will use to make sure they are all installed and loaded.
list.of.packages <- c("ape",
                      "Biostrings",
                      "dplyr",
                      "FastaUtils",
                      "ORFik",
                      "readr",
                      "tidyr",
                      "rBLAST",
                      "seqinr",
                      "stringr")

# Use lapply to load all of the packages in the list.
lapply(list.of.packages, require, character.only = TRUE)

# Load all of the functions written for DraftGenomeMineR     
files.sources <- list.files("Functions", full.names=T)
sapply(files.sources, source)

#7. Copy the output of getwd() in the object as follows:
project.folder <- getwd()

#Set wd
setwd(project.folder)

9.3 Extract scaffold(s) in draft genome

9.3.1 Overview

In this section, you will use the output of the BLAST analysis to extract the scaffold containing the AQP gene. For this analysis, you will need the following files:

Input:
- FASTAs/Draft_Genome_Assembly.fasta: The file containing all the scaffolds (= draft genome)
- Unique_Filtered_Blast_Hit_Info.csv: A CSV containing the results of the BLAST analysis (= which scaffold matches the AQP protein sequence)
Output:
- Output_FASTAs/Scaffold151535.fa: A FASTA file with the scaffold containing the AQP gene

!!NOTE: Don’t forget to copy paste the code in your R script!!

9.3.2 Bioinformatics

In R, open Unique_Filtered_Blast_Hit_Info.csv containing results of BLAST analysis and extract name of target scaffold(s)

#Open BLAST file
cl.filt.unique <- read.csv(file = "Unique_Filtered_Blast_Hit_Info.csv") # Output of module 1

#Scaffold IDs
cl.filt.unique$SubjectID

Extract scaffold(s) sequences from draft genome file (FASTAs/Draft_Genome_Assembly.fasta)

#Open draft genome file
genome <- readLines("FASTAs/Draft_Genome_Assembly.fasta")

#Check formatting of file
head(genome)

#How many scaffolds?
Nscaff <- length(genome)/2

#Extract scaffolds in a loop

#Create data frame to store scaffold data
scaffold <- data.frame("scaffoldID" = character(length(cl.filt.unique$SubjectID)), "scaffoldSeq" = character(length(cl.filt.unique$SubjectID)))
#Start populating data frame
scaffold$scaffoldID <- cl.filt.unique$SubjectID

for(i in 1:nrow(scaffold)){
  # Find and Extract seq of each scaffold from draft genome FASTA file
  scaffold$scaffoldSeq[i] <- genome[match(paste0(">",scaffold$scaffoldID[i]), genome)+1]
}

#Convert into FASTA format
scaffoldFASTA <- apply(scaffold, 1, paste, collapse="\n")

#Save/export FASTA file
write.table(scaffoldFASTA, "Output_FASTAs/Scaffold151535.fa", row.names = F, col.names=F, quote = F)

9.4 Find ORFs along scaffold(s)

9.4.1 Overview

In this section, you will find ORFs along the scaffold containing the AQP gene. For this analysis, you will need the following files:

Input:
- Output_FASTAs/Scaffold151535.fa: A FASTA file with the scaffold containing the AQP gene
Output:
- ORFs_report/Scaffold151535_ORFs.csv: A CSV file with all predicted ORFs along scaffold
- AA_ORFs/Scaffold151535_ORFs.fa: A FASTA file with AA sequences for all predicted ORFs (= input for module 3)
- ORFs_map/Scaffold151535_ORFs_annotated.pdf: A PDF with the map of all recovered ORFs along the scaffold

!!NOTE: Don’t forget to copy paste the code in your R script!!

9.4.2 Bioinformatics

We will now find ORFs in scaffold(s) using findORFsTranslateDNA2AA(). This user-defined function (UDF) relies on functions from the ORFik R package. The output will be saved in ORFs_report/.

Let’s first look at the structure of the findORFsTranslateDNA2AA() function to understand the applied approach:

##################################
#WARNING: DON'T EXECUTE THIS CODE#
##################################

#####~~~
#A UDF used to findORFs in scaffold file (FASTA format) and produce AA sequences as well as report
#####~~~
findORFsTranslateDNA2AA <- function(scaffold, scaffoldID, MinLen){
  #Extract scaffold sequence (and convert to right format)
  seqRaw <- scaffold[c(grep(paste(">", scaffoldID, sep=''), scaffold)+1)] 
  seqs <- DNAStringSet(seqRaw)
  
  #positive strands 
  pos <- ORFik::findORFs(seqs, startCodon = "ATG", minimumLength = MinLen) 
  
  #negative strands (DNAStringSet only if character) 
  neg <- ORFik::findORFs(reverseComplement(DNAStringSet(seqs)), startCodon = "ATG", minimumLength = MinLen) 
  pos <- relist(c(GRanges(pos,strand = "+"),GRanges(neg,strand = "-")),skeleton = merge(pos,neg))
  
  #Process output
  pos <- as.data.frame(pos)
  pos$scaffoldID <- rep(scaffoldID, nrow(pos))
  pos$ORFID <- paste("ORF_", seq(from=1, to=nrow(pos)), sep='')
  
  #Extract and write ORFs from seq and produce AA sequence
  for(i in 1:nrow(pos)){
    if(as.character(pos$strand[i]) == "+"){
      extSeq <- as.DNAbin(DNAStringSet(paste(strsplit(seqRaw, split='')[[1]][c(pos$start[i]:pos$end[i])], collapse = "")))
      AAseq <- paste(paste(">", pos$scaffoldID[i], "_", pos$ORFID[i], sep=''), paste(as.character(trans(extSeq, code = 1, codonstart = 1))[[1]], collapse = ''), sep='\n')
      write.table(AAseq, file = paste("AA_ORFs/", pos$scaffoldID[i], "_", pos$ORFID[i], ".fa", sep=''), col.names = F, row.names = F, quote = F)
    }
    if(as.character(pos$strand[i]) == "-"){
      revComp <- reverseComplement(DNAStringSet(seqs))
      
      extSeq <- as.DNAbin(DNAStringSet(paste(strsplit(as.vector(revComp), split='')[[1]][c(pos$start[i]:pos$end[i])], collapse = "")))
      AAseq <- paste(paste(">", pos$scaffoldID[i], "_", pos$ORFID[i], sep=''), paste(as.character(trans(extSeq, code = 1, codonstart = 1))[[1]], collapse = ''), sep='\n')
      write.table(AAseq, file = paste("AA_ORFs/", pos$scaffoldID[i], "_", pos$ORFID[i], ".fa", sep=''), col.names = F, row.names = F, quote = F)
    }
  }
  
  #Merge all ORFs for BLAST analysis
  system(paste("cat AA_ORFs/", pos$scaffoldID[i], "* > AA_ORFs/", pos$scaffoldID[i], "_ORFs.fa", sep=''))
  system(paste("rm AA_ORFs/", pos$scaffoldID[i], "_ORF_*", sep=""))
  #Save pos file
  write.csv(pos, file = paste("ORFs_report/",pos$scaffoldID[i], "_", "ORFs.csv", sep=''), col.names = T, row.names = F, quote = F)
}

Now, we can apply findORFsTranslateDNA2AA() to our data:

# Open FASTA file with target scaffold
scaffold <- readLines("Output_FASTAs/Scaffold151535.fa")
scaffoldID <- grep(pattern = "^>", x = scaffold, value = T)
scaffoldID <- gsub(pattern = ">", replacement = "", x = scaffoldID)
# Apply UDF to find ORFs and translate these into AA
# --> we will use the AA sequences to conduct a BLAST analysis and identify ORFs coding for Aquaporin exons
tryCatch(
  {
    for(i in 1:length(scaffoldID)){
      findORFsTranslateDNA2AA(scaffold = scaffold, scaffoldID = scaffoldID[i], MinLen = 40)
    }
  })

We should now have ORFs reports and FASTA files of translated ORFs. Let’s check these outputs.

# Open the CSV output of the UDF containing data on predicted ORFs and associated sequences 
orf.report <- read.csv("ORFs_report/Scaffold151535_ORFs.csv")

# How many ORFs were found?
nrow(orf.report) 

# What are the longest ORFs?
orf.report[order(-orf.report$width),] 

# Look at translated ORFs (=AA sequences)
translated.orfs <- readLines("AA_ORFs/Scaffold151535_ORFs.fa")
head(translated.orfs)

9.5 Produce maps of recovered ORFs along scaffold(s)

Here, we are visualizing the ORFs recovered by our analysis along each scaffold by producing maps saved in pdf format in ORFs_map/. The code also checks if the folder ORFs_map/ where files will be saved exists and if not creates it.

9.5.1 Bioinformatics

First study the code provided below and then execute it. Don’t forget to copy and paste it in your R script.

###
#Build map of scaffold with ORFs
###

#Read scaffold FASTA file (line by line)
scaffold <- readLines("Output_FASTAs/Scaffold151535.fa")

#List all files with ORFs
ORFfiles <- list.files(path = "ORFs_report", pattern = ".csv", full.names = T)

#Check if folder where ORF maps will be saved exists
# if not then creates it
output_dir <- file.path(paste0(getwd(), "/ORFs_map/"))

if(dir.exists(output_dir)){
  print(paste0("Dir", output_dir, " already exists!"))
}else{
  print(paste0("Created ", output_dir))
  dir.create(output_dir)  
}

#Produce a map (in pdf format) for each scaffold
for(i in 1:length(ORFfiles)){
  print(ORFfiles[i])
  #Read file in
  ORF <- read.csv(ORFfiles[i])
  
  #Process FASTA scaffold sequence
  seq <- strsplit(scaffold[grep(paste(">", ORF$scaffoldID[1], sep=''), scaffold)+1], split='')
  
  #Separate ORFs by strand
  ORFplus <- subset(ORF, ORF$strand == "+")
  ORFneg <- subset(ORF, ORF$strand == "-")
  
  #Create plot
  pdf(paste("ORFs_map/", ORF$scaffoldID[1], "_ORFs_annotated.pdf", sep=''))
  
  #Initiate plot
  plot(x=1, y=1, xlim=c(0,length(seq[[1]])), ylim=c(0,2), type='n', bty="n", axes=F, xlab = "", ylab='')
  #Add title
  text(x=5, y=2, paste(ORF$scaffoldID[1], " (strand: + in grey and - in blue)", sep=''), adj=0, cex=.8)
    
  #Create a segment with length of scaffold
  segments(x0=0, x1=length(seq[[1]]), y0=1, y1=1, col='black', lwd=3)
    
  #Add ORFs: rectangles (grey: +, blue: -)
  rect(xleft=ORFplus$start, xright=ORFplus$end, ybottom=0.75, ytop=1.25, col='grey')
  rect(xleft=ORFneg$start, xright=ORFneg$end, ybottom=0.75, ytop=1.25, col='blue')
  text(x=(ORFneg$start + ORFneg$end)/2, y=0.7, paste(ORFneg$ORFID, " (", ORFneg$start, ":", ORFneg$end, ")", sep=''), srt=90, col='blue', adj=1, cex=0.4)
  text(x=(ORFplus$start + ORFplus$end)/2, y=1.3, paste(ORFplus$ORFID, " (", ORFplus$start, ":", ORFplus$end, ")", sep=''), srt=90, col='black', adj=0, cex=0.4)
    
  #Add x axis
  axis(side = 1)
  mtext("Sequence (bp)", pos = c(0,0.5), side=1, line=2, cex.lab=0.6,las=1)
  #Close pdf
  dev.off()
}

Figure 9.1 shows the location of the predicted ORFs along Scaffold151535 inferred by the code displayed above.

Figure 9.1: Map of predicted ORFs along scaffold151535.

9.6 Questions

How many ORFs were discovered on Scaffold151535?

How many ORFs are on the + and - strands?

Write an R code based on ORFs_report/Scaffold151535_ORFs.csv to answer the questions.

10 Module 3

10.1 Objectives

In this module, we are conducting the following tasks:

Annotate ORFs inferred in module 2 using online protein BLAST tool. Students will use the output of the protein BLAST analysis to identify ORF(s) coding for Aquaporin genes.
Extract ORF(s) identified by protein BLAST analysis to reconstruct Aquaporin gene sequence. Students will provide DNA sequence of Aquaporin gene located on scaffold and its associated protein sequence.

10.2 Annotate ORFs using online protein BLAST tool

10.2.1 Input file for protein BLAST analysis

The FASTA file Scaffold151535_ORFs.fa (see Figure 10.1) containing AA sequences of ORFs identified in module 2 is available on the shared Google Drive or on GitHub.

Figure 10.1: FASTA file containing all AA sequences for ORFs inferred along scaffold. See module 2 for more details.

10.2.2 Protein BLAST analysis

Students download input file on their personal computers and open it in their favorite text editor.
Go on the online protein BLAST platform (described in Altschul et al., 1997).
Copy content of Scaffold151535_ORFs.fa as shown in Figure 10.2 and press BLAST button to send the query.

Figure 10.2: Online protein BLAST form.
Inspect output of protein BLAST analysis and identified ORF(s) along scaffold coding for Aquaporin gene products (see Figure 10.3).

Figure 10.3: Output of the protein BLAST analysis. Use the dropdown button to select each ORF and identify their gene products.
What ORFs are coding for Aquaporin gene products?

10.2.3 Questions

How many ORFs on Scaffold151535 are coding for AQP exons and which are they?

Based on the BLAST analysis and the ORFs map (Figure 9.1), do you predict that the Aquaporin gene sequence in Scaffold151535 is complete or not? Please motivate your answer

10.3 Extract ORF(s) identified by protein BLAST analysis to reconstruct Aquaporin gene sequence

Students are tasked to develop an R code producing the DNA sequence of Aquaporin gene located on identified scaffold and its associated protein sequence.

10.4 A user-defined function to produce a protein sequence

Please find below a user defined function (UDF) that concatenates ORFs into a protein sequence in FASTA format. This function implements defensive programming aiming at providing the users with meaningful error messages to debug their code.

In this UDF, defensive programming has been implemented 2-ways:

The code checks that the FASTA file declared in fastaAA exists in the working directory (or in the provided path) and stops and return an error message pertaining to this issue if it doesn’t.
The code checks that all the ORFs declared in ORFsToCat exists in the FASTA file (= declared in fastaAA) and stops and return an error message pertaining to this issue if they are not all present in the file.

Finally, the UDF as a logical argument entitled stopCodon allowing to declare whether stop codons (represented by *) are present in the AA sequences (in your FASTA file). Here it is especially convenient since the concatenated protein sequence should not contain stop codons for our subsequent analyses. See UDF below for more details.

The function is provided below for you to inspect.

###~~~~
#ORFs2protSeq - A UDF to produce a protein sequence from ORFs
# Arguments:
# - fastaAA: Name of Fasta file containing ORFs AA sequences
# - ORFsToCat: vector with names of ORFs to concatenate (as in FASTA file and including ">" and in the right order)
# - stopCodon: Logical (TRUE/FALSE) variable to state whether stop (*) codons are in the AA sequences
# Output:
# - An object with the concatenated protein sequence in FASTA format
ORFs2protSeq <- function(fastaAA, ORFsToCat, stopCodon){
  #Check if the FASTA file exists
  checkFile <- file.exists(fastaAA)
  if(checkFile == FALSE){
    #Stop the function and print error
    stop(call. = FALSE, paste(fastaAA, "does not exists! Please check that your working directory is set properly.", sep = " "))
  }else{
    #Open FASTA file
    fas <- readLines(fastaAA)
    #Find ORFs in Fasta file
    ToFind <- match(ORFsToCat, fas)
    #Check that all the ORFs were found in the Fasta
    checkORFs <- is.na(ToFind)
    if(length(which(checkORFs == TRUE)) > 0){
      #Stop the function and print error
      stop(call. = FALSE, paste(paste(ORFsToCat[which(checkORFs == TRUE)], collapse = " "), "couldn't be found in", fastaAA, " Please fix the error(s).", sep = " "))
    }else{
      print(paste("Producing protein sequence with", paste(ORFsToCat, collapse = ", "), sep = " "))
      #Cat ORFs into AA sequence
      protSeq <- paste(fas[ToFind+1], collapse = "")
      if(is.logical(stopCodon) == TRUE){
        #Discard stop codons from final sequence
        protSeq <- gsub("[*]", "", protSeq)  
      }
    }
    #Produce final protein sequence in FASTA format
    # header
    head <- paste(">Protein_sequence_from_", paste(gsub("[>]", "", ORFsToCat), collapse = "|"), sep= "")
    protSeqOUT <- paste(head, protSeq, sep = "\n")
    print(protSeqOUT)
  }
  return(protSeqOUT)
}

10.4.1 Bioinformatics

In this section, we apply the UDF ORFs2protSeq to Scaffolds_groups.fasta (= fastaAA). First, we will use an example where the ORFsToCat argument contains ORFs that are not found in the FASTA file to demonstrate how defensive programming was implemented in the UDF.

The step-by-step approach is as follows:

Copy the UDF ORFs2protSeq() in your R code and execute it to source the function in the global environment (on the remote computer)
Apply ORFs2protSeq() to a FASTA file containing ORFs

#Name of the FASTA file (make sure the path is correct)
fastaAA <- "Data/Scaffold151535_ORFs.fa"

#Fasta headers of the ORFs you want to concatenate
# Here some ORFs names are faulty
ORFsToCat <- c(">Scaffold151535_ORF_1tt", ">Scaffold151535_ORF_13c", ">Scaffold151535_ORF_2")

#Stop codon
stopCodon <- TRUE

#Apply UDF
ORFs2protSeq(fastaAA, ORFsToCat, stopCodon)

## Error: >Scaffold151535_ORF_1tt >Scaffold151535_ORF_13c couldn't be found in Data/Scaffold151535_ORFs.fa  Please fix the error(s).

Apply the UDF again, but this time with matching ORFs (= they are declared in the FASTA file)

#Name of the FASTA file
fastaAA <- "Data/Scaffold151535_ORFs.fa"

#Fasta headers of the ORFs you want to concatenate
# Here ORF names are accurate
ORFsToCat <- c(">Scaffold151535_ORF_1", ">Scaffold151535_ORF_13", ">Scaffold151535_ORF_2")

#Stop codon
stopCodon <- TRUE

#Apply UDF
ORFs2protSeq(fastaAA, ORFsToCat, stopCodon)

## [1] "Producing protein sequence with >Scaffold151535_ORF_1, >Scaffold151535_ORF_13, >Scaffold151535_ORF_2"
## [1] ">Protein_sequence_from_Scaffold151535_ORF_1|Scaffold151535_ORF_13|Scaffold151535_ORF_2\nMSSKPEIFEISSNDSTSSDSITSPWSSYFPSTYSNSVSKKSTAARSGKKVSTSVNPPVTTQALSMTSKRTQVLGLPTKENYAKITAKGKGKRTLTMEYLTILPEASALFNICLIGWSVMNFIGWASKYLRSLLAICSRARANFSMGAMHLCMHVCNEMLLVSVKIKLLVFIHHQCQNTTWYIVTCLDLTYQLNSYFKHKQKTQLHTFPTL"

## [1] ">Protein_sequence_from_Scaffold151535_ORF_1|Scaffold151535_ORF_13|Scaffold151535_ORF_2\nMSSKPEIFEISSNDSTSSDSITSPWSSYFPSTYSNSVSKKSTAARSGKKVSTSVNPPVTTQALSMTSKRTQVLGLPTKENYAKITAKGKGKRTLTMEYLTILPEASALFNICLIGWSVMNFIGWASKYLRSLLAICSRARANFSMGAMHLCMHVCNEMLLVSVKIKLLVFIHHQCQNTTWYIVTCLDLTYQLNSYFKHKQKTQLHTFPTL"

Amend the above code to save the output of ORFs2protSeq() in a FASTA file located at a specific path on your remote computer

11 Module 4

11.1 Objectives

The objectives of this module are to validate the AQP protein sequences obtained in module 3 and infer its function. This will be done by conducting the following analyses:

Predict the number of protein transmembrane helices based on the approach implemented in TMHMM (Krogh et al., 2001). This approach relies on Hidden Markov Models.
Identify NPA motifs using an R script relying on Biostring (Pagès et al., 2019) and IRanges (Lawrence et al., 2013) packages.
Model 3D AQP protein using an approach implemented in Phyre² and predict protein (therefore its function). The approach for protein modeling, prediction and analysis is presented in Kelley et al. (2015).

11.1.1 Predictions

To be valid, the AQP protein sequence should have:

Six transmembrane helices.
Two hydrophobic loops.
Two NPA motifs (one per loop).
Mutations in the NPA motifs (at N or A positions) changes affinity with water and therefore suggests different substrate (i.e., this would be indicative of a change of function).

11.2 Predict the number of protein transmembrane helices (THM) (and outside loops)

To conduct this analysis (based on >Scaffold151535_ORF_26) do the following:

Go to the TMHMM website: https://services.healthtech.dtu.dk/service.php?TMHMM-2.0
Paste the FASTA AA sequence generated in module 3, here corresponding to Scaffold151535_ORF_26 as shown in Figure 11.1 and submit the analysis.

Figure 11.1: Snapshot of TMHMM website showing how to submit job.
Inspect the output of the TMHMM analysis (see Figure 11.2). For AQP proteins, we are expecting:
- Number of predicted TMHs (= Trans-Membrane Helix): 6.
- Outside loops: 2.
  
  Figure 11.2: Snapshot of TMHMM results.

11.3 Identify the number of NPA motifs and type

In this section, we are identifying the location and type of NPA motif along the AQP protein sequence. The instructors provide a template of an R code (based on Scaffold151535_ORF_26) to conduct this task relying on the Biostring (Pagès et al., 2019) and IRanges (Lawrence et al., 2013) packages. These latter packages should already be installed on your computer, but if they are not then install and load them as follows:

#Install Biostrings and IRanges (deposited on Bioconductor)
BiocManager::install("Biostrings")
BiocManager::install("IRanges")

#Load the packages to check installation
pkgs <- c("Biostrings", "IRanges")
lapply(pkgs, require, character.only = TRUE)

Now that the R packages are installed and loaded we can carry on with our coding:

##
#AA sequence
##
#User copy AA sequence in the object below
# Here, associated to Scaffold151535_ORF_26
AAseq <- "MEGKEEDVKLGANKYSERQPIGTSAQTDKDYKEPPPAPLFEPGELSSWSFYRAGIAEFIATFLFLYITVLTVMGVVKSPTKCGTVGIQGIAWAFGGMIFALVYCTAGISGIFSEKPLFQLSF*"

##
#Create a data.frame to store data on
# - N_NPA: Number of NPA motifs
# - NPA_motif: NPA motif
# - StartNPA: Starting position of NPA motif in AA seq
##
NPAdat <- data.frame("N_NPA" = numeric(length(AAseq)), "NPA_motif" = character(length(AAseq)), "StartNPA" = numeric(length(AAseq)))

#Start by matching to find NP motif and locations along sequence
tmp <- Biostrings::matchPattern(pattern = "NP", Biostrings::AAString(AAseq)) 
tmp <- as.data.frame(IRanges::ranges(tmp))
    
if(nrow(tmp) > 0){
  #Print that it found NP motif
  print("NP found")
  #There is an NP motif
  #N NP
  NPAdat$N_NPA <- nrow(tmp)
  #Infer motifs
  NPAdat$NPA_motif <- paste(paste(rep("NP", nrow(tmp)), strsplit(as.vector(AAseq), split='')[[1]][as.numeric(tmp[,2]+1)], sep = ''), collapse = '/')
  #Start motifs
  NPAdat$StartNPA <- paste(tmp[,1], collapse = '/')
  }else{
    #There isn't an NP motif, so we look for a PA motif
  tmp <- Biostrings::matchPattern(pattern = "PA", Biostrings::AAString(AAseq))
  tmp <- as.data.frame(IRanges::ranges(tmp))
      if(nrow(tmp) > 0){
      #Print that it found PA motif
      print("PA found")
      #N NP
      NPAdat$N_NPA <- nrow(tmp)
      #Infer motifs
      NPAdat$NPA_motif <- paste(paste(strsplit(as.vector(AAseq), split='')[[1]][as.numeric(tmp[,1]-1)], rep("PA", nrow(tmp)), sep = ''), collapse = '/')
      #Start motifs
      NPAdat$StartNPA <- paste(tmp[,1]-1, collapse = '/')
      }else{
        next
      }
}

## [1] "PA found"

print(paste("The following NPA motifs were found for AA sequence:", AAseq, sep=" "))

## [1] "The following NPA motifs were found for AA sequence: MEGKEEDVKLGANKYSERQPIGTSAQTDKDYKEPPPAPLFEPGELSSWSFYRAGIAEFIATFLFLYITVLTVMGVVKSPTKCGTVGIQGIAWAFGGMIFALVYCTAGISGIFSEKPLFQLSF*"

print(NPAdat)

##   N_NPA NPA_motif StartNPA
## 1     1       PPA       35

##
#Write data out
##
# If you want to write the data, execute the following command (don't forget to adjust your path)
#write.csv(NPAdat, file='Scaffold151535_ORF_26_NPA_motifs.csv', row.names = F, quote = F)

11.3.1 Question

Based on your number of NPA motifs, how many hydrophobic loops does your Aquaporin protein contain? How does this result compare with the output of the TMHMM analysis?

11.4 Model 3D AQP protein

To model 3D structure of AQP protein and predict its function do the following (here based on >Scaffold151535_ORF_26):

Go to Phyre² website: http://www.sbg.bio.ic.ac.uk/~phyre2/html/page.cgi?id=index
Copy only the AA sequence (without the FASTA header) and fill the form as shown in Figure 11.3. Once completed, please submit job by pressing Phyre Search. The job will take ca. 50 minutes to run.

Figure 11.3: Snapshot of Phyre2 website used to model 3D structure of protein and predict its function.
You will relieve an email when the analysis is completed. A link to the output of the Phyre² analysis will send you to the report page as shown in Figure 11.4.

Figure 11.4: Results of Phyre2 analysis.
Note that although the AQP protein sequence is not complete, the analysis still modeled it correctly and predicted that it belonged to the PIP subfamily (see Figure 11.4).

12 Discussion and Conclusions

Based on the analyses and data gathered in this lab, what are your thoughts and conclusions on the Aquaporin gene/protein located on Scaffold151535?

Examples of questions that we could ask ourselves to answer the above question are:

How many Aquaporin exons were recovered and where are they located on the scaffold?
How many TMHs and loops were inferred?
How many NPA motifs and types were inferred?
What is the predicted protein and its function based on the modeling analysis conducted on the AA sequence?

Based on these latter evidence, do you think that the Aquaporin protein sequence is complete? The instructor invites you to consult Melton et al. (2021) for more information allowing answering these questions and come to a conclusion.

13 Your turn!

Each group is now working as a team to reproduce the analyses presented here on their assigned scaffold.

Good luck!

13.1 Scaffolds assigned to groups

The FASTA file with scaffolds sequences can be downloaded here. The scaffolds assigned to each group are as followed:

Group 1: Scaffold106379.
Group 2: Scaffold254734.
Group 3: Scaffold348135.

References

Altschul, S.F., T.L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic acids research 25: 3389–3402.

Chaumont, F., M. Moshelion, and M.J. Daniels. 2005. Regulation of plant aquaporin activity. Biology of the Cell 97: 749–764.

Danielson, J.Å., and U. Johanson. 2008. Unexpected complexity of the aquaporin gene family in the moss physcomitrella patens. BMC plant biology 8: 1–15.

Ishibashi, K. 2006. Aquaporin subfamily with unusual NPA boxes. Biochimica et Biophysica Acta (BBA)-Biomembranes 1758: 989–993.

Kelley, L.A., S. Mezulis, C.M. Yates, M.N. Wass, and M.J. Sternberg. 2015. The Phyre2 web portal for protein modeling, prediction and analysis. Nature protocols 10: 845–858.

Krogh, A., B. Larsson, G. Von Heijne, and E.L. Sonnhammer. 2001. Predicting transmembrane protein topology with a hidden markov model: Application to complete genomes. Journal of molecular biology 305: 567–580.

Lawrence, M., W. Huber, H. Pagès, P. Aboyoun, M. Carlson, R. Gentleman, M. Morgan, and V. Carey. 2013. Software for computing and annotating genomic ranges. PLoS Computational Biology 9: Available at: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003118.

Li, G., V. Santoni, and C. Maurel. 2014. Plant aquaporins: Roles in plant physiology. Biochimica et Biophysica Acta (BBA)-General Subjects 1840: 1574–1582.

Melton, A.E., J. Beck, S.J. Galla, J. Jenkins, L. Handley, M. Kim, J. Grimwood, et al. 2021. A draft genome provides hypotheses on drought tolerance in a keystone plant species in western north america threatened by climate change. Ecology and Evolution 11: 15417–15429. Available at: https://onlinelibrary.wiley.com/doi/abs/10.1002/ece3.8245.

Pagès, H., P. Aboyoun, R. Gentleman, and S. DebRoy. 2019. Biostrings: Efficient manipulation of biological strings.

BIOL 497/597 - Genomics & Bioinformatics

Lab: Mining sagebrush draft genome

Sven Buerki & Anthony Melton - Boise State University

2024-04-12

1 Introduction

1.1 AQP defining characteristics

1.2 AQP subfamilies

2 Learning outcomes

3 Objectives and scientific question

3.1 Scaffolds assigned to groups

4 Group lab report (TOTAL: 150 points)

4.1 Apply the IMRAD format and supply data and code

4.2 Report deadline

5 Group lab oral presentation (TOTAL: 50 points)

6 Analytical workflow

7 Project structure

7.1 Where are the data located?

7.2 How do I access the data and run analyses?

8 Module 1

8.1 Objectives

8.2 Remotely connect to computer

8.3 Install R v.4.1.3

8.4 Start R session and associated R script

8.4.1 To do before starting

8.5 R dependencies

8.6 Load R packages and user-defined functions

8.6.1 Bioinformatics

8.7 Set environment for BLAST analysis

8.7.1 Bioinformatics

8.8 Run the BLAST analysis

8.8.1 Overview

8.8.2 Bioinformatics

9 Module 2

9.1 Objectives

9.2 Setting up your working environment

9.3 Extract scaffold(s) in draft genome

9.3.1 Overview

9.3.2 Bioinformatics

9.4 Find ORFs along scaffold(s)

9.4.1 Overview

9.4.2 Bioinformatics

9.5 Produce maps of recovered ORFs along scaffold(s)

9.5.1 Bioinformatics

9.6 Questions

10 Module 3

10.1 Objectives

10.2 Annotate ORFs using online protein BLAST tool

10.2.1 Input file for protein BLAST analysis

10.2.2 Protein BLAST analysis

10.2.3 Questions

10.3 Extract ORF(s) identified by protein BLAST analysis to reconstruct Aquaporin gene sequence

10.4 A user-defined function to produce a protein sequence

10.4.1 Bioinformatics

11 Module 4

11.1 Objectives

11.1.1 Predictions

11.2 Predict the number of protein transmembrane helices (THM) (and outside loops)

11.3 Identify the number of NPA motifs and type

11.3.1 Question

11.4 Model 3D AQP protein

12 Discussion and Conclusions

13 Your turn!

13.1 Scaffolds assigned to groups

References

8.3 Install `R` v.4.1.3