1 Overview

Summaries of procedures applied in this project to sequence, phase and annotate the sagebrush genome based on the G2_b24_1 individual line is presented here. Click here to learn more about G2_b24_1. We are briefly presenting below our predicted sequence data and biomass requirements for each sequencing technology.

1.1 Sequence data

Our sequencing and assembly strategy is described in section 2, but a summary of sequence data and their associated NGS platforms is provided below (see also Table 1.1 for biomass requirement and Table 1.2 for sequencing data per technology):

Whole Genome Shotgun:
- Illumina HiSeq platform: 5 runs (each 2x150bp yielding 350M raw paired-end reads). These data will be used to infer genome size and complexity and produce a haploid draft genome.
- PacBio Sequel II (using 8M SMRT cells): 5 cells (each yielding 50Gbp). These data will be used to conduct de novo genome assembly.
Proximity ligation:
- Omni-C libraries (sequenced on an Illumina HiSeq X instrument): 3 runs (each 2x150bp yielding 100M raw paired-end reads). These data will be used to conduct phasing of the genome.
RNASeq:
- Illumina NovaSeq platform: A fraction of a run (2x150bp yielding 20M raw paired-end reads). Used to perform genome annotation. These data will be complemented by Illumina RNA-Seq data produced at HudsonAlpha (representing 150 leaf and root tissue samples from diploid A. tridentata subsp. tridentata involved in a drought GxE experiment).
SRA Experiments: Sequence Read Archive (SRA) data will be made available on NCBI upon submission of the genome.
Protein Sequences: Protein sequences will be made available.
Assembly/Genome: An assembly at pseudo-chromosome level will be made available on NCBI upon submission of the genome.

1.2 Biomass requirements per sequencing technology

We are estimating that ca. 120 gr of leaf biomass are necessary for genome sequencing, phasing and annotation. Table 1.1 provides a summary of sequencing technologies applied in this project, their purpose and biomass requirements. These data do not account for a preliminary DNA extraction trial (to conduct PacBio sequencing).

Table 1.1: Summary of sequencing technologies, their purpose and requirements to complete the sagebrush genome project.
Type	Purpose	Unit/Biomass	Total units / biomass	Number of plantlets
Illumina HiSeq	Genome size and complexity (incl. haploid draft genome)	1 Illumina library = 20 mg	1 library = 20 mg	1
PacBio sequencing	De novo genome assembly	1 cell = 20 gr	5 cells = 100 gr	125
Proximity ligation (Hi-C for phasing genome)	Phasing	1 Illumina library = 6 gr	3 libraries = 18 gr	23
RNA-seq	Annotation	1 library = 20 mg	1 library = 20 mg	1

In the case of the Illumina HiSeq, although we will only produce 1 library, it will be dispatched on five HiSeq runs.

1.3 Sequencing data by technology

The amount of data (in Gbp) produced per sequencing technology for de novo genome assembly is provided in Table 1.2 together with estimates of haploid genome coverage (x). We aim at sequencing the sagebrush genome between 50-100x. Please see Wet-lab procedures for more details on these data.

Table 1.2: Summary of sequencing data produced by technology for the sagebrush genome project. Haploid genome size = 4.5Gbp
Type	Purpose	Data (Gbp)/Run	N. runs	Total data (Gbp)	Haploid genome coverage (x)
Illumina HiSeq	Genome size and complexity (incl. haploid draft genome)	105	5	525	116.7
PacBio sequencing	De novo genome assembly	50	5	250	55.6
Proximity ligation (Hi-C for phasing genome)	Phasing	30	3	90	20.0

2 Wet-lab procedures

Completion of the wet-lab work by Dovetail Genomics detailed here is predicted to take 28 weeks (7 months) upon receipt of the biomass. To gain insights into the timetable for biomass production, please click here.

Ascertain the ploidy level, genome size and genome complexity of G2_b24_1. This will be done by counting chromosomes (based on root squashes), inferring 2C genome size using flow cytometry (based on root and leaf tissues) and estimating genome size and complexity by applying a k-mer approach on Illumina HiSeq data (5 runs, each 2x150bp; see Table 1.2 for more details). In addition, the Illumina data will be used to assemble a haploid draft genome (which should have a coverage of ca. 100x). The sequencing will be outsourced to GENEWIZ.
Dovetail Genomics will extract HMW DNA with an average fragment size ca. 50-100kb based on biomass supplied by the team (see Biomass production).
Whole genome sequencing (WGS) will be conducted using PacBio technology. We aim at generating 50-100X raw data coverage (haploid sagebrush genome size: 4.5 Gbp). To achieve this objective, 5 PacBio CLR (continuous long-read) libraries will be built and sequenced on the PacBio Sequel II using 8M SMRT cells. A SMRT cell should generate >50Gb of raw data. In our case, 250Gb (5x 50Gb) of raw data will be generated, which corresponds to 55X haploid genome coverage (we are on the low side, but it should be fine). These data will be used to assemble the haploid draft genome (used as input for phasing based on Omni-C libraries).
Three Omni-C libraries (= proximity ligation libraries) will be constructed (one library per 3 Gbp of the organism’s genome) based on the HMW DNA extractions. These libraries will be sequenced on an Illumina HiSeq X instrument (ca. 100M PE150bp read pairs per Gbp of genome size). Before final sequencing, Dovetail will conduct some shallow sequencing (ca. 2M PE75 bp) to assess library quality.
RNA will be extracted from leaf tissue (20mg) by Dovetail and a standard RNA library will be prepared with rRNA-depletion. 2x150bp read will be sequenced on an Illumina platform and raw data from ca. 20M read pairs per sample (1 in this project) will be used for genome annotation.

3 Bioinformatic procedures

PacBio de novo assembly (= produce haploid draft assembly). We aim at a minimum N50 of 100kb (needed for HiRise pipeline, Putnam et al. 2016).
The haploid daft assembly will be scaffolded (phased) through the HiRise software pipeline using the proximity ligation data (this software is owned by Dovetail). The Omni-C libraries can be used for genome assembly and haplotype phasing. We will also be able to use this data to call SNPs and look at structural variations.
Whole genome annotation will be conducted using RNA-seq data. The basic services will be included:
1. Repeat masking.
2. Ab initio gene prediction using related species (e.g. Artemisia annua).
3. RNAseq mapping to enhance annotation.
4. Manual curation of 5 genes (not very interesting for us). The work done by Anthony Melton based on seedlings from GxE experiment will be key here.
5. Assignment of functional tags to genes.
Results files will be delivered electronically via secure FTP.

4 References

Putnam, Nicholas H, Brendan L O’Connell, Jonathan C Stites, Brandon J Rice, Marco Blanchette, Robert Calef, Christopher J Troll, et al. 2016. “Chromosome-Scale Shotgun Assembly Using an in Vitro Method for Long-Range Linkage.” Genome Research 26 (3): 342–50.

5 Appendix 1

Citations of all R packages used to generate this report.

[1] J. Allaire, Y. Xie, J. McPherson, et al. rmarkdown: Dynamic Documents for R. R package version 2.6. 2020. <URL: https://github.com/rstudio/rmarkdown>.

[2] C. Boettiger. knitcitations: Citations for Knitr Markdown Files. R package version 1.0.10. 2019. <URL: https://github.com/cboettig/knitcitations>.

[3] J. Bryan. googlesheets4: Access Google Sheets using the Sheets API V4. R package version 0.2.0. 2020. <URL: https://github.com/tidyverse/googlesheets4>.

[4] J. Bryan, C. Citro, and H. Wickham. gargle: Utilities for Working with Google APIs. R package version 0.5.0. 2020. <URL: https://CRAN.R-project.org/package=gargle>.

[5] J. Cheng, B. Karambelkar, and Y. Xie. leaflet: Create Interactive Web Maps with the JavaScript Leaflet Library. R package version 2.0.3. 2019. <URL: http://rstudio.github.io/leaflet/>.

[6] D. Ebbert. chisq.posthoc.test: A Post Hoc Analysis for Pearson’s Chi-Squared Test for Count Data. R package version 0.1.2. 2019. <URL: http://chisq-posthoc-test.ebbert.nrw/>.

[7] G. Grolemund and H. Wickham. “Dates and Times Made Easy with lubridate.” In: Journal of Statistical Software 40.3 (2011), pp. 1-25. <URL: https://www.jstatsoft.org/v40/i03/>.

[8] T. Hothorn, A. Zeileis, R. W. Farebrother, et al. lmtest: Testing Linear Regression Models. R package version 0.9-38. 2020. <URL: https://CRAN.R-project.org/package=lmtest>.

[9] S. Jackman, A. Tahk, A. Zeileis, et al. pscl: Political Science Computational Laboratory. R package version 1.5.5. 2020. <URL: http://github.com/atahk/pscl>.

[10] A. Kassambara. ggpubr: ggplot2 Based Publication Ready Plots. R package version 0.4.0. 2020. <URL: https://rpkgs.datanovia.com/ggpubr/>.

[11] M. C. Koohafkan. kfigr: Integrated Code Chunk Anchoring and Referencing for R Markdown Documents. R package version 1.2. 2015. <URL: https://github.com/mkoohafkan/kfigr>.

[12] R. Lenth. emmeans: Estimated Marginal Means, aka Least-Squares Means. R package version 1.5.2-1. 2020. <URL: https://github.com/rvlenth/emmeans>.

[13] R. Lenth. lsmeans: Least-Squares Means. R package version 2.30-0. 2018. <URL: https://CRAN.R-project.org/package=lsmeans>.

[14] R. V. Lenth. “Least-Squares Means: The R Package lsmeans.” In: Journal of Statistical Software 69.1 (2016), pp. 1-33. DOI: 10.18637/jss.v069.i01.

[15] E. Neuwirth. RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. 2014. <URL: https://CRAN.R-project.org/package=RColorBrewer>.

[16] E. Paradis, S. Blomberg, B. Bolker, et al. ape: Analyses of Phylogenetics and Evolution. R package version 5.4-1. 2020. <URL: http://ape-package.ird.fr/>.

[17] E. Paradis and K. Schliep. “ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R.” In: Bioinformatics 35 (2019), pp. 526-528.

[18] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2019. <URL: https://www.R-project.org/>.

[19] K. Ren and K. Russell. formattable: Create Formattable Data Structures. R package version 0.2.0.1. 2016. <URL: https://CRAN.R-project.org/package=formattable>.

[20] B. Ripley. MASS: Support Functions and Datasets for Venables and Ripley’s MASS. R package version 7.3-53. 2020. <URL: http://www.stats.ox.ac.uk/pub/MASS4/>.

[21] M. R. Smith. TreeTools: Create, Modify and Analyse Phylogenetic Trees. R package version 1.4.0. 2020. <URL: https://CRAN.R-project.org/package=TreeTools>.

[22] V. Spinu, G. Grolemund, and H. Wickham. lubridate: Make Dealing with Dates a Little Easier. R package version 1.7.9.2. 2020. <URL: https://CRAN.R-project.org/package=lubridate>.

[23] W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Fourth. ISBN 0-387-95457-0. New York: Springer, 2002. <URL: http://www.stats.ox.ac.uk/pub/MASS4/>.

[24] G. R. Warnes, B. Bolker, L. Bonebakker, et al. gplots: Various R Programming Tools for Plotting Data. R package version 3.1.0. 2020. <URL: https://github.com/talgalili/gplots>.

[25] H. Wickham. forcats: Tools for Working with Categorical Variables (Factors). R package version 0.5.0. 2020. <URL: https://CRAN.R-project.org/package=forcats>.

[26] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. ISBN: 978-3-319-24277-4. <URL: https://ggplot2.tidyverse.org>.

[27] H. Wickham. stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. 2019. <URL: https://CRAN.R-project.org/package=stringr>.

[28] H. Wickham and J. Bryan. usethis: Automate Package and Project Setup. R package version 2.0.0. 2020. <URL: https://CRAN.R-project.org/package=usethis>.

[29] H. Wickham, W. Chang, L. Henry, et al. ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. R package version 3.3.3. 2020. <URL: https://CRAN.R-project.org/package=ggplot2>.

[30] H. Wickham, R. François, L. Henry, et al. dplyr: A Grammar of Data Manipulation. R package version 1.0.2. 2020. <URL: https://CRAN.R-project.org/package=dplyr>.

[31] H. Wickham, J. Hester, and W. Chang. devtools: Tools to Make Developing R Packages Easier. R package version 2.3.2. 2020. <URL: https://CRAN.R-project.org/package=devtools>.

[32] H. Wickham and D. Seidel. scales: Scale Functions for Visualization. R package version 1.1.1. 2020. <URL: https://CRAN.R-project.org/package=scales>.

[33] C. O. Wilke. ggridges: Ridgeline Plots in ggplot2. R package version 0.5.2. 2020. <URL: https://wilkelab.org/ggridges>.

[34] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. ISBN 978-1138700109. Boca Raton, Florida: Chapman and Hall/CRC, 2016. <URL: https://github.com/rstudio/bookdown>.

[35] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.21. 2020. <URL: https://github.com/rstudio/bookdown>.

[36] Y. Xie. Dynamic Documents with R and knitr. 2nd. ISBN 978-1498716963. Boca Raton, Florida: Chapman and Hall/CRC, 2015. <URL: https://yihui.org/knitr/>.

[37] Y. Xie. formatR: Format R Code Automatically. R package version 1.7. 2019. <URL: https://github.com/yihui/formatR>.

[38] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R.” In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. <URL: http://www.crcpress.com/product/isbn/9781466561595>.

[39] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.30. 2020. <URL: https://yihui.org/knitr/>.

[40] Y. Xie, J. Allaire, and G. Grolemund. R Markdown: The Definitive Guide. ISBN 9781138359338. Boca Raton, Florida: Chapman and Hall/CRC, 2018. <URL: https://bookdown.org/yihui/rmarkdown>.

[41] Y. Xie, J. Cheng, and X. Tan. DT: A Wrapper of the JavaScript Library DataTables. R package version 0.16. 2020. <URL: https://github.com/rstudio/DT>.

[42] Y. Xie, C. Dervieux, and E. Riederer. R Markdown Cookbook. ISBN 9780367563837. Boca Raton, Florida: Chapman and Hall/CRC, 2020. <URL: https://bookdown.org/yihui/rmarkdown-cookbook>.

[43] G. Yu and T. T. Lam. ggtree: an R package for visualization of tree and annotation data. R package version 2.0.4. 2020. <URL: https://yulab-smu.github.io/treedata-book/>.

[44] G. Yu, T. T. Lam, H. Zhu, et al. “Two methods for mapping and visualizing associated data on phylogeny using ggtree.” In: Molecular Biology and Evolution 35 (2 2018), pp. 3041-3043. DOI: 10.1093/molbev/msy194. <URL: https://doi.org/10.1093/molbev/msy194>.

[45] G. Yu, D. Smith, H. Zhu, et al. “ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data.” In: Methods in Ecology and Evolution 8 (1 2017), pp. 28-36. DOI: 10.1111/2041-210X.12628. <URL: http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12628/abstract>.

[46] A. Zeileis and G. Grothendieck. “zoo: S3 Infrastructure for Regular and Irregular Time Series.” In: Journal of Statistical Software 14.6 (2005), pp. 1-27. DOI: 10.18637/jss.v014.i06.

[47] A. Zeileis, G. Grothendieck, and J. A. Ryan. zoo: S3 Infrastructure for Regular and Irregular Time Series (Z’s Ordered Observations). R package version 1.8-8. 2020. <URL: http://zoo.R-Forge.R-project.org/>.

[48] A. Zeileis and T. Hothorn. “Diagnostic Checking in Regression Relationships.” In: R News 2.3 (2002), pp. 7-10. <URL: https://CRAN.R-project.org/doc/Rnews/>.

[49] A. Zeileis, C. Kleiber, and S. Jackman. “Regression Models for Count Data in R.” In: Journal of Statistical Software 27.8 (2008). <URL: http://www.jstatsoft.org/v27/i08/>.

[50] H. Zhu. kableExtra: Construct Complex Table with kable and Pipe Syntax. R package version 1.2.1. 2020. <URL: https://CRAN.R-project.org/package=kableExtra>.

6 Appendix 2

Version information about R, the operating system (OS) and attached or R loaded packages. This appendix was generated using sessionInfo().

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gargle_0.5.0             formattable_0.2.0.1      leaflet_2.0.3           
##  [4] googlesheets4_0.2.0      kableExtra_1.2.1         dplyr_1.0.2             
##  [7] kfigr_1.2                scales_1.1.1             lubridate_1.7.9.2       
## [10] MASS_7.3-53              forcats_0.5.0            TreeTools_1.4.0         
## [13] ggridges_0.5.2           stringr_1.4.0            ape_5.4-1               
## [16] ggtree_2.0.4             ggpubr_0.4.0             ggplot2_3.3.3           
## [19] chisq.posthoc.test_0.1.2 DT_0.16                  lsmeans_2.30-0          
## [22] emmeans_1.5.2-1          lmtest_0.9-38            zoo_1.8-8               
## [25] pscl_1.5.5               RColorBrewer_1.1-2       gplots_3.1.0            
## [28] devtools_2.3.2           usethis_2.0.0            formatR_1.7             
## [31] knitcitations_1.0.10     bookdown_0.21            rmarkdown_2.6           
## [34] knitr_1.30              
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1        backports_1.2.1     fastmatch_1.1-0    
##   [4] plyr_1.8.6          igraph_1.2.6        lazyeval_0.2.2     
##   [7] crosstalk_1.1.0.1   digest_0.6.27       htmltools_0.5.0    
##  [10] fansi_0.4.1         magrittr_2.0.1      memoise_1.1.0      
##  [13] openxlsx_4.2.2      remotes_2.2.0       R.utils_2.10.1     
##  [16] askpass_1.1         prettyunits_1.1.1   colorspace_2.0-0   
##  [19] rvest_0.3.6         haven_2.3.1         rbibutils_1.4      
##  [22] xfun_0.20           callr_3.5.1         crayon_1.3.4       
##  [25] jsonlite_1.7.2      phangorn_2.5.5      glue_1.4.2         
##  [28] gtable_0.3.0        webshot_0.5.2       R.cache_0.14.0     
##  [31] car_3.0-10          pkgbuild_1.2.0      abind_1.4-5        
##  [34] mvtnorm_1.1-1       bibtex_0.4.2.3      rstatix_0.6.0      
##  [37] Rcpp_1.0.5          viridisLite_0.3.0   xtable_1.8-4       
##  [40] tidytree_0.3.3      foreign_0.8-75      bit_4.0.4          
##  [43] htmlwidgets_1.5.3   httr_1.4.2          ellipsis_0.3.1     
##  [46] pkgconfig_2.0.3     R.methodsS3_1.8.1   tidyselect_1.1.0   
##  [49] rlang_0.4.10        munsell_0.5.0       cellranger_1.1.0   
##  [52] tools_3.6.1         cli_2.2.0           generics_0.1.0     
##  [55] broom_0.7.1         evaluate_0.14       yaml_2.2.1         
##  [58] RefManageR_1.2.12   processx_3.4.5      bit64_4.0.5        
##  [61] fs_1.5.0            zip_2.1.1           caTools_1.18.0     
##  [64] purrr_0.3.4         nlme_3.1-149        R.oo_1.24.0        
##  [67] xml2_1.3.2          compiler_3.6.1      rstudioapi_0.13    
##  [70] curl_4.3            testthat_3.0.1      ggsignif_0.6.0     
##  [73] treeio_1.10.0       tibble_3.0.4        stringi_1.5.3      
##  [76] highr_0.8           ps_1.5.0            desc_1.2.0         
##  [79] lattice_0.20-41     Matrix_1.2-18       vctrs_0.3.6        
##  [82] pillar_1.4.7        lifecycle_0.2.0     BiocManager_1.30.10
##  [85] Rdpack_2.1          estimability_1.3    data.table_1.13.6  
##  [88] bitops_1.0-6        gbRd_0.4-11         R6_2.5.0           
##  [91] KernSmooth_2.23-17  rio_0.5.16          codetools_0.2-16   
##  [94] sessioninfo_1.1.1   gtools_3.8.2        assertthat_0.2.1   
##  [97] pkgload_1.1.0       openssl_1.4.3       rprojroot_2.0.2    
## [100] withr_2.3.0         parallel_3.6.1      hms_0.5.3          
## [103] quadprog_1.5-8      grid_3.6.1          tidyr_1.1.2        
## [106] coda_0.19-4         rvcheck_0.1.8       carData_3.0-4      
## [109] googledrive_1.0.1

Genome sequencing, phasing and annotation