Raw data on GitHub

1 Introduction

To conduct a literature review on genomic heterozygosity levels in plants inferred using approaches implemented in GenomeScope (Vurture et al., 2017) and GenomeScope v.2.0 (Ranallo-Benavidez et al., 2020), an R function (CitedInExtractoR) was developed. This function takes unique PubMed IDs as input and retrieves citations of publications citing these studies. Publications were manually inspected to extract genomic heterozygosity values, taxonomy, ploidy and estimated genome size. Data were summarized using violin plots. Sections below provide the code to execute the analyses presented here.

1.1 Retriving publications citing GenomeScope

The function used to retrieve publications citing GenomeScope software is presented below and depends on rentrez and easyPubMed R packages. To install these packages do:

#Install packages
install.packages(c("rentrez", "easyPubMed"))

1.1.1 CitedInExtractoR function

This function needs to be sourced prior to conducting the analyses.

#A function to list publications citing a target article and 
# downloading data on these latter publications
CitedInExtractoR <- function(PubMedID){
  print(paste("Fetch data associated to", PubMedID, sep = " "))
  #Search data associated to PubMedID in pubmed
  src <- rentrez::entrez_link(dbfrom = "pubmed", id = PubMedID, db = "pubmed")
  
  print(paste("Extract list of publications citing", PubMedID, sep = " "))
  #Extract PubMedIDs of pubs citing our target Pub (= vect of pubmedIDs)
  citedIn <- src$links$pubmed_pubmed_citedin
  print(paste(PubMedID, "is cited in", length(citedIn), "publications", sep = " "))
  #Create empty object to save output of loop
  OUT <- NULL
  #Generate table with output data
  print(paste("Download data on publications citing", PubMedID, sep = " "))
  #Set a progress bar
  pb <- txtProgressBar(min = 0, max = length(citedIn), style = 3)
  for(i in 1:length(citedIn)){
    RefData <- rentrez::entrez_fetch(db = "pubmed", id = citedIn[i], rettype = "xml")
    tabPubMed <- easyPubMed::article_to_df(RefData)[1,]
    OUT <- rbind(OUT, tabPubMed)
    # update progress bar
    setTxtProgressBar(pb, i)
  }
  close(pb)
  
  #Add col with TargetPubMed ID
  TargetPubMed <- rep(PubMedID, nrow(OUT))
  
  #Add col with clickable DOI
  paste(rep("http://doi.org/", nrow(OUT)), OUT$doi, sep = "")
  #FINAL dataset
  FINAL <- cbind(TargetPubMed, OUT)
  
  FINAL
}

1.1.2 Applying CitedInExtractoR to retrive publications

CitedInExtractoR was applied on PubMed IDs for GenomScope publications to retrieve publications citing these software (executed on March 29th 2022).

# Load packages
library(rentrez)
library(easyPubMed)

# List PubMed IDs for GenomeScope publications
PubMedID <- c("28369201", "32188846")

# Apply function across PubMedId vector
DatPubMed <- NULL
for(i in 1:length(PubMedID)){
  tmp <- CitedInExtractoR(PubMedID[i])
  DatPubMed <- rbind(DatPubMed, tmp)
  #Add a URLs to directly access to DOIs
  DatPubMed$URLDOI <- paste(rep("http://doi.org/", nrow(DatPubMed)), DatPubMed$doi, sep = "")
}

# Write data out (as csv)
write.csv(DatPubMed, file = "PubCitedIn_GenomeScope_publications_28369201_32188846_29March2022.csv", row.names = F, quote=T)

1.2 Summarizing extracted data

The code below was used to summarize the genomic data using the R package ggplot2.

####Creating plots using genomic and plant usage data####
##Paige Ellestad
##June 6th, 2022

#load Genomescope review data
data <- read.csv("PubCitedIn_GenomeScope_publications_28369201_32188846_29March2022 - PubCitedIn_GenomeScope_publications_28369201_32188846_29March2022.csv")

#change genome size and heterozygosity values to numericals
data$Heterozygosity_Perc <- as.numeric(data$Heterozygosity_Perc)
data$GS_bp <- as.numeric(data$GS_bp)

#For rows with ploidy levels as "na", change to "unknown" (for plots)
data$Ploidy[which(data$Ploidy == "na")] <- "unknown"

#Make Ploidy levels and Plant assignation factors
data$Ploidy <- as.factor(data$Ploidy)
data$Plant <- as.factor(data$Plant)

##make two subsets of data for separate heterozygosity and genome size plots
#subset data to exclude rows without heterozygosity values
data_het <- data[!is.na(data$Heterozygosity_Perc),]

#subset data to exclude rows without genome size values
data_gensize <- data[!is.na(data$GS_bp),]

##use ggplot2 to make violin plots of distributions
library(ggplot2)

#heterozygosity plot
#assign labels to chosen species (this was manually done within the Genomescope review 
#dataset by adding a "1" in a new column "figure" for species to be labelled within the plot)
species_labels <- data_het[which(data_het$figure == 1),]

#Make basic violin plot grouped by ploidy levels
grouped_plot <- ggplot(data_het, aes(x=Ploidy, y=Heterozygosity_Perc)) + 
  geom_violin() 

#Make labelled, colored violin plot with all ploidy levels in one plot
het_plot <- 
  ggplot(data_het, aes(x=Plant, y=Heterozygosity_Perc)) + 
  geom_violin() +
  geom_text(data=species_labels,aes(x=Plant, y = Heterozygosity_Perc,label=Species, 
                                    color = Ploidy),hjust=0, nudge_x = .01, vjust=.35, size= 3)+
  geom_point(aes(color = Ploidy)) +
  xlab("") +
  ylab("Genomic Heterozygosity(%)") +
  theme_classic() +
  theme(legend.title.align = 0.5, axis.ticks.x = element_blank(),
        axis.text.x = element_blank())

#Print to a pdf
pdf("het_violin_plot.pdf")
het_plot
dev.off()

#genome size plot
gensize_plot <- 
  ggplot(data_gensize, aes(x=Plant, y=GS_bp)) + 
  geom_violin() +
  geom_point(aes(color = Ploidy)) + 
  xlab("") +
  ylab("Genome Size (bp)") +
  theme(axis.ticks.x = element_blank(),
        axis.text.x = element_blank())

#print to a pdf
pdf("gensize_violin_plot.pdf")
gensize_plot
dev.off()

1.3 Output

The graph showing variation in genomic heterozygosity levels by ploidy in plants inferred by GenomeScope is in Figure 1.1 and data on genome size variation for the same data is presented in Figure 1.2.

GenomeScope output showing variation in heterozygosity levels in plants.

Figure 1.1: GenomeScope output showing variation in heterozygosity levels in plants.

GenomeScope output showing variation in genome size levels in plants.

Figure 1.2: GenomeScope output showing variation in genome size levels in plants.

2 References

Ranallo-Benavidez, T.R., K.S. Jaron, and M.C. Schatz. 2020. GenomeScope 2.0 and smudgeplot for reference-free profiling of polyploid genomes. Nature Communications 11: 1432. Available at: https://doi.org/10.1038/s41467-020-14998-3.
Vurture, G.W., F.J. Sedlazeck, M. Nattestad, C.J. Underwood, H. Fang, J. Gurtowski, and M.C. Schatz. 2017. GenomeScope: Fast reference-free genome profiling from short reads. Bioinformatics 33: 2202–2204. Available at: + http://dx.doi.org/10.1093/bioinformatics/btx153.

3 Appendix 1

Citations of all R packages used to generate this report.

[1] J. Allaire, Y. Xie, J. McPherson, et al. rmarkdown: Dynamic Documents for R. R package version 2.14. 2022. https://CRAN.R-project.org/package=rmarkdown.

[2] C. Boettiger. knitcitations: Citations for Knitr Markdown Files. R package version 1.0.12. 2021. https://github.com/cboettig/knitcitations.

[3] D. Fantini. easyPubMed: Search and Retrieve Scientific Publication Records from PubMed. R package version 2.13. 2019. https://www.data-pulse.com/dev_site/easypubmed/.

[4] M. C. Koohafkan. kfigr: Integrated Code Chunk Anchoring and Referencing for R Markdown Documents. R package version 1.2.1. 2021. https://github.com/mkoohafkan/kfigr.

[5] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2022. https://www.R-project.org/.

[6] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. ISBN: 978-3-319-24277-4. https://ggplot2.tidyverse.org.

[7] H. Wickham, J. Bryan, and M. Barrett. usethis: Automate Package and Project Setup. R package version 2.1.6. 2022. https://CRAN.R-project.org/package=usethis.

[8] H. Wickham, W. Chang, L. Henry, et al. ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. R package version 3.3.6. 2022. https://CRAN.R-project.org/package=ggplot2.

[9] H. Wickham, R. François, L. Henry, et al. dplyr: A Grammar of Data Manipulation. R package version 1.0.9. 2022. https://CRAN.R-project.org/package=dplyr.

[10] H. Wickham, J. Hester, W. Chang, et al. devtools: Tools to Make Developing R Packages Easier. R package version 2.4.4. 2022. https://CRAN.R-project.org/package=devtools.

[11] D. Winter. rentrez: Entrez in R. R package version 1.2.3. 2020. https://CRAN.R-project.org/package=rentrez.

[12] D. J. Winter. “rentrez: an R package for the NCBI eUtils API”. In: The R Journal 9 (2 2017), pp. 520-526.

[13] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. ISBN 978-1138700109. Boca Raton, Florida: Chapman and Hall/CRC, 2016. https://bookdown.org/yihui/bookdown.

[14] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.27. 2022. https://CRAN.R-project.org/package=bookdown.

[15] Y. Xie. Dynamic Documents with R and knitr. 2nd. ISBN 978-1498716963. Boca Raton, Florida: Chapman and Hall/CRC, 2015. https://yihui.org/knitr/.

[16] Y. Xie. formatR: Format R Code Automatically. R package version 1.12. 2022. https://github.com/yihui/formatR.

[17] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. http://www.crcpress.com/product/isbn/9781466561595.

[18] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.39. 2022. https://yihui.org/knitr/.

[19] Y. Xie, J. Allaire, and G. Grolemund. R Markdown: The Definitive Guide. ISBN 9781138359338. Boca Raton, Florida: Chapman and Hall/CRC, 2018. https://bookdown.org/yihui/rmarkdown.

[20] Y. Xie, C. Dervieux, and E. Riederer. R Markdown Cookbook. ISBN 9780367563837. Boca Raton, Florida: Chapman and Hall/CRC, 2020. https://bookdown.org/yihui/rmarkdown-cookbook.

[21] H. Zhu. kableExtra: Construct Complex Table with kable and Pipe Syntax. R package version 1.3.4. 2021. https://CRAN.R-project.org/package=kableExtra.