Overview
Allele accumulation curves quantify how rapidly new alleles are discovered as additional individuals are sampled. In systems under negative frequency-dependent selection (NFDS), such as plant self-incompatibility loci, rare alleles are continuously maintained in populations. This leads to a characteristic accumulation pattern where new alleles continue to appear even after substantial sampling effort.
In contrast, populations evolving primarily under genetic drift show rapid early allele discovery followed by saturation, indicating that most diversity is captured quickly.
This pipeline calculates allele accumulation curves at two hierarchical levels:
- Species level – using all ingroup individuals
- Population level – for populations with sufficient sampling (≥5 individuals)
From these curves, several statistics are extracted that provide quantitative indicators of NFDS versus drift. In addition, the pipeline estimates the total number of SRK alleles in the species using three complementary richness estimators applied to the species-level data. This estimated total serves as a baseline for evaluating how far each population is from a balanced NFDS equilibrium.
Pipeline Steps
Data Integration
Two input datasets are combined:
- SRK genotype matrix
- Sampling metadata
The genotype table contains allele copy counts for each individual (the same count matrix produced in Step 11), while the metadata provides population assignments and ingroup/outgroup classification. Allele presence for accumulation purposes is determined by count > 0, so the count values do not affect which alleles are considered discovered.
geno <- read.table(
"SRK_individual_allele_genotypes.tsv",
header = TRUE,
sep = "\t",
check.names = FALSE,
stringsAsFactors = FALSE
)
sample_info <- read.csv(
"sampling_metadata.csv",
stringsAsFactors = FALSE
)
Ingroup Filtering and Individual Matching
Outgroup samples are excluded to focus the analysis on the focal species.
Individuals in the genotype matrix are then matched to metadata entries using the SampleID field, ensuring consistent population assignment.
sample_info <- sample_info[sample_info$Ingroup == 1, ]
geno <- geno[geno$Individual %in% sample_info$SampleID, ]
geno$Population <- sample_info$Pop[match(geno$Individual, sample_info$SampleID)]
This step ensures:
- Accurate population assignments
- Removal of external species
- Consistency between genotype and metadata datasets
Genotype Matrix Construction
Allele count columns begin at column 2 of the genotype table. These columns are extracted to construct the allele matrix used for accumulation analyses.
allele_start_col <- 2
allele_end_col <- ncol(geno) - 1
allele_matrix <- as.matrix(geno[, allele_start_col:allele_end_col])
allele_matrix <- apply(allele_matrix, 2, as.numeric)
allele_matrix[is.na(allele_matrix)] <- 0
rownames(allele_matrix) <- geno$Individual
Key processing steps include:
- Conversion to numeric format
- Replacement of missing values with zero
- Assignment of individual identifiers as row names
This produces an allele count matrix with:
rows = individuals
columns = SRK allele bins
values = allele copy counts (0 = absent, >0 = present with that many gene copies)
For accumulation curve purposes, presence is assessed as count > 0; the magnitude of the count does not affect allele discovery.
Allele Accumulation Algorithm
Allele accumulation curves are calculated using randomized sampling permutations.
For each permutation:
- Individuals are randomly ordered
- The number of unique alleles is counted cumulatively
- The process is repeated across many permutations (default = 1000)
The average accumulation trajectory and variance are then computed.
allele_counts <- colSums(mat)
present_alleles <- allele_counts > 0
true_alleles <- sum(present_alleles)
Allele presence is determined using the condition:
colSums(matrix) > 0
This counts alleles that occur in at least one individual.
Species-Level Analysis
The first analysis considers the entire ingroup dataset as a single population.
species_res <- allele_accumulation(allele_matrix, "Species")
This produces:
- Mean allele accumulation curve
- Standard deviation across permutations
- Summary statistics describing the curve shape
The resulting curve represents total SRK allelic diversity across the species.
Population-Level Analysis
Population-level curves are calculated for populations with sufficient sampling effort.
Only populations with at least five individuals are retained:
pop_counts <- table(geno$Population)
valid_pops <- names(pop_counts[pop_counts >= 5])
Each population is analyzed independently using the same accumulation algorithm.
for (pop in valid_pops) {
pop_geno <- geno[geno$Population == pop, ]
pop_allele_matrix <- as.matrix(pop_geno[, allele_start_col:allele_end_col])
}
This allows direct comparison between:
- species-wide diversity
- local population diversity
Allele Richness Estimation
After computing each accumulation curve, three complementary estimators are applied to the species-level data (and independently to each population) to estimate the true total number of SRK alleles:
Michaelis-Menten (MM) asymptote
The mean accumulation curve is fitted to a rectangular hyperbola using non-linear least squares:
\[
S(n) = \frac{S_{max} \cdot n}{K + n}
\]
where \(S_{max}\) is the estimated asymptote (total allele richness) and \(K\) is the half-saturation constant. This estimator directly reflects the shape of the observed accumulation curve and is preferred when sampling is approaching saturation.
Chao1 non-parametric estimator
A lower-bound richness estimator based on singleton and doubleton allele counts:
\[
\hat{S}_{Chao1} = S_{obs} + \frac{f_1^2}{2 f_2}
\]
where \(f_1\) and \(f_2\) are the number of alleles observed in exactly one and two individuals, respectively. When \(f_2 = 0\), the bias-corrected formula \(S_{obs} + f_1(f_1-1)/2\) is used. Chao1 provides a conservative lower bound; note that under strong NFDS, rare alleles are suppressed which can reduce \(f_1\) and lead to underestimation.
iNEXT asymptotic estimator (optional)
If the iNEXT R package is installed, the Hill numbers framework is used to provide an asymptotic richness estimate with bootstrap confidence intervals. Install with install.packages("iNEXT").
The species-level results from all three estimators are written to SRK_species_richness_estimates.tsv and used downstream by the allele frequency analysis as the empirical optimum allele count.
Sampling Adequacy Estimation
Once the MM model is fitted, the pipeline inverts the Michaelis-Menten equation to estimate how many additional individuals would need to be sampled to reach a given level of allelic diversity:
\[
n^* = \frac{S^* \cdot K}{S_{max} - S^*}
\]
where \(S^*\) is the allele target and \(K\) is the MM half-saturation constant. Additional individuals needed = \(n^* - n_{current}\) (floored at 0 if the target is already met).
Five targets are calculated at each level:
| +1 allele |
Additional individuals to discover the next new allele beyond observed |
| 80% of MM |
Individuals to reach 80% of the estimated asymptote |
| 90% of MM |
Individuals to reach 90% of the estimated asymptote |
| 95% of MM |
Individuals to reach 95% of the estimated asymptote |
| 99% of MM |
Individuals to reach 99% of the estimated asymptote |
The +1 allele estimate gives an intuitive sense of current sampling efficiency (how many more individuals are needed just to find one new allele). The percentage-based targets illustrate the hyperbolic cost of approaching saturation — the last few percent of an asymptote require disproportionately large sample sizes, especially under NFDS where rare alleles are maintained at low frequency.
If MM fitting failed for a given level, all five targets are reported as NA.
Curve Visualization
Accumulation curves are exported to a multi-page PDF file.
pdf(
"SRK_allele_accumulation_curves.pdf",
width = 8,
height = 6
)
Each plot includes:
- Mean accumulation curve
- Confidence envelope (± SD)
- Number of individuals sampled
- Total alleles detected
- Estimated asymptotes as horizontal reference lines (MM = dark green dashed, Chao1 = purple dotted, iNEXT = orange if available)
These plots allow visual evaluation of diversity saturation patterns and comparison of observed diversity against estimated total richness.
Biological Interpretation
Three statistics extracted from accumulation curves provide insight into the evolutionary forces shaping allele diversity.
Initial Slope
Measures how rapidly new alleles appear early in sampling.
Initial_slope > 1 → high rare allele diversity
Initial_slope < 0.5 → few dominant alleles
High values are expected under negative frequency-dependent selection.
Tail Slope
Measures whether new alleles continue appearing late in sampling.
Tail_slope ≈ 0 → curve saturation (drift)
Tail_slope > 0.2 → ongoing rare allele discovery (NFDS)
Persistent discovery of alleles is characteristic of balancing selection.
Late Fraction
Quantifies how much allelic diversity is composed of rare alleles appearing late in sampling.
Late_fraction =
(alleles_total − alleles_half) / alleles_total
Interpretation:
Late_fraction > 0.30 → strong NFDS
Late_fraction < 0.10 → drift-dominated diversity
This metric captures the proportion of diversity contributed by rare alleles.