1 GitHub Repository

The companion repository for this webpage is available at:

https://github.com/svenbuerki/SRK_bioinformatics/

For a concise step-by-step protocol (scripts, inputs, outputs, key parameters), see Bioinformatics_pipeline.md.

2 Introduction

Self-incompatibility (SI) systems represent one of the most important mechanisms preventing inbreeding in flowering plants, with the sporophytic SI system found in the Brassicaceae being controlled by the highly polymorphic S-locus. The S-receptor kinase (SRK) gene, encoding the female determinant of incompatibility specificity, exhibits extraordinary allelic diversity maintained by balancing selection over evolutionary timescales. Understanding SRK haplotype diversity and population genetic structure is critical for assessing the evolutionary maintenance of outcrossing systems and identifying populations at risk of SI breakdown.

Traditional approaches to SRK characterization have been limited by the technical challenges of resolving highly similar allelic sequences and the complex genetic architecture of polyploid species. Long-read Nanopore sequencing offers unprecedented opportunities to reconstruct complete SRK haplotypes and resolve allelic relationships within individuals, but requires specialized bioinformatics approaches to handle the unique characteristics of amplicon data from polymorphic loci.

This comprehensive pipeline integrates Nanopore amplicon assembly, haplotype phasing, and population genetic analysis to provide a complete workflow for SRK diversity assessment. The approach combines multiple CANU assembly replicates optimized for rare haplotype recovery with polyploid-aware variant calling and read-backed phasing to generate high-confidence haplotype reconstructions. Downstream analysis components translate DNA sequences to functional proteins, apply abundance-based filtering to distinguish authentic alleles from technical artifacts, and generate population genetic datasets suitable for demographic inference and conservation assessment.

The pipeline is specifically designed for threatened plant species research, addressing the critical need to assess SI system integrity in small or fragmented populations where genetic diversity loss may compromise reproductive success. By providing both molecular-level haplotype reconstruction and population-level genetic analysis, this workflow enables comprehensive evaluation of self-incompatibility system evolution and maintenance in natural populations facing conservation challenges.

Key innovations include multi-replicate assembly strategies for maximizing allelic recovery, reference-guided sequence orientation and gap-filling procedures, SRK-specific protein validation criteria, and tetraploid-aware zygosity analysis. The modular design allows adaptation to different species and study systems while maintaining rigorous quality control throughout the analysis workflow.

3 Pipeline Overview

The pipeline consists of 19 main steps organized into three phases, progressing from within-library sequence assembly to cross-library integration and final population genetic analyses.

3.1 Phase 1: SRK amplicon sequence assembly

Prepare Canonical Sequences – Preparation of reference SRK sequences used for downstream assembly and validation.
Nanopore Amplicon Assembly and Phasing Pipeline – Multi-CANU assembly of Nanopore amplicons followed by haplotype phasing to reconstruct allelic sequences.
Haplotype Orientation and Consolidation Pipeline – Standardization of sequence orientation and consolidation of phased haplotypes.
FASTA Sequence Filtering and Reference Integration Pipeline – Quality filtering of sequences and integration of validated reference alleles.
Multiple Sequence Alignment – Alignment of nucleotide sequences using MAFFT.
Exon Extraction from Multiple Sequence Alignments Pipeline – Extraction of coding regions based on AUGUSTUS gene annotations.
MSA Gap Backfilling and Terminus Processing Pipeline – Reference-guided backfilling of terminal alignment gaps and sequence terminus correction.
DNA to Amino Acid Translation Pipeline – Frame-specific translation of nucleotide sequences into proteins. Note: This step is optional and may be skipped if protein sequences are not required.

3.2 Phase 2: Functional proteins, S alleles and genotyping

SRK Protein Translation, Alignment, and Abundance Filtering Pipeline – Translation of SRK alleles into proteins, alignment, and filtering based on abundance thresholds to retain functional candidates.
Distance-Based SRK S-Allele Definition – Grouping of functional proteins into allele bins using pairwise amino acid p-distance on the S-domain ectodomain, with sensitivity analysis and amino acid variation heatmaps.
SRK S-Allele Genotyping Pipeline – Assignment of distance-defined S-allele bins to individuals, producing long-format allele tables (with allele copy counts) and count-based genotype matrices for population genetic analysis.
SRK Zygosity Analysis Pipeline for Tetraploid Species – Classification of individuals as homozygous or heterozygous and estimation of population-level zygosity statistics in tetraploid species.

⚠️ In Progress — Step 12b: SRK Class Assignment and Dominance Prediction Pipeline – Phylogenetic classification of SRK proteins into Class I (dominant) or Class II (recessive) using IQ-TREE3, with per-individual dominance prediction for conservation breeding decisions. This analysis is under active development and may not function correctly on all datasets.

3.3 Phase 3: Data Analyses

Population Genetics Statistics – Estimation of population-level diversity metrics, including heterozygosity, mean alleles per individual, total allele counts, and effective allele numbers.
Allele Accumulation Curves – Rarefaction-based analysis of SRK allele discovery across individuals to evaluate patterns consistent with negative frequency-dependent selection versus genetic drift. Includes estimation of total species allele richness using Michaelis-Menten asymptote fitting, Chao1, and iNEXT estimators.
Allele Frequency Analysis – Species- and population-level χ² tests of allele frequency distributions to assess deviations from equal-frequency expectations under negative frequency-dependent selection. The estimated species allele richness (from step 14) is used as the NFDS optimum baseline.
TP1 Tipping Point Analysis – Health of the SI system: diagnostic scatter plot asking (x) how many different alleles a population has retained and (y) how evenly those alleles are distributed across individuals; with CRITICAL zone shading and TP1 status flags.
Allele Composition Comparison Across Element Occurrences – UpSet plot and pairwise sharing heatmap quantifying how S-allele sets overlap and partition across Element Occurrences, identifying private alleles and alleles shared by all populations.
Individual Genotypic Fitness Score – Per-individual GFS quantifying the proportion of heterozygous diploid gametes a tetraploid individual can produce, differentiating dosage-imbalanced genotypes (e.g. AABB vs AAAB) invisible to simple zygosity classification.
TP2 Tipping Point Analysis – Reproductive Fitness: EO-level tipping point assessment asking (x) what fraction of individuals are reproductive dead-ends (prop_AAAA) and (y) how much reproductive capacity the average individual still has (mean GFS); EOs breaching both thresholds simultaneously are flagged CRITICAL.
Reproductive Effort Support per Element Occurrence – Horizontal proportional bar chart showing the fraction of individuals at each GFS tier per EO, quantifying the proportion of the population capable of contributing allelic diversity to compatible crosses.

4 Phase 1: SRK amplicon sequence assembly

4.1 Prepare Canonical Sequences

Warning: This need to be be done ONLY ONCE!

SRK paralogs are in SRK_canonical_haplotype_sequences.fasta. These sequences are the result of the genome mining, but they need to be curated prior to conducting our analyses:
1. Reverse complement sequences (the positions are aligned to genome, but preliminary annotation showed that the gene is on the negative strand)
2. Check that they start with start codon (ATG). (it seems that the automatic annotation did miss the start codon of the gene. We just need to manually add the first two nucleotides)
3. Annotate one sequence
4. Check AA product
Folder: /Users/buerkilab/Documents/Amplicon_SRK/Canu_amplicon/Canonical_sequences

4.1.1 Reverse Complement Sequences and Check Start Codon

reverse_complement_startcodon_fasta.py

#!/usr/bin/env python3

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# ----------------------------
# User input
# ----------------------------
in_fasta = input("Enter input DNA FASTA file: ").strip()

out_fasta = in_fasta.replace(".fasta", "_revcomp_ATGfixed.fasta")
log_file = in_fasta.replace(".fasta", "_startcodon_fix.log")

# ----------------------------
# Process sequences
# ----------------------------
records_out = []

with open(log_file, "w") as log:
    log.write("Sequence_ID\tAction\tNew_start\n")

    for record in SeqIO.parse(in_fasta, "fasta"):
        seq_rc = record.seq.reverse_complement()
        seq_str = str(seq_rc)

        action = "OK"

        if seq_str.startswith("ATG"):
            pass
        elif seq_str.startswith("TG"):
            seq_str = "A" + seq_str
            action = "Added_A"
        elif seq_str.startswith("G"):
            seq_str = "AT" + seq_str
            action = "Added_AT"
        else:
            action = "No_ATG_found"

        log.write(f"{record.id}\t{action}\t{seq_str[:3]}\n")

        records_out.append(
            SeqRecord(
                Seq(seq_str),
                id=record.id,
                description=record.description
            )
        )

# ----------------------------
# Write output
# ----------------------------
SeqIO.write(records_out, out_fasta, "fasta")

print("\n✅ Reverse complement + start codon correction complete")
print(f"Corrected FASTA: {out_fasta}")
print(f"Log file:        {log_file}")

4.1.2 Gene Annotation

Done with the lab linux computer (/media/Artemisia/Slicksport_PacBio/BEA/Augustus) and the SRK_canonical_haplotype_sequences_revcomp_ATGfixed.fasta as template.
Output: augustus_SRK_canonical_haplotype_sequences_revcomp_ATGfixed_arabidopsis.csv

## Model = Arabidopsis
# SRK_canonical_haplotype_sequences.fasta
augustus --species=arabidopsis --strand=both --singlestrand=false --genemodel=partial --protein=on --codingseq=on --introns=on --start=on --stop=on --cds=on --gff3=on --sample=100 --keep_viterbi=true --alternatives-from-sampling=true --minexonintronprob=0.2 --minmeanexonintronprob=0.5 --maxtracks=2 SRK_canonical_haplotype_sequences_revcomp_ATGfixed.fasta --exonnames=on > augustus_SRK_canonical_haplotype_sequences_revcomp_ATGfixed_arabidopsis.gff

4.2 Nanopore Amplicon Assembly and Phasing Pipeline

4.2.1 Overview

This pipeline implements a multi-replicate CANU assembly approach optimized for rare haplotype detection in Nanopore amplicon sequencing data. The workflow follows the methodology described in the knowledge base for SRK amplicon reconstruction and haplotype phasing.

4.2.2 Pipeline Steps

4.2.2.1 Amplicon Reconstruction and Haplotype Phasing

Based on the knowledge base context: “For each individual, Nanopore reads from SRK amplicons are assembled de novo using CANU, followed by iterative polishing with RACON to improve consensus accuracy. Reads are then realigned to the polished assembly, and sequence variants are called using a polyploid-aware variant caller implemented in FreeBayes. Haplotypes are phased using read-backed phasing as implemented in WhatsHap, generating up to four phased SRK haplotypes per individual.”

4.2.2.2 Input Parameters

The script requires three user inputs:

Library name: Identifier for the sequencing library (e.g., Library001)
Starting barcode number: First barcode to process (e.g., 01)
Ending barcode number: Last barcode to process (e.g., 10)

4.2.2.3 FASTQ File Management

The pipeline includes robust FASTQ concatenation logic:

# Concatenates individual .fastq.gz files if needed
# Checks for existing concatenated files
# Validates file integrity and timestamps

Key features:

Automatic detection of existing concatenated files
Validation of file completeness and timestamps
Error handling for missing or empty files

4.2.2.4 Multi-Replicate CANU Assembly

The pipeline runs four independent CANU assemblies per sample to maximize rare haplotype recovery:

REPS=4  # Number of CANU replicates

canu \
-p "${library}_${sample}_rep${rep}" \
-d "CANU_rep${rep}" \
genomeSize=5k \
-nanopore "$reads" \
useGrid=false \
correctedErrorRate=0.15 \
minInputCoverage=0 \
stopOnLowCoverage=0 \
corOutCoverage=10000 \
batOptions="-dg 3 -db 3 -dr 1 -ca 500 -cp 50"

Parameters optimized for amplicon assembly:

genomeSize=5k: Appropriate for amplicon targets
correctedErrorRate=0.15: Accommodates Nanopore error rates
minInputCoverage=0 and stopOnLowCoverage=0: Prevents assembly termination with low coverage
Custom batOptions: Fine-tuned for small amplicon assembly

4.2.2.5 Contig Consolidation

Combines contigs from multiple CANU replicates with proper naming:

# Renames contigs with replicate prefixes
# Removes duplicate sequences using seqkit

4.2.2.6 Iterative RACON Polishing

Implements three rounds of RACON polishing to improve consensus accuracy:

for round in 1 2 3; do
    minimap2 -ax map-ont "$input" "$reads" > align.sam
    racon "$reads" align.sam "$input" > polished_r${round}.fasta
done

4.2.2.7 Read Alignment and Indexing

Uses minimap2 for optimal Nanopore read alignment:

minimap2 -ax map-ont -t 8 "$polished_final" "$reads" | \
samtools view -bS - | \
samtools sort -o aligned.bam -

4.2.2.8 Polyploid-Aware Variant Calling

Employs FreeBayes with tetraploid settings as specified in the knowledge base:

freebayes \
-f "$polished_final" \
-p 4 \
aligned.bam > variants.vcf

4.2.2.9 Read-Backed Phasing

Implements WhatsHap polyphase for tetraploid phasing:

whatshap polyphase \
--ploidy 4 \
--reference "$polished_final" \
--ignore-read-groups \
-o phased.vcf \
variants.vcf.gz aligned.bam

4.2.2.10 Haplotype Export

Generates up to four phased haplotypes per individual using bcftools consensus:

for i in 1 2 3 4; do
    bcftools consensus \
    -H $i \
    -f "$polished_final" \
    phased.vcf.gz \
    > "${library}_${sample}_hap${i}.fasta"
done

4.2.3 Output Files

For each sample, the pipeline generates:

combined_unique_contigs.fasta: Deduplicated contigs from all CANU replicates
polished_r3.fasta: Final polished assembly
aligned.bam: Sorted and indexed read alignments
variants.vcf.gz: Called variants
phased.vcf.gz: Phased variants
${library}_${sample}_hap[1-4].fasta: Individual haplotype sequences
${library}_${sample}_Phased_haplotypes.fasta: Combined haplotype file

4.2.4 Error Handling

The script includes comprehensive error checking:

File existence validation
Empty file detection
Timestamp comparison for file freshness
Process failure detection with set -euo pipefail

4.2.5 Scientific Rationale

This multi-replicate approach addresses the stochastic nature of long-read assembly algorithms, particularly important for detecting rare haplotypes in self-incompatibility systems where allelic diversity is crucial for population viability. The combination of multiple assembly attempts with rigorous polishing and polyploid-aware variant calling maximizes the recovery of authentic haplotype sequences from complex amplicon pools.

4.2.6 Bioinformatics

Load environment first by conda activate nanopore_pipeline.
LibraryID MUST be e.g. Library0001. Avoid underscore to support downstream analyses.
Script: ./nanopore_assembly_pipeline_barcode_range.sh and enter info when prompted (LibraryID and range of barcodes).
Run CANU 4 times on large read subsets to detect low frequency haplotypes, combine outputs and perform the rest of the analyses on this file.

#!/bin/bash
set -euo pipefail

# ==========================================
# Nanopore Amplicon Assembly & Phasing Script
# Multi-CANU Rare Haplotype Optimized Version
# WITH FASTQ CONCATENATION
# ==========================================

orig_dir=$(pwd)

read -p "Enter library name (e.g., Library001): " library
read -p "Enter starting barcode number (e.g., 01): " start_barcode
read -p "Enter ending barcode number (e.g., 10): " end_barcode

start_num=$((10#$start_barcode))
end_num=$((10#$end_barcode))

# Number of CANU replicates
REPS=4

for n in $(seq "$start_num" "$end_num"); do

    sample=$(printf "barcode%02d" "$n")

    echo ""
    echo "========================================="
    echo "Processing $library / $sample"
    echo "========================================="

    cd "$orig_dir/$library/$sample"

    reads="${library}_${sample}.fastq.gz"

    # ==========================================
    # FASTQ CONCATENATION LOGIC
    # ==========================================
    
    if [[ ! -f "$reads" ]]; then
        echo "Concatenated FASTQ not found. Creating $reads..."
        
        # Check if there are individual .fastq.gz files to concatenate
        if ls *.fastq.gz 1> /dev/null 2>&1; then
            # Count existing .fastq.gz files (excluding the target file if it exists)
            gz_count=$(ls *.fastq.gz 2>/dev/null | grep -v "^${reads}$" | wc -l || echo "0")
            
            if [[ $gz_count -gt 0 ]]; then
                echo "Found $gz_count .fastq.gz files to concatenate"
                
                # Concatenate all .fastq.gz files except the target file
                cat $(ls *.fastq.gz | grep -v "^${reads}$") > "$reads"
                
                echo "Created $reads from individual files"
            else
                echo "ERROR: No .fastq.gz files found to concatenate in $(pwd)"
                echo "Available files:"
                ls -la
                exit 1
            fi
        else
            echo "ERROR: No .fastq.gz files found in $(pwd)"
            echo "Available files:"
            ls -la
            exit 1
        fi
    else
        echo "Using existing $reads"
        
        # Optional: Check if the existing file is newer than individual files
        if ls *.fastq.gz 1> /dev/null 2>&1; then
            individual_files=$(ls *.fastq.gz | grep -v "^${reads}$" || true)
            if [[ -n "$individual_files" ]]; then
                # Check if any individual file is newer than the concatenated file
                newer_files=$(find . -name "*.fastq.gz" -not -name "$reads" -newer "$reads" 2>/dev/null || true)
                if [[ -n "$newer_files" ]]; then
                    echo "WARNING: Some individual .fastq.gz files are newer than $reads"
                    echo "Consider regenerating the concatenated file"
                fi
            fi
        fi
    fi

    # Final check that the reads file exists and is not empty
    if [[ ! -f "$reads" ]]; then
        echo "ERROR: $reads still not found after concatenation attempt"
        exit 1
    fi
    
    if [[ ! -s "$reads" ]]; then
        echo "ERROR: $reads exists but is empty"
        exit 1
    fi
    
    echo "Using reads file: $reads ($(du -h "$reads" | cut -f1))"

    # ==========================================
    # Run multiple CANU assemblies
    # ==========================================

    for rep in $(seq 1 $REPS); do

        echo ""
        echo "Running CANU replicate $rep"

        canu \
        -p "${library}_${sample}_rep${rep}" \
        -d "CANU_rep${rep}" \
        genomeSize=5k \
        -nanopore "$reads" \
        useGrid=false \
        correctedErrorRate=0.15 \
        minInputCoverage=0 \
        stopOnLowCoverage=0 \
        corOutCoverage=10000 \
        batOptions="-dg 3 -db 3 -dr 1 -ca 500 -cp 50"

    done

    # ==========================================
    # Combine contigs with proper renaming
    # ==========================================

    echo ""
    echo "Combining contigs"

    > combined_contigs.fasta

    rep=1

    for dir in CANU_rep*; do

        fasta=$(ls $dir/*.contigs.fasta)

        awk -v prefix="rep${rep}_" '
        /^>/ {print ">" prefix substr($0,2); next}
        {print}
        ' "$fasta" >> combined_contigs.fasta

        ((rep++))

    done

    echo "Removing duplicate sequences"

    seqkit rmdup -s combined_contigs.fasta \
    > combined_unique_contigs.fasta

    assembly="combined_unique_contigs.fasta"

    # ==========================================
    # RACON polishing
    # ==========================================

    echo ""
    echo "Running RACON polishing"

    for round in 1 2 3; do

        input=$([ $round -eq 1 ] && echo "$assembly" || echo "polished_r$((round-1)).fasta")

        minimap2 -ax map-ont "$input" "$reads" > align.sam

        racon "$reads" align.sam "$input" > polished_r${round}.fasta

    done

    polished_final="polished_r3.fasta"

    # ==========================================
    # Align reads
    # ==========================================

    echo ""
    echo "Aligning reads"

    minimap2 -ax map-ont -t 8 "$polished_final" "$reads" | \
    samtools view -bS - | \
    samtools sort -o aligned.bam -

    samtools index aligned.bam

    # ==========================================
    # Variant Calling
    # ==========================================

    echo ""
    echo "Calling variants"

    freebayes \
    -f "$polished_final" \
    -p 4 \
    aligned.bam > variants.vcf

    bgzip -f variants.vcf

    tabix -p vcf variants.vcf.gz

    # ==========================================
    # WhatsHap Phasing
    # ==========================================

    echo ""
    echo "Phasing variants"

    whatshap polyphase \
    --ploidy 4 \
    --reference "$polished_final" \
    --ignore-read-groups \
    -o phased.vcf \
    variants.vcf.gz aligned.bam

    bgzip -f phased.vcf

    tabix -p vcf phased.vcf.gz

    # ==========================================
    # Export haplotypes
    # ==========================================

    echo ""
    echo "Exporting haplotypes"

    for i in 1 2 3 4; do

        bcftools consensus \
        -H $i \
        -f "$polished_final" \
        phased.vcf.gz \
        > temp.fasta

        sed "s/^>/>${library}_${sample}_hap${i}_/" temp.fasta \
        > "${library}_${sample}_hap${i}.fasta"

    done

    cat ${library}_${sample}_hap*.fasta \
    > ${library}_${sample}_Phased_haplotypes.fasta

    rm temp.fasta

    echo ""
    echo "Finished $library $sample"

    cd "$orig_dir"

done

echo ""
echo "====================================="
echo "ALL SAMPLES COMPLETE"
echo "====================================="

4.3 Haplotype Orientation and Consolidation Pipeline

4.3.1 Overview

This pipeline standardizes the orientation of phased haplotypes across all samples within a sequencing library by aligning them to a reference sequence and reverse-complementing sequences that align in the antisense orientation. The workflow consolidates all oriented haplotypes into a single multi-FASTA file for downstream analysis.

4.3.2 Pipeline Steps

4.3.2.1 Haplotype Orientation Standardization

The pipeline addresses the inherent directionality ambiguity in amplicon sequencing data. Since PCR amplification can produce products in both forward and reverse orientations, and assembly algorithms may reconstruct haplotypes in either direction, standardization against a reference sequence ensures consistent orientation for comparative analyses.

4.3.2.2 Input Parameters

The script requires two user inputs:

Library name: Identifier for the sequencing library being processed
Reference FASTA: Path to reference sequence file for orientation determination

read -p "Enter Library name: " Library
read -p "Enter reference FASTA: " REF_FASTA
REF_FASTA=$(realpath "$REF_FASTA")

4.3.2.3 Output File Initialization

Creates a consolidated output file for all oriented haplotypes:

FINAL_OUT="all_${Library}_Phased_haplotypes.fasta"
> "$FINAL_OUT"  # Initialize empty file

4.3.2.4 Barcode Directory Processing

Iterates through all barcode subdirectories within the specified library:

for BARCODE_DIR in "$Library"/barcode*/; do
    BARCODE=$(basename "$BARCODE_DIR")
    FULL_FASTA="${BARCODE_DIR}/${Library}_${BARCODE}_Phased_haplotypes.fasta"

Key features:

Automatic detection of barcode directories
Dynamic file path construction
Missing file handling with warning messages

4.3.2.5 File Validation

Implements robust error checking for missing input files:

if [[ ! -f "$FULL_FASTA" ]]; then
    echo "WARNING missing $FULL_FASTA"
    continue
fi

This ensures the pipeline continues processing even if individual samples failed in upstream steps.

4.3.2.6 Reference Alignment

Uses minimap2 for sensitive alignment of haplotypes to the reference sequence:

minimap2 -a "$REF_FASTA" "$FILE" > hap_vs_ref.sam

Alignment parameters:

-a: Output in SAM format for downstream processing
Optimized for sequence similarity detection rather than read mapping

4.3.2.7 Reverse-Complement Detection

Identifies sequences aligned in antisense orientation using SAM flags:

samtools view -f 16 hap_vs_ref.sam \
| cut -f1 \
| sort -u \
> to_reverse.txt

SAM flag interpretation:

Flag 16: Sequence is reverse-complemented in alignment
Extracts sequence identifiers requiring orientation correction

4.3.2.8 Conditional Sequence Reversal

Implements conditional logic for sequence orientation correction:

if [[ -s to_reverse.txt ]]; then
    # Reverse-complement identified sequences
    seqkit grep -f to_reverse.txt "$FILE" \
    | seqkit seq -r -p \
    > reversed.fasta
    
    # Keep forward-oriented sequences unchanged
    seqkit grep -v -f to_reverse.txt "$FILE" \
    > forward.fasta
    
    # Combine oriented sequences
    cat forward.fasta reversed.fasta > "$ORIENTED"
else
    # No sequences require reversal
    cp "$FILE" "$ORIENTED"
fi

Seqkit parameters:

-r: Reverse-complement sequence
-p: Print sequence on single line
-f: Filter sequences by identifier list
-v: Invert selection (exclude listed sequences)

4.3.2.9 File Consolidation

Appends oriented haplotypes to the master output file:

cat "$ORIENTED" >> "$OLDPWD/$FINAL_OUT"

Uses $OLDPWD to maintain correct path reference when changing directories.

4.3.2.10 Cleanup Operations

Removes intermediate files to conserve disk space:

rm -f hap_vs_ref.sam forward.fasta reversed.fasta

4.3.3 Output Files

4.3.3.1 Per-Sample Outputs

${Library}_${BARCODE}_Phased_haplotypes.oriented.fasta: Oriented haplotypes for individual samples

4.3.3.2 Consolidated Output

all_${Library}_Phased_haplotypes.fasta: Master file containing all oriented haplotypes from the library

4.3.4 Error Handling and Robustness

The pipeline includes several robustness features:

Missing file tolerance: Continues processing if individual samples are missing
Path resolution: Uses realpath for absolute reference paths
Directory navigation: Proper handling of working directory changes
Conditional processing: Only performs reverse-complement operations when necessary
Process failure detection: Uses set -euo pipefail for immediate error detection

4.3.5 Scientific Rationale

Consistent haplotype orientation is critical for downstream phylogenetic and population genetic analyses. In self-incompatibility research, where allelic relationships and evolutionary patterns are paramount, standardized sequence orientation ensures:

Accurate sequence comparisons: Prevents false divergence estimates due to orientation differences
Proper alignment preparation: Facilitates multiple sequence alignment for phylogenetic reconstruction
Consistent annotation: Enables uniform feature annotation across all haplotypes
Reproducible analyses: Ensures consistent results across different analysis runs

4.3.6 Computational Efficiency

The pipeline optimizes computational resources through: - Single alignment per sample: Minimizes computational overhead - Conditional processing: Only processes sequences requiring orientation correction - Intermediate file cleanup: Prevents disk space accumulation - Batch processing: Handles multiple samples in a single execution

This approach is particularly valuable when processing large numbers of samples with variable haplotype counts, as commonly encountered in population-level studies of self-incompatibility systems.

4.3.7 Bioinformatics

For each barcode, find all *_Phased_haplotypes.fasta, orient haplotype sequences based on SRK haplotype sequences (canonical reference), produced final curated file
Input:
- Library folder name (e.g., Library001)
- Canonical fasta file (incl. path) (e.g., /Users/buerkilab/Documents/Amplicon_SRK/Canu_amplicon/Canonical_sequences/SRK_canonical_haplotype_sequences_revcomp_ATGfixed.fasta)
Script: orient_haplotypes_library.sh

#!/bin/bash
set -euo pipefail

# ==========================================
# Orient haplotypes for ALL barcodes
# ==========================================

read -p "Enter Library name: " Library
read -p "Enter reference FASTA: " REF_FASTA

REF_FASTA=$(realpath "$REF_FASTA")

FINAL_OUT="all_${Library}_Phased_haplotypes.fasta"

> "$FINAL_OUT"

echo
echo "Processing $Library"
echo

for BARCODE_DIR in "$Library"/barcode*/; do

BARCODE=$(basename "$BARCODE_DIR")

FULL_FASTA="${BARCODE_DIR}/${Library}_${BARCODE}_Phased_haplotypes.fasta"

FILE=$(basename "$FULL_FASTA")


if [[ ! -f "$FULL_FASTA" ]]; then

echo "WARNING missing $FULL_FASTA"

continue

fi

echo "Processing $FULL_FASTA"

cd "$BARCODE_DIR"

ORIENTED="${Library}_${BARCODE}_Phased_haplotypes.oriented.fasta"

# Align

minimap2 -a "$REF_FASTA" "$FILE" > hap_vs_ref.sam

# Detect reverse

samtools view -f 16 hap_vs_ref.sam \
| cut -f1 \
| sort -u \
> to_reverse.txt

# Reverse if needed

if [[ -s to_reverse.txt ]]; then

seqkit grep -f to_reverse.txt "$FILE" \
| seqkit seq -r -p \
> reversed.fasta

seqkit grep -v -f to_reverse.txt "$FILE" \
> forward.fasta


cat forward.fasta reversed.fasta \
> "$ORIENTED"

else

cp "$FILE" "$ORIENTED"

fi

rm -f hap_vs_ref.sam forward.fasta reversed.fasta

cat "$ORIENTED" >> "$OLDPWD/$FINAL_OUT"

cd "$OLDPWD"

done

echo
echo "DONE"
echo "Output: $FINAL_OUT"

4.4 FASTA Sequence Filtering and Reference Integration Pipeline

4.4.1 Overview

This pipeline filters haplotype sequences by minimum length criteria and integrates canonical reference sequences to create a comprehensive dataset for downstream phylogenetic and comparative analyses. The workflow is essential for preparing high-quality sequence datasets where length uniformity and reference inclusion are critical for accurate evolutionary inferences.

4.4.2 Pipeline Steps

4.4.2.1 Quality Control Through Length Filtering

Length filtering serves multiple scientific purposes in haplotype analysis:

Assembly quality control: Removes truncated or incomplete assemblies
Alignment preparation: Ensures sequences are suitable for multiple sequence alignment
Phylogenetic reliability: Excludes sequences with insufficient informative sites
Comparative analysis: Maintains dataset homogeneity for population genetic analyses

4.4.2.2 Input Parameters and Validation

The script requires three user-specified parameters:

read -p "Enter merged haplotype FASTA file: " MERGED_FASTA
read -p "Enter canonical reference FASTA file: " REF_FASTA
read -p "Enter minimum sequence length (bp): " MINLEN

Parameter specifications:

Merged haplotype FASTA: Output from previous haplotype consolidation step
Canonical reference FASTA: Known reference sequences for comparative context
Minimum length threshold: Quality control parameter (typically 80-90% of expected amplicon length)

4.4.2.3 File Existence Validation

Implements robust input validation to prevent downstream errors:

for f in "$MERGED_FASTA" "$REF_FASTA"; do
    if [[ ! -f "$f" ]]; then
        echo "ERROR: File not found: $f" >&2
        exit 1
    fi
done

Error handling features:

Checks all input files before processing
Outputs errors to stderr (>&2)
Immediate termination on missing files
Prevents partial processing with invalid inputs

4.4.2.4 Dynamic Output Naming

Generates descriptive output filenames that preserve processing parameters:

BASENAME="${MERGED_FASTA%.fasta}"
OUT="${BASENAME}_filtered_min${MINLEN}.fasta"

Naming convention benefits:

Preserves original filename stem
Documents filtering parameters in filename
Enables parameter tracking in complex workflows
Prevents accidental file overwrites

4.4.2.5 Reference Integration

Combines canonical references with experimental haplotypes before filtering:

TMP_COMBINED=$(mktemp)
cat "$REF_FASTA" "$MERGED_FASTA" > "$TMP_COMBINED"

Scientific rationale:

Phylogenetic rooting: Provides known sequences for tree rooting
Allelic classification: Enables assignment of haplotypes to known allelic classes
Evolutionary context: Places novel haplotypes within established evolutionary framework
Quality assessment: Allows comparison with validated reference sequences

4.4.2.6 Length-Based Filtering

Applies minimum length threshold using seqkit for efficient processing:

seqkit seq -m "$MINLEN" "$TMP_COMBINED" > "$OUT"

Seqkit parameters:

-m "$MINLEN": Minimum sequence length filter (3250 bp in our case)
Efficient processing of large FASTA files
Preserves sequence headers and formatting
Maintains sequence order from input

4.4.2.7 Temporary File Management

Implements proper cleanup of intermediate files:

TMP_COMBINED=$(mktemp)  # Creates secure temporary file
# ... processing ...
rm -f "$TMP_COMBINED"   # Cleanup after processing

Security and efficiency features:

Uses mktemp for secure temporary file creation
Prevents disk space accumulation
Avoids naming conflicts in concurrent executions
Maintains system cleanliness

4.4.3 Output Files

4.4.3.1 Primary Output

${BASENAME}_filtered_min${MINLEN}.fasta: Filtered sequences meeting length criteria, including canonical references

Output characteristics:

Contains both reference and experimental sequences
All sequences meet minimum length requirement
Preserves original sequence identifiers
Ready for downstream phylogenetic analysis

4.4.4 Quality Control Metrics

The filtering process provides implicit quality metrics:

Sequence retention rate: Proportion of sequences passing length filter
Length distribution: Range of sequence lengths in final dataset
Reference inclusion: Confirmation of canonical sequence integration
Dataset completeness: Total number of sequences for analysis

4.4.5 Scientific Applications

This filtered dataset serves multiple research objectives:

4.4.5.1 Phylogenetic Analysis

Tree construction: Uniform sequence lengths improve alignment quality
Bootstrap support: Adequate sequence length enhances statistical confidence
Evolutionary relationships: Reference sequences provide phylogenetic context

4.4.5.2 Population Genetics

Allelic diversity: Quantification of haplotype variation within populations
Frequency estimation: Accurate representation of allelic frequencies
Selection analysis: Detection of balancing selection signatures

4.4.5.3 Self-Incompatibility Research

Allele identification: Classification of novel haplotypes relative to known alleles
Functional prediction: Sequence-based inference of incompatibility phenotypes
Evolutionary dynamics: Analysis of allelic turnover and maintenance

4.4.6 Parameter Optimization

4.4.6.1 Length Threshold Selection

Optimal minimum length depends on:

Expected amplicon size: Typically 80-90% of full-length sequence
Assembly quality: Balance between inclusivity and quality
Downstream requirements: Alignment and phylogenetic analysis needs
Reference sequence lengths: Consistency with canonical sequences

4.4.6.2 Typical Thresholds

SRK amplicons: 3000-4000 bp (for ~4500 bp expected length)
Partial sequences: 1500-2000 bp (for targeted regions)
Quality control: 70-80% of expected full length

4.4.7 Integration with Upstream Workflows

This pipeline integrates seamlessly with the haplotype orientation workflow:

Input compatibility: Accepts consolidated haplotype files
Reference consistency: Uses same reference sequences for orientation and filtering
Parameter continuity: Maintains sample and library identifiers
Output formatting: Produces analysis-ready FASTA files

The filtered output serves as high-quality input for subsequent phylogenetic reconstruction, population genetic analysis, and functional annotation workflows essential for understanding self-incompatibility system evolution and maintenance.

4.4.8 Bioinformatics

Script: filter_and_add_references.sh
Input:
- Merged haplotypes for a library (all_Library006_Phased_haplotypes.fasta)
- Canonical reference FASTA file (/Users/buerkilab/Documents/Amplicon_SRK/Canu_amplicon/Canonical_sequences/SRK_canonical_haplotype_sequences_revcomp_ATGfixed.fasta)
- Minimum length: 3250
- Output: all_Library003_Phased_haplotypes_filtered_min3250.fasta

#!/bin/bash
set -euo pipefail

# ==========================================
# Filter FASTA sequences by minimum length
# and add canonical references
# ==========================================

# ----------------------------
# User input
# ----------------------------
read -p "Enter merged haplotype FASTA file: " MERGED_FASTA
read -p "Enter canonical reference FASTA file: " REF_FASTA
read -p "Enter minimum sequence length (bp): " MINLEN

# Check if input files exist
for f in "$MERGED_FASTA" "$REF_FASTA"; do
    if [[ ! -f "$f" ]]; then
        echo "ERROR: File not found: $f" >&2
        exit 1
    fi
done

# Determine output file name
BASENAME="${MERGED_FASTA%.fasta}"
OUT="${BASENAME}_filtered_min${MINLEN}.fasta"

# ----------------------------
# Step 1: Combine canonical references with merged FASTA
# ----------------------------
TMP_COMBINED=$(mktemp)
cat "$REF_FASTA" "$MERGED_FASTA" > "$TMP_COMBINED"

# ----------------------------
# Step 2: Filter sequences by minimum length using seqkit
# ----------------------------
seqkit seq -m "$MINLEN" "$TMP_COMBINED" > "$OUT"

# Cleanup
rm -f "$TMP_COMBINED"

echo "Filtered FASTA written to: $OUT"

4.5 Multiple Sequence Alignment

Align sequences with MAFFT
Input: all_Library003_Phased_haplotypes_filtered_min3250.fasta


all_haplotypes=all_Library001_Phased_haplotypes_filtered_min3200.fasta

# Library001
mafft --auto --adjustdirection all_Library001_Phased_haplotypes_filtered_min3250.fasta > all_Library001_Phased_haplotypes_filtered_min3250_aligned.fasta

# Library002
mafft --auto --adjustdirection all_Library002_Phased_haplotypes_filtered_min3250.fasta > all_Library002_Phased_haplotypes_filtered_min3250_aligned.fasta

# Library003
mafft --auto --adjustdirection all_Library003_Phased_haplotypes_filtered_min3250.fasta > all_Library003_Phased_haplotypes_filtered_min3250_aligned.fasta

# Library004
mafft --auto --adjustdirection all_Library004_Phased_haplotypes_filtered_min3250.fasta > all_Library004_Phased_haplotypes_filtered_min3250_aligned.fasta

# Library005
mafft --auto --adjustdirection all_Library005_Phased_haplotypes_filtered_min3250.fasta > all_Library005_Phased_haplotypes_filtered_min3250_aligned.fasta

# Library006
mafft --auto --adjustdirection all_Library006_Phased_haplotypes_filtered_min3250.fasta > all_Library006_Phased_haplotypes_filtered_min3250_aligned.fasta

4.6 Exon Extraction from Multiple Sequence Alignments Pipeline

4.6.1 Overview

This Python pipeline extracts exonic sequences from multiple sequence alignments (MSAs) using gene structure annotations from AUGUSTUS gene prediction software. The workflow maps genomic coordinates to alignment positions and extracts coding sequences while preserving alignment-derived insertions and deletions, enabling comparative analysis of protein-coding regions across haplotypes.

4.6.2 Pipeline Steps

4.6.2.1 Exon-Focused Comparative Analysis

Exon extraction serves critical functions in self-incompatibility research:

Functional sequence analysis: Focuses on protein-coding regions
Selection pressure detection: Enables analysis of synonymous vs. non-synonymous changes
Structural annotation: Identifies conserved and variable coding domains
Phylogenetic inference: Improves signal-to-noise ratio for evolutionary analysis

4.6.2.2 Input Parameters and Validation

The script requires five user-specified parameters:

msa_file = input("Enter MSA FASTA path (aligned sequences, including canonical reference): ").strip()
canonical_id = input("Enter canonical reference ID (exact header in MSA): ").strip()
augustus_csv = input("Enter AUGUSTUS CSV with exon positions: ").strip()
strand = input("Enter strand of annotation for canonical reference (+/-): ").strip()
min_exon_len = input("Enter minimum exon length (default 0): ").strip()

Parameter specifications:

MSA FASTA: Multiple sequence alignment including reference and haplotype sequences
Canonical reference ID: Exact sequence identifier for coordinate mapping
AUGUSTUS CSV: Gene structure annotation in CSV format from AUGUSTUS gene prediction
Strand orientation: Directionality of gene annotation (+ or -)
Minimum exon length: Quality filter for exon inclusion (default: 0 bp)

4.6.2.3 Multiple Sequence Alignment Loading

Loads and validates the input MSA using BioPython:

msa = SeqIO.to_dict(SeqIO.parse(msa_file, "fasta"))

if canonical_id not in msa:
    raise ValueError("Canonical reference ID not found in MSA")

canonical_seq = msa[canonical_id].seq

Validation features:

Dictionary conversion for efficient sequence access
Reference sequence validation
Error handling for missing canonical reference
Sequence object preservation for downstream processing

4.6.2.4 AUGUSTUS Annotation Processing

Parses gene structure annotations and applies quality filters:

with open(augustus_csv, newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    exons = [row for row in reader if row['V3'] == 'exon']

# Filter by minimum exon length and sort by start
exons = [e for e in exons if int(e['V5']) - int(e['V4']) + 1 >= min_exon_len]
exons.sort(key=lambda x: int(x['V4']))

Processing features:

Feature filtering: Selects only exon annotations from AUGUSTUS output
Length filtering: Removes exons below minimum length threshold
Coordinate sorting: Orders exons by genomic start position
Quality control: Ensures reliable exon boundary definition

4.6.2.5 Genomic-to-MSA Coordinate Mapping

Creates position mapping between genomic coordinates and alignment positions:

genomic_to_msa = {}
seq_index = 0
for i, nt in enumerate(canonical_seq):
    if nt != '-':
        genomic_to_msa[seq_index + 1] = i  # 1-based genomic positions
        seq_index += 1

Mapping algorithm:

Gap handling: Skips alignment gaps in canonical reference
Coordinate conversion: Maps 1-based genomic positions to 0-based alignment indices
Reference-based: Uses canonical sequence as coordinate reference
Insertion preservation: Maintains alignment-derived insertions in other sequences

4.6.2.6 Exon Sequence Extraction

Extracts exonic sequences from all sequences in the MSA:

for seq_name, seq_record in msa.items():
    seq_aln = seq_record.seq
    exon_seq = Seq("")

    for e in exons:
        start_gen = int(e['V4'])
        end_gen = int(e['V5'])
        try:
            msa_start = genomic_to_msa[start_gen]
            msa_end = genomic_to_msa[end_gen]
        except KeyError:
            continue  # skip if exon position not in mapping

        exon_seq += seq_aln[msa_start:msa_end+1]

Extraction features:

Concatenated exons: Joins all exonic sequences into single sequence
Error handling: Skips exons with unmappable coordinates
Sequence preservation: Maintains gaps and insertions from alignment
Universal processing: Applies same extraction to all sequences

4.6.2.7 Strand-Specific Processing

Handles gene orientation through conditional reverse-complementation:

if strand == '-':
    exon_seq = exon_seq.reverse_complement()

Biological significance:

Coding strand correction: Ensures sequences represent actual coding strand
Functional analysis: Maintains proper reading frame orientation
Comparative consistency: Standardizes orientation across all sequences
Translation compatibility: Enables downstream protein analysis

4.6.2.8 Output Generation

Creates BioPython SeqRecord objects and writes output FASTA:

exon_records.append(SeqRecord(exon_seq, id=seq_name, description=""))

out_file = msa_file.replace(".fasta", "_exons.fasta")
SeqIO.write(exon_records, out_file, "fasta")

Output features:

Identifier preservation: Maintains original sequence names
Automatic naming: Generates descriptive output filename
Standard format: Produces FASTA-compatible output
Metadata retention: Preserves sequence relationships

4.6.3 Output Files

4.6.3.1 Primary Output

${input_basename}_exons.fasta: Concatenated exonic sequences for all input sequences

Output characteristics:

Contains only protein-coding sequences
Preserves alignment-derived gaps and insertions
Maintains sequence identifier consistency
Ready for downstream protein analysis

4.6.4 Scientific Applications

4.6.4.1 Molecular Evolution Analysis

Synonymous vs. Non-synonymous Analysis:

Calculation of dN/dS ratios for selection detection
Identification of sites under positive or purifying selection
Comparative analysis of selection pressures across gene regions

Protein Structure Prediction:

Translation to amino acid sequences for structural analysis
Domain identification and functional annotation
Comparative protein modeling across haplotypes

4.6.4.2 Self-Incompatibility Research

Functional Domain Analysis: - Hypervariable regions: Identification of polymorphic sites critical for specificity - Conserved domains: Detection of structurally important regions - Binding sites: Analysis of receptor-ligand interaction domains

Allelic Classification: - Protein-based phylogenetic reconstruction - Functional group assignment based on coding sequence similarity - Prediction of incompatibility phenotypes from sequence data

4.6.5 Quality Control and Error Handling

4.6.5.1 Coordinate Mapping Validation

Boundary checking: Ensures exon coordinates fall within sequence bounds
Gap handling: Properly manages alignment gaps in coordinate conversion
Missing data: Graceful handling of unmappable exon positions

4.6.5.2 Sequence Integrity

Length validation: Confirms expected exon lengths after extraction
Frame preservation: Maintains reading frame consistency across sequences
Quality metrics: Implicit validation through successful coordinate mapping

4.6.6 Integration with Upstream Workflows

This pipeline integrates with previous steps in the haplotype analysis workflow:

MSA dependency: Requires high-quality multiple sequence alignment
Annotation integration: Utilizes AUGUSTUS gene predictions for structural annotation
Reference consistency: Uses same canonical reference across all analysis steps
Output compatibility: Produces sequences suitable for phylogenetic and selection analysis

The extracted exonic sequences provide the foundation for detailed molecular evolutionary analyses essential for understanding the functional significance of sequence variation in self-incompatibility systems, where protein-level changes directly impact reproductive compatibility and population genetic dynamics.

4.6.7 Bioinformatics

Activate container: source srk_env/bin/activate
Script: extract_exons_with_annotation.py

Input: - MSA file: all_Library006_Phased_haplotypes_filtered_min3250_aligned.fasta - Name of ref sequence associated with Augustus annotation: SRK_BEA_hybrid_bp_hap1_p_ctg_fa|amp_1|h1tg000019l|1700582|3407 - Annotation file in csv: /Users/buerkilab/Documents/Amplicon_SRK/Canu_amplicon/Canonical_sequences/augustus_SRK_canonical_haplotype_sequences_revcomp_ATGfixed_arabidopsis.csv -> This file was generated by annotating the canonical sequences using Augustus - Annotation strand: + - Min exon length: 0 Output: - all_Library003_Phased_haplotypes_filtered_min3250_aligned_exons.fasta

#!/usr/bin/env python3
import csv
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# ----------------------------
# User inputs
# ----------------------------
msa_file = input("Enter MSA FASTA path (aligned sequences, including canonical reference): ").strip()
canonical_id = input("Enter canonical reference ID (exact header in MSA): ").strip()
augustus_csv = input("Enter AUGUSTUS CSV with exon positions: ").strip()
strand = input("Enter strand of annotation for canonical reference (+/-): ").strip()
min_exon_len = input("Enter minimum exon length (default 0): ").strip()

min_exon_len = int(min_exon_len) if min_exon_len else 0

# ----------------------------
# Load MSA
# ----------------------------
msa = SeqIO.to_dict(SeqIO.parse(msa_file, "fasta"))

if canonical_id not in msa:
    raise ValueError("Canonical reference ID not found in MSA")

canonical_seq = msa[canonical_id].seq

# ----------------------------
# Read AUGUSTUS CSV
# ----------------------------
with open(augustus_csv, newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    exons = [row for row in reader if row['V3'] == 'exon']

# Filter by minimum exon length and sort by start
exons = [e for e in exons if int(e['V5']) - int(e['V4']) + 1 >= min_exon_len]
exons.sort(key=lambda x: int(x['V4']))

# ----------------------------
# Build genomic -> MSA mapping
# ----------------------------
genomic_to_msa = {}
seq_index = 0
for i, nt in enumerate(canonical_seq):
    if nt != '-':
        genomic_to_msa[seq_index + 1] = i  # 1-based genomic positions
        seq_index += 1

# ----------------------------
# Extract exon sequences
# ----------------------------
exon_records = []

for seq_name, seq_record in msa.items():
    seq_aln = seq_record.seq
    exon_seq = Seq("")

    for e in exons:
        start_gen = int(e['V4'])
        end_gen = int(e['V5'])
        try:
            msa_start = genomic_to_msa[start_gen]
            msa_end = genomic_to_msa[end_gen]
        except KeyError:
            continue  # skip if exon position not in mapping

        exon_seq += seq_aln[msa_start:msa_end+1]

    if strand == '-':
        exon_seq = exon_seq.reverse_complement()

    exon_records.append(SeqRecord(exon_seq, id=seq_name, description=""))

# ----------------------------
# Write exon-only FASTA
# ----------------------------
out_file = msa_file.replace(".fasta", "_exons.fasta")
SeqIO.write(exon_records, out_file, "fasta")
print(f"Exon DNA FASTA written to: {out_file}")

4.7 MSA Gap Backfilling and Terminus Processing Pipeline

4.7.1 Overview

This Python pipeline addresses alignment gaps in multiple sequence alignments (MSAs) by implementing a reference-based backfilling strategy combined with terminus gap replacement. The workflow improves alignment quality for downstream phylogenetic analysis by reducing the impact of assembly artifacts and sequencing limitations while preserving authentic biological variation.

4.7.2 Pipeline Steps

4.7.2.1 Alignment Quality Enhancement Strategy

Gap backfilling serves multiple analytical purposes in haplotype analysis:

Assembly artifact correction: Addresses incomplete terminal regions in de novo assemblies
Phylogenetic signal preservation: Maintains informative sites that may be lost due to technical limitations
Alignment consistency: Standardizes sequence termini for comparative analysis
Reference-guided improvement: Leverages high-quality reference sequences to fill technical gaps

4.7.2.2 Input Parameters and Configuration

The script requires two user-specified parameters with one hardcoded constant:

msa_fasta = input("Enter MSA FASTA file: ").strip()
ref_id = input("Enter canonical reference ID (exact FASTA header): ").strip()
WINDOW = 25  # bp to backfill from reference

Parameter specifications:

MSA FASTA file: Multiple sequence alignment containing haplotypes and reference
Canonical reference ID: Exact sequence identifier for high-quality reference sequence
Backfill window: Fixed 25 bp window at sequence termini for reference-based gap filling

4.7.2.3 MSA Loading and Validation

Loads alignment data and performs reference sequence validation:

records = list(SeqIO.parse(msa_fasta, "fasta"))
seqs = {r.id: list(str(r.seq)) for r in records}

if ref_id not in seqs:
    raise ValueError("Reference ID not found in MSA")

ref_seq = seqs[ref_id]
aln_len = len(ref_seq)

Data structure features:

Dictionary conversion: Enables efficient sequence access by identifier
List conversion: Allows character-level sequence manipulation
Length standardization: Ensures uniform alignment length across sequences
Reference validation: Confirms canonical sequence availability

4.7.2.4 Sequence Boundary Detection

Implements helper function to identify authentic sequence boundaries:

def real_bounds(seq):
    start = next((i for i, x in enumerate(seq) if x != '-'), None)
    end = next((i for i in range(len(seq)-1, -1, -1) if seq[i] != '-'), None)
    return start, end

Boundary detection algorithm:

Start boundary: First non-gap character position
End boundary: Last non-gap character position (reverse iteration)
Empty sequence handling: Returns None for sequences with no real content
Efficient search: Uses generator expressions for optimal performance

4.7.2.5 Terminal Region Backfilling

Applies reference-based gap filling in terminal regions:

# ---- Backfill first 25 bp ----
for i in range(min(WINDOW, aln_len)):
    if new_seq[i] == '-':
        new_seq[i] = ref_seq[i]

# ---- Backfill last 25 bp ----
for i in range(max(0, aln_len - WINDOW), aln_len):
    if new_seq[i] == '-':
        new_seq[i] = ref_seq[i]

Backfilling strategy:

Selective replacement: Only fills gaps, preserves existing sequence data
Boundary protection: Uses min() and max() functions to prevent index errors
Terminal focus: Targets regions most likely affected by assembly truncation
Reference-guided: Uses high-quality canonical sequence as gap filler

Scientific rationale:

Terminal regions in amplicon assemblies are frequently incomplete due to: - PCR primer effects: Reduced coverage at amplicon termini - Assembly algorithms: Difficulty resolving terminal repeats or low-complexity regions - Sequencing bias: Reduced quality at read termini affecting assembly

4.7.2.6 Extended Gap Processing

Converts remaining gaps to ambiguous nucleotides outside sequence boundaries:

# ---- Replace extra leading gaps with Ns ----
for i in range(0, start):
    if new_seq[i] == '-':
        new_seq[i] = 'N'

# ---- Replace extra trailing gaps with Ns ----
for i in range(end + 1, aln_len):
    if new_seq[i] == '-':
        new_seq[i] = 'N'

Gap replacement logic:

Leading gaps: Converts gaps before first real nucleotide to ‘N’
Trailing gaps: Converts gaps after last real nucleotide to ‘N’
Preservation principle: Maintains internal gaps as authentic indels
Phylogenetic compatibility: ‘N’ characters are properly handled by most phylogenetic software

4.7.2.7 Sequence Reconstruction and Output

Creates new SeqRecord objects and writes processed alignment:

new_records.append(
    SeqIO.SeqRecord(
        Seq("".join(new_seq)),
        id=sid,
        description=""
    )
)

out_fasta = msa_fasta.replace(".fasta", "_backfilled.fasta")
SeqIO.write(new_records, out_fasta, "fasta")

Output features:

Identifier preservation: Maintains original sequence names and order
Clean descriptions: Removes potentially confusing description fields
Automatic naming: Generates descriptive output filename
BioPython compatibility: Creates proper SeqRecord objects for downstream analysis

4.7.3 Output Files

4.7.3.1 Primary Output

${input_basename}_backfilled.fasta: Processed MSA with gap-filled terminal regions

Output characteristics:

Improved terminal region completeness through reference backfilling
Converted external gaps to ‘N’ characters for phylogenetic compatibility
Preserved internal gaps representing authentic biological variation
Maintained alignment structure and sequence relationships

4.7.4 Quality Control and Processing Logic

4.7.4.1 Gap Classification System

The pipeline implements a three-tier gap classification:

Terminal gaps within backfill window: Filled with reference sequence
External gaps beyond sequence boundaries: Converted to ‘N’ characters
Internal gaps within sequence boundaries: Preserved as authentic indels

4.7.4.2 Sequence Integrity Preservation

Biological variation protection:

Internal gaps representing genuine insertions/deletions are preserved
Only technical gaps (assembly artifacts) are modified
Reference sequence provides high-confidence terminal sequence data

Phylogenetic analysis optimization:

‘N’ characters are treated as missing data by phylogenetic software
Reduces alignment ambiguity without introducing false homology
Maintains positional homology for comparative analysis

4.7.5 Scientific Applications

4.7.5.1 Phylogenetic Analysis Enhancement

Tree reconstruction improvement:

Reduced terminal gap artifacts improve branch length estimation
Enhanced signal-to-noise ratio in phylogenetic inference
Better resolution of closely related haplotypes

Bootstrap support enhancement:

Increased informative sites through reference backfilling
Reduced alignment uncertainty in terminal regions
Improved statistical confidence in phylogenetic relationships

4.7.5.2 Population Genetic Analysis

Allelic diversity assessment:

More accurate sequence length estimates for diversity calculations
Improved detection of terminal sequence variation
Enhanced resolution of rare haplotype differences

Selection analysis: - Better terminal region coverage for comprehensive selection scans - Reduced false positive signals from assembly artifacts - Improved power for detecting selection in terminal domains

4.7.6 Integration with Downstream Workflows

This processed alignment serves as optimized input for:

Phylogenetic reconstruction: Maximum likelihood and Bayesian inference
Selection analysis: dN/dS calculations and site-specific selection tests
Population genetics: Diversity indices and demographic inference
Functional annotation: Improved terminal region analysis for domain identification

The backfilled alignment provides a balance between technical artifact correction and biological variation preservation, essential for accurate evolutionary analysis of self-incompatibility haplotypes where terminal regions may contain functionally important domains.

4.7.7 Bioinformatics

Add primer sequences to fill out ends of the exon sequences.

Script: backfill_alignment_ends.py
Input:
- MSA fasta file (exon): all_Library006_Phased_haplotypes_filtered_min3250_aligned_exons.fasta
- Canonical reference ID (exact FASTA header): SRK_BEA_hybrid_bp_hap1_p_ctg_fa|amp_1|h1tg000019l|1700582|3407
Output: all_Library003_Phased_haplotypes_filtered_min3250_aligned_exons_backfilled.fasta

#!/usr/bin/env python3
from Bio import SeqIO
from Bio.Seq import Seq

# ----------------------------
# User input
# ----------------------------
msa_fasta = input("Enter MSA FASTA file: ").strip()
ref_id = input("Enter canonical reference ID (exact FASTA header): ").strip()
WINDOW = 25  # bp to backfill from reference

# ----------------------------
# Load MSA
# ----------------------------
records = list(SeqIO.parse(msa_fasta, "fasta"))
seqs = {r.id: list(str(r.seq)) for r in records}

if ref_id not in seqs:
    raise ValueError("Reference ID not found in MSA")

ref_seq = seqs[ref_id]
aln_len = len(ref_seq)

# ----------------------------
# Helper: find real sequence bounds
# ----------------------------
def real_bounds(seq):
    start = next((i for i, x in enumerate(seq) if x != '-'), None)
    end = next((i for i in range(len(seq)-1, -1, -1) if seq[i] != '-'), None)
    return start, end

# ----------------------------
# Process sequences
# ----------------------------
new_records = []

for sid, seq in seqs.items():
    new_seq = seq.copy()
    start, end = real_bounds(seq)

    if start is None:
        continue  # skip empty sequence

    # ---- Backfill first 25 bp ----
    for i in range(min(WINDOW, aln_len)):
        if new_seq[i] == '-':
            new_seq[i] = ref_seq[i]

    # ---- Backfill last 25 bp ----
    for i in range(max(0, aln_len - WINDOW), aln_len):
        if new_seq[i] == '-':
            new_seq[i] = ref_seq[i]

    # ---- Replace extra leading gaps with Ns ----
    for i in range(0, start):
        if new_seq[i] == '-':
            new_seq[i] = 'N'

    # ---- Replace extra trailing gaps with Ns ----
    for i in range(end + 1, aln_len):
        if new_seq[i] == '-':
            new_seq[i] = 'N'

    new_records.append(
        SeqIO.SeqRecord(
            Seq("".join(new_seq)),
            id=sid,
            description=""
        )
    )

# ----------------------------
# Write output
# ----------------------------
out_fasta = msa_fasta.replace(".fasta", "_backfilled.fasta")
SeqIO.write(new_records, out_fasta, "fasta")

print("\n✅ Done!")
print(f"Backfilled alignment written to: {out_fasta}")

4.8 DNA to Amino Acid Translation Pipeline

4.8.1 Overview

This Python pipeline translates aligned DNA sequences to amino acid sequences using a specified reading frame and the standard nuclear genetic code. The workflow is essential for protein-level comparative analysis of coding sequences, enabling detection of selection pressures, functional domain identification, and phylogenetic analysis based on protein evolution rather than nucleotide changes.

4.8.2 Pipeline Steps

4.8.2.1 Protein-Level Evolutionary Analysis

Translation to amino acids serves critical functions in molecular evolution studies:

Selection pressure detection: Enables synonymous vs. non-synonymous substitution analysis
Functional constraint identification: Reveals structurally and functionally important regions
Phylogenetic signal enhancement: Reduces noise from synonymous changes in closely related sequences
Protein structure analysis: Facilitates domain identification and structural modeling

4.8.2.2 Input Parameters and Validation

The script requires two user-specified parameters:

dna_fasta = input("Enter aligned DNA FASTA file: ").strip()
frame = int(input("Enter translation frame (1, 2, or 3): ").strip())

if frame not in (1, 2, 3):
    raise ValueError("Frame must be 1, 2, or 3")

Parameter specifications:

DNA FASTA file: Aligned coding sequences from previous pipeline steps
Translation frame: Reading frame specification (1-based indexing)
Frame validation: Ensures valid frame selection to prevent downstream errors

Reading frame significance:

Frame 1: Translation starting from first nucleotide (most common for complete CDS)
Frame 2: Translation starting from second nucleotide (for sequences with 5’ extensions)
Frame 3: Translation starting from third nucleotide (for sequences with longer 5’ extensions)

4.8.2.3 Genetic Code Configuration

Specifies the appropriate genetic code for translation:

GENETIC_CODE = 1  # Standard nuclear code (plants)

Genetic code selection:

Standard nuclear code: Appropriate for plant nuclear genes including self-incompatibility loci
Universal application: Compatible with most eukaryotic nuclear genes
Translation accuracy: Ensures correct amino acid assignment for all codons

4.8.2.4 Output File Generation

Creates descriptive output filename incorporating processing parameters:

out_fasta = dna_fasta.replace(".fasta", f"_frame{frame}_AA.fasta")

Naming convention benefits:

Parameter documentation: Records translation frame in filename
Format specification: Clearly indicates amino acid content with “_AA” suffix
Traceability: Maintains connection to source DNA file
Workflow integration: Enables automated downstream processing

4.8.2.5 Sequence Processing and Gap Handling

Processes each DNA sequence with gap removal and frame application:

for record in SeqIO.parse(dna_fasta, "fasta"):
    # Remove gaps
    ungapped = str(record.seq).replace("-", "").replace("N", "N")
    
    # Apply frame
    cds = ungapped[frame - 1 :]

Gap handling strategy:

Alignment gap removal: Eliminates “-” characters that disrupt reading frame
Ambiguous nucleotide preservation: Maintains “N” characters for downstream handling
Frame offset application: Adjusts start position based on specified reading frame
Sequence integrity: Preserves biological sequence content while removing alignment artifacts

4.8.2.6 Codon Boundary Management

Ensures translation occurs on complete codons only:

# Trim to full codons
trim_len = len(cds) - (len(cds) % 3)
cds = cds[:trim_len]

Codon trimming logic:

Modulo calculation: Determines excess nucleotides beyond complete codons
3’ trimming: Removes incomplete terminal codon to prevent translation errors
Length preservation: Maintains maximum possible coding sequence length
Translation compatibility: Ensures all sequences translate without frame errors

4.8.2.7 Translation Execution

Performs actual DNA to amino acid translation:

if len(cds) == 0:
    aa_seq = ""
else:
    aa_seq = str(Seq(cds).translate(table=GENETIC_CODE, to_stop=False))

Translation parameters:

Genetic code table: Uses specified genetic code for accurate translation
Stop codon handling: to_stop=False translates through internal stop codons
Empty sequence handling: Gracefully manages sequences with no translatable content
Complete translation: Processes entire coding sequence regardless of stop codons

Stop codon preservation rationale:

Internal stop codons may represent: - Pseudogenization events: Important for evolutionary analysis - Sequencing errors: Should be identified rather than truncated - Rare allelic variants: May have biological significance in population studies

4.8.2.8 SeqRecord Creation and Metadata

Creates properly formatted output records with descriptive metadata:

aa_records.append(
    SeqIO.SeqRecord(
        Seq(aa_seq),
        id=record.id,
        description=f"frame={frame}"
    )
)

Metadata preservation:

Identifier consistency: Maintains original sequence identifiers
Processing documentation: Records translation frame in description
BioPython compatibility: Creates proper SeqRecord objects
Downstream traceability: Enables tracking of processing parameters

4.8.2.9 Output Generation and Reporting

Writes translated sequences and provides processing summary:

SeqIO.write(aa_records, out_fasta, "fasta")

print("\n✅ Translation complete")
print(f"AA FASTA written to: {out_fasta}")

4.8.3 Output Files

4.8.3.1 Primary Output

${input_basename}_frame${frame}_AA.fasta: Amino acid sequences translated from specified reading frame

Output characteristics:

Contains protein sequences corresponding to input DNA sequences
Maintains sequence identifier consistency with input
Documents translation frame in sequence descriptions
Ready for protein-level phylogenetic and evolutionary analysis

4.8.4 Quality Control Considerations

4.8.4.1 Translation Validation

Frame selection verification:

Correct frame identification crucial for meaningful protein analysis
Frame 1 typically appropriate for complete coding sequences
Alternative frames may be necessary for partial sequences or sequences with 5’ extensions

Stop codon analysis:

Internal stop codons may indicate pseudogenes or sequencing errors
Premature termination patterns can reveal evolutionary processes
Complete translation enables comprehensive analysis of all sequence variants

4.8.4.2 Sequence Length Considerations

Codon trimming effects:

1-2 nucleotide loss at 3’ end for sequences not divisible by 3
Minimal impact on overall protein analysis
Maintains reading frame consistency across all sequences

4.8.5 Scientific Applications

4.8.5.1 Molecular Evolution Analysis

Selection pressure quantification:

dN/dS analysis: Ratio of non-synonymous to synonymous substitution rates
Site-specific selection: Identification of amino acid positions under selection
Functional constraint mapping: Detection of structurally important regions

Phylogenetic reconstruction:

Protein-based trees: Often more reliable for distant evolutionary relationships
Reduced homoplasy: Amino acid changes less prone to multiple hits than nucleotides
Functional grouping: Clustering based on protein similarity rather than DNA similarity

4.8.5.2 Self-Incompatibility Research

Functional domain analysis:

Hypervariable regions: Identification of specificity-determining domains
Conserved motifs: Detection of structurally essential amino acid patterns
Binding site prediction: Analysis of receptor-ligand interaction domains

Allelic classification:

Protein-based clustering: Grouping alleles by amino acid similarity
Functional prediction: Inference of incompatibility phenotypes from protein sequences
Evolutionary relationships: Protein-level phylogenetic analysis of allelic lineages

4.8.6 Integration with Downstream Workflows

The translated amino acid sequences serve as input for:

Phylogenetic analysis: Maximum likelihood protein evolution models
Selection analysis: PAML, HyPhy, and other selection detection software
Structural analysis: Protein folding prediction and domain identification
Functional annotation: Conserved domain searches and motif identification

This translation step bridges the gap between nucleotide-level haplotype reconstruction and protein-level functional analysis, essential for understanding the molecular basis of self-incompatibility specificity and the evolutionary forces maintaining allelic diversity in natural populations.

4.8.7 Bioinformatics

POTENTIALLY NOT NEEDED BECAUSE IT APPLIES TO INDIVIDUAL LIBRARY!

translate_filter_align_AA.py
Only keep AA without stop codons in sequence
Input:
- Exon aligned fasta: all_Library006_Phased_haplotypes_filtered_min3250_aligned_exons_backfilled.fasta
- Position for translation (here, 1st)
Output:
- 4 files, but this one if important: all_Library003_Phased_haplotypes_filtered_min3250_aligned_exons_backfilled_frame1_AA_filtered_aligned.fasta

#!/usr/bin/env python3
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Data import CodonTable

# ----------------------------
# User input
# ----------------------------
dna_fasta = input("Enter aligned DNA FASTA file: ").strip()
frame = int(input("Enter translation frame (1, 2, or 3): ").strip())

if frame not in (1, 2, 3):
    raise ValueError("Frame must be 1, 2, or 3")

# Plant nuclear code
GENETIC_CODE = 1  # Standard nuclear code (plants)

# ----------------------------
# Output file
# ----------------------------
out_fasta = dna_fasta.replace(".fasta", f"_frame{frame}_AA.fasta")

# ----------------------------
# Translate sequences
# ----------------------------
aa_records = []

for record in SeqIO.parse(dna_fasta, "fasta"):
    # Remove gaps
    ungapped = str(record.seq).replace("-", "").replace("N", "N")

    # Apply frame
    cds = ungapped[frame - 1 :]

    # Trim to full codons
    trim_len = len(cds) - (len(cds) % 3)
    cds = cds[:trim_len]

    if len(cds) == 0:
        aa_seq = ""
    else:
        aa_seq = str(Seq(cds).translate(table=GENETIC_CODE, to_stop=False))

    aa_records.append(
        SeqIO.SeqRecord(
            Seq(aa_seq),
            id=record.id,
            description=f"frame={frame}"
        )
    )

# ----------------------------
# Write output
# ----------------------------
SeqIO.write(aa_records, out_fasta, "fasta")

print("\n✅ Translation complete")
print(f"AA FASTA written to: {out_fasta}")

5 Phase 2: Functional proteins, S alleles and genotyping

5.1 SRK Protein Translation, Alignment, and Abundance Filtering Pipeline

5.1.1 Overview

This comprehensive Python pipeline translates DNA sequences to amino acids, performs multiple sequence alignment, and applies abundance-based filtering to identify functional SRK (S-receptor kinase) proteins. The workflow includes enhanced tracking and validation features specifically designed for self-incompatibility research, enabling identification of potentially self-compatible individuals through protein-level analysis. The script is designed to be re-run incrementally as new sequencing libraries are added, with built-in safeguards to prevent accidental use of stale output files.

5.1.2 Pipeline Steps

5.1.2.1 SRK Protein Analysis Framework

This pipeline addresses critical aspects of self-incompatibility research:

Functional protein identification: Distinguishes authentic SRK proteins from assembly artifacts
Abundance-based validation: Uses protein frequency to confirm biological authenticity
Ploidy-aware filtering: Removes individuals carrying more distinct proteins than expected given ploidy level
Self-compatibility detection: Identifies individuals lacking functional SRK proteins
Population-level analysis: Enables comparative analysis across multiple individuals

5.1.2.2 Output File Management and Overwrite Safeguard

Before prompting for any parameters, the script checks whether output files from a previous run already exist in the working directory:

existing = [f for f in OUTPUT_FILES if os.path.exists(f)]
if existing:
    print("\nWARNING: The following output files already exist and will be overwritten:")
    for f in existing:
        print(f"  {f}")
    confirm = input("\nProceed and overwrite? (y/n): ").strip().lower()
    if confirm != 'y':
        print("Aborted. No files were modified.")
        sys.exit(0)

Design rationale:

The check runs before parameter prompts so the user can abort without entering any input
All seven output files are checked as a group
Entering anything other than y exits cleanly with no files modified
If no existing files are found, the check is silent and execution continues

The input glob is also hardened to prevent previous output files from accidentally being read back as inputs:

for fasta in sorted(set(glob.glob(pattern)) - set(OUTPUT_FILES)):

5.1.2.3 Input Parameters and Configuration

After the overwrite check, the script prompts for analysis parameters:

frame = int(input("Translation frame (1,2,3): ")) - 1
min_length = 100
min_count = int(input("Minimum sequences per protein (recommended 2-5): "))
max_alleles = int(input("Maximum functional proteins per individual (e.g. 4 for allotetraploid): "))

use_enhanced_validation = input("Use enhanced SRK domain validation? (y/n, default=n): ").strip().lower()
use_enhanced_validation = use_enhanced_validation == 'y'

Parameter specifications:

Translation frame: Reading frame for DNA to protein translation (0-based internally)
Minimum length: Quality threshold for protein sequences (100 amino acids, hardcoded)
Minimum count: Abundance threshold for protein validation (2-5 recommended)
Maximum alleles: Upper bound on distinct functional proteins per individual — set to ploidy level (e.g. 4 for allotetraploid). Individuals exceeding this threshold are excluded as likely assembly artefacts or contaminated samples
Enhanced validation: Optional SRK-specific domain validation

5.1.2.4 Enhanced SRK Domain Validation

Implements optional domain-specific validation for SRK proteins:

def validate_srk_features(protein_seq):
    """Enhanced SRK domain validation - OPTIONAL"""
    if not use_enhanced_validation:
        return True  # Skip enhanced validation, use original logic

    # Check for minimum cysteine content (SRK should have ~12)
    if protein_seq.count('C') < 6:
        return False

    # Check for basic kinase motifs
    kinase_motifs = [
        r'[LIVMFYC]G.{0,20}GxGxxG',  # ATP-binding
        r'[LIVMFYC]K.{6,20}D[LIVMFY]K',  # Catalytic
    ]

    motif_count = sum(1 for motif in kinase_motifs if re.search(motif, protein_seq))
    return motif_count >= 1

Validation criteria:

Cysteine content: Minimum 6 cysteines (SRK proteins typically contain ~12)
ATP-binding motif: Glycine-rich loop characteristic of kinase domains
Catalytic motif: Lysine and aspartate residues essential for kinase activity
Motif requirement: At least one kinase motif must be present

5.1.2.5 Comprehensive Individual Tracking

Implements detailed tracking of sequence processing for each individual:

individual_status = defaultdict(lambda: {
    'total_sequences': 0,
    'functional_proteins': 0,
    'failed_reasons': [],
    'sequences_details': []
})

Tracking categories:

Total sequences: All sequences processed per individual
Functional proteins: Sequences passing validation criteria
Failed reasons: Detailed failure mode documentation
Sequence details: Per-sequence processing information

5.1.2.6 DNA Translation and Quality Control

Processes DNA sequences with comprehensive quality validation:

dna = str(record.seq).replace("-", "").upper()
dna = dna[frame:]
dna = dna[:len(dna)-(len(dna)%3)]

failure_reasons = []

if len(dna) < 3:
    failure_reasons.append("too_short_dna")
    continue

prot = str(Seq(dna).translate()).rstrip("*")

if len(prot) < min_length:
    failure_reasons.append("too_short_protein")

if "X" in prot:
    failure_reasons.append("ambiguous_aa")

if "*" in prot:
    failure_reasons.append("premature_stop")

Quality control criteria:

Minimum DNA length: At least 3 nucleotides for translation
Protein length: Minimum 100 amino acids for functional analysis
Ambiguous residues: Rejection of sequences with ‘X’ characters
Premature stops: Identification of internal stop codons
Frame adjustment: Proper reading frame application and codon boundary management

5.1.2.7 Multiple Sequence Alignment

Performs amino acid sequence alignment using MAFFT:

subprocess.run(
    f"mafft --auto --amino {raw_fasta} > {aligned_fasta}",
    shell=True,
    check=True
)

Alignment parameters:

MAFFT auto mode: Automatically selects optimal alignment strategy
Amino acid mode: Optimized for protein sequence alignment
Quality assurance: Error checking for alignment completion

5.1.2.8 Abundance-Based Filtering

Implements protein frequency filtering to identify authentic variants:

counts = Counter(aligned_seqs)
filtered = {
    seq: count
    for seq, count in counts.items()
    if count >= min_count
}

Filtering rationale:

Assembly artifact removal: Low-frequency sequences likely represent errors
Biological validation: Authentic alleles should appear in multiple individuals
Population genetics: Rare alleles vs. technical artifacts distinction
Statistical confidence: Higher frequency increases reliability

5.1.2.9 Ploidy-Aware Individual Filtering

After abundance filtering, the script counts the number of distinct functional proteins per individual and removes those exceeding the expected ploidy-based maximum:

individual_to_proteins = defaultdict(set)
for seq_id, prot in original_to_protein.items():
    if prot in protein_names:
        individual_id = "_".join(seq_id.split('_')[:2])
        individual_to_proteins[individual_id].add(protein_names[prot])

overloaded_individuals = {
    ind for ind, proteins in individual_to_proteins.items()
    if len(proteins) > max_alleles
}

Design:

Uses a set of distinct protein names per individual, so sequence count does not affect the filter
Overloaded individuals are excluded from the key file; the protein FASTA is unchanged (proteins seen in valid individuals are preserved)
Excluded individuals are written to a dedicated output file (SRK_too_many_alleles.txt) and classified separately from self-compatibility candidates, as their exclusion reflects data-quality concerns rather than biology

5.1.2.10 Enhanced Individual Status Reporting

Generates comprehensive reports tracking processing outcomes for every individual:

def generate_individual_report(individual_status, individuals_in_final_key, max_alleles):

    if total_seq == 0:
        classification = "no_sequences"
    elif functional == 0:
        classification = "no_functional_proteins"
    elif not in_final_data and individual_status[individual].get('too_many_alleles', False):
        classification = "too_many_alleles"
    elif not in_final_data:
        classification = "dropped_abundance_filter"
    elif functional < total_seq * 0.5:
        classification = "low_functional_rate"
    else:
        classification = "normal"

Classification system:

normal: Standard processing with adequate functional proteins
low_functional_rate: <50% of sequences produced functional proteins
dropped_abundance_filter: Functional proteins did not meet the abundance threshold
no_functional_proteins: All sequences failed quality control
no_sequences: No sequences processed for individual
too_many_alleles: Individual has more distinct proteins than allowed by max_alleles; excluded from the key file and downstream genotyping

5.1.2.11 Self-Compatibility Candidate Identification

Individuals classified as no_functional_proteins, low_functional_rate, or dropped_abundance_filter are flagged as potential self-compatibility candidates. Individuals classified as too_many_alleles are written to a separate file and are not included in the self-compatibility candidates list, as their exclusion is due to data quality rather than loss of self-incompatibility function:

sc_candidates = [row['Individual'] for row in report_data
                if row['Classification'] in ['no_functional_proteins',
                                             'low_functional_rate',
                                             'dropped_abundance_filter']]

too_many = [row['Individual'] for row in report_data
            if row['Classification'] == 'too_many_alleles']

Self-compatibility indicators:

Absence of functional SRK: Complete loss of S-receptor function
Low functional rate: Potential pseudogenization events
Abundance filter dropout: Possible rare or degraded alleles

5.1.3 Output Files

5.1.3.1 Primary Outputs

SRK_proteins_raw.fasta: All functional proteins before alignment
SRK_proteins_aligned.fasta: MAFFT-aligned protein sequences
SRK_functional_proteins.fasta: Final abundance-filtered protein dataset
SRK_functional_protein_key.tsv: Mapping between original sequences and final proteins (overloaded individuals excluded)

5.1.3.2 Analysis Reports

SRK_individual_status_report.tsv: Comprehensive per-individual processing summary
SRK_self_compatible_candidates.txt: Individuals with no or low functional SRK proteins
SRK_too_many_alleles.txt: Individuals excluded for exceeding the max_alleles threshold

Report columns (SRK_individual_status_report.tsv):

Individual identifier and total sequence counts
Functional protein count and rate
Number of sequences retained in the final key
Classification and top failure reasons

5.1.4 Scientific Applications

5.1.4.1 Population Genetics of Self-Incompatibility

Allelic diversity assessment:

Quantification of functional SRK protein variants
Population-level frequency estimation of S-alleles
Detection of rare vs. common incompatibility alleles

Self-compatibility evolution:

Identification of individuals with compromised SRK function
Analysis of pseudogenization patterns in S-loci
Evolutionary transitions from outcrossing to selfing

5.1.4.2 Molecular Evolution Analysis

Selection pressure detection:

Protein-level analysis of selection on SRK domains
Identification of sites under balancing selection
Comparative analysis of selection across populations

Functional domain analysis:

Kinase domain conservation and variation
Receptor domain hypervariability analysis
Structure-function relationship investigation

5.1.5 Quality Control and Validation

5.1.5.1 Multi-Level Filtering System

Translation quality: DNA length, reading frame, stop codons
Protein validation: Length, ambiguous residues, optional domain features
Abundance filtering: Population-level frequency validation
Ploidy filtering: Per-individual distinct protein count bounded by max_alleles
Individual tracking: Comprehensive failure mode analysis with six classification categories

5.1.5.2 Statistical Considerations

Abundance threshold optimisation:

Balance between artifact removal and rare allele retention
Population size considerations for frequency thresholds
Technical replicate validation for threshold setting

Maximum allele threshold:

Should match the ploidy level of the target species (e.g. 4 for allotetraploid)
Individuals exceeding the threshold are more likely to reflect barcode bleed-through, contamination, or chimeric assemblies than genuine allelic diversity
These individuals are tracked separately to allow manual investigation

Classification reliability:

Multiple validation criteria for self-compatibility candidates
Conservative approach to minimise false positives
Detailed documentation for manual review of edge cases

This pipeline provides a robust framework for identifying functional SRK proteins and detecting potential self-compatibility evolution, essential for understanding the maintenance and breakdown of self-incompatibility systems in natural populations.

5.1.6 Bioinformatics

Definition: A functional allele is defined by the number of distinct functional protein sequences (isoforms) underpinning all DNA SRK sequences. We need to include frequency to remove bioinformatics and sequencing noise (set to 5). Individuals with more distinct proteins than the ploidy level (set to 4 for allotetraploid) are excluded as likely data quality issues.
Approach: Merge _exons_backfilled.fasta files, translate sequences and discard those that do not produce functional proteins, and create protein sequences (with a key). Discarded samples carry information about whether they have lost SI due to non-functional SRK and will then be self-compatible.
Script: define_functional_proteins.py
Input: *_exons_backfilled.fasta (e.g., all_Library003_Phased_haplotypes_filtered_min3250_aligned_exons_backfilled.fasta)
- frame = 1
- filter = 5
- max_alleles = 4
- Enhanced = n
Output:
- SRK_proteins_raw.fasta - All translated proteins before filtering
- SRK_proteins_aligned.fasta - MAFFT-aligned proteins
- SRK_functional_proteins.fasta - Final functional proteins
- SRK_functional_protein_key.tsv - Mapping of sequences to protein names (overloaded individuals excluded)
- Quality control files:
  - SRK_individual_status_report.tsv - Detailed per-individual analysis
  - SRK_self_compatible_candidates.txt - Individuals with no/low functional SRK
  - SRK_too_many_alleles.txt - Individuals excluded for exceeding max_alleles

Unique aligned proteins: 399

After abundance filter: 94


FINAL SUMMARY

Final functional proteins: 94
Protein FASTA: SRK_functional_proteins.fasta
Protein key: SRK_functional_protein_key.tsv

Individual classifications:
  normal: 75
  low_functional_rate: 53
  dropped_abundance_filter: 21
  no_functional_proteins: 5

Individuals dropped by abundance filter (min_count=5): 21

Dropped individuals:
  Library004_barcode82: 20 functional proteins → 0 in final data
  Library004_barcode86: 4 functional proteins → 0 in final data
  Library002_barcode26: 9 functional proteins → 0 in final data
  Library002_barcode27: 11 functional proteins → 0 in final data
  Library002_barcode28: 8 functional proteins → 0 in final data
  Library002_barcode34: 11 functional proteins → 0 in final data
  Library002_barcode36: 7 functional proteins → 0 in final data
  Library002_barcode38: 13 functional proteins → 0 in final data
  Library002_barcode45: 8 functional proteins → 0 in final data
  Library002_barcode55: 16 functional proteins → 0 in final data
  Library003_barcode68: 7 functional proteins → 0 in final data
  Library006_barcode39: 26 functional proteins → 0 in final data
  Library006_barcode45: 2 functional proteins → 0 in final data
  Library006_barcode49: 11 functional proteins → 0 in final data
  Library006_barcode52: 12 functional proteins → 0 in final data
  Library001_barcode03: 4 functional proteins → 0 in final data
  Library001_barcode12: 8 functional proteins → 0 in final data
  Library001_barcode13: 5 functional proteins → 0 in final data
  Library001_barcode15: 12 functional proteins → 0 in final data
  Library001_barcode16: 7 functional proteins → 0 in final data
  Library001_barcode22: 2 functional proteins → 0 in final data

Individuals excluded by allele count filter (max_alleles=4): 3

Excluded individuals:
  Library003_barcode12: 6 distinct proteins (exceeds max 4)
  Library005_barcode31: 5 distinct proteins (exceeds max 4)
  Library007_barcode09: 7 distinct proteins (exceeds max 4)

Potential self-compatible individuals: 79

Detailed report: SRK_individual_status_report.tsv
SC candidates: SRK_self_compatible_candidates.txt
Too-many-alleles exclusions: SRK_too_many_alleles.txt

CODE

#!/usr/bin/env python3
# Enhanced version with abundance filtering tracking

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from collections import Counter, defaultdict
import glob
import os
import subprocess
import sys
import re

# ----------------------------
# Files
# ----------------------------
pattern = "*_exons_backfilled.fasta"
raw_fasta = "SRK_proteins_raw.fasta"
aligned_fasta = "SRK_proteins_aligned.fasta"
final_fasta = "SRK_functional_proteins.fasta"
key_file = "SRK_functional_protein_key.tsv"
report_file = "SRK_individual_status_report.tsv"
sc_file = "SRK_self_compatible_candidates.txt"
too_many_file = "SRK_too_many_alleles.txt"

OUTPUT_FILES = [raw_fasta, aligned_fasta, final_fasta, key_file,
                report_file, sc_file, too_many_file]

# ----------------------------
# Overwrite check
# ----------------------------
existing = [f for f in OUTPUT_FILES if os.path.exists(f)]
if existing:
    print("\nWARNING: The following output files already exist and will be overwritten:")
    for f in existing:
        print(f"  {f}")
    confirm = input("\nProceed and overwrite? (y/n): ").strip().lower()
    if confirm != 'y':
        print("Aborted. No files were modified.")
        sys.exit(0)

# ----------------------------
# Parameters
# ----------------------------
frame = int(input("Translation frame (1,2,3): ")) - 1
min_length = 100
min_count = int(input("Minimum sequences per protein (recommended 2-5): "))
max_alleles = int(input("Maximum functional proteins per individual (e.g. 4 for allotetraploid): "))

# Add option for enhanced validation
use_enhanced_validation = input("Use enhanced SRK domain validation? (y/n, default=n): ").strip().lower()
use_enhanced_validation = use_enhanced_validation == 'y'

# ----------------------------
# DEFINE FUNCTIONS FIRST
# ----------------------------
def validate_srk_features(protein_seq):
    """Enhanced SRK domain validation - OPTIONAL"""
    if not use_enhanced_validation:
        return True  # Skip enhanced validation, use original logic

    # Check for minimum cysteine content (SRK should have ~12)
    if protein_seq.count('C') < 6:
        return False

    # Check for basic kinase motifs
    kinase_motifs = [
        r'[LIVMFYC]G.{0,20}GxGxxG',  # ATP-binding
        r'[LIVMFYC]K.{6,20}D[LIVMFY]K',  # Catalytic
    ]

    motif_count = sum(1 for motif in kinase_motifs if re.search(motif, protein_seq))
    return motif_count >= 1

def generate_individual_report(individual_status, individuals_in_final_key, max_alleles):
    """Generate comprehensive individual status report with abundance filtering tracking"""

    report_data = []

    for individual, status in individual_status.items():
        total_seq = status['total_sequences']
        functional = status['functional_proteins']

        # Check if individual made it to final genotype data
        in_final_data = individual in individuals_in_final_key
        proteins_in_final = individuals_in_final_key.get(individual, 0)

        # Classify individual status
        if total_seq == 0:
            classification = "no_sequences"
        elif functional == 0:
            classification = "no_functional_proteins"
        elif not in_final_data and individual_status[individual].get('too_many_alleles', False):
            classification = "too_many_alleles"
        elif not in_final_data:
            classification = "dropped_abundance_filter"
        elif functional < total_seq * 0.5:
            classification = "low_functional_rate"
        else:
            classification = "normal"

        # Most common failure reasons
        failure_counts = Counter(status['failed_reasons'])
        top_failures = failure_counts.most_common(3)

        report_data.append({
            'Individual': individual,
            'Total_sequences': total_seq,
            'Functional_proteins': functional,
            'Functional_rate': functional/total_seq if total_seq > 0 else 0,
            'Proteins_in_final_data': proteins_in_final,
            'In_final_genotype_data': 'Yes' if in_final_data else 'No',
            'Classification': classification,
            'Top_failure_reasons': ';'.join([f"{reason}({count})" for reason, count in top_failures])
        })

    return report_data

# ----------------------------
# Enhanced tracking dictionaries
# ----------------------------
individual_status = defaultdict(lambda: {
    'total_sequences': 0,
    'functional_proteins': 0,
    'failed_reasons': [],
    'sequences_details': []
})

# ----------------------------
# Step 1: Translate and store mapping
# ----------------------------
print("\nTranslating sequences with enhanced tracking...")

raw_records = []
original_to_protein = {}

# Exclude known output files in case they accidentally match the input pattern
for fasta in sorted(set(glob.glob(pattern)) - set(OUTPUT_FILES)):
    print(f"Processing {fasta}...")

    for record in SeqIO.parse(fasta, "fasta"):
        individual_id = "_".join(record.id.split('_')[:2])

        individual_status[individual_id]['total_sequences'] += 1

        dna = str(record.seq).replace("-", "").upper()
        dna = dna[frame:]
        dna = dna[:len(dna)-(len(dna)%3)]

        failure_reasons = []
        prot = ""

        if len(dna) < 3:
            failure_reasons.append("too_short_dna")
            continue

        prot = str(Seq(dna).translate()).rstrip("*")

        if len(prot) < min_length:
            failure_reasons.append("too_short_protein")

        if "X" in prot:
            failure_reasons.append("ambiguous_aa")

        if "*" in prot:
            failure_reasons.append("premature_stop")

        if use_enhanced_validation and not validate_srk_features(prot):
            failure_reasons.append("missing_srk_domains")

        seq_detail = {
            'sequence_id': record.id,
            'length_dna': len(dna),
            'length_protein': len(prot),
            'failure_reasons': failure_reasons
        }
        individual_status[individual_id]['sequences_details'].append(seq_detail)

        if not failure_reasons:
            raw_records.append(
                SeqRecord(Seq(prot), id=record.id, description="")
            )
            original_to_protein[record.id] = prot
            individual_status[individual_id]['functional_proteins'] += 1
        else:
            individual_status[individual_id]['failed_reasons'].extend(failure_reasons)

print("Raw functional proteins:", len(raw_records))
SeqIO.write(raw_records, raw_fasta, "fasta")

# ----------------------------
# Step 2: Align
# ----------------------------
print("\nRunning MAFFT alignment...")
subprocess.run(
    f"mafft --auto --amino {raw_fasta} > {aligned_fasta}",
    shell=True,
    check=True
)

# ----------------------------
# Step 3: Count aligned proteins
# ----------------------------
aligned_records = list(SeqIO.parse(aligned_fasta, "fasta"))
aligned_seq_dict = {}
aligned_seqs = []

for rec in aligned_records:
    ungapped = str(rec.seq).replace("-", "")
    aligned_seq_dict[rec.id] = ungapped
    aligned_seqs.append(ungapped)

counts = Counter(aligned_seqs)
print("Unique aligned proteins:", len(counts))

# ----------------------------
# Step 4: Apply abundance filter
# ----------------------------
filtered = {
    seq: count
    for seq, count in counts.items()
    if count >= min_count
}
print("After abundance filter:", len(filtered))

# ----------------------------
# Step 5: Assign protein names
# ----------------------------
protein_names = {}
for i, seq in enumerate(filtered):
    protein_names[seq] = f"SRK_protein_{i+1:03d}"

# ----------------------------
# Step 6: Write final FASTA
# ----------------------------
final_records = []
for seq, name in protein_names.items():
    count = filtered[seq]
    final_records.append(
        SeqRecord(
            Seq(seq),
            id=name,
            description=f"count={count}"
        )
    )

SeqIO.write(final_records, final_fasta, "fasta")

# ----------------------------
# Step 7: Build per-individual protein sets and filter by max_alleles
# ----------------------------

# Build per-individual set of distinct protein names that passed all filters
individual_to_proteins = defaultdict(set)
for seq_id, prot in original_to_protein.items():
    if prot in protein_names:
        individual_id = "_".join(seq_id.split('_')[:2])
        individual_to_proteins[individual_id].add(protein_names[prot])

# Identify overloaded individuals
overloaded_individuals = {
    ind for ind, proteins in individual_to_proteins.items()
    if len(proteins) > max_alleles
}

if overloaded_individuals:
    print(f"\nFiltering {len(overloaded_individuals)} individuals with >{max_alleles} distinct proteins...")
    for ind in sorted(overloaded_individuals):
        n = len(individual_to_proteins[ind])
        print(f"  {ind}: {n} proteins (excluded)")
        individual_status[ind]['too_many_alleles'] = True

# Write key excluding overloaded individuals
with open(key_file, "w") as out:
    out.write("Original_sequence_ID\tProtein\tCount\n")
    for seq_id, prot in original_to_protein.items():
        if prot in protein_names:
            individual_id = "_".join(seq_id.split('_')[:2])
            if individual_id not in overloaded_individuals:
                name = protein_names[prot]
                count = filtered[prot]
                out.write(f"{seq_id}\t{name}\t{count}\n")

# Track which individuals made it to final data
individuals_in_final_key = defaultdict(int)

for seq_id, prot in original_to_protein.items():
    if prot in protein_names:
        individual_id = "_".join(seq_id.split('_')[:2])
        if individual_id not in overloaded_individuals:
            individuals_in_final_key[individual_id] += 1

# ----------------------------
# Generate reports
# ----------------------------
report_data = generate_individual_report(individual_status, individuals_in_final_key, max_alleles)

with open(report_file, "w") as f:
    f.write("Individual\tTotal_sequences\tFunctional_proteins\tFunctional_rate\t"
            "Proteins_in_final_data\tIn_final_genotype_data\tClassification\tTop_failure_reasons\n")
    for row in report_data:
        f.write(f"{row['Individual']}\t{row['Total_sequences']}\t{row['Functional_proteins']}\t"
                f"{row['Functional_rate']:.3f}\t{row['Proteins_in_final_data']}\t"
                f"{row['In_final_genotype_data']}\t{row['Classification']}\t{row['Top_failure_reasons']}\n")

sc_candidates = [row['Individual'] for row in report_data
                if row['Classification'] in ['no_functional_proteins', 'low_functional_rate', 'dropped_abundance_filter']]

with open(sc_file, "w") as f:
    f.write("# Individuals with no or low functional SRK proteins\n")
    f.write("# These may be self-compatible\n")
    for individual in sc_candidates:
        f.write(f"{individual}\n")

too_many = [row['Individual'] for row in report_data
            if row['Classification'] == 'too_many_alleles']

with open(too_many_file, "w") as f:
    f.write(f"# Individuals excluded: >{max_alleles} distinct functional proteins\n")
    f.write("# Likely reflects assembly artefacts, contamination, or mixed samples\n")
    for individual in too_many:
        n = len(individual_to_proteins.get(individual, set()))
        f.write(f"{individual}\t{n}_proteins\n")

# ----------------------------
# Final summary
# ----------------------------
print("\nFINAL SUMMARY")
print(f"\nFinal functional proteins: {len(protein_names)}")
print(f"Protein FASTA: {final_fasta}")
print(f"Protein key: {key_file}")

classification_counts = Counter([row['Classification'] for row in report_data])
print("\nIndividual classifications:")
for classification, count in sorted(classification_counts.items(), key=lambda x: -x[1]):
    print(f"  {classification}: {count}")

dropped_abundance = [row for row in report_data if row['Classification'] == 'dropped_abundance_filter']
print(f"\nIndividuals dropped by abundance filter (min_count={min_count}): {len(dropped_abundance)}")
if dropped_abundance:
    print("\nDropped individuals:")
    for row in dropped_abundance:
        print(f"  {row['Individual']}: {row['Functional_proteins']} functional proteins → 0 in final data")

too_many = [row['Individual'] for row in report_data if row['Classification'] == 'too_many_alleles']
print(f"\nIndividuals excluded by allele count filter (max_alleles={max_alleles}): {len(too_many)}")
if too_many:
    print("\nExcluded individuals:")
    for ind in too_many:
        n = len(individual_to_proteins.get(ind, set()))
        print(f"  {ind}: {n} distinct proteins (exceeds max {max_alleles})")

print(f"\nPotential self-compatible individuals: {len(sc_candidates)}")
print(f"\nDetailed report: {report_file}")
print(f"SC candidates: {sc_file}")
print(f"Too-many-alleles exclusions: {too_many_file}")

5.2 Distance-Based SRK S-Allele Definition

5.2.1 Overview

This step addresses a fundamental limitation of the upstream protein pipeline: treating every unique protein sequence as a distinct S-allele overestimates allelic diversity because Nanopore sequencing noise and assembly artefacts generate near-identical sequence variants that do not represent biologically distinct recognition specificities. Two scripts implement a distance-based proxy for functional allele identity, mirroring the biological criterion used in crossing experiments (where two proteins specify the same allele if pollen bearing one is rejected by a pistil expressing the other).

Two complementary outputs are produced: an allele assignment table grouping proteins into bins, and amino acid variation heatmaps that visualise where along the protein sequence the informative differences are located and how frequent each amino acid is at each variable position.

Input: SRK_functional_proteins_aligned.fasta (228 aligned functional proteins, 848 alignment positions)

5.2.2 Motivation

The standard pipeline identifies functional SRK proteins by abundance filtering and translation quality, then treats every unique protein sequence as a separate allele. This approach is likely to overestimate allele diversity for several reasons:

Nanopore sequencing error introduces ~1–2% per-base noise, creating near-identical sequence variants that do not represent distinct biological entities.
Assembly and phasing artefacts can produce spurious haplotypes from the same template.
SRK allele identity is functionally defined by crosses: two proteins specify the same allele if pollen bearing one is rejected by a pistil expressing the other. This functional criterion does not map one-to-one onto sequence uniqueness.

In Brassica and Arabidopsis, SRK alleles from different recognition classes typically differ by 20–50% in the ectodomain (S-domain). Within the same recognition class, variants may differ by only a few percent. A distance-based grouping approach — well established in the allele-typing literature — provides a principled proxy for functional allele identity without requiring crossing experiments.

5.2.3 Approach

5.2.3.1 Sequence distance metric

Pairwise amino acid p-distance is computed between all protein pairs:

p-distance = (number of differing positions) / (number of valid aligned positions)

Gap columns are excluded from both the numerator and denominator. This gives a value between 0 (identical) and 1 (completely divergent).

5.2.3.2 Region of interest: S-domain (ectodomain)

Distance is computed on the S-domain only (alignment columns 31–430 by default), not on the full protein. The S-domain ectodomain is the receptor-binding region that determines pollination specificity. The kinase domain is much more conserved and would dilute the signal from specificity-determining positions.

Approximate domain structure of Brassicaceae SRK (~858 aa):

Domain	Alignment positions	Role
Signal peptide	1–30	Secretion targeting
S-domain (ectodomain)	31–430	Allele specificity — key region
Transmembrane domain	431–455	Membrane anchor
Kinase domain	456–848	Intracellular signalling

These boundaries are approximate and based on published Brassicaceae SRK structures. Adjust DOMAIN_REGION in the script if species-specific annotation is available.

5.2.3.3 Hierarchical clustering

All pairwise distances are clustered using average-linkage hierarchical clustering (UPGMA). A flat partition is obtained by cutting the dendrogram at a user-defined distance threshold: all proteins within a cluster have average pairwise distance below this threshold and are treated as the same allele. A sensitivity analysis sweeps the threshold from 0% to 5% and records the resulting allele count. The plot of allele count versus threshold is used to identify the “elbow” where further lumping slows down, which should correspond to genuine allele boundaries.

5.2.3.4 Representative sequence per allele

For each allele cluster, the most central sequence (minimum mean p-distance to all other cluster members) is selected as the representative and written to the output FASTA.

5.2.4 Pipeline Steps

5.2.4.1 Step 1 — Distance-based allele grouping (`define_SRK_alleles_from_distance.py`)

Key parameters:

Parameter	Default	Description
`INPUT_FASTA`	`SRK_functional_proteins_aligned.fasta`	Aligned protein FASTA
`N_ALLELES`	`None`	Option A — fix the number of alleles directly (overrides `DIST_THRESHOLD` when not `None`). Use when you can identify the elbow on the sensitivity curve but cannot read the exact distance.
`DIST_THRESHOLD`	`0.01` (1%)	Option B — fix the distance cutoff. Used only when `N_ALLELES` is `None`. Adjust based on the sensitivity plot elbow.
`DOMAIN_REGION`	`(31, 430)`	Alignment columns used for distance (1-based)
`THRESH_MIN/MAX/STEP`	`0.0 / 0.05 / 0.0005`	Sensitivity analysis range

Outputs:

File	Description
`SRK_protein_allele_assignments.tsv`	Protein → allele mapping table
`SRK_protein_allele_representatives.fasta`	One representative sequence per allele
`SRK_protein_distance_analysis.pdf`	Sensitivity curve + pairwise distance heatmap

Output table columns (SRK_protein_allele_assignments.tsv):

Column	Description
`Protein`	Protein ID from input FASTA
`Allele`	Assigned allele bin (e.g. `Allele_021`)
`Cluster_id`	Numeric cluster identifier
`Is_representative`	`True` for the cluster centroid

5.2.4.2 Step 2 — Amino acid variation heatmaps (`SRK_AA_mutation_heatmap.py`)

Produces two complementary figures sharing the same set of variable positions and domain annotation track.

Key parameters:

Parameter	Default	Description
`INPUT_FASTA`	`SRK_functional_proteins_aligned.fasta`	Aligned protein FASTA
`ALLELE_TSV`	`SRK_protein_allele_assignments.tsv`	Orders rows by allele cluster (Figure 1)
`MAX_GAP_FREQ`	`0.20`	Excludes positions where >20% of sequences have a gap
`ENTROPY_MIN`	`0.0`	Minimum Shannon entropy (bits) to include a position
`MAX_POSITIONS`	`300`	Maximum variable positions to display
`DOMAIN_REGIONS`	See script	List of `(name, start, end, color)` tuples

Outputs:

File	Description
`SRK_AA_mutation_heatmap.pdf`	Figure 1: protein × position heatmap (AA physicochemical class)
`SRK_AA_frequency_heatmap.pdf`	Figure 2: AA × position frequency heatmap
`SRK_AA_variable_positions.tsv`	Variable positions with entropy and AA composition

Figure 1 — Protein × position heatmap:

Rows = proteins sorted by allele cluster; columns = variable alignment positions
Cell colour = amino acid physicochemical class (aliphatic: grey; aromatic: gold; positive: blue; negative: red; polar: green; cysteine: orange; proline: orchid; gap: white)
Top strip = domain annotation band

Figure 2 — AA frequency heatmap:

Three stacked panels: (i) domain annotation strip; (ii) 20 AAs (rows) × variable positions (columns), white = absent, dark red = high frequency, black dot = consensus AA, white dividers separate physicochemical groups, coloured left margin strip shows AA class; (iii) Shannon entropy bar chart per position.

5.2.5 Interpreting the sensitivity curve

The left panel of SRK_protein_distance_analysis.pdf shows allele count as a function of distance threshold. Three reference lines are drawn:

Line	Meaning
Dotted grey (horizontal)	Total proteins in input (n = 228)
Dashed grey (horizontal)	Unique S-domain sequences at threshold = 0 (n = 109) — 119 proteins share their S-domain exactly with at least one other protein, differing only in the kinase domain or elsewhere
Option B — Dashed red (vertical)	Current `DIST_THRESHOLD` value set in the script — not computed algorithmically
Option A — Dashed red (horizontal)	Target allele count set by `N_ALLELES`
Option A — Dotted red (vertical)	Implied distance threshold — the smallest distance at which allele count ≤ `N_ALLELES` (derived from the sensitivity sweep)

Why the curve starts at 109, not 228: distance is computed on the S-domain only. Of the 228 functional proteins, 119 are S-domain identical to at least one other protein (they differ only in the kinase domain). These are merged at threshold = 0, so the curve begins at 109. This is biologically correct: two proteins with an identical ectodomain present the same recognition specificity.

How to choose the threshold: look for the elbow — the point where the curve transitions from a steep descent (merging genuine alleles) to a shallow plateau (absorbing only sequencing noise). If you can identify the elbow on the y-axis (allele count) but cannot read the exact distance, use N_ALLELES (Option A) — the script will derive the corresponding distance and annotate the plot. Alternatively, set DIST_THRESHOLD directly (Option B).

5.2.6 Interpreting the AA frequency heatmap

At genuine allele-discriminating positions (high Shannon entropy bar), exactly two rows light up in the frequency heatmap with roughly equal intensity — one for each of the two competing amino acids. A near-50:50 split is the molecular signature of balancing selection (NFDS) maintaining two SRK recognition specificities at equal frequency. Positions where the black dot (consensus) sits in a very dark cell with all other rows near white reflect fixed differences rather than balanced polymorphism.

5.2.7 Findings

5.2.7.1 Pairwise distance distribution

Metric	S-domain (cols 31–430)	Full protein
Minimum	0.000	0.000
Maximum	4.76%	3.44%
Mean	1.26%	0.95%

This is strikingly low relative to the 20–50% divergence expected between different SRK recognition classes in Brassica and Arabidopsis, confirming that the pipeline substantially overestimates allele diversity when treating each unique protein as a distinct allele.

5.2.7.2 S-domain uniqueness

Of the 228 functional proteins, 109 have a unique S-domain sequence. The remaining 119 proteins are S-domain identical to at least one other protein (differing only in the kinase domain or elsewhere). The effective starting point for allele grouping is therefore 109, not 228.

5.2.7.3 Sensitivity analysis

Threshold	Alleles	Interpretation
0%	109	S-domain unique sequences (exact duplicates already merged)
0.5%	~50–80	Still capturing most sequencing noise as alleles
1.0%	24	Working intermediate; 131 proteins in the largest cluster
2.0%	few	Most noise absorbed; 2–5 genuine alleles likely visible
4.0%+	1–3	Near-complete lumping

A threshold of 1–2% is biologically motivated: above typical Nanopore per-base error rates (~1%) but well below inter-allele divergence values from the literature. Use the sensitivity plot to confirm the elbow position for this dataset.

5.2.7.4 Allele composition at 1% threshold

24 allele clusters from 228 proteins
Largest cluster (Allele_021): 131 proteins (57%)
Second largest (Allele_009): 35 proteins (15%)
Together: 73% of all proteins in two clusters
9 singleton clusters (likely rare alleles or residual artefacts)

The dominance of 2–3 large clusters is consistent with a population carrying a small number of true SRK alleles, expected for a small or historically bottlenecked threatened species.

5.2.7.5 Variable positions and domain distribution

152 variable positions identified (out of 848; 18%), all with only 2–4 amino acid variants — diagnostic of a small number of alleles rather than high diversity.

Domain	Variable positions
Signal peptide (1–30)	8
S-domain (31–430)	82
TM domain (431–455)	8
Kinase domain (456–848)	54

The most variable S-domain position is alignment position 118 (A:113 / G:115, entropy = 1.000 bits — near-equal split), the strongest signal of two balanced allele classes maintained by NFDS.

5.2.8 Recommendations

Calibrate DIST_THRESHOLD using the sensitivity plot elbow. A value between 1–2% is recommended as a starting point.
Validate against crossing data: if crossing experiments are available (Crosses_w_seeds_LEPA.csv), test whether incompatible pairs share the same distance-defined allele to empirically calibrate the threshold.
Use distance-defined alleles in downstream population genetic analyses (allele accumulation, chi-square tests) to avoid inflated allele counts.
Adjust domain boundaries (DOMAIN_REGIONS / DOMAIN_REGION) if species-specific SRK annotation is available.
Inspect singletons at the chosen threshold: they may represent genuine rare alleles or residual assembly artefacts.

5.2.9 Bioinformatics

5.2.9.1 Scripts

Script	Role
`define_SRK_alleles_from_distance.py`	Pairwise p-distance, clustering, sensitivity analysis
`SRK_AA_mutation_heatmap.py`	AA class heatmap (Figure 1) and frequency heatmap (Figure 2)

5.2.9.2 Execution

Run Step 1 first so allele assignments are available for Figure 1 row ordering in Step 2.

source srk_env/bin/activate
python define_SRK_alleles_from_distance.py
python SRK_AA_mutation_heatmap.py

5.2.9.3 `define_SRK_alleles_from_distance.py`

#!/usr/bin/env python3
# See full script in repository.
# Key user-settings block (top of file):

INPUT_FASTA    = "SRK_functional_proteins_aligned.fasta"

# ── Clustering mode: choose ONE ──────────────────────────────────────────────
# Option A — fix the NUMBER OF ALLELES (set N_ALLELES; DIST_THRESHOLD ignored)
N_ALLELES = None   # e.g. 50 — set to None to use Option B instead
# Option B — fix the DISTANCE THRESHOLD (used only when N_ALLELES is None)
DIST_THRESHOLD = 0.01   # ← adjust based on sensitivity plot elbow
# ─────────────────────────────────────────────────────────────────────────────

DOMAIN_REGION  = (31, 430)  # S-domain columns (1-based, inclusive)
OUT_TSV        = "SRK_protein_allele_assignments.tsv"
OUT_FASTA      = "SRK_protein_allele_representatives.fasta"
OUT_PDF        = "SRK_protein_distance_analysis.pdf"
THRESH_MIN, THRESH_MAX, THRESH_STEP = 0.0, 0.05, 0.0005

5.2.9.4 `SRK_AA_mutation_heatmap.py`

#!/usr/bin/env python3
# See full script in repository.
# Key user-settings block (top of file):

INPUT_FASTA   = "SRK_functional_proteins_aligned.fasta"
ALLELE_TSV    = "SRK_protein_allele_assignments.tsv"  # set None to skip
MAX_GAP_FREQ  = 0.20
ENTROPY_MIN   = 0.0
MAX_POSITIONS = 300
OUT_PDF       = "SRK_AA_mutation_heatmap.pdf"
OUT_FREQ_PDF  = "SRK_AA_frequency_heatmap.pdf"
OUT_VAR_TSV   = "SRK_AA_variable_positions.tsv"
# Domain regions: list of (name, start, end, color) — adjust for your species
DOMAIN_REGIONS = [
    ("Signal peptide",  1,   30,  "#b0c4de"),
    ("S-domain",        31,  430, "#90ee90"),
    ("TM domain",       431, 455, "#ffa07a"),
    ("Kinase domain",   456, 848, "#d3a4d3"),
]

5.3 SRK S-Allele Genotyping Pipeline

5.3.1 Overview

This Python pipeline genotypes individuals based on distance-defined S-allele bins rather than raw functional protein sequences. The workflow joins two upstream outputs — the functional protein key (sequence → protein) and the allele assignment table (protein → allele) — to produce per-individual allele lists and a count-based genotype matrix for population genetic analysis. Reference sequences are automatically excluded; only Library* individuals are retained.

The wide genotype matrix records allele copy counts (number of distinct proteins per individual assigned to each allele bin) rather than binary presence/absence. For a tetraploid, an individual with two proteins in Allele_044 and two in Allele_047 receives counts of 2 and 2 respectively, directly reflecting its AABB dosage state. This information drives zygosity inference and allele frequency estimation downstream.

Data flow:

SRK_functional_protein_key.tsv      (Original_sequence_ID → Protein)
        +
SRK_protein_allele_assignments.tsv  (Protein → Allele)
        ↓  join + deduplicate per individual
SRK_individual_allele_table.tsv     (long format: Individual | Protein | Allele)
SRK_individual_allele_genotypes.tsv (wide count matrix: individuals × alleles, values = allele copy counts)

5.3.2 Inputs

File	Source	Description
`SRK_functional_protein_key.tsv`	Step 9	Maps every sequence ID to its functional protein
`SRK_protein_allele_assignments.tsv`	Step 10	Maps every protein to its distance-defined allele bin
`sampling_metadata.csv`	Field data	Sample metadata; used to restrict genotyping to ingroup individuals (optional)

5.3.3 Pipeline Steps

5.3.3.1 Ingroup filtering (optional)

If METADATA_FILE is set, sampling_metadata.csv is read and individuals with Ingroup == 1 are collected into an allow-list. Only these individuals are retained in the genotype outputs. Setting INGROUP_ONLY = False disables the filter and includes all Library* individuals regardless of ingroup status. Setting METADATA_FILE = None skips metadata loading entirely.

5.3.3.2 Individual ID extraction

Individual IDs are parsed from the sequence name by splitting on _ and combining the first two components (e.g. Library001_barcode01_hap1_rep1_tig00000001 → Library001_barcode01). Sequences whose parsed ID does not start with Library (e.g. reference sequences SRK_BEA_*) are excluded.

5.3.3.3 Protein → Allele join

For each sequence assigned to an individual, the protein name is looked up in the allele assignment table. Sequences whose protein has no allele mapping are skipped with a warning count.

5.3.3.4 Deduplication and allele copy counting

Multiple sequences from the same individual may map to the same protein, and multiple proteins may belong to the same allele bin. Identical (protein, allele) pairs are deduplicated. The number of distinct proteins per individual assigned to each allele bin is then counted using Counter — this count is the allele copy count, reflecting how many gene copies from that individual carry that allele class.

5.3.3.5 Output generation

A long-format table (one row per individual–protein–allele combination, with a Count column showing the allele copy count) and a wide count matrix (individuals × allele bins, values = allele copy counts) are written. The long table retains protein-level information so the protein → allele mapping can be traced. The count matrix replaces the old binary presence/absence format and directly encodes dosage information needed for zygosity inference and allele frequency analysis.

5.3.4 Output Files

File	Format	Description
`SRK_individual_allele_table.tsv`	Long (Individual, Protein, Allele, Count)	One row per unique individual–protein–allele combination; Count = distinct proteins per individual per allele bin
`SRK_individual_allele_genotypes.tsv`	Wide count matrix	Rows = individuals, columns = allele bins, values = allele copy counts

Long format example:

Individual            Protein          Allele      Count
Library001_barcode01  SRK_protein_001  Allele_044  2
Library001_barcode01  SRK_protein_004  Allele_044  2
Library001_barcode01  SRK_protein_002  Allele_047  2
Library001_barcode01  SRK_protein_003  Allele_047  2
Library001_barcode02  SRK_protein_006  Allele_019  1
Library001_barcode02  SRK_protein_005  Allele_046  2

The Count column records how many distinct proteins from that individual map to the same allele bin — repeated identically on every row sharing the same Individual + Allele combination.

Wide format example (allele copy counts):

Individual            Allele_019  Allele_044  Allele_046  Allele_047
Library001_barcode01  0           2           0           2
Library001_barcode02  1           0           2           1

Values > 1 indicate that more than one distinct protein from that individual was assigned to the same allele bin, consistent with multiple gene copies carrying that allele class.

5.3.5 Results

Running with the current dataset (N_ALLELES = 50, implied distance ≈ 0.006, ingroup only):

Metadata loaded: 170 ingroup, 8 outgroup samples
INGROUP_ONLY = True  →  restricting to 170 ingroup samples
Loaded 228 protein → allele mappings
Reference sequences excluded : 4
Proteins with no allele match: 0
Individuals found            : 152

── FINAL SUMMARY ──
Individuals genotyped : 152
S-alleles in matrix   : 43
Min alleles/individual: 1
Max alleles/individual: 3
Mean alleles/individual: 1.47

152 ingroup individuals genotyped; 8 outgroup samples excluded. 43 of the 50 defined allele bins are observed in the ingroup. Two alleles dominate: Allele_044 (43 individuals) and Allele_047 (40 individuals), together accounting for ~54% of the ingroup.

5.3.6 Bioinformatics

Script: genotype_SRK_from_alleles.py
Inputs: SRK_functional_protein_key.tsv, SRK_protein_allele_assignments.tsv, sampling_metadata.csv (optional, for ingroup filtering)
Outputs: SRK_individual_allele_table.tsv, SRK_individual_allele_genotypes.tsv

#!/usr/bin/env python3
"""
genotype_SRK_from_alleles.py

Builds individual-level genotype matrices based on distance-defined S-allele
bins rather than raw functional protein sequences.

Data flow
---------
  SRK_functional_protein_key.tsv     Original_sequence_ID -> Protein
        + (parse individual from sequence ID)
  SRK_protein_allele_assignments.tsv  Protein -> Allele
        + (join + deduplicate)
  Per-individual allele set
        +
  SRK_individual_allele_table.tsv     long format  (Individual, Protein, Allele, Count)
  SRK_individual_allele_genotypes.tsv wide matrix  (individuals x alleles, copy counts)

Reference sequences (IDs not starting with "Library") are excluded.
"""

import csv, os, sys
from collections import defaultdict, OrderedDict, Counter

PROTEIN_KEY_FILE  = "SRK_functional_protein_key.tsv"
ALLELE_TSV_FILE   = "SRK_protein_allele_assignments.tsv"
LONG_OUT          = "SRK_individual_allele_table.tsv"
WIDE_OUT          = "SRK_individual_allele_genotypes.tsv"
INDIVIDUAL_PREFIX = "Library"

existing = [f for f in [LONG_OUT, WIDE_OUT] if os.path.exists(f)]
if existing:
    for f in existing: print(f"WARNING: will overwrite {f}")
    if input("Proceed? (y/n): ").strip().lower() != "y": sys.exit(0)

protein_to_allele = {}
with open(ALLELE_TSV_FILE, newline="", encoding="utf-8-sig") as f:
    for row in csv.DictReader(f, delimiter="\t"):
        protein_to_allele[row["Protein"]] = row["Allele"]

individual_data = defaultdict(list)
n_skipped_ref = n_skipped_nomap = 0
with open(PROTEIN_KEY_FILE, newline="", encoding="utf-8-sig") as f:
    for row in csv.DictReader(f, delimiter="\t"):
        parts = row["Original_sequence_ID"].split("_")
        if len(parts) < 2: continue
        individual = f"{parts[0]}_{parts[1]}"
        if not individual.startswith(INDIVIDUAL_PREFIX):
            n_skipped_ref += 1; continue
        allele = protein_to_allele.get(row["Protein"])
        if allele is None:
            n_skipped_nomap += 1; continue
        individual_data[individual].append((row["Protein"], allele))

individual_pairs = {ind: sorted(set(pairs), key=lambda x: (x[1], x[0]))
                    for ind, pairs in individual_data.items()}

# Count distinct proteins per allele bin per individual (= allele copy count)
individual_allele_counts = {ind: Counter(a for _, a in pairs)
                             for ind, pairs in individual_pairs.items()}

individual_alleles = {ind: sorted(counts.keys())
                      for ind, counts in individual_allele_counts.items()}

with open(LONG_OUT, "w", newline="") as out:
    w = csv.writer(out, delimiter="\t")
    w.writerow(["Individual", "Protein", "Allele", "Count"])
    for ind in sorted(individual_pairs):
        for protein, allele in individual_pairs[ind]:
            w.writerow([ind, protein, allele, individual_allele_counts[ind][allele]])

all_alleles = sorted({a for v in individual_alleles.values() for a in v})
with open(WIDE_OUT, "w", newline="") as out:
    w = csv.writer(out, delimiter="\t")
    w.writerow(["Individual"] + all_alleles)
    for ind in sorted(individual_allele_counts):
        counts = individual_allele_counts[ind]
        w.writerow([ind] + [counts.get(a, 0) for a in all_alleles])

n_alleles_per_ind = [len(v) for v in individual_alleles.values()]
print(f"Individuals genotyped  : {len(individual_alleles)}")
print(f"S-alleles in matrix    : {len(all_alleles)}")
print(f"Max alleles/individual : {max(n_alleles_per_ind)}")
print(f"Mean alleles/individual: {sum(n_alleles_per_ind)/len(n_alleles_per_ind):.2f}")

5.4 SRK Class Assignment and Dominance Prediction Pipeline

⚠️ Work in Progress — This analysis is under active development and may not function correctly.

The automated class assignment (srk_phylogenetic_analysis.py) relies on IQ-TREE3 and phylogenetic topology inference that has not yet been fully validated for this dataset. In particular: - The “longest internal branch” heuristic for separating Class I from Class II has not been confirmed against published SRK references with known class assignments. - The script reads a binary presence/absence genotype matrix (SRK_individual_genotypes.tsv) that predates the current allele copy-count matrix (SRK_individual_allele_genotypes.tsv) produced by Step 11. Input file compatibility must be verified before running. - Results should be treated with caution and validated in FigTree/iTOL before use in conservation management decisions.

A working draft of this analysis is also available in tmp_SRK_classes_prediction.Rmd (may be outdated relative to this section).

5.4.1 Overview

This Python pipeline takes the functional SRK protein alignment and individual genotype data, filters to ingroup individuals, infers a maximum-likelihood phylogenetic tree, and automatically assigns proteins to Class I or Class II based on tree topology. Per-individual class genotypes are then derived to predict dominance behaviour for crossing decisions in a conservation breeding program.

5.4.2 Biological Background

5.4.2.1 The Two-Class System

SRK alleles in Brassicaceae self-incompatibility (SI) systems are phylogenetically structured into two ancient classes with a strict dominance hierarchy:

Class I: fewer alleles; dominant over Class II alleles in heterozygous individuals
Class II: more alleles; codominant within class; recessive to Class I

This hierarchy directly determines which S-allele is expressed in a given individual and therefore which crosses will be compatible (produce seed) or incompatible (rejected):

Individual A genotype	Individual B genotype	Compatibility
Class I allele X	Class II only	Compatible if B lacks allele X
Class I allele X	Class I allele X	Incompatible
Class II only	Class II only	Compatible if no shared alleles
No functional SRK	Any	Potentially self-compatible

5.4.2.2 The Grade Problem and Rooting

Class I forms a paraphyletic grade at the base of the SRK phylogeny; Class II is the derived monophyletic clade. The typical topology is:

    ┌── C1a
────┤         ┌── C1b
    └─────────┤
              └──────────── [Class II — monophyletic]
                      └──── C1c

This means standard midpoint rooting followed by a simple “two clades from the root” split will misclassify basal Class I sequences. The correct automated approach is to identify the longest internal branch in the unrooted tree, which separates the two classes due to extreme divergence accumulated under ancient balancing selection. The subtree connected by that branch is Class II; everything else is Class I. This is independent of rooting and correctly handles the grade structure.

Midpoint rooting is still produced as a separate output file for visual validation in FigTree or iTOL.

5.4.3 Pipeline Steps

5.4.3.1 Output File Management and Overwrite Safeguard

Output files and IQ-TREE outputs are registered before any analysis. The script checks for existing files and prompts for confirmation before overwriting — consistent with the rest of the pipeline.

5.4.3.2 IQ-TREE3 Executable Detection

The script automatically searches for the IQ-TREE3 executable in the following order, without requiring it to be on the system PATH:

for candidate in ["./iqtree3", "./iqtree", "iqtree3", "iqtree"]:

If not found automatically, the user is prompted for the full path.

5.4.3.3 Step 1: Load Ingroup Individuals from Metadata

Reads sampling_metadata.csv and collects individuals where Ingroup == 1:

with open(metadata_file, newline="", encoding="utf-8-sig") as f:
    reader = csv.DictReader(f)
    for row in reader:
        if row.get("Ingroup", "").strip() == "1":
            library = row["Library"].strip()
            barcode = row["barcode"].strip()
            if library.lstrip("-").isdigit():
                library = f"Library{int(library):03d}"
            if barcode.lstrip("-").isdigit():
                barcode = f"barcode{int(barcode):02d}"
            individual_id = f"{library}_{barcode}"
            ingroup_individuals.add(individual_id)

Key features:

BOM-safe reading: encoding="utf-8-sig" handles files that have been opened in Excel
ID normalisation: bare numbers in the Library and barcode columns (e.g. 1, 45) are automatically converted to the Library001_barcode45 format used in the genotype file
Metadata columns used: Library, barcode, Ingroup

5.4.3.4 Step 2: Load Genotype Matrix

Reads SRK_individual_genotypes.tsv and collects, for each ingroup individual, the set of proteins they carry (presence = 1):

proteins = {p for p in protein_columns if row.get(p, "0").strip() == "1"}
ingroup_individual_proteins[individual] = proteins
ingroup_proteins.update(proteins)

Two data structures are built: - ingroup_proteins: union of all proteins present in any ingroup individual — used to filter the alignment - ingroup_individual_proteins: per-individual protein sets — used in Step 5 for class genotype assignment

Diagnostic: individuals present in the metadata but absent from the genotype file are reported as a NOTE. This is expected for individuals excluded by the abundance filter or max_alleles filter in define_functional_proteins.py. A WARNING is only raised if >50% of ingroup individuals are missing, which would indicate an ID format mismatch.

5.4.3.5 Step 3: Filter Alignment to Ingroup Proteins

Reads SRK_functional_proteins_aligned.fasta and retains only sequences whose ID is in ingroup_proteins:

filtered_records = [rec for rec in all_records if rec.id in ingroup_proteins]
SeqIO.write(filtered_records, filtered_alignment, "fasta")

Proteins from outgroup individuals or those not present in any ingroup individual are excluded from the tree. Any expected proteins missing from the alignment are reported as warnings.

5.4.3.6 Step 4: Maximum-Likelihood Phylogenetic Analysis with IQ-TREE3

Runs IQ-TREE3 on the filtered ingroup alignment with automatic model selection optimised for SRK protein evolution:

cmd = [
    iqtree_exe,
    "-s", filtered_alignment,
    "-m", "TEST",
    "-mset", "LG,WAG,JTT",
    "-mfreq", "FU",
    "-mrate", "G4,I,R4",
    "-bb", "1000",
    "-alrt", "1000",
    "-redo",
    "-nt", "AUTO",
    "-pre", iqtree_prefix
]

Parameter rationale:

Parameter	Value	Rationale
`-m TEST`	Automatic model selection	Selects best-fit substitution model from the candidate set
`-mset`	`LG,WAG,JTT`	Standard empirical protein models; LG and WAG are widely used for plant proteins
`-mfreq FU`	Empirical frequencies from alignment	More accurate than equal frequencies for divergent sequences
`-mrate G4,I,R4`	Gamma + invariant + FreeRate	Tests rate heterogeneity models; important for SRK given strong balancing selection creating unequal rates
`-bb 1000`	1000 ultrafast bootstrap replicates	Branch support for tree topology
`-alrt 1000`	1000 SH-aLRT replicates	Additional branch support metric
`-nt AUTO`	Automatic thread detection	Uses all available CPU cores
`-redo`	Overwrite existing output	Required for re-runs; user confirmation is handled before IQ-TREE is called

5.4.3.7 Step 5: SRK Class Assignment by Longest Internal Branch

Parses the unrooted IQ-TREE tree and identifies the longest internal branch:

for clade in tree.find_clades(order="level"):
    if clade == tree.root or clade.is_terminal():
        continue
    if clade.branch_length and clade.branch_length > max_branch_len:
        max_branch_len = clade.branch_length
        best_clade = clade

The subtree connected by this branch defines one group; all remaining leaves define the other:

group_A = {t.name for t in best_clade.get_terminals()}
group_B  = all_leaves - group_A
# Class I = smaller (dominant), Class II = larger (recessive)
if len(group_A) <= len(group_B):
    class_I, class_II = group_A, group_B
else:
    class_I, class_II = group_B, group_A

Why this works for the grade structure: The branch connecting the Class II clade to the rest of the tree is systematically the longest internal branch in SRK phylogenies, because Class I and Class II diverged over an exceptionally long evolutionary time under strong balancing selection. This criterion is independent of rooting and correctly places all paraphyletic Class I sequences outside the Class II clade without requiring an outgroup.

A midpoint-rooted version of the tree is also written for visual validation:

tree_viz.root_at_midpoint()
Phylo.write(tree_viz, rooted_treefile, "newick")

Important: this automated classification should be validated by opening SRK_phylogeny_midpoint_rooted.treefile in FigTree or iTOL. For higher confidence, published SRK sequences with known class assignments (e.g. from Arabidopsis halleri or Brassica) can be added to the alignment as anchors before running the pipeline.

5.4.3.8 Per-Individual Class Genotype and Dominance Prediction

For each ingroup individual, the proteins they carry are split by class assignment and a dominance prediction is derived:

c1 = sorted(proteins & class_I)
c2 = sorted(proteins & class_II)

if c1 and c2:
    prediction = "Class_I_dominant"
elif c1:
    prediction = "Class_I_only_dominant"
elif c2:
    prediction = "Class_II_only_codominant"
else:
    prediction = "no_functional_SRK"

Dominance prediction categories:

Category	Meaning	Crossing implication
`Class_I_dominant`	Carries Class I and Class II alleles	Class I allele expressed; incompatible with any plant sharing that Class I allele
`Class_I_only_dominant`	Carries only Class I alleles	Class I allele(s) expressed; compatible with Class II-only plants
`Class_II_only_codominant`	Carries only Class II alleles	Both alleles co-expressed; incompatible only with plants sharing the exact same allele(s)
`no_functional_SRK`	No functional proteins detected	Potentially self-compatible; will set seed with most partners

5.4.4 Output Files

5.4.4.1 Primary Outputs

SRK_ingroup_proteins_aligned.fasta: filtered alignment of ingroup proteins used as input to IQ-TREE
SRK_protein_classes.tsv: two-column table mapping each protein to Class_I or Class_II
SRK_individual_class_genotype.tsv: per-individual class genotype and dominance prediction

5.4.4.2 Tree Files

SRK_phylogeny.treefile: maximum-likelihood tree (unrooted, Newick format) — primary tree output
SRK_phylogeny_midpoint_rooted.treefile: midpoint-rooted version for visualisation in FigTree/iTOL
SRK_phylogeny.iqtree: full IQ-TREE log including model selection results and tree statistics
SRK_phylogeny.contree: consensus tree from bootstrap replicates

5.4.4.3 `SRK_individual_class_genotype.tsv` Column Description

Column	Description
`Individual`	Individual identifier (`Library_barcode` format)
`n_Class_I`	Number of distinct Class I proteins carried
`Class_I_proteins`	Semicolon-separated list of Class I protein names
`n_Class_II`	Number of distinct Class II proteins carried
`Class_II_proteins`	Semicolon-separated list of Class II protein names
`Dominance_prediction`	Predicted dominance category for crossing decisions

5.4.4.4 Example output

Individual              n_Class_I  Class_I_proteins   n_Class_II  Class_II_proteins              Dominance_prediction
Library001_barcode01    1          SRK_protein_003    2           SRK_protein_041;SRK_protein_078 Class_I_dominant
Library001_barcode05    0                             3           SRK_protein_012;SRK_protein_033 Class_II_only_codominant
Library002_barcode14    0                             0                                           no_functional_SRK

5.4.5 Bioinformatics

Goal: Assign functional SRK proteins to Class I or Class II using phylogenetic topology, then predict dominance behaviour for ingroup individuals to guide crossing decisions in the breeding program
Script: srk_phylogenetic_analysis.py
Inputs:
- SRK_functional_proteins_aligned.fasta — one aligned sequence per functional protein (from define_functional_proteins.py)
- SRK_individual_genotypes.tsv — binary presence/absence matrix (from genotype_SRK_from_functional_proteins.py)
- sampling_metadata.csv — sample metadata; must contain Library, barcode, and Ingroup columns
Parameters:
- IQ-TREE model set: LG,WAG,JTT
- Bootstrap replicates: 1000 (UFBoot + SH-aLRT)
- Class split: longest internal branch (automated)
Outputs:
- SRK_ingroup_proteins_aligned.fasta — filtered alignment (ingroup proteins only)
- SRK_phylogeny.treefile — ML tree (unrooted)
- SRK_phylogeny_midpoint_rooted.treefile — midpoint-rooted tree for visualisation
- SRK_phylogeny.iqtree — IQ-TREE full log
- SRK_protein_classes.tsv — protein → Class I / Class II assignment
- SRK_individual_class_genotype.tsv — per-individual class genotype and dominance prediction

Using IQ-TREE executable: ./iqtree3
Loading metadata...
Ingroup individuals found in metadata: 170

Loading genotype matrix...
Individuals in genotype file:        160
Ingroup individuals matched:         152
Proteins present in ingroup:         202

NOTE: 18 ingroup individuals from metadata absent from genotype file.
  These were likely excluded by the abundance or max_alleles filter upstream.
  Absent individuals: ['Library001_barcode13', 'Library001_barcode22', ...]

Filtering alignment to ingroup proteins...
Sequences in full alignment:         228
Sequences retained for ingroup:      202
Filtered alignment written: SRK_ingroup_proteins_aligned.fasta

Running IQ-TREE3...
Command: ./iqtree3 -s SRK_ingroup_proteins_aligned.fasta -m TEST -mset LG,WAG,JTT -mfreq FU -mrate G4,I,R4 -bb 1000 -alrt 1000 -redo -nt AUTO -pre SRK_phylogeny

[IQ-TREE3 output...]

Classifying proteins into SRK classes...
Longest internal branch length:  1.2847
Class I  (dominant,  smaller):   52 proteins
Class II (recessive, larger):    150 proteins

NOTE: Validate this split in FigTree using the midpoint-rooted tree.
      For higher confidence, add published SRK reference sequences with
      known class assignments to anchor the tree.

FINAL SUMMARY

Ingroup individuals:     152
Ingroup proteins:        202
Sequences in tree:       202

Class I  (dominant):     52 proteins
Class II (recessive):    150 proteins

Filtered alignment:      SRK_ingroup_proteins_aligned.fasta
Tree file (unrooted):    SRK_phylogeny.treefile
Tree file (midpt rooted):SRK_phylogeny_midpoint_rooted.treefile
IQ-TREE log:             SRK_phylogeny.iqtree
Protein classes:         SRK_protein_classes.tsv
Individual class geno:   SRK_individual_class_genotype.tsv

CODE

#!/usr/bin/env python3

import csv
import os
import shutil
import subprocess
import sys
from Bio import Phylo, SeqIO

# ----------------------------
# Files
# ----------------------------
metadata_file      = "sampling_metadata.csv"
genotype_file      = "SRK_individual_genotypes.tsv"
alignment_file     = "SRK_functional_proteins_aligned.fasta"
filtered_alignment = "SRK_ingroup_proteins_aligned.fasta"
iqtree_prefix      = "SRK_phylogeny"

IQTREE_EXTENSIONS = ["treefile", "iqtree", "log", "model.gz", "contree", "splits.nex", "bionj", "mldist"]
iqtree_outputs = [f"{iqtree_prefix}.{ext}" for ext in IQTREE_EXTENSIONS]

class_file      = "SRK_protein_classes.tsv"
ind_class_file  = "SRK_individual_class_genotype.tsv"
rooted_treefile = f"{iqtree_prefix}_midpoint_rooted.treefile"

OUTPUT_FILES = [filtered_alignment, class_file, ind_class_file,
                rooted_treefile] + iqtree_outputs

# ----------------------------
# Locate IQ-TREE3 executable
# ----------------------------
iqtree_exe = None
for candidate in ["./iqtree3", "./iqtree", "iqtree3", "iqtree"]:
    if candidate.startswith("./"):
        if os.path.isfile(candidate[2:]):
            iqtree_exe = candidate
            break
    elif shutil.which(candidate):
        iqtree_exe = candidate
        break

if iqtree_exe is None:
    user_path = input("IQ-TREE3 not found. Enter path to executable (e.g. ./iqtree3): ").strip()
    if not os.path.isfile(user_path.lstrip("./")):
        print(f"ERROR: File not found: {user_path}")
        sys.exit(1)
    iqtree_exe = user_path
print(f"Using IQ-TREE executable: {iqtree_exe}")

# ----------------------------
# Overwrite check
# ----------------------------
existing = [f for f in OUTPUT_FILES if os.path.exists(f)]
if existing:
    print("\nWARNING: The following output files already exist and will be overwritten:")
    for f in existing:
        print(f"  {f}")
    confirm = input("\nProceed and overwrite? (y/n): ").strip().lower()
    if confirm != 'y':
        print("Aborted. No files were modified.")
        sys.exit(0)

# ----------------------------
# Step 1: Load ingroup individuals from metadata
# ----------------------------
print("Loading metadata...")
ingroup_individuals = set()

with open(metadata_file, newline="", encoding="utf-8-sig") as f:
    reader = csv.DictReader(f)
    for row in reader:
        if row.get("Ingroup", "").strip() == "1":
            library = row["Library"].strip()
            barcode = row["barcode"].strip()
            if library.lstrip("-").isdigit():
                library = f"Library{int(library):03d}"
            if barcode.lstrip("-").isdigit():
                barcode = f"barcode{int(barcode):02d}"
            individual_id = f"{library}_{barcode}"
            ingroup_individuals.add(individual_id)

print(f"Ingroup individuals found in metadata: {len(ingroup_individuals)}")

if not ingroup_individuals:
    print("ERROR: No ingroup individuals found.")
    print("  Check that the 'Ingroup' column contains 1 for ingroup samples.")
    sys.exit(1)

# ----------------------------
# Step 2: Load genotype matrix and collect proteins present in ingroup
# ----------------------------
print("\nLoading genotype matrix...")
ingroup_proteins = set()
genotype_individuals = set()
matched_individuals = set()
ingroup_individual_proteins = {}

with open(genotype_file, newline="", encoding="utf-8-sig") as f:
    reader = csv.DictReader(f, delimiter="\t")
    protein_columns = [col for col in (reader.fieldnames or []) if col != "Individual"]

    for row in reader:
        individual = row["Individual"]
        genotype_individuals.add(individual)
        if individual in ingroup_individuals:
            matched_individuals.add(individual)
            proteins = {p for p in protein_columns if row.get(p, "0").strip() == "1"}
            ingroup_individual_proteins[individual] = proteins
            ingroup_proteins.update(proteins)

print(f"Individuals in genotype file:        {len(genotype_individuals)}")
print(f"Ingroup individuals matched:         {len(matched_individuals)}")
print(f"Proteins present in ingroup:         {len(ingroup_proteins)}")

unmatched = ingroup_individuals - genotype_individuals
if unmatched:
    print(f"\nNOTE: {len(unmatched)} ingroup individuals from metadata absent from genotype file.")
    print(f"  These were likely excluded by the abundance or max_alleles filter upstream.")
    print(f"  Absent individuals: {sorted(unmatched)[:5]}{' ...' if len(unmatched) > 5 else ''}")
    if len(unmatched) > len(ingroup_individuals) * 0.5:
        print(f"  WARNING: >50% of ingroup individuals are missing — check ID format.")
        print(f"  Example metadata IDs:  {sorted(ingroup_individuals)[:3]}")
        print(f"  Example genotype IDs:  {sorted(genotype_individuals)[:3]}")

if not ingroup_proteins:
    print("\nERROR: No proteins found for ingroup individuals. Cannot build tree.")
    sys.exit(1)

# ----------------------------
# Step 3: Filter alignment to ingroup proteins
# ----------------------------
print("\nFiltering alignment to ingroup proteins...")
all_records = list(SeqIO.parse(alignment_file, "fasta"))
filtered_records = [rec for rec in all_records if rec.id in ingroup_proteins]

print(f"Sequences in full alignment:         {len(all_records)}")
print(f"Sequences retained for ingroup:      {len(filtered_records)}")

missing_from_alignment = ingroup_proteins - {rec.id for rec in all_records}
if missing_from_alignment:
    print(f"\nWARNING: {len(missing_from_alignment)} ingroup proteins not found in alignment:")
    for p in sorted(missing_from_alignment):
        print(f"  {p}")

if not filtered_records:
    print("\nERROR: No sequences matched ingroup proteins in the alignment.")
    sys.exit(1)

if len(filtered_records) < 4:
    print(f"\nWARNING: Only {len(filtered_records)} sequences — tree may not be informative.")

SeqIO.write(filtered_records, filtered_alignment, "fasta")
print(f"\nFiltered alignment written: {filtered_alignment}")

# ----------------------------
# Step 4: Run IQ-TREE3
# ----------------------------
print("\nRunning IQ-TREE3...")

cmd = [
    iqtree_exe,
    "-s", filtered_alignment,
    "-m", "TEST",
    "-mset", "LG,WAG,JTT",
    "-mfreq", "FU",
    "-mrate", "G4,I,R4",
    "-bb", "1000",
    "-alrt", "1000",
    "-redo",
    "-nt", "AUTO",
    "-pre", iqtree_prefix
]

print("Command:", " ".join(cmd))
print()
subprocess.run(cmd, check=True)

# ----------------------------
# Step 5: Classify proteins into SRK Class I and Class II
# ----------------------------
# Strategy: find the longest internal branch in the unrooted tree.
# In SRK phylogenies the two classes are separated by an exceptionally long
# branch due to ancient balancing selection. The subtree connected by that
# branch = Class II (larger, recessive); everything else = Class I (smaller,
# dominant). This avoids the rooting problem and handles the Class I grade.

print("\nClassifying proteins into SRK classes...")
treefile = f"{iqtree_prefix}.treefile"
tree = Phylo.read(treefile, "newick")

max_branch_len = 0
best_clade = None

for clade in tree.find_clades(order="level"):
    if clade == tree.root or clade.is_terminal():
        continue
    if clade.branch_length and clade.branch_length > max_branch_len:
        max_branch_len = clade.branch_length
        best_clade = clade

if best_clade is None:
    print("WARNING: Could not identify a splitting branch. Check tree file.")
    class_I, class_II = set(), {t.name for t in tree.get_terminals()}
else:
    all_leaves = {t.name for t in tree.get_terminals()}
    group_A    = {t.name for t in best_clade.get_terminals()}
    group_B    = all_leaves - group_A
    if len(group_A) <= len(group_B):
        class_I, class_II = group_A, group_B
    else:
        class_I, class_II = group_B, group_A

print(f"Longest internal branch length:  {max_branch_len:.4f}")
print(f"Class I  (dominant,  smaller):   {len(class_I)} proteins")
print(f"Class II (recessive, larger):    {len(class_II)} proteins")
print()
print("NOTE: Validate this split in FigTree using the midpoint-rooted tree.")
print("      For higher confidence, add published SRK reference sequences with")
print("      known class assignments to anchor the tree.")

# Write protein class assignments
with open(class_file, "w") as f:
    f.write("Protein\tClass\n")
    for p in sorted(class_I):
        f.write(f"{p}\tClass_I\n")
    for p in sorted(class_II):
        f.write(f"{p}\tClass_II\n")

# Write midpoint-rooted tree for visualisation
tree_viz = Phylo.read(treefile, "newick")
tree_viz.root_at_midpoint()
Phylo.write(tree_viz, rooted_treefile, "newick")

# Per-individual class genotype and dominance prediction
with open(ind_class_file, "w") as f:
    f.write("Individual\t"
            "n_Class_I\tClass_I_proteins\t"
            "n_Class_II\tClass_II_proteins\t"
            "Dominance_prediction\n")
    for individual in sorted(ingroup_individual_proteins):
        proteins = ingroup_individual_proteins[individual]
        c1 = sorted(proteins & class_I)
        c2 = sorted(proteins & class_II)

        if c1 and c2:
            prediction = "Class_I_dominant"
        elif c1:
            prediction = "Class_I_only_dominant"
        elif c2:
            prediction = "Class_II_only_codominant"
        else:
            prediction = "no_functional_SRK"

        f.write(f"{individual}\t"
                f"{len(c1)}\t{';'.join(c1)}\t"
                f"{len(c2)}\t{';'.join(c2)}\t"
                f"{prediction}\n")

# ----------------------------
# Summary
# ----------------------------
print("\nFINAL SUMMARY")
print(f"\nIngroup individuals:     {len(matched_individuals)}")
print(f"Ingroup proteins:        {len(ingroup_proteins)}")
print(f"Sequences in tree:       {len(filtered_records)}")
print(f"\nClass I  (dominant):     {len(class_I)} proteins")
print(f"Class II (recessive):    {len(class_II)} proteins")
print(f"\nFiltered alignment:      {filtered_alignment}")
print(f"Tree file (unrooted):    {iqtree_prefix}.treefile")
print(f"Tree file (midpt rooted):{rooted_treefile}")
print(f"IQ-TREE log:             {iqtree_prefix}.iqtree")
print(f"Protein classes:         {class_file}")
print(f"Individual class geno:   {ind_class_file}")

5.5 SRK Zygosity Analysis Pipeline for Tetraploid Species

5.5.1 Overview

This R pipeline analyzes S-locus zygosity patterns in tetraploid species by quantifying the number of distinct SRK proteins per individual and classifying genotype configurations. The workflow addresses the complex genetics of polyploid self-incompatibility systems, where individuals can carry multiple S-alleles with varying degrees of heterozygosity, providing insights into mating system evolution and population genetic structure.

5.5.2 Pipeline Steps

5.5.2.1 Polyploid Self-Incompatibility Analysis Framework

Zygosity analysis serves critical functions in polyploid self-incompatibility research:

Genotype complexity assessment: Quantifies S-allele diversity within individuals
Mating system inference: Identifies patterns consistent with outcrossing vs. selfing
Population structure analysis: Reveals demographic processes affecting S-locus diversity
Evolutionary constraint evaluation: Assesses maintenance of incompatibility function in polyploids

5.5.2.2 Data Loading and Validation

Loads the allele copy-count matrix produced by the genotyping pipeline:

geno <- read.table(
  "SRK_individual_allele_genotypes.tsv",
  header = TRUE,
  sep = "\t",
  stringsAsFactors = FALSE,
  check.names = FALSE
)
allele_cols <- setdiff(colnames(geno), "Individual")
geno[, allele_cols] <- lapply(geno[, allele_cols], as.numeric)

Data structure expectations:

Wide format: rows = individuals, columns = allele bins
Count values: each cell records the number of distinct proteins from that individual assigned to that allele bin
Tab-delimited: standard format from upstream genotyping pipeline

5.5.2.3 Allele Diversity Quantification

Computes two complementary measures of per-individual allele diversity:

N_distinct_alleles <- rowSums(geno[, allele_cols] > 0)
N_total_proteins   <- rowSums(geno[, allele_cols])

N_distinct_alleles: number of allele bins with at least one copy — the primary zygosity signal
N_total_proteins: sum of all copy counts across allele bins — reflects total protein recovery (approaches ploidy when sampling is complete)

Biological significance:

In tetraploid species, these metrics reflect: - Allelic diversity: Higher N_distinct_alleles indicates greater S-locus heterozygosity - Dosage information: N_total_proteins captures how many gene copies were recovered per allele class - Mating patterns: Outcrossing maintains higher diversity than selfing

5.5.2.4 Zygosity Classification

Assigns homozygous vs. heterozygous categories based on the number of distinct allele bins:

Zygosity <- ifelse(N_distinct_alleles == 1, "Homozygous", "Heterozygous")

Classification logic:

Homozygous: Single distinct allele bin present (AAAA genotype)
Heterozygous: Two or more distinct allele bins present (AAAB, AABB, AABC, ABCD genotypes)
Functional relevance: Homozygous individuals may exhibit compromised self-incompatibility

5.5.2.5 Detailed Genotype Model Assignment

Assigns specific tetraploid genotype configurations using both the number of distinct allele bins and their copy counts:

Genotype <- apply(geno[, allele_cols, drop = FALSE], 1, function(row) {
  counts_present <- sort(row[row > 0], decreasing = TRUE)
  n <- length(counts_present)
  if (n == 1) return("AAAA")
  if (n == 2) {
    if (counts_present[1] >= 2 * counts_present[2]) return("AAAB")
    return("AABB")
  }
  if (n == 3) return("AABC")
  if (n >= 4) return("ABCD")
  return("Unknown")
})

Genotype model interpretations:

AAAA (Homozygous):

All four chromosome copies carry identical S-allele
Likely self-compatible due to lack of allelic diversity
May result from inbreeding or population bottlenecks
Represents potential breakdown of self-incompatibility system

AAAB (Simplex heterozygous):

Two distinct allele bins, but the dominant allele has ≥ 2× the copy count of the minor allele
Inferred when counts_present[1] >= 2 * counts_present[2]
One allele predominates (e.g. three copies of allele A, one copy of allele B)
Asymmetric dosage; the dominant allele may suppress the minor in sporophytic SI

AABB (Duplex heterozygous):

Two distinct allele bins with roughly equal copy counts
Symmetric dosage — both alleles equally represented
Common pattern in balanced polyploid populations
Maintains incompatibility through allelic diversity

AABC (Triplex heterozygous):

Three distinct allele bins; one allele present in two copies, the other two each in one copy
Intermediate heterozygosity level
Maintains incompatibility with good diversity

ABCD (Quadruplex heterozygous):

Four distinct allele bins, each in one copy
Maximum within-individual S-locus diversity
Indicates strong outcrossing and large effective population size
Optimal configuration for maintaining incompatibility function

5.5.2.6 Data Export and Documentation

Saves zygosity analysis results with proper formatting:

write.table(
    zyg,
    "SRK_individual_zygosity.tsv",
    sep="\t",
    quote=FALSE,
    row.names=FALSE
)

Export features:

Tab-delimited format: Standard for downstream analysis
No quotes: Clean formatting for data import
No row names: Prevents index column creation
Header preservation: Maintains column names for clarity

5.5.2.7 Summary Statistics Generation

Provides comprehensive descriptive statistics:

cat("\nDistinct allele count distribution:\n")
print(table(zyg$N_distinct_alleles))
cat("\nZygosity distribution:\n")
print(table(zyg$Zygosity))
cat("\nGenotype distribution:\n")
print(table(zyg$Genotype))

Statistical summaries:

Distinct allele distribution: Frequency of each N_distinct_alleles value
Zygosity proportions: Homozygous vs. heterozygous ratios
Genotype distribution: Breakdown across AAAA, AAAB, AABB, AABC, ABCD classes
Quality control: Identification of unusual genotype frequencies

5.5.2.8 Visualization Generation

Creates graphical summary of genotype diversity:

pdf("SRK_zygosity_distribution.pdf")

barplot(
  table(zyg$N_distinct_alleles),
  main = "SRK genotype diversity (tetraploid)",
  xlab = "Number of distinct SRK alleles",
  ylab = "Number of individuals",
  col  = "gray70"
)

dev.off()

Visualization features:

Bar plot format: Distribution of N_distinct_alleles across individuals
PDF output: High-quality vector graphics for publication
Descriptive labels: Clear axis labels and title
Distribution assessment: Visual identification of dominant genotype classes

5.5.3 Output Files

5.5.3.1 Primary Outputs

Zygosity table (SRK_individual_zygosity.tsv):

Individual identifiers
N_distinct_alleles: number of allele bins with ≥1 copy
N_total_proteins: total allele copy count across all bins
Binary zygosity classification (Homozygous/Heterozygous)
Detailed genotype models (AAAA, AAAB, AABB, AABC, ABCD)
Allele_composition: allele copy string, e.g. Allele_044(2)+Allele_047(2)
Ready for downstream population genetic analysis and cross planning

Distribution plot (SRK_zygosity_distribution.pdf):

Bar plot showing frequency of each genotype class
Visual assessment of population genetic structure
Publication-ready figure for manuscript inclusion

5.5.3.2 Data Structure Example

Individual            N_distinct_alleles  N_total_proteins  Zygosity      Genotype  Allele_composition
Library001_barcode01  2                   4                 Heterozygous  AABB      Allele_044(2)+Allele_047(2)
Library001_barcode02  3                   4                 Heterozygous  AABC      Allele_019(1)+Allele_046(2)+Allele_047(1)
Library001_barcode03  1                   1                 Homozygous    AAAA      Allele_017(1)
Library001_barcode05  1                   2                 Homozygous    AAAA      Allele_017(2)

Note: N_total_proteins = 4 indicates complete tetraploid allele recovery; values < 4 indicate partial recovery from sequencing. Homozygous individuals may have count > 1 (e.g., two copies of the same allele sequenced).

5.5.4 Scientific Applications

5.5.5 Population Genetic Inference

Demographic history reconstruction:

High ABCD frequency: Large, outcrossing populations with minimal drift
High AAAA frequency: Population bottlenecks or inbreeding
AABB dominance: Structured populations or assortative mating
Mixed patterns: Complex demographic history or population admixture

Effective population size estimation:

S-allele diversity correlates with effective population size
Homozygosity excess indicates small population effects
Heterozygosity patterns reveal demographic constraints

5.5.5.1 Self-Incompatibility System Analysis

Functional assessment:

AAAA individuals: Likely self-compatible due to S-allele homozygosity
Multi-allelic individuals: Maintain strong incompatibility function
Population-level patterns: Overall system integrity assessment

Evolutionary constraints:

Balancing selection effectiveness in maintaining diversity
Drift effects on S-locus polymorphism
Population size thresholds for incompatibility maintenance

5.5.5.2 Conservation Genetics Applications

Population viability assessment:

S-locus diversity as indicator of reproductive potential
Identification of populations at risk of incompatibility breakdown
Prioritization of conservation efforts based on genetic diversity

Management recommendations:

High homozygosity: Genetic rescue through outcrossing
Low diversity: Population augmentation strategies
Fragmented patterns: Habitat connectivity restoration priorities

5.5.6 Statistical Considerations

5.5.6.1 Sample Size Requirements

Minimum sample sizes:

At least 20-30 individuals for basic frequency estimation
Larger samples (>50) for demographic inference
Population-specific considerations based on expected diversity

5.5.6.2 Interpretation Guidelines

Genotype frequency expectations:

Random mating: Hardy-Weinberg proportions modified for polyploidy
Inbreeding: Excess homozygosity relative to random mating
Population structure: Non-random genotype associations

Comparative analysis:

Comparison with neutral markers to detect selection
Temporal sampling to assess diversity changes
Multi-population studies to understand regional patterns

This zygosity analysis pipeline provides essential insights into the genetic architecture of self-incompatibility in polyploid species, enabling assessment of system functionality and population genetic health critical for both evolutionary understanding and conservation management.

5.5.7 Bioinformatics

Goal: Score individual zygosity from the allele count matrix to AAAA, AAAB, AABB, AABC, ABCD based on number of distinct allele bins and their copy counts
Script: SRK_zygosity_from_genotype.R
Execute code: Rscript SRK_zygosity_from_genotype.R
Input: SRK_individual_allele_genotypes.tsv
Output:
- SRK_individual_zygosity.tsv
- SRK_zygosity_distribution.pdf

Starting SRK zygosity analysis from genotype matrix

  
Number of individuals: 128 

Protein count distribution:
 1  2  3  4 

66 51  5  6 

  

Zygosity distribution:

Heterozygous   Homozygous 

          62           66 

Output files created:

SRK_individual_zygosity.tsv
SRK_zygosity_distribution.pdf

############################################################
# SRK zygosity analysis from allele count matrix
############################################################

cat("\nStarting SRK zygosity analysis from allele count matrix\n")

############################################################
# Load allele count matrix
############################################################

geno <- read.table(
  "SRK_individual_allele_genotypes.tsv",
  header = TRUE,
  sep = "\t",
  stringsAsFactors = FALSE,
  check.names = FALSE
)

allele_cols <- setdiff(colnames(geno), "Individual")
geno[, allele_cols] <- lapply(geno[, allele_cols], as.numeric)

############################################################
# Compute per-individual statistics
############################################################

N_distinct_alleles <- rowSums(geno[, allele_cols] > 0)
N_total_proteins   <- rowSums(geno[, allele_cols])

############################################################
# Assign zygosity
############################################################

Zygosity <- ifelse(N_distinct_alleles == 1, "Homozygous", "Heterozygous")

############################################################
# Assign genotype model
# AAAB: 2 distinct alleles, dominant allele has >= 2x copies of the minor allele
# AABB: 2 distinct alleles, roughly equal copy numbers
############################################################

Genotype <- apply(geno[, allele_cols, drop = FALSE], 1, function(row) {
  counts_present <- sort(row[row > 0], decreasing = TRUE)
  n <- length(counts_present)
  if (n == 0) return("Unknown")
  if (n == 1) return("AAAA")
  if (n == 2) {
    if (counts_present[1] >= 2 * counts_present[2]) return("AAAB")
    return("AABB")
  }
  if (n == 3) return("AABC")
  if (n >= 4) return("ABCD")
  return("Unknown")
})

############################################################
# Build allele composition string
############################################################

Allele_composition <- apply(geno[, allele_cols, drop = FALSE], 1, function(row) {
  present <- sort(row[row > 0])
  if (length(present) == 0) return("None")
  paste(paste0(names(present), "(", present, ")"), collapse = "+")
})

############################################################
# Assemble and save
############################################################

zyg <- data.frame(
  Individual         = geno$Individual,
  N_distinct_alleles = N_distinct_alleles,
  N_total_proteins   = N_total_proteins,
  Zygosity           = Zygosity,
  Genotype           = Genotype,
  Allele_composition = Allele_composition,
  stringsAsFactors = FALSE
)

write.table(zyg, "SRK_individual_zygosity.tsv",
            sep = "\t", quote = FALSE, row.names = FALSE)

############################################################
# Summary and plot
############################################################

cat("\nNumber of individuals:", nrow(zyg), "\n")
cat("\nDistinct allele count distribution:\n")
print(table(zyg$N_distinct_alleles))
cat("\nZygosity distribution:\n")
print(table(zyg$Zygosity))
cat("\nGenotype distribution:\n")
print(table(zyg$Genotype))

pdf("SRK_zygosity_distribution.pdf")
barplot(
  table(zyg$N_distinct_alleles),
  main = "SRK genotype diversity (tetraploid)",
  xlab = "Number of distinct SRK alleles",
  ylab = "Number of individuals",
  col  = "gray70"
)
dev.off()

cat("\nOutput files:\n")
cat("SRK_individual_zygosity.tsv\n")
cat("SRK_zygosity_distribution.pdf\n")

6 Phase 3: Data Analyses

6.1 Scientific Approach

Phase 3 addresses a single overarching conservation question: what is the impact of genetic drift on reproductive success? In plant species with sporophytic self-incompatibility (SI), reproduction requires that pollen and pistil carry different SRK alleles. Under negative frequency-dependent selection (NFDS), all S-alleles are maintained at approximately equal frequencies, maximising the proportion of compatible mating pairs and sustaining seed production across a population. In small or isolated populations, genetic drift counteracts NFDS — stochastically eliminating rare alleles, skewing frequencies away from the equal-frequency expectation, and degrading individual genotype quality — with direct consequences for reproductive success. Three sequential questions structure the analyses.

Q1 — What is the S-allele richness of the species? The species-level allele pool approximates the NFDS richness equilibrium: the total number of functionally distinct S-alleles that selection is capable of maintaining across the species. This is estimated from allele accumulation curves and asymptotic richness estimators (Michaelis-Menten, Chao1, iNEXT) applied to the full dataset (Step 14), and serves as the reference optimum against which population-level deficits are measured. Population genetic parameters — allele richness, effective allele number, heterozygosity, and mean alleles per individual — are characterised per EO (Step 13) to provide the empirical foundation for downstream comparisons.

Q2 — How did genetic drift impact populations’ S-allele diversity? Genetic drift reduces allele richness and skews remaining frequencies away from the equal-frequency NFDS expectation. Tipping Point 1 (TP1) marks the threshold at which allele loss is so severe that inter-population allele transfers are required to restore the SI system. Two criteria are evaluated per EO: the proportion of the species optimum retained (allele richness deficit), and the frequency evenness of those alleles — the ratio of effective allele number to observed allele count (Ne/N) — as a measure of departure from NFDS equal-frequency expectations. These are assessed through allele accumulation (Step 14), frequency analysis (Step 15), TP1 synthesis (Step 16), and allele composition comparison across EOs (Step 17). EOs breaching both thresholds (< 50% of optimum and Ne/N < 0.80) are flagged CRITICAL.

Note that “genetic drift” here encompasses two conceptually distinct processes: (1) a discrete historical bottleneck — a sudden reduction in population size that causes rapid simultaneous loss of allele richness; and (2) ongoing drift in small populations — the continuous stochastic erosion of allele frequency evenness as dominant alleles accumulate. The two TP1 axes are designed to capture these separately: low prop_optimum reflects richness loss consistent with a bottleneck or founder event; low Ne/N reflects frequency distortion consistent with ongoing drift. EOs deficient on both axes simultaneously — as observed across all five LEPA EOs — are most consistent with a historical bottleneck followed by prolonged ongoing drift. The SRK locus is under balancing selection (NFDS) and cannot be used directly with standard neutral-marker bottleneck tests; formally partitioning the two processes requires genome-wide neutral marker data and demographic modelling (e.g., SMC++, PSMC).

Q3 — How did genetic drift impact the reproductive fitness of individuals within populations? Even where some allele diversity persists, drift causes allele copy-count imbalances at the individual level that directly reduce reproductive output. In a tetraploid, the proportion of compatible diploid gametes an individual produces — the Genotypic Fitness Score (GFS) — depends not only on which alleles are present but on their dosage balance across the four allele copies. An AABB individual (balanced 2+2) produces heterozygous gametes in 4 of 6 combinations (GFS = 0.667), whereas an AAAB individual (unbalanced 3+1) produces only 3 of 6 (GFS = 0.500), despite both carrying two distinct alleles. In self-incompatible plants, an individual’s value as a breeding partner depends not only on which key/lock types it carries, but also on the number of distinct alleles present across its genome copies. Individuals with more key/lock types can participate in more compatible crosses, making them especially valuable in a managed breeding program. Tipping Point 2 (TP2) marks the threshold at which this individual-level degradation is so advanced that within-population crosses alone cannot restore reproductive fitness. Two criteria are evaluated per EO: mean GFS below the AABB benchmark (< 0.667) and the proportion of AAAA individuals — which produce no heterozygous gametes — exceeding 30%. These are assessed through individual GFS computation (Step 18), TP2 synthesis (Step 19), and the reproductive effort visualisation (Step 20). EOs breaching both criteria are flagged CRITICAL.

Together, these three questions trace the consequences of demographic decline from the species-level allele pool, through population-level richness erosion and frequency skew, to individual-level reproductive fitness — providing a quantitative framework for prioritising conservation interventions.

6.2 Population Genetic Statistics

6.2.1 Overview

This pipeline calculates key population genetic statistics from SRK genotyping data in order to evaluate whether self-incompatibility allele diversity is maintained by negative frequency-dependent selection (NFDS) or instead influenced by genetic drift or demographic bottlenecks.

The analysis integrates individual genotype data, inferred zygosity states, and population metadata to produce summary statistics at the population level. These statistics provide insight into the evolutionary dynamics governing the SRK self-incompatibility system.

For SRK genes under a functional self-incompatibility system, theoretical expectations include:

High allelic diversity
High heterozygosity
High effective number of alleles (Ne)

Deviations from these expectations can indicate demographic or evolutionary processes affecting allele maintenance.

6.2.2 Pipeline Steps

6.2.2.1 Biological Expectations for SRK Systems

In gametophytic or sporophytic self-incompatibility systems, SRK alleles are typically maintained by negative frequency-dependent selection (NFDS). Under this model, rare alleles gain a reproductive advantage because they encounter fewer incompatible mates.

6.2.2.1.1 Expected patterns under strong NFDS

High heterozygosity among individuals
High effective number of alleles (Ne)
High allele richness within populations
Low frequency of homozygous individuals

6.2.2.2 Expected patterns under genetic drift or demographic bottlenecks

Reduced effective number of alleles (Ne)
Reduced allele richness
Increased proportion of homozygous individuals
Skewed allele frequency distributions

These statistics therefore provide a quantitative framework for distinguishing selection-driven allele maintenance from demographic erosion of diversity.

6.2.2.3 Input Data

The pipeline integrates three datasets produced earlier in the SRK genotyping workflow.

6.2.2.3.1 Individual genotype matrix

SRK_individual_allele_genotypes.tsv

Allele copy-count matrix: each cell records the number of distinct proteins from that individual assigned to that allele bin. Values > 0 indicate presence; values > 1 indicate multiple gene copies carrying that allele class. Used for colSums-based allele frequency estimation.

Example structure:

Individual	Allele_044	Allele_046	Allele_047	…
Library001_barcode01	2	0	2	…
Library001_barcode02	0	2	1	…

6.2.2.3.2 Individual zygosity table

SRK_individual_zygosity.tsv

Table summarizing the number of distinct SRK allele bins and total protein copies per individual, with inferred zygosity and genotype.

Example:

Individual	N_distinct_alleles	N_total_proteins	Zygosity	Genotype	Allele_composition
Library001_barcode01	2	4	Heterozygous	AABB	Allele_044(2)+Allele_047(2)
Library001_barcode03	1	1	Homozygous	AAAA	Allele_017(1)

6.2.2.3.3 Sampling metadata

sampling_metadata.csv

Metadata describing each sampled individual, including:

population identity
ingroup/outgroup status
sampling information

6.2.2.4 Output Files

The analysis generates two output files.

6.2.2.4.1 Population summary statistics

SRK_population_genetic_summary.tsv

This table summarizes key statistics for each population:

Population	N_individuals	N_alleles	Effective_alleles_Ne	Prop_heterozygous	Prop_homozygous	Mean_alleles

Example results:

Population	N_individuals	N_alleles	Effective_alleles_Ne	Prop_heterozygous	Prop_homozygous	Mean_alleles
25	32	12	5.853	0.438	0.562	1.438
67	32	12	5.404	0.438	0.562	1.531
76	25	9	4.930	0.480	0.520	1.480

6.2.2.4.2 Summary plots

SRK_population_genetic_summary.pdf

This PDF file contains four visual summaries:

Effective number of alleles per population
Population heterozygosity
Allele richness
Sample size per population

These plots facilitate visual comparison among populations and help identify potential signatures of drift-driven allele loss.

6.2.2.5 Computational Workflow

6.2.2.5.0.1 Loading and filtering input data

The script first loads the allele count matrix, zygosity, and metadata files. Only ingroup samples are retained for downstream analysis.

geno <- read.table(
  "SRK_individual_allele_genotypes.tsv",
  header = TRUE,
  sep = "\t",
  check.names = FALSE,
  stringsAsFactors = FALSE
)

zyg <- read.table(
  "SRK_individual_zygosity.tsv",
  header = TRUE,
  sep = "\t",
  stringsAsFactors = FALSE
)

meta <- read.csv(
  "sampling_metadata.csv",
  stringsAsFactors = FALSE
)

meta <- meta[meta$Ingroup == 1, ]

This filtering step ensures that only biologically relevant individuals are included in the population genetic analysis.

6.2.2.5.1 Standardized genotype matrix processing

The genotype matrix may contain mixed data types or missing values. The script converts all allele columns into a clean numeric matrix of allele copy counts.

This step:

identifies allele bin columns
converts character or factor values to numeric
replaces missing values with zeros
produces the count matrix used for colSums-based allele frequency calculation

colSums across the count matrix gives total allele copy counts per allele bin (not just the number of individuals carrying each allele), which is the correct denominator for allele frequency estimation in a tetraploid.

6.2.2.5.2 Population-level allele counting

For each population:

All individuals belonging to the population are extracted.
Allele presence is summed across individuals.
Alleles with non-zero counts are considered present.

This procedure determines the number of alleles segregating within each population.

6.2.2.5.3 Allele frequency estimation

Allele frequencies are calculated as:

\[ p_i = \frac{n_i}{\sum n_i} \]

where:

( n_i ) is the count of allele i
the denominator is the total number of allele observations in the population.

These frequencies are used to estimate the effective number of alleles.

6.2.2.5.4 Effective number of alleles

The effective number of alleles (Ne) is calculated using:

\[ N_e = \frac{1}{\sum p_i^2} \]

This metric accounts for both allele richness and frequency evenness.

Interpretation:

High Ne → many alleles with balanced frequencies (NFDS expected)
Low Ne → skewed frequencies or allele loss (drift signal)

6.2.2.5.5 Population heterozygosity

Heterozygosity is estimated using the zygosity table:

\[ H = \frac{\text{number of heterozygous individuals}}{\text{total individuals}} \]

The script also calculates:

proportion of homozygous individuals
mean number of proteins per individual

These metrics provide additional insight into functional diversity within populations.

6.2.3 Population Genetic Interpretation

The combined statistics allow evaluation of selection vs drift:

6.2.3.1 Indicators of strong NFDS

High allele richness
High effective allele number
High heterozygosity
Balanced allele frequencies

6.2.3.2 Indicators of drift or demographic bottlenecks

Reduced allele richness
Low effective allele number
Increased homozygosity
Dominance of a few alleles

Comparing these metrics across populations can identify populations experiencing allele erosion or reduced mating compatibility.

6.2.4 Bioinformatics

6.2.4.1 Script

SRK_population_genetic_summary.R

6.2.4.2 Execution

Rscript SRK_population_genetic_summary.R

6.2.4.3 Input files

SRK_individual_allele_genotypes.tsv
SRK_individual_zygosity.tsv
sampling_metadata.csv

6.2.4.4 Output files

SRK_population_genetic_summary.tsv
SRK_population_genetic_summary.pdf

6.2.4.5 Full Script

############################################################
# SRK population genetic summary
############################################################

cat("Starting SRK population genetic summary analysis\n")

geno <- read.table(
  "SRK_individual_allele_genotypes.tsv",
  header = TRUE,
  sep = "\t",
  check.names = FALSE,
  stringsAsFactors = FALSE
)

zyg <- read.table(
  "SRK_individual_zygosity.tsv",
  header = TRUE,
  sep = "\t",
  stringsAsFactors = FALSE
)

meta <- read.csv(
  "sampling_metadata.csv",
  stringsAsFactors = FALSE
)

meta <- meta[meta$Ingroup == 1, ]

geno <- geno[geno$Individual %in% meta$SampleID, ]
zyg  <- zyg[zyg$Individual %in% meta$SampleID, ]

geno$Population <- meta$Pop[match(geno$Individual, meta$SampleID)]
zyg$Population  <- meta$Pop[match(zyg$Individual, meta$SampleID)]

allele_cols <- setdiff(colnames(geno), c("Individual","Population"))

geno[allele_cols] <- lapply(geno[allele_cols], function(x){
  x <- as.numeric(as.character(x))
  x[is.na(x)] <- 0
  return(x)
})

geno_matrix <- as.matrix(geno[,allele_cols])

analyze_population <- function(pop){

  g <- geno[geno$Population == pop,]
  z <- zyg[zyg$Population == pop,]

  mat <- as.matrix(g[,allele_cols])

  allele_counts <- colSums(mat)
  present <- allele_counts > 0

  N_alleles <- sum(present)

  freqs <- allele_counts[present] / sum(allele_counts[present])

  Ne <- if(length(freqs)>0) 1/sum(freqs^2) else NA

  prop_het <- mean(z$Zygosity=="Heterozygous",na.rm=TRUE)

  mean_alleles <- mean(z$N_distinct_alleles,na.rm=TRUE)

  data.frame(
    Population=pop,
    N_individuals=nrow(g),
    N_alleles=N_alleles,
    Effective_alleles_Ne=round(Ne,3),
    Prop_heterozygous=round(prop_het,3),
    Prop_homozygous=round(1-prop_het,3),
    Mean_alleles=round(mean_alleles,3)
  )
}

pops <- sort(unique(geno$Population))

results <- do.call(rbind,lapply(pops,analyze_population))

write.table(
  results,
  "SRK_population_genetic_summary.tsv",
  sep="\t",
  quote=FALSE,
  row.names=FALSE
)

cat("Analysis complete\n")

6.3 Allele Accumulation Curves

6.3.1 Overview

Allele accumulation curves quantify how rapidly new alleles are discovered as additional individuals are sampled. In systems under negative frequency-dependent selection (NFDS), such as plant self-incompatibility loci, rare alleles are continuously maintained in populations. This leads to a characteristic accumulation pattern where new alleles continue to appear even after substantial sampling effort.

In contrast, populations evolving primarily under genetic drift show rapid early allele discovery followed by saturation, indicating that most diversity is captured quickly.

This pipeline calculates allele accumulation curves at two hierarchical levels:

Species level – using all ingroup individuals
Population level – for populations with sufficient sampling (≥5 individuals)

From these curves, several statistics are extracted that provide quantitative indicators of NFDS versus drift. In addition, the pipeline estimates the total number of SRK alleles in the species using three complementary richness estimators applied to the species-level data. This estimated total serves as a baseline for evaluating how far each population is from a balanced NFDS equilibrium.

6.3.2 Pipeline Steps

6.3.2.1 Data Integration

Two input datasets are combined:

SRK genotype matrix
Sampling metadata

The genotype table contains allele copy counts for each individual (the same count matrix produced in Step 11), while the metadata provides population assignments and ingroup/outgroup classification. Allele presence for accumulation purposes is determined by count > 0, so the count values do not affect which alleles are considered discovered.

geno <- read.table(
  "SRK_individual_allele_genotypes.tsv",
  header = TRUE,
  sep = "\t",
  check.names = FALSE,
  stringsAsFactors = FALSE
)

sample_info <- read.csv(
  "sampling_metadata.csv",
  stringsAsFactors = FALSE
)

6.3.2.2 Ingroup Filtering and Individual Matching

Outgroup samples are excluded to focus the analysis on the focal species.

Individuals in the genotype matrix are then matched to metadata entries using the SampleID field, ensuring consistent population assignment.

sample_info <- sample_info[sample_info$Ingroup == 1, ]
geno <- geno[geno$Individual %in% sample_info$SampleID, ]

geno$Population <- sample_info$Pop[match(geno$Individual, sample_info$SampleID)]

This step ensures:

Accurate population assignments
Removal of external species
Consistency between genotype and metadata datasets

6.3.2.3 Genotype Matrix Construction

Allele count columns begin at column 2 of the genotype table. These columns are extracted to construct the allele matrix used for accumulation analyses.

allele_start_col <- 2
allele_end_col <- ncol(geno) - 1

allele_matrix <- as.matrix(geno[, allele_start_col:allele_end_col])

allele_matrix <- apply(allele_matrix, 2, as.numeric)
allele_matrix[is.na(allele_matrix)] <- 0

rownames(allele_matrix) <- geno$Individual

Key processing steps include:

Conversion to numeric format
Replacement of missing values with zero
Assignment of individual identifiers as row names

This produces an allele count matrix with:

rows = individuals
columns = SRK allele bins
values = allele copy counts (0 = absent, >0 = present with that many gene copies)

For accumulation curve purposes, presence is assessed as count > 0; the magnitude of the count does not affect allele discovery.

6.3.2.4 Allele Accumulation Algorithm

Allele accumulation curves are calculated using randomized sampling permutations.

For each permutation:

Individuals are randomly ordered
The number of unique alleles is counted cumulatively
The process is repeated across many permutations (default = 1000)

The average accumulation trajectory and variance are then computed.

allele_counts <- colSums(mat)
present_alleles <- allele_counts > 0
true_alleles <- sum(present_alleles)

Allele presence is determined using the condition:

colSums(matrix) > 0

This counts alleles that occur in at least one individual.

6.3.2.5 Species-Level Analysis

The first analysis considers the entire ingroup dataset as a single population.

species_res <- allele_accumulation(allele_matrix, "Species")

This produces:

Mean allele accumulation curve
Standard deviation across permutations
Summary statistics describing the curve shape

The resulting curve represents total SRK allelic diversity across the species.

6.3.2.6 Population-Level Analysis

Population-level curves are calculated for populations with sufficient sampling effort.

Only populations with at least five individuals are retained:

pop_counts <- table(geno$Population)
valid_pops <- names(pop_counts[pop_counts >= 5])

Each population is analyzed independently using the same accumulation algorithm.

for (pop in valid_pops) {
  pop_geno <- geno[geno$Population == pop, ]
  pop_allele_matrix <- as.matrix(pop_geno[, allele_start_col:allele_end_col])
}

This allows direct comparison between:

species-wide diversity
local population diversity

6.3.2.7 Allele Richness Estimation

After computing each accumulation curve, three complementary estimators are applied to the species-level data (and independently to each population) to estimate the true total number of SRK alleles:

Michaelis-Menten (MM) asymptote

The mean accumulation curve is fitted to a rectangular hyperbola using non-linear least squares:

\[ S(n) = \frac{S_{max} \cdot n}{K + n} \]

where $S_{max}$ is the estimated asymptote (total allele richness) and $K$ is the half-saturation constant. This estimator directly reflects the shape of the observed accumulation curve and is preferred when sampling is approaching saturation.

Chao1 non-parametric estimator

A lower-bound richness estimator based on singleton and doubleton allele counts:

\[ \hat{S}_{Chao1} = S_{obs} + \frac{f_1^2}{2 f_2} \]

where $f_1$ and $f_2$ are the number of alleles observed in exactly one and two individuals, respectively. When $f_2 = 0$, the bias-corrected formula $S_{obs} + f_1(f_1-1)/2$ is used. Chao1 provides a conservative lower bound; note that under strong NFDS, rare alleles are suppressed which can reduce $f_1$ and lead to underestimation.

iNEXT asymptotic estimator (optional)

If the iNEXT R package is installed, the Hill numbers framework is used to provide an asymptotic richness estimate with bootstrap confidence intervals. Install with install.packages("iNEXT").

The species-level results from all three estimators are written to SRK_species_richness_estimates.tsv and used downstream by the allele frequency analysis as the empirical optimum allele count.

6.3.2.8 Sampling Adequacy Estimation

Once the MM model is fitted, the pipeline inverts the Michaelis-Menten equation to estimate how many additional individuals would need to be sampled to reach a given level of allelic diversity:

\[ n^* = \frac{S^* \cdot K}{S_{max} - S^*} \]

where $S^*$ is the allele target and $K$ is the MM half-saturation constant. Additional individuals needed = $n^* - n_{current}$ (floored at 0 if the target is already met).

Five targets are calculated at each level:

Target	Description
+1 allele	Additional individuals to discover the next new allele beyond observed
80% of MM	Individuals to reach 80% of the estimated asymptote
90% of MM	Individuals to reach 90% of the estimated asymptote
95% of MM	Individuals to reach 95% of the estimated asymptote
99% of MM	Individuals to reach 99% of the estimated asymptote

The +1 allele estimate gives an intuitive sense of current sampling efficiency (how many more individuals are needed just to find one new allele). The percentage-based targets illustrate the hyperbolic cost of approaching saturation — the last few percent of an asymptote require disproportionately large sample sizes, especially under NFDS where rare alleles are maintained at low frequency.

If MM fitting failed for a given level, all five targets are reported as NA.

6.3.2.9 Curve Visualization

Accumulation curves are exported to a multi-page PDF file.

pdf(
  "SRK_allele_accumulation_curves.pdf",
  width = 8,
  height = 6
)

Each plot includes:

Mean accumulation curve
Confidence envelope (± SD)
Number of individuals sampled
Total alleles detected
Estimated asymptotes as horizontal reference lines (MM = dark green dashed, Chao1 = purple dotted, iNEXT = orange if available)

These plots allow visual evaluation of diversity saturation patterns and comparison of observed diversity against estimated total richness.

6.3.3 Output Files

6.3.3.1 Primary Outputs

SRK_allele_accumulation_curves.pdf

Contains:

Species-level allele accumulation curve
Individual curves for each sufficiently sampled population

SRK_allele_accumulation_stats.tsv

Tabulated statistics extracted from each curve, including richness estimates for each level.

Column	Description
`Level`	Species or Population
`Population`	Population identifier (or “All” for species)
`N_individuals`	Number of individuals analyzed
`N_alleles`	Observed alleles
`Initial_slope`	Mean allele gain per individual in early sampling
`Tail_slope`	Mean allele gain per individual in late sampling
`Late_fraction`	Proportion of diversity contributed by rare alleles
`MM_estimate`	Michaelis-Menten asymptote estimate
`Chao1_estimate`	Chao1 non-parametric richness estimate
`iNEXT_estimate`	iNEXT asymptotic estimate (NA if package not installed)
`N_more_next_allele`	Additional individuals needed to discover one more allele beyond observed
`N_more_80pct_MM`	Additional individuals needed to reach 80% of the MM asymptote
`N_more_90pct_MM`	Additional individuals needed to reach 90% of the MM asymptote
`N_more_95pct_MM`	Additional individuals needed to reach 95% of the MM asymptote
`N_more_99pct_MM`	Additional individuals needed to reach 99% of the MM asymptote

Example output:

Level	Population	N_ind	N_alleles	MM_est	N_more_next	N_more_80%	N_more_90%	N_more_95%	N_more_99%
Species	All	152	43	59	28	94	400	1013	5915
Population	25	32	12	17	5	14	70	183	1086
Population	27	30	15	25	6	50	149	348	1935
Population	67	32	12	16	14	10	63	168	1006
Population	76	25	9	12	10	3	37	105	649

SRK_species_richness_estimates.tsv

A dedicated summary of species-level richness estimates used as the NFDS optimum in downstream frequency analyses.

Column	Description
`S_observed`	Observed allele count across all ingroup individuals
`MM_estimate`	Michaelis-Menten asymptote
`Chao1_estimate`	Chao1 lower-bound estimate
`iNEXT_estimate`	iNEXT asymptotic estimate
`Consensus_estimate`	Mean of all available estimators
`N_more_next_allele`	Additional individuals needed to discover one more allele beyond observed
`N_more_80pct_MM`	Additional individuals needed to reach 80% of the MM asymptote
`N_more_90pct_MM`	Additional individuals needed to reach 90% of the MM asymptote
`N_more_95pct_MM`	Additional individuals needed to reach 95% of the MM asymptote
`N_more_99pct_MM`	Additional individuals needed to reach 99% of the MM asymptote

6.3.4 Biological Interpretation

Three statistics extracted from accumulation curves provide insight into the evolutionary forces shaping allele diversity.

6.3.4.1 Initial Slope

Measures how rapidly new alleles appear early in sampling.

Initial_slope > 1   → high rare allele diversity
Initial_slope < 0.5 → few dominant alleles

High values are expected under negative frequency-dependent selection.

6.3.4.2 Tail Slope

Measures whether new alleles continue appearing late in sampling.

Tail_slope ≈ 0 → curve saturation (drift)
Tail_slope > 0.2 → ongoing rare allele discovery (NFDS)

Persistent discovery of alleles is characteristic of balancing selection.

6.3.4.3 Late Fraction

Quantifies how much allelic diversity is composed of rare alleles appearing late in sampling.

Late_fraction =
(alleles_total − alleles_half) / alleles_total

Interpretation:

Late_fraction > 0.30 → strong NFDS
Late_fraction < 0.10 → drift-dominated diversity

This metric captures the proportion of diversity contributed by rare alleles.

6.3.5 Evolutionary Interpretation Framework

Comparing species-level and population-level statistics provides insight into evolutionary dynamics.

Pattern	Interpretation
High species slopes, low population slopes	Migration + local drift
High slopes at both levels	Strong NFDS maintaining diversity
Low slopes at both levels	Genetic drift dominates

This framework complements traditional neutrality tests by directly analyzing allele discovery dynamics.

6.3.6 Bioinformatics

Script: SRK_allele_accumulation_analysis.R
Execution:

Rscript SRK_allele_accumulation_analysis.R

6.3.6.1 Input files

SRK_individual_allele_genotypes.tsv
sampling_metadata.csv

Filtering criteria:

Exclude outgroups (Ingroup == 0)
Match individuals using SampleID
Retain populations with ≥ 5 individuals

6.3.6.2 Output files

SRK_allele_accumulation_curves.pdf
SRK_allele_accumulation_combined.png — single-panel combined plot showing all EO accumulation curves on the same axes; end-of-curve labels show observed allele count and MM-predicted total (e.g. EO27 (15/26)); legend entries include obs= and MM= values
SRK_allele_accumulation_drift_erosion.png — stacked bar chart decomposing the allele deficit per EO into observed alleles, alleles predicted-undetected (EO MM − observed), and alleles lost to genetic drift (species MM − EO MM); quantifies the irreversible erosion of SI diversity driven by drift (60–89% of the 65-allele species optimum lost per EO)
SRK_allele_accumulation_stats.tsv
SRK_species_richness_estimates.tsv

Note: SRK_species_richness_estimates.tsv must be generated before running SRK_chisq_species_population.R, as it provides the empirical species optimum used in the frequency plots.

6.3.6.3 Scientific Applications

Allele accumulation curves provide an empirical framework for testing evolutionary hypotheses in self-incompatibility systems.

Applications include:

6.3.6.4 Detection of Balancing Selection

NFDS maintains many rare alleles, producing accumulation curves with persistent allele discovery.

6.3.6.5 Population Genetic Comparisons

Differences among populations can reveal:

bottlenecks
founder effects
migration patterns

6.3.6.6 Self-Incompatibility Evolution

For SRK loci, accumulation dynamics reflect:

maintenance of allelic diversity
turnover of functional incompatibility alleles
strength of balancing selection across populations.

6.4 Allele Frequency Analysis

6.4.1 Overview

While negative frequency-dependent selection (NFDS) maintains high allelic diversity at the SRK self-incompatibility locus, allele frequencies may still deviate from equal expectations due to demographic processes such as genetic drift, population structure, or recent allele turnover.

This analysis evaluates whether observed SRK allele frequency distributions deviate from the theoretical NFDS expectation, where all alleles should occur at approximately equal frequencies.

A key feature of this pipeline is the use of an empirically estimated species-level allele optimum (from SRK_species_richness_estimates.tsv) rather than an arbitrary fixed value. The Michaelis-Menten asymptote is used by default as it directly reflects the shape of the observed accumulation curve. This optimum sets the x-axis limit for all frequency plots and defines the “missing alleles to optimum” zone, quantifying how far each population is from carrying the full species-level SRK diversity.

The pipeline:

Loads the estimated species allele optimum from the accumulation analysis
Filters ingroup individuals only
Links genotype data to sampling metadata
Performs χ² goodness-of-fit tests at the species and population levels
Generates allele frequency plots with missing allele zones relative to the estimated optimum
Outputs a summary table of statistical tests

6.4.2 Pipeline Steps

6.4.2.1 Species Richness Optimum

At startup the script loads SRK_species_richness_estimates.tsv produced by SRK_allele_accumulation_analysis.R. The Michaelis-Menten estimate is used preferentially as the species optimum; if MM fitting failed, Chao1 is used, then the consensus mean. If the file is absent a fallback value of 50 is applied with a warning.

est_file <- "SRK_species_richness_estimates.tsv"
if (file.exists(est_file)) {
  est_df <- read.table(est_file, header = TRUE, sep = "\t")
  species_optimum <- as.integer(est_df$MM_estimate[1])
}

6.4.2.2 Input Data Integration

Three input datasets are required:

SRK species richness estimates (SRK_species_richness_estimates.tsv) — produced by the accumulation analysis
SRK protein assignment table containing allele identities per individual
Sampling metadata containing population information and ingroup status

Individuals classified as outgroups (Ingroup == 0) are excluded to ensure that only relevant samples are analysed.

Individual identifiers in the genotype dataset are matched with the SampleID column of the metadata table to assign population membership.

6.4.2.3 Allele Frequency Estimation

For each analysis level (species and populations), allele counts are calculated as:

\[ p_i = \frac{n_i}{\sum n_i} \]

where

( n_i ) is the number of observations of allele (i)
( n_i ) is the total number of allele observations

Under ideal NFDS expectations, all alleles should have equal frequency:

\[ p_{expected} = \frac{1}{k} \]

where (k) is the number of alleles present.

6.4.2.4 χ² Goodness-of-Fit Test

To test for deviations from equal allele frequencies, a chi-square goodness-of-fit test is performed:

\[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \]

where:

(O_i) = observed allele counts
(E_i) = expected allele counts assuming equal frequencies

A significant result indicates that allele frequencies are uneven, which may reflect:

genetic drift
population subdivision
sampling effects
transient demographic processes

6.4.2.5 Population-Level Filtering

Population-level analyses are restricted to populations with:

at least 5 individuals

This filtering avoids unstable statistical estimates caused by very small sample sizes.

6.4.2.6 Allele Frequency Visualization

For each analysis level, the pipeline produces a frequency distribution plot including:

Observed allele frequencies (black line, rank-ordered)
NFDS expectation (1/k) (blue dashed line)
LOESS smoothed trend of the observed distribution (red line)
Missing alleles zone (tomato-shaded rectangle) — the gap between the last observed allele and the estimated species optimum, labelled with the count of missing alleles

The x-axis spans from 1 to species_optimum (the MM-estimated total), so all plots share the same scale and deficits are directly comparable across populations.

These visualizations allow rapid inspection of:

dominant alleles
rare allele tails
deviations from theoretical NFDS equilibrium
how many alleles each population is missing relative to the species pool.

6.4.3 Output Files

6.4.3.1 Statistical Results

SRK_chisq_species_population.tsv

This table summarizes the χ² test statistics at the species and population levels.

Column	Description
`N_individuals`	Number of individuals analyzed
`N_alleles`	Number of SRK alleles detected
`X2`	Chi-square statistic
`df`	Degrees of freedom
`p_value`	Significance level
`Level`	Analysis level (Species or Population)
`Population`	Population identifier

Example output:

N_individuals	N_alleles	X2	df	p_value	Level	Population
97	73	619.34	72	5.37e-88	Species	All
30	24	321.72	23	1.82e-54	Population	25
26	21	153.11	20	1.59e-22	Population	76

6.4.3.2 Frequency Plots

SRK_chisq_species_population_frequency_plots.pdf

The PDF file contains allele frequency plots for each analysis level, allowing visual comparison between observed allele distributions and NFDS expectations.

6.4.4 Biological Interpretation

Results can be interpreted using the following guidelines:

6.4.4.1 Non-significant deviation

If the chi-square test is not significant:

A chi-square test revealed no significant deviation from equal allele frequencies ((^2 = X), df = (Y), (p = Z)), consistent with negative frequency-dependent selection maintaining SRK allele diversity.

6.4.4.2 Significant deviation

If the chi-square test is significant:

Allele frequencies deviated significantly from equal expectations ((^2 = X), df = (Y), (p = Z)), suggesting that additional evolutionary or demographic processes may influence SRK allele frequencies.

Possible explanations include:

population bottlenecks
local drift
unequal migration rates
recent allele turnover

6.4.5 Bioinformatics

Script: SRK_chisq_species_population.R
Run command:

Rscript SRK_chisq_species_population.R

6.4.5.1 Input Files

SRK_species_richness_estimates.tsv — produced by SRK_allele_accumulation_analysis.R (run first)
SRK_individual_allele_genotypes.tsv — allele count matrix (individuals × allele bins, values = copy counts)
sampling_metadata.csv

6.4.5.2 Output Files

SRK_chisq_species_population.tsv
SRK_chisq_species_population_frequency_plots.pdf

6.4.5.3 R Script

############################################################
# SRK allele frequency χ² tests
# Uses allele count matrix for proper copy-number frequencies
############################################################

cat("\nStarting SRK chi-square analysis\n")

############################################################
# 1. Load files
############################################################

geno <- read.table(
  "SRK_individual_allele_genotypes.tsv",
  header = TRUE,
  sep = "\t",
  stringsAsFactors = FALSE,
  check.names = FALSE
)

meta_df <- read.csv(
  "sampling_metadata.csv",
  stringsAsFactors = FALSE
)

############################################################
# 2. Keep ingroup only
############################################################

meta_df <- meta_df[meta_df$Ingroup == 1, ]

geno <- geno[geno$Individual %in% meta_df$SampleID, ]

geno$Population <- meta_df$Pop[match(geno$Individual, meta_df$SampleID)]

allele_cols <- setdiff(colnames(geno), c("Individual", "Population"))

geno[, allele_cols] <- lapply(geno[, allele_cols], as.numeric)

cat("Individuals analysed:", nrow(geno), "\n")
cat("Allele bins:", length(allele_cols), "\n")

6.5 TP1 Tipping Point Analysis

Requires outputs from Steps 13 and 14. Run after both have completed.

Script: SRK_TP1_tipping_point.R

Command:

Rscript SRK_TP1_tipping_point.R

Inputs: - SRK_population_genetic_summary.tsv — from Step 13 - SRK_species_richness_estimates.tsv — from Step 14

6.5.1 What TP1 measures — health of the SI system

TP1 assesses the health of the SI system at the population level: the degree to which the SRK allele pool within an EO is still capable of sustaining compatible mating. Two complementary questions define the two axes:

How many different alleles has a population retained? (x-axis) — A population holding all alleles can offer each individual a large pool of compatible partners; as alleles are lost, compatible pairings become progressively rarer.
How evenly are the remaining alleles distributed across individuals? (y-axis) — Even if some alleles survive, a system where one allele dominates is functionally impoverished: most individuals share that allele and cannot mate with each other.

Together, the two axes diagnose whether SI system health has degraded beyond the point where within-population crossing alone can restore it.

Key metrics per EO: - prop_optimum = N_alleles / MM_estimate — proportion of the species-level S-allele pool retained, using the Michaelis-Menten asymptote as the species optimum. Answers: how many different alleles does this EO still have? - evenness = Ne / N_alleles — frequency evenness index. Ne (effective allele number) = 1/Σpᵢ², where pᵢ is the frequency of allele i; it answers how many equally frequent alleles would produce the same level of diversity as observed. When all alleles are at equal frequency (NFDS ideal), Ne = N and the ratio = 1.0. Genetic drift pushes the ratio downward by creating dominant alleles and marginalising rare ones. An EO with evenness = 0.50, for example, has allele frequencies so uneven that only half the observed alleles are effectively contributing to SI function — the remainder are present in such low copy numbers that they are rarely expressed in crosses and are at elevated risk of loss through drift. Answers: how evenly are the surviving alleles distributed?

Interpreting the two axes — bottleneck vs. ongoing drift

The two metrics capture different aspects of demographic history. prop_optimum (allele richness relative to the species optimum) is the primary footprint of a historical bottleneck or founder event: the rapid, simultaneous elimination of rare alleles when population size collapsed. evenness (Ne/N) reflects ongoing drift in small populations: the progressive skewing of surviving allele frequencies as dominant alleles accumulate across generations. An EO that experienced only a historical bottleneck — with no subsequent drift — would show low richness but moderate evenness (founders carry whatever they carried, at roughly similar initial frequencies). An EO experiencing only ongoing drift without a prior bottleneck would retain more alleles but show declining evenness over time.

EOs that score poorly on both axes simultaneously — as observed across all five LEPA EOs — are most consistent with a historical bottleneck (which reduced richness) followed by prolonged ongoing drift (which subsequently eroded evenness). The allele composition analysis (Step 17) adds a further signal: if all EOs were founded from a single shared bottlenecked pool, they would be missing the same alleles. Predominantly private allele sets across EOs instead point to independent drift in isolated populations. Note that the SRK locus, being under balancing selection, cannot be used with standard neutral-marker bottleneck tests (e.g., BOTTLENECK software, M-ratio); formally distinguishing the two processes requires genome-wide neutral marker data and demographic modelling.

TP1 thresholds (adjustable at top of script):

Parameter	Default	Criterion
`TP1_PROP_OPTIMUM`	0.50	EO retains < 50% of species optimum
`TP1_EVENNESS`	0.80	Ne/N_alleles below 0.80

EOs breaching both criteria are flagged CRITICAL; those breaching either one are flagged AT RISK; the remainder are OK.

Outputs: - SRK_TP1_summary.tsv — per-EO N_alleles, prop_optimum, evenness, and TP1 status - SRK_TP1_tipping_point.png — scatter plot positioning each EO by richness retained (x) and frequency evenness (y), with colour-coded zone polygons for all three categories (OK, AT RISK, CRITICAL) and threshold lines - SRK_TP1_tipping_point_blank.png — same plot without data points, for presentation use

6.6 Allele Composition Comparison Across Element Occurrences

6.6.1 Overview

This step quantifies how S-allele sets are partitioned and shared across Element Occurrences (EOs). While allele accumulation curves (Step 14) and frequency analyses (Step 15) characterise diversity within and relative to the species optimum, this analysis asks a complementary question: which alleles are exclusive to a single EO, which are shared between subsets, and which are ubiquitous? The answer directly informs cross-population breeding decisions — EOs that share few alleles are the most important exchange partners for increasing allele richness in the focal population.

Two complementary visualisations are produced:

UpSet plot — displays all 2ⁿ−1 non-empty intersection sizes for n EOs, sorted by allele count. Single-EO bars (private alleles) are colour-coded per EO; multi-EO intersections are shown in grey with dot-matrix membership indicators below.
Pairwise sharing heatmap — a symmetric n × n matrix showing the raw count of alleles present in both EOs of each pair. Diagonal values equal each EO’s total allele richness.

6.6.2 Scientific Rationale

Under negative frequency-dependent selection (NFDS), S-alleles are maintained at approximate frequency equality and are expected to be widely distributed across populations over evolutionary time. Conversely, genetic drift and habitat fragmentation lead to allele loss and population-private allele compositions — each remnant population retains only a subset of the ancestral species repertoire, with little overlap between populations. The UpSet and heatmap visualisations make this partitioning explicit:

Many private alleles per EO: indicates strong historical isolation and drift; each EO carries unique diversity that cannot be recovered from any other.
Very few alleles shared across all EOs: confirms that the species-wide allele pool is distributed rather than redundant — any single EO is an inadequate proxy for species diversity.
Low pairwise sharing: identifies which EO pairs are most compositionally distinct, making them the highest-priority partners for managed cross-population transfers.

6.6.3 Bioinformatics

Script: SRK_allele_sharing_EOs.py
Environment: polyploid-model (requires only pandas, numpy, matplotlib)
Execution:

# Default: reads SRK_individual_allele_genotypes.tsv and sampling_metadata.csv from current directory
python SRK_allele_sharing_EOs.py

# Custom paths / population selection
python SRK_allele_sharing_EOs.py \
    --genotypes SRK_individual_allele_genotypes.tsv \
    --metadata sampling_metadata.csv \
    --pops 25 27 67 70 76 \
    --min-samples 5 \
    --outdir figures/

6.6.3.1 Key Parameters

Parameter	Default	Description
`--genotypes`	`SRK_individual_allele_genotypes.tsv`	Allele count matrix from Step 11
`--metadata`	`sampling_metadata.csv`	Sampling metadata with `Pop` and `Ingroup` columns
`--pops`	`25 27 67 70 76`	EO population IDs to include (space-separated)
`--min-samples`	`5`	Minimum individuals per EO
`--outdir`	`.`	Output directory for figures

6.6.3.2 Input Files

SRK_individual_allele_genotypes.tsv — allele count matrix (individuals × allele bins), from Step 11
sampling_metadata.csv — must contain Pop and Ingroup columns; Pop is matched as string

6.6.3.3 Output Files

SRK_allele_upset_EOs.pdf — UpSet plot of all pairwise and higher-order allele set intersections
SRK_allele_sharing_heatmap_EOs.pdf — symmetric pairwise sharing heatmap

6.6.3.4 Scientific Applications

Identify private alleles unique to one EO — these are irreplaceable and must be prioritised in rescue crosses
Identify alleles shared by all EOs — likely the most abundant, frequency-skewed alleles (consistent with NFDS erosion of rare alleles)
Guide cross-population transfer priorities: EO pairs with the lowest pairwise sharing offer the greatest complementarity for restoring allele richness

6.7 Individual Genotypic Fitness Score

6.7.1 Overview

This analysis introduces the Genotypic Fitness Score (GFS) — a per-individual metric that quantifies reproductive potential under the tetraploid SI system. GFS is defined as the proportion of diploid gametes that carry two distinct alleles (heterozygous gametes), computed directly from allele copy numbers at the SRK locus.

The key motivation is that two individuals carrying the same number of unique alleles can have very different seed production potential depending on the dosage balance of those alleles. For example:

AABB (two alleles, 2 copies each): produces 4 of 6 possible gamete types as heterozygous → GFS = 0.667
AAAB (two alleles, 3+1 copies): produces only 3 of 6 gamete types as heterozygous → GFS = 0.500

This distinction is invisible to simple zygosity classification (both are “2-allele heterozygotes”) but has direct implications for managed crossing and seed orchard design.

6.7.2 Biological Rationale

In a tetraploid, gametes are formed by sampling 2 alleles from the 4 copies at the SRK locus (C(4,2) = 6 equally likely combinations). The GFS captures the fraction of these gametes that are heterozygous — i.e., carry two different alleles. Heterozygous gametes maximise:

Compatibility range: offspring receiving two distinct alleles are less likely to be incompatible with any given mate
Allele diversity contribution: heterozygous gametes pass both alleles to the next generation, preventing allele loss through dosage asymmetry
NFDS benefit: diverse offspring have higher reproductive success under negative frequency-dependent selection

6.7.3 Formula

\[\text{GFS}_i = 1 - \frac{\displaystyle\sum_k n_k\,(n_k - 1)}{12}\]

where $n_k$ is the copy number of allele $k$ in individual $i$, summed over all alleles present, and $12 = 2 \times \binom{4}{2}$ normalises to the full gamete space.

6.7.4 Genotype Class Mapping

Genotype	Copy-number profile	GFS	Heterozygous gametes
ABCD	1, 1, 1, 1	1.000	6 / 6
AABC	2, 1, 1	0.833	5 / 6
AABB	2, 2	0.667	4 / 6
AAAB	3, 1	0.500	3 / 6
AAAA	4	0.000	0 / 6

6.7.5 Bioinformatics

Script: SRK_individual_GFS.R
Environment: base R + tidyverse + ggplot2; ggrepel optional (for labelled scatter plot)
Execution (run from project root):

Rscript SRK_individual_GFS.R

6.7.5.1 Input Files

File	Description
`modeling/data/assigned_genotypes.tsv`	Per-individual tetraploid genotype assignments (4 allele columns + method)
`modeling/data/salleles/sampling_metadata.csv`	Population metadata; must contain `SampleID`, `Pop`, and `Ingroup` columns

6.7.5.2 Key Parameters (top of script)

Parameter	Default	Description
`EO_MAP`	5-EO lookup vector	Maps sub-population labels to EO identifiers
`TP2_MEAN_GFS`	0.667	Mean GFS threshold for TP2 breach
`TP2_PROP_AAAA`	0.30	Proportion AAAA threshold for TP2 breach

6.7.5.3 Output Files

File	Content
`SRK_individual_GFS.tsv`	Per-individual GFS, genotype class, EO, and method

6.7.5.4 Diagnostic Plots

Proportional stacked bar — genotype class composition per EO, with 30% AAAA threshold marked
Individual GFS jitter + mean crossbar — per-individual scores with AABB (0.667) and AAAB (0.500) threshold lines

6.8 TP2 Tipping Point Analysis

Requires SRK_individual_GFS.tsv from Step 18. Run after Step 18 has completed.

6.8.1 Overview

The TP2 Tipping Point Analysis assesses reproductive fitness at the population level — the degree to which individuals within an EO can actually participate in compatible crosses and contribute allelic diversity to the next generation. While Step 18 characterises each individual’s gametic diversity potential, this step synthesises those scores at the EO level and combines them with AAAA prevalence into a single diagnostic framework. Two complementary questions define the two axes:

What fraction of individuals are reproductive dead-ends? (x-axis: prop_AAAA) — An AAAA individual cannot contribute to any compatible cross regardless of its mate, making it a complete loss to the breeding pool. As this fraction grows, the effective breeding population shrinks accordingly.
How much reproductive capacity does the average individual still have? (y-axis: mean GFS) — Even individuals that are not dead-ends vary in how much allelic diversity they can contribute per cross. When the population mean falls below the AABB benchmark (GFS = 0.667), even the functional breeders are producing predominantly homotypic gametes.

Together, the two axes establish whether reproductive fitness has degraded beyond the point where within-EO crossing alone can recover it.

6.8.2 Tipping Point 2 — Thresholds

Two threshold conditions are evaluated simultaneously for each EO:

Threshold	Criterion	Biological meaning
TP2-AAAA	prop(AAAA) > 0.30	More than 30% of individuals are reproductive dead-ends
TP2-mean	mean GFS < 0.667	The average individual has less reproductive capacity than an AABB genotype

An EO is flagged CRITICAL when both thresholds are breached simultaneously; AT RISK when either one is breached; OK otherwise.

6.8.3 Bioinformatics

The TP2 analysis is performed by the same script as Step 18 (SRK_individual_GFS.R). The EO-level summary and TP2 status flags are computed after per-individual GFS values are aggregated.

6.8.3.1 Output Files

File	Content
`SRK_EO_GFS_summary.tsv`	EO-level mean GFS, class proportions, and TP2 status flags
`SRK_GFS_plots.pdf`	Four diagnostic plots (see below)
`SRK_GFS_plots_p3_TP2_tipping_point.png`	Standalone TP2 tipping point map with colour-coded zone polygons for all three categories (OK, AT RISK, CRITICAL)
`SRK_GFS_plots_p3_TP2_tipping_point_blank.png`	Same plot without data points, for presentation use

6.8.3.2 Diagnostic Plots

TP2 tipping point map — scatter plot of prop(AAAA) × mean GFS with colour-coded zone polygons for OK, AT RISK, and CRITICAL categories; EOs labelled by TP2 status; also saved as standalone PNG
Absolute count stacked bar — raw individual counts per genotype class per EO

6.8.4 Scientific Applications

EO prioritisation: EOs in the CRITICAL quadrant (high prop_AAAA AND low mean_GFS) need both allele importation (TP1 intervention) and targeted within-EO crossing to rebuild AABB/AABC/ABCD individuals
Seed parent selection: rank individuals within each EO by GFS; AABC and AABB individuals are priority seed parents over AAAB and AAAA
Crossing design: pairing two AAAB individuals carrying different allele pairs can produce AABB or AABC offspring — GFS identifies which pairs are worth prioritising
Monitoring: GFS can be tracked across generations to evaluate whether managed crossing is effectively redistributing allele dosage toward AABB/AABC/ABCD classes

6.9 Reproductive Effort Support per Element Occurrence

Requires SRK_individual_GFS.tsv from Step 18. Run after Step 18 has completed.

6.9.1 Overview

This analysis visualises the proportion of individuals at each GFS tier per Element Occurrence, making explicit the fraction of each population that actively contributes allelic diversity to compatible crosses. In self-incompatible plants, an individual’s value as a breeding partner depends not only on which key/lock types it carries, but also on the number of distinct alleles present across its genome copies. Individuals with more key/lock types can participate in more compatible crosses, making them especially valuable in a managed breeding program. The figure translates this principle into a population-level diagnostic: for each EO, it shows what fraction of individuals can “support reproductive effort” (GFS > 0 — carrying at least two distinct alleles and producing at least one heterozygous gamete combination) and how those individuals are distributed across GFS tiers.

6.9.2 Biological Rationale

AAAA individuals (GFS = 0) carry a single SRK allele across all four genome copies. All six of their diploid gamete combinations are homozygous for that allele, meaning they contribute no allelic diversity to crosses and cannot drive compatible pairings with any partner carrying the same allele. All other genotype classes (AAAB through ABCD) carry at least two distinct alleles and produce at least 3 of 6 heterozygous gametes — they “support reproductive effort” in the sense that their gametes diversify the allele combinations available to offspring.

The proportion of supporting individuals per EO, combined with the GFS tier distribution within that fraction, provides a direct measure of how much reproductive capacity the population retains.

6.9.3 Bioinformatics

Script: SRK_GFS_reproductive_effort.R
Environment: base R + tidyverse + ggplot2
Execution (run from project root):

Rscript SRK_GFS_reproductive_effort.R

6.9.3.1 Input Files

File	Description
`SRK_individual_GFS.tsv`	Per-individual GFS scores and genotype classes (output of Step 18)

6.9.3.2 Key Parameters (top of script)

Parameter	Default	Description
`EO_FOCUS`	5-EO vector	Element Occurrences to include
`TP2_PROP_AAAA`	0.30	Threshold line position; matches TP2 threshold from Step 19

6.9.3.3 Output Files

File	Content
`SRK_GFS_reproductive_effort.pdf`	Horizontal proportional bar chart — GFS tier composition per EO, ordered by mean GFS, with reproductive effort annotations
`SRK_GFS_reproductive_effort.png`	Same figure as PNG at 200 dpi

6.9.3.4 Figure Description

Each horizontal bar represents one EO. Segments are coloured by GFS class (red = AAAA through blue = ABCD), stacked left to right from lowest to highest GFS. The dashed vertical line marks the TP2 AAAA threshold (30%): bars where the red (AAAA) segment extends past this line indicate a TP2 breach. Annotations to the right of each bar show the proportion of supporting individuals (GFS > 0), the count (n / N), and the mean GFS for that EO. EOs are ordered from top (lowest mean GFS, most degraded) to bottom (highest mean GFS).

6.9.4 Scientific Applications

Conservation triage: the proportion of supporting individuals per EO directly quantifies how much of the population is functional for managed crossing
Seed parent pool size: the annotated n / N count identifies how many individuals are candidates for inclusion in a crossing programme
Cross-EO comparison: ordering by mean GFS highlights which EOs are most degraded and therefore highest priority for allele importation combined with GFS-targeted crossing
Communication: the figure is designed for presentation to conservation managers, translating the GFS metric into an intuitive “proportion supporting reproductive effort” framing

SRK Bioinformatics

Nanopore-Based SRK Haplotype Reconstruction and Population Analysis Pipeline

Sven Buerki - Boise State University

2026-04-14

1 GitHub Repository

2 Introduction

3 Pipeline Overview

3.1 Phase 1: SRK amplicon sequence assembly

3.2 Phase 2: Functional proteins, S alleles and genotyping

3.3 Phase 3: Data Analyses

4 Phase 1: SRK amplicon sequence assembly

4.1 Prepare Canonical Sequences

4.1.1 Reverse Complement Sequences and Check Start Codon

4.1.2 Gene Annotation

4.2 Nanopore Amplicon Assembly and Phasing Pipeline

4.2.1 Overview

4.2.2 Pipeline Steps

4.2.2.1 Amplicon Reconstruction and Haplotype Phasing

4.2.2.2 Input Parameters

4.2.2.3 FASTQ File Management

4.2.2.4 Multi-Replicate CANU Assembly

4.2.2.5 Contig Consolidation

4.2.2.6 Iterative RACON Polishing

4.2.2.7 Read Alignment and Indexing

4.2.2.8 Polyploid-Aware Variant Calling

4.2.2.9 Read-Backed Phasing

4.2.2.10 Haplotype Export

4.2.3 Output Files

4.2.4 Error Handling

4.2.5 Scientific Rationale

4.2.6 Bioinformatics

4.3 Haplotype Orientation and Consolidation Pipeline

4.3.1 Overview

4.3.2 Pipeline Steps

4.3.2.1 Haplotype Orientation Standardization

4.3.2.2 Input Parameters

4.3.2.3 Output File Initialization

4.3.2.4 Barcode Directory Processing

4.3.2.5 File Validation

4.3.2.6 Reference Alignment

4.3.2.7 Reverse-Complement Detection

4.3.2.8 Conditional Sequence Reversal

4.3.2.9 File Consolidation

4.3.2.10 Cleanup Operations

4.3.3 Output Files

4.3.3.1 Per-Sample Outputs

4.3.3.2 Consolidated Output

4.3.4 Error Handling and Robustness

4.3.5 Scientific Rationale

4.3.6 Computational Efficiency

4.3.7 Bioinformatics

4.4 FASTA Sequence Filtering and Reference Integration Pipeline

4.4.1 Overview

4.4.2 Pipeline Steps

4.4.2.1 Quality Control Through Length Filtering

4.4.2.2 Input Parameters and Validation

4.4.2.3 File Existence Validation

4.4.2.4 Dynamic Output Naming

4.4.2.5 Reference Integration

4.4.2.6 Length-Based Filtering

4.4.2.7 Temporary File Management

4.4.3 Output Files

4.4.3.1 Primary Output

4.4.4 Quality Control Metrics

4.4.5 Scientific Applications

4.4.5.1 Phylogenetic Analysis

4.4.5.2 Population Genetics

4.4.5.3 Self-Incompatibility Research

4.4.6 Parameter Optimization

4.4.6.1 Length Threshold Selection

4.4.6.2 Typical Thresholds

4.4.7 Integration with Upstream Workflows

4.4.8 Bioinformatics

4.5 Multiple Sequence Alignment

4.6 Exon Extraction from Multiple Sequence Alignments Pipeline

4.6.1 Overview

4.6.2 Pipeline Steps

4.6.2.1 Exon-Focused Comparative Analysis

4.6.2.2 Input Parameters and Validation

4.6.2.3 Multiple Sequence Alignment Loading