The article details a comprehensive analysis of 625 metagenomic samples from the Yangtze River Basin, integrating public data and field-collected samples. This involved curating 606 samples from previous studies and collecting 19 samples from the Chenhang Reservoir, ensuring broad spatial coverage. Field samples underwent quality control, DNA extraction, and metagenomic sequencing using the Illumina NovaSeq platform.
To identify potential pathogens, a curated Pathogen Genome-Specific Markers (GSM) Database was constructed from multiple authoritative sources, resulting in 723,000 pathogen-specific markers. Despite its specificity for known pathogens, the GSM database has limitations, including potential underrepresentation of species and inability to discover novel pathogens.
After quality filtering, 586 high-quality samples remained for further analysis. Clean reads were aligned against the GSM library, recording pathogenic frequencies while applying normalization techniques to ensure standardized comparisons across samples.
