Skip to main content

0

Fig. 2. HiFi-based assembly string graph of the CHM13 genome. (A)Bandage (36) string graphvisualization, where nodes represent unambiguously assembled sequences colored by sourcechromosome and scaled by length. Edges correspond to the overlaps between node sequences, due toeither repeats or true adjacencies in the underlying genome. Centromeric satellite sequences are thesource of most ambiguity in the graph (gray highlights). The graph is partially fragmented due to HiFicoverage dropout surrounding GA-rich sequence (black triangles). Local graph structures are enlargedwith insets. The correct graph walks through these complex structures were resolved and confirmed withultra-long ONT reads.(B)The identified graph traversal for the 2p11 locus is given by numerical order.Based on a depth-of-coverage analysis, the unlabeled light gray node represents an artifact or possibleheterozygous variant and was not used.(C)The multi-megabase tandem HSat3 duplication (9qh+) at9q12 requires two traversals of the large loop structure (note that the size of the loop is exaggeratedbecause graph edges are of constant size). The first traversal of the loop is given in dark purple and thesecond in light purple. Nodes used by both traversals are also in dark purple and typically have twice thesequencing coverage.(D)The telomeric ends of four acrocentric p-arms form an ambiguous graphstructure due to the highly similar sequence shared between all four chromosomes, specifically within thedistal junction (DJ) sequence adjacent to the rDNA arrays, which themselves form dense, but separate,clusters of small nodes.7.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 27, 2021. ; https://doi.org/10.1101/2021.05.26.445798doi: bioRxiv preprint Assembly validation and polishingThe first step of genome assembly validation is to confirm that the constructed assembly isconsistent with the data used to generate it (37). To evaluate concordance between the readsand the assembly we mapped all available primary data, including HiFi, ONT, ILMN, Strand-seq,and Hi-C, to the v0.9 draft assembly using Winnowmap2 (38) for long reads and BWA (39) forshort reads (Note S5). Structural variants were identified with Sniffles (40), and small variantswere called with DeepVariant (41) for ILMN and HiFi and PEPPER-DeepVariant for ONT (42).Small variants were further filtered using Merfin (43) to exclude any corrections that were notsupported by the underlying ILMN and HiFi reads. After manual curation to differentiate trueerrors from heterozygous variants and mapping artifacts, a total of 4 large variants and 993small variants were corrected, 52% of which were small indels within homopolymers. Anadditional 44 large and 3,901 small heterozygous variants were cataloged during curation (44),including the hemizygous insertion of an hTERT vector on Chromosome 21, consistent with theimmortalization process used to create the CHM13 line (this insertion was excluded from thefinal assembly). The assembled sizes of major repeat arrays were consistent with ddPCRcopy-number estimates for those tested (Tables S1-2, Fig. S7), and both Strand-seq (Figs.S8-9) and Hi-C (Fig. S10) data were concordant with the overall structure of the assembly,showing no signs of misorientations or other large-scale structural errors. In addition, theassembly correctly resolved 644 of 647 previously sequenced CHM13 BACs at >99.99%identity, with the three unresolved BACs appearing to be errors in the BACs themselves ratherthan the T2T assembly (Figs. S11-14).The entire validation process was then repeated on the polished assembly, and investigation ofthe remaining variant calls revealed additional base-calling errors within some telomeric[TTAGGG]nrepeats. These putative errors were primarily a result of decreased coverage byboth HiFi and ONT technologies towards the telomeres and not flagged by the initialvariant-calling pipeline due to a telomere-associated strand bias in the ONT data. TelomericONT reads were only found oriented in the direction of the chromosome end and never awayfrom it, which led to low-confidence variant calls and omission from polishing. After furthercuration and adjustment of the variant calling strategy (Note S4), an additional 454 correctionswere made to the telomeres using PEPPER (42), followed by addition of the rDNA arrays asdescribed below, resulting in a gapless CHM13v1.1 assembly—the first telomere-to-telomererepresentation of a human genome.Mapped sequencing read depth across the final assembly shows uniform coverage across allchromosomes (Fig. 3A), with 99.86% of the assembly within three standard deviations of themean coverage for both HiFi and ONT (HiFi coverage 34.70 ± 7.03, ONT coverage 116.16 ±16.96, excluding the mitochondrial genome). Ignoring the 10 Mbp of rDNA sequence, wheremost of the coverage deviation resides, 99.99% of the assembly is within three standarddeviations (Note S5). This is consistent with uniform coverage of the genome and confirms boththe overall accuracy of the assembly and the absence of aneuploidy in the sequenced CHM13cells. Copy-number concordance with raw ILMN and HiFi data also increased with successiveversions of the assembly (Figs. S15-16). Local coverage anomalies were, however, observedacross multiple satellite arrays (Table S3, Note S6). Given the uniformity of coverage increases8.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 27, 2021. ; https://doi.org/10.1101/2021.05.26.445798doi: bioRxiv preprint and decreases across these arrays, association with specific satellite classes, and thesometimes opposite effect observed for HiFi and ONT, we hypothesize that these anomalies arerelated to systematic biases introduced during either sample preparation (e.g., shearing bias) orsequencing (e.g., polymerase kinetics), rather than assembly error (Note S6). For example, HiFicoverage is consistently elevated across HSat2 and HSat3 arrays, while ONT coverage remainsnormal but with an apparent strand bias and reduced read lengths for HSat2 (Fig. 3B-C, Figs.S17-S21). On the other hand, both HiFi and ONT coverage is depleted across the AT-rich HSat1arrays, with ONT reads also showing shorter read lengths (Fig. 3D, Figs. S17-18, Table S3).While the specific mechanisms require further investigation, prior studies have noted similarbiases within certain satellite arrays and sequence contexts for both ONT and HiFi (45,46).Due to the challenge of assembling them correctly, we performed targeted validation of all largesatellite arrays and segmental duplications (Note S7). For centromeric alpha satellite arrays, weused the TandemTools package (47) to catalog additional variants that were missed by thestandard approach. TandemTools was used throughout the process to guide development of theassembly method, and analysis of the final assembly shows high accuracy across allcentromeric arrays (Fig. S22, Table S4). Independent ILMN-based copy number estimates ofalpha satellite higher-order repeats (HOR) also correlate strongly with the assembly (Fig. S23).The beta satellite (BSat) and HSat arrays were separately validated by measuring the frequencyof secondary variants identified by HiFi read mappings using a technique previously developedto identify collapsed segmental duplications (48). Because CHM13 is mostly homozygous, weexpect to find very few heterozygous variants when mapping the raw reads back to theassembly and any variant clusters would indicate potential mis-assembly. This analysis showsconsistent coverage across all satellite arrays, with only a handful of potential variants flagged(Fig. S24). A companion study (26) used this same approach to validate segmentally duplicatedregions of the genome, along with an analysis of copy number variation compared to acollection of diverse human genomes, demonstrating that T2T-CHM13 represents thesecomplex regions better than GRCh38.In addition to high structural accuracy, we estimate the average consensus accuracy of theassembly to be between Phred Q67 and Q73 (Note S5), which is equivalent to 1 error per 10Mbp and far exceeds the original Q40 definition of “finished” sequence (49). However, thisrepresents an average across the entire genome and some regions are expected to be higherquality than others. In particular, regions of low HiFi coverage were found to be associated withan enrichment of potential consensus errors, as estimated from both HiFi and ILMN data (44).To guide future use of the assembly, we provide a curated list of all low-coverage and knownheterozygous sites identified by the above validation procedures (Note S5). The total number ofbases covered by potential issues in the T2T-CHM13 assembly is just 0.3% of the totalassembly length compared to 8% for GRCh38 (Fig. 3A), making T2T-CHM13 a more complete,accurate, and representative reference sequence for both short- and long-read variant callingacross human samples of all ancestries (50). Compared to GRCh38, T2T-CHM13 reduces falsenegative variant calls by adding 182 Mbp of novel sequence and removing 1.2 Mbp of falselyduplicated sequence, while simultaneously reducing false positive variant calls by fixingcollapsed segmental duplications and other errors, affecting a total of at least 388 genes (68 protein coding) in GRCh38. Lastly, the T2T-CHM13 haplotype structure and SNP density ismuch more consistent than the mosaic GRCh38 when calling variants. A full comparison ofGRCh38 versus CHM13 as a reference for variant calling is provided by Aganezovet al.(50),and a discussion of validation and polishing strategies for T2T genome assemblies byMcCartneyet al.(44).Fig. 3. Sequencing coverage and assembly validation.Both HiFi and ONT sequencing reads mappedto the assembly show uniform coverage (cov.) across the whole genome as visualized by IGV (51), withthe exception of certain human satellite repeat classes. Coverage deviations in these regions were foundto be caused by sequencing biases associated with specific repeats rather than misassembly.(A)Whole-genome coverage of HiFi and ONT reads with primary alignments is shown in light shades andmarker-assisted alignments overlaid in dark shades. Large HSat2 and HSat3 arrays are noted with lightand dark blue triangles, respectively, and the location of the rDNA arrays is marked with asterisks (theinset regions are marked with arrowheads). Regions with low marker-assisted alignment coveragecorrespond with a lack of unique 21-mer markers (density shown in green), but are recovered by theprimary alignments, albeit with low mapping quality. Suspected assembly issues in T2T-CHM13 arecompared to known assembly gaps and issues in GRCh38/hg38 below, as reported by the GRC.(B–D)Enlargements corresponding to regions of the genome featured in Figure 2, along with annotations of themajor satellite repeats contained (primarily HSat1, HSat2, HSat3, and alpha satellite HOR arrays).Elevated HiFi sequencing coverage is observed for HSat2 and HSat3, while reduced ONT coverage isobserved for HSat1. Identified errors (Issues) and heterozygous variants (Het SVs) are shown below,which typically correspond with low HiFi coverage of the primary allele (black) and elevated coverage of asecondary allele (red). Microsatellite repeats (%) in every 128 bp window are shown at the bottom,labeled with dimer notation in homopolymer compressed space.rDNA assemblyThe most complex region of the HiFi string graph involves the human ribosomal DNA arrays andtheir surrounding sequence (Fig. 2). Human rDNAs are 45 kbp near-identical repeats thatencode the 45S rRNA and are arranged in large, tandem repeat arrays embedded within the10.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 27, 2021. ; https://doi.org/10.1101/2021.05.26.445798doi: bioRxiv preprint

This collection has no description yet. Contact the owner of this collection about setting it up on OpenSea!
Contract Address0x495f...7b5e
Token ID
Token StandardERC-1155
BlockchainEthereum
MetadataCentralized
Creator Earnings
info
0%

Genome Human 2

keyboard_arrow_down
  • Price
    USD Price
    Expiration
    From
  • Price
    USD Price
    Floor Difference
    Expiration
    From
Event
Price
From
To
Date

Genome Human 2

0

  • Price
    USD Price
    Expiration
    From
  • Price
    USD Price
    Floor Difference
    Expiration
    From

Fig. 2. HiFi-based assembly string graph of the CHM13 genome. (A)Bandage (36) string graphvisualization, where nodes represent unambiguously assembled sequences colored by sourcechromosome and scaled by length. Edges correspond to the overlaps between node sequences, due toeither repeats or true adjacencies in the underlying genome. Centromeric satellite sequences are thesource of most ambiguity in the graph (gray highlights). The graph is partially fragmented due to HiFicoverage dropout surrounding GA-rich sequence (black triangles). Local graph structures are enlargedwith insets. The correct graph walks through these complex structures were resolved and confirmed withultra-long ONT reads.(B)The identified graph traversal for the 2p11 locus is given by numerical order.Based on a depth-of-coverage analysis, the unlabeled light gray node represents an artifact or possibleheterozygous variant and was not used.(C)The multi-megabase tandem HSat3 duplication (9qh+) at9q12 requires two traversals of the large loop structure (note that the size of the loop is exaggeratedbecause graph edges are of constant size). The first traversal of the loop is given in dark purple and thesecond in light purple. Nodes used by both traversals are also in dark purple and typically have twice thesequencing coverage.(D)The telomeric ends of four acrocentric p-arms form an ambiguous graphstructure due to the highly similar sequence shared between all four chromosomes, specifically within thedistal junction (DJ) sequence adjacent to the rDNA arrays, which themselves form dense, but separate,clusters of small nodes.7.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 27, 2021. ; https://doi.org/10.1101/2021.05.26.445798doi: bioRxiv preprint Assembly validation and polishingThe first step of genome assembly validation is to confirm that the constructed assembly isconsistent with the data used to generate it (37). To evaluate concordance between the readsand the assembly we mapped all available primary data, including HiFi, ONT, ILMN, Strand-seq,and Hi-C, to the v0.9 draft assembly using Winnowmap2 (38) for long reads and BWA (39) forshort reads (Note S5). Structural variants were identified with Sniffles (40), and small variantswere called with DeepVariant (41) for ILMN and HiFi and PEPPER-DeepVariant for ONT (42).Small variants were further filtered using Merfin (43) to exclude any corrections that were notsupported by the underlying ILMN and HiFi reads. After manual curation to differentiate trueerrors from heterozygous variants and mapping artifacts, a total of 4 large variants and 993small variants were corrected, 52% of which were small indels within homopolymers. Anadditional 44 large and 3,901 small heterozygous variants were cataloged during curation (44),including the hemizygous insertion of an hTERT vector on Chromosome 21, consistent with theimmortalization process used to create the CHM13 line (this insertion was excluded from thefinal assembly). The assembled sizes of major repeat arrays were consistent with ddPCRcopy-number estimates for those tested (Tables S1-2, Fig. S7), and both Strand-seq (Figs.S8-9) and Hi-C (Fig. S10) data were concordant with the overall structure of the assembly,showing no signs of misorientations or other large-scale structural errors. In addition, theassembly correctly resolved 644 of 647 previously sequenced CHM13 BACs at >99.99%identity, with the three unresolved BACs appearing to be errors in the BACs themselves ratherthan the T2T assembly (Figs. S11-14).The entire validation process was then repeated on the polished assembly, and investigation ofthe remaining variant calls revealed additional base-calling errors within some telomeric[TTAGGG]nrepeats. These putative errors were primarily a result of decreased coverage byboth HiFi and ONT technologies towards the telomeres and not flagged by the initialvariant-calling pipeline due to a telomere-associated strand bias in the ONT data. TelomericONT reads were only found oriented in the direction of the chromosome end and never awayfrom it, which led to low-confidence variant calls and omission from polishing. After furthercuration and adjustment of the variant calling strategy (Note S4), an additional 454 correctionswere made to the telomeres using PEPPER (42), followed by addition of the rDNA arrays asdescribed below, resulting in a gapless CHM13v1.1 assembly—the first telomere-to-telomererepresentation of a human genome.Mapped sequencing read depth across the final assembly shows uniform coverage across allchromosomes (Fig. 3A), with 99.86% of the assembly within three standard deviations of themean coverage for both HiFi and ONT (HiFi coverage 34.70 ± 7.03, ONT coverage 116.16 ±16.96, excluding the mitochondrial genome). Ignoring the 10 Mbp of rDNA sequence, wheremost of the coverage deviation resides, 99.99% of the assembly is within three standarddeviations (Note S5). This is consistent with uniform coverage of the genome and confirms boththe overall accuracy of the assembly and the absence of aneuploidy in the sequenced CHM13cells. Copy-number concordance with raw ILMN and HiFi data also increased with successiveversions of the assembly (Figs. S15-16). Local coverage anomalies were, however, observedacross multiple satellite arrays (Table S3, Note S6). Given the uniformity of coverage increases8.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 27, 2021. ; https://doi.org/10.1101/2021.05.26.445798doi: bioRxiv preprint and decreases across these arrays, association with specific satellite classes, and thesometimes opposite effect observed for HiFi and ONT, we hypothesize that these anomalies arerelated to systematic biases introduced during either sample preparation (e.g., shearing bias) orsequencing (e.g., polymerase kinetics), rather than assembly error (Note S6). For example, HiFicoverage is consistently elevated across HSat2 and HSat3 arrays, while ONT coverage remainsnormal but with an apparent strand bias and reduced read lengths for HSat2 (Fig. 3B-C, Figs.S17-S21). On the other hand, both HiFi and ONT coverage is depleted across the AT-rich HSat1arrays, with ONT reads also showing shorter read lengths (Fig. 3D, Figs. S17-18, Table S3).While the specific mechanisms require further investigation, prior studies have noted similarbiases within certain satellite arrays and sequence contexts for both ONT and HiFi (45,46).Due to the challenge of assembling them correctly, we performed targeted validation of all largesatellite arrays and segmental duplications (Note S7). For centromeric alpha satellite arrays, weused the TandemTools package (47) to catalog additional variants that were missed by thestandard approach. TandemTools was used throughout the process to guide development of theassembly method, and analysis of the final assembly shows high accuracy across allcentromeric arrays (Fig. S22, Table S4). Independent ILMN-based copy number estimates ofalpha satellite higher-order repeats (HOR) also correlate strongly with the assembly (Fig. S23).The beta satellite (BSat) and HSat arrays were separately validated by measuring the frequencyof secondary variants identified by HiFi read mappings using a technique previously developedto identify collapsed segmental duplications (48). Because CHM13 is mostly homozygous, weexpect to find very few heterozygous variants when mapping the raw reads back to theassembly and any variant clusters would indicate potential mis-assembly. This analysis showsconsistent coverage across all satellite arrays, with only a handful of potential variants flagged(Fig. S24). A companion study (26) used this same approach to validate segmentally duplicatedregions of the genome, along with an analysis of copy number variation compared to acollection of diverse human genomes, demonstrating that T2T-CHM13 represents thesecomplex regions better than GRCh38.In addition to high structural accuracy, we estimate the average consensus accuracy of theassembly to be between Phred Q67 and Q73 (Note S5), which is equivalent to 1 error per 10Mbp and far exceeds the original Q40 definition of “finished” sequence (49). However, thisrepresents an average across the entire genome and some regions are expected to be higherquality than others. In particular, regions of low HiFi coverage were found to be associated withan enrichment of potential consensus errors, as estimated from both HiFi and ILMN data (44).To guide future use of the assembly, we provide a curated list of all low-coverage and knownheterozygous sites identified by the above validation procedures (Note S5). The total number ofbases covered by potential issues in the T2T-CHM13 assembly is just 0.3% of the totalassembly length compared to 8% for GRCh38 (Fig. 3A), making T2T-CHM13 a more complete,accurate, and representative reference sequence for both short- and long-read variant callingacross human samples of all ancestries (50). Compared to GRCh38, T2T-CHM13 reduces falsenegative variant calls by adding 182 Mbp of novel sequence and removing 1.2 Mbp of falselyduplicated sequence, while simultaneously reducing false positive variant calls by fixingcollapsed segmental duplications and other errors, affecting a total of at least 388 genes (68 protein coding) in GRCh38. Lastly, the T2T-CHM13 haplotype structure and SNP density ismuch more consistent than the mosaic GRCh38 when calling variants. A full comparison ofGRCh38 versus CHM13 as a reference for variant calling is provided by Aganezovet al.(50),and a discussion of validation and polishing strategies for T2T genome assemblies byMcCartneyet al.(44).Fig. 3. Sequencing coverage and assembly validation.Both HiFi and ONT sequencing reads mappedto the assembly show uniform coverage (cov.) across the whole genome as visualized by IGV (51), withthe exception of certain human satellite repeat classes. Coverage deviations in these regions were foundto be caused by sequencing biases associated with specific repeats rather than misassembly.(A)Whole-genome coverage of HiFi and ONT reads with primary alignments is shown in light shades andmarker-assisted alignments overlaid in dark shades. Large HSat2 and HSat3 arrays are noted with lightand dark blue triangles, respectively, and the location of the rDNA arrays is marked with asterisks (theinset regions are marked with arrowheads). Regions with low marker-assisted alignment coveragecorrespond with a lack of unique 21-mer markers (density shown in green), but are recovered by theprimary alignments, albeit with low mapping quality. Suspected assembly issues in T2T-CHM13 arecompared to known assembly gaps and issues in GRCh38/hg38 below, as reported by the GRC.(B–D)Enlargements corresponding to regions of the genome featured in Figure 2, along with annotations of themajor satellite repeats contained (primarily HSat1, HSat2, HSat3, and alpha satellite HOR arrays).Elevated HiFi sequencing coverage is observed for HSat2 and HSat3, while reduced ONT coverage isobserved for HSat1. Identified errors (Issues) and heterozygous variants (Het SVs) are shown below,which typically correspond with low HiFi coverage of the primary allele (black) and elevated coverage of asecondary allele (red). Microsatellite repeats (%) in every 128 bp window are shown at the bottom,labeled with dimer notation in homopolymer compressed space.rDNA assemblyThe most complex region of the HiFi string graph involves the human ribosomal DNA arrays andtheir surrounding sequence (Fig. 2). Human rDNAs are 45 kbp near-identical repeats thatencode the 45S rRNA and are arranged in large, tandem repeat arrays embedded within the10.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 27, 2021. ; https://doi.org/10.1101/2021.05.26.445798doi: bioRxiv preprint

This collection has no description yet. Contact the owner of this collection about setting it up on OpenSea!
Contract Address0x495f...7b5e
Token ID
Token StandardERC-1155
BlockchainEthereum
MetadataCentralized
Creator Earnings
info
0%
Event
Price
From
To
Date