Skip to main content

0

Abstract In 2001, Celera Genomics and the International Human Genome Sequencing Consortium published their initial drafts of the human genome, which revolutionized the field of genomics. While these drafts and the updates that followed effectively covered the euchromatic fraction of the genome, the heterochromatin and many other complex regions were left unfinished or erroneous. Addressing this remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium has finished the first truly complete 3.055 billion base pair (bp) sequence of a human genome, representing the largest improvement to the human reference genome since its initial release. The new T2T-CHM13 reference includes gapless assemblies for all 22 autosomes plus Chromosome X, corrects numerous errors, and introduces nearly 200 million bp of novel sequence containing 2,226 paralogous gene copies, 115 of which are predicted to be protein coding. The newly completed regions include all centromeric satellite arrays and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies for the first time.

Introduction The latest major update to the human reference genome was released by the Genome Reference Consortium (GRC) in 2013 and most recently patched in 2019 (GRCh38.p13) (1). This assembly traces its origin to the publicly funded Human Genome Project (2) and has been continually improved over the past two decades. Unlike the competing Celera assembly (3), and most modern genome projects that are also based on shotgun sequence assembly (4), the GRC human reference assembly is primarily based on Sanger sequencing data derived from bacterial artificial chromosome (BAC) clones that were ordered and oriented along the genome via radiation hybrid, genetic linkage, and fingerprint maps (5). This laborious approach resulted in what remains one of the most continuous and accurate reference genomes today. However, reliance on these technologies limited the assembly to only the euchromatic regions of the genome that could be reliably cloned into BACs, mapped, and assembled. Restriction enzyme biases led to the underrepresentation of many long, tandem repeats in the resulting BAC libraries, and the opportunistic assembly of BACs derived from multiple different individuals resulted in a mosaic assembly that does not represent a continuous haplotype. As such, the current GRC assembly contains several unsolvable gaps, where a correct genomic reconstruction is impossible due to incompatible structural polymorphisms associated with segmental duplications on either side of the gap (6). As a result of these shortcomings, many repetitive and polymorphic regions of the genome have been left unfinished or incorrectly assembled for over 20 years.

The current GRCh38.p13 reference genome contains 151 Mbp of unknown sequence distributed throughout the genome, including pericentromeric and subtelomeric regions, recent segmental duplications, ampliconic gene arrays, and ribosomal DNA (rDNA) arrays, all of which are necessary for fundamental cellular processes (Fig. 1A). Some of the largest reference gaps include the entire p-arms (short arms) of all five acrocentric chromosomes (Chr13, Chr14, Chr15, Chr21, and Chr22), and large human satellite arrays (e.g., Chr1, Chr9, and Chr16), which are currently represented in the reference simply as multi-megabase stretches of unknown bases (‘N’s). In addition to these apparent gaps, other regions of the current reference are artificial or are otherwise incorrect. The centromeric alpha satellite arrays, for example, are represented in GRCh38 as computationally generated models of alpha satellite monomers to serve as decoys for resequencing analyses (7). In the case of the acrocentrics, some sequence is included for the p-arm of Chromosome 21 but appears incorrectly localized and poorly assembled, resulting in false gene duplications that complicate downstream analyses (8). When compared to other human genomes, the current reference also shows a genome-wide deletion bias, suggesting the systematic collapse of repeats during its initial cloning and/or assembly (9). Summary of the complete T2T-CHM13 human genome assembly. (A) karyoploteR (25) ideogram of the T2T-CHM13v1.1 assembly improvements. The bottom track shows the density of known genes in green and new paralogs in red. GRCh38 gaps and issues that are resolved by the CHM13 assembly are highlighted by black rectangles. Above, the density of segmental duplications is given in blue (26) and centromeric satellites (CenSat) in red (27). The top track is a local ancestry analysis where the majority of the genome is predicted to be of European ancestry (1000 Genomes EUR), with regions of admixture colored as specified in the legend. (B) New bases in the CHM13 assembly relative to GRCh38 per chromosome, with the acrocentrics highlighted in yellow. (C) New or structurally variable bases added by sequence type (“CenSat & SDs” is the overlap between these two annotations). (D) Total non-gap bases in UCSC reference genome releases dating back to September 2000 (hg4) and ending with T2T-CHM13 in 2021.

Despite the functional importance of these missing or erroneous regions, the Human Genome Project was officially declared complete in 2003 (10), and there was limited progress towards closing the remaining gaps in the years that followed. This was largely due to limitations of its construction discussed above, but also due to the sequencing technologies of the time, which were dominated by low-cost, high-throughput methods capable of sequencing only a few hundred bases per read. Thus, shotgun-based assembly methods were unable to surpass the quality of the existing reference. However, recent advances in long-read genome sequencing and assembly methods have enabled the complete assembly of individual human chromosomes from telomere to telomere without gaps (11, 12). In addition to using long reads, these T2T projects have targeted the genomes of clonal, complete hydatidiform mole (CHM) cell lines, which are almost completely homozygous and therefore easier to assemble than heterozygous diploid genomes (13). This single-haplotype, de novo strategy overcomes the limitations of the GRC’s mosaic BAC-based legacy, bypasses the challenges of structural polymorphism, and allows the use of modern genome sequencing and assembly methods.

Application of long-read sequencing for the improvement of the human reference genome followed the introduction of PacBio’s single-molecule, polymerase-based technology (14). This was the first commercial sequencing technology capable of producing multi-kilobase sequence reads, which, even with a 15% error rate, proved capable of resolving complex forms of structural variation and gaps in GRCh38 (9, 15). The next major advance in sequencing read lengths came from Oxford Nanopore’s single-molecule, nanopore-based technology, capable of sequencing “ultra-long” reads in excess of 1 Mbp (16), but again with an error rate of 15%. By spanning most genomic repeats, these ultra-long reads enabled highly continuous de novo assembly (17), including the first complete assemblies of a human centromere (ChrY) (18) and a human chromosome (ChrX) (11). However, due to their high error rate, these long-read technologies have posed considerable algorithmic challenges, especially for the reliable assembly of long, highly similar repeat arrays (19). Improved sequencing accuracy simplifies the problem, but past technologies have excelled at either accuracy or length, not both. PacBio’s recent “HiFi” circular consensus sequencing offers a compromise of 20 kbp read lengths and a median accuracy of 99.9% (20, 21), which has resulted in unprecedented assembly accuracy with relatively minor adjustments to standard assembly approaches (22, 23). Whereas ultra-long nanopore sequencing excels at spanning long, identical repeats, HiFi sequencing excels at differentiating subtly diverged repeat copies or haplotypes.

In order to create a complete and gapless human genome assembly, we leveraged the complementary aspects of PacBio HiFi and Oxford Nanopore ultra-long read sequencing, combined with the essentially haploid nature of the CHM13hTERT cell line (hereafter, CHM13) (24). The resulting T2T-CHM13 reference assembly removes a 20-year-old barrier that has hidden 8% of the genome from sequence-based analysis, including all centromeric regions and the entire short arms of five human chromosomes. Here we describe the construction, validation, and initial analysis of the first truly complete human reference genome and discuss its potential impact on the field.

Cell line and sequencing As with many prior reference genome improvement efforts (1, 9, 13, 24, 28, 29), including the T2T assemblies of human chromosomes X (11) and 8 (12), we utilized a complete hydatidiform mole for sequencing. CHM genomes arise from the loss of the maternal complement and duplication of the paternal complement postfertilization and are, therefore, homozygous for one set of alleles. This simplifies the genome assembly problem by removing the confounding effect of heterozygous variation. We selected CHM13 for its stable 46,XX karyotype compared to other CHMs (11), but later found that CHM13 does possess a low level of heterozygosity, notably including a megabase-scale heterozygous deletion within the rDNA array on Chromosome 15, which was revealed by both FISH and nanopore sequencing (Figs. S1-2, Note S1). This and other identified heterozygous variants appear fixed in CHM13 and may have arisen during growth of the mole or passaging of the cell line. Local ancestry analysis shows the majority of the CHM13 genome is of European origin, including regions of Neanderthal introgression, with some predicted admixture from other populations (30) (Fig. 1A, Note S2).

Over the past 6 years, we have extensively sequenced CHM13 with multiple technologies (Note S3), including 30× PacBio circular consensus sequencing (HiFi) (29), 120× Oxford Nanopore ultra-long read sequencing (ONT) (11, 12), 100× Illumina PCR-Free sequencing (ILMN) (1), 70× Illumina / Arima Genomics Hi-C (Hi-C) (11), BioNano optical maps (11), and Strand-seq (29). Here we developed new methods for assembly, polishing, and validation that better utilize these datasets. In contrast to the first T2T assembly of Chromosome X (11)—which relied on ONT sequencing to create a backbone that was then polished with other technologies—we shifted to a new strategy that leverages the combined accuracy and length of HiFi reads to enable assembly of highly repetitive centromeric satellite arrays and closely related segmental duplications (12, 22, 29).

Genome assembly The basis of the T2T-CHM13 assembly is a high-resolution assembly string graph (31) built directly from HiFi reads. In a bidirected string graph, nodes represent unambiguously assembled sequences and edges correspond to the overlaps between them, due to either repeats or true adjacencies in the underlying genome. The HiFi-based string graph was constructed using a purpose-built method that combines components from the HiCanu (22) and Miniasm (32) assemblers along with specialized graph processing. Although HiFi reads are very accurate, their primary error mode is small insertions or deletions within homopolymer runs, so, like HiCanu, the first step of the T2T string graph construction process was to “compress” homopolymer runs in the reads to a single nucleotide (e.g., [A]n becomes [A]1 for n > 1) (33). All compressed reads were then aligned to one another to identify and correct small errors, and differences within simple sequence repeats were masked to overcome this other known source of HiFi errors (22). After compression, correction, and masking, only exact overlaps were considered during graph construction, and new methods were developed for iterative graph simplification, as described in the supplementary methods (Fig. S3, Note S4). Edges in the resulting string graph correspond to exact overlaps of at least 8 kbp in homopolymer-compressed space.

In the resulting graph, most chromosomes are represented by one or more connected components, each having a mostly linear structure (Fig. 2A). This suggests very few perfect repeats greater than roughly 10 kbp exist between different chromosomes or distant loci, with the exception of the five acrocentric chromosomes, which form a single connected component in the graph. Another complex region is the HSat3 array on Chromosome 9, which includes a recent multi-megabase tandem HSat3 duplication consistent with the 9qh+ (34) karyotype of CHM13 (Fig. S4). Minor fragmentation of the chromosomes into multiple connected components resulted from HiFi sequencing dropout across some GA-rich simple sequence repeats, presumably due to a bias of the HiFi sequencing or base-calling process (22). These gaps were later filled using a prior ONT-based assembly (CHM13v0.7) (11).

This collection has no description yet. Contact the owner of this collection about setting it up on OpenSea!
Contract Address0x495f...7b5e
Token ID
Token StandardERC-1155
BlockchainEthereum
MetadataCentralized
Creator Earnings
info
0%

Genome Human

keyboard_arrow_down
  • Price
    USD Price
    Expiration
    From
  • Price
    USD Price
    Floor Difference
    Expiration
    From
Event
Price
From
To
Date

Genome Human

0

  • Price
    USD Price
    Expiration
    From
  • Price
    USD Price
    Floor Difference
    Expiration
    From

Abstract In 2001, Celera Genomics and the International Human Genome Sequencing Consortium published their initial drafts of the human genome, which revolutionized the field of genomics. While these drafts and the updates that followed effectively covered the euchromatic fraction of the genome, the heterochromatin and many other complex regions were left unfinished or erroneous. Addressing this remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium has finished the first truly complete 3.055 billion base pair (bp) sequence of a human genome, representing the largest improvement to the human reference genome since its initial release. The new T2T-CHM13 reference includes gapless assemblies for all 22 autosomes plus Chromosome X, corrects numerous errors, and introduces nearly 200 million bp of novel sequence containing 2,226 paralogous gene copies, 115 of which are predicted to be protein coding. The newly completed regions include all centromeric satellite arrays and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies for the first time.

Introduction The latest major update to the human reference genome was released by the Genome Reference Consortium (GRC) in 2013 and most recently patched in 2019 (GRCh38.p13) (1). This assembly traces its origin to the publicly funded Human Genome Project (2) and has been continually improved over the past two decades. Unlike the competing Celera assembly (3), and most modern genome projects that are also based on shotgun sequence assembly (4), the GRC human reference assembly is primarily based on Sanger sequencing data derived from bacterial artificial chromosome (BAC) clones that were ordered and oriented along the genome via radiation hybrid, genetic linkage, and fingerprint maps (5). This laborious approach resulted in what remains one of the most continuous and accurate reference genomes today. However, reliance on these technologies limited the assembly to only the euchromatic regions of the genome that could be reliably cloned into BACs, mapped, and assembled. Restriction enzyme biases led to the underrepresentation of many long, tandem repeats in the resulting BAC libraries, and the opportunistic assembly of BACs derived from multiple different individuals resulted in a mosaic assembly that does not represent a continuous haplotype. As such, the current GRC assembly contains several unsolvable gaps, where a correct genomic reconstruction is impossible due to incompatible structural polymorphisms associated with segmental duplications on either side of the gap (6). As a result of these shortcomings, many repetitive and polymorphic regions of the genome have been left unfinished or incorrectly assembled for over 20 years.

The current GRCh38.p13 reference genome contains 151 Mbp of unknown sequence distributed throughout the genome, including pericentromeric and subtelomeric regions, recent segmental duplications, ampliconic gene arrays, and ribosomal DNA (rDNA) arrays, all of which are necessary for fundamental cellular processes (Fig. 1A). Some of the largest reference gaps include the entire p-arms (short arms) of all five acrocentric chromosomes (Chr13, Chr14, Chr15, Chr21, and Chr22), and large human satellite arrays (e.g., Chr1, Chr9, and Chr16), which are currently represented in the reference simply as multi-megabase stretches of unknown bases (‘N’s). In addition to these apparent gaps, other regions of the current reference are artificial or are otherwise incorrect. The centromeric alpha satellite arrays, for example, are represented in GRCh38 as computationally generated models of alpha satellite monomers to serve as decoys for resequencing analyses (7). In the case of the acrocentrics, some sequence is included for the p-arm of Chromosome 21 but appears incorrectly localized and poorly assembled, resulting in false gene duplications that complicate downstream analyses (8). When compared to other human genomes, the current reference also shows a genome-wide deletion bias, suggesting the systematic collapse of repeats during its initial cloning and/or assembly (9). Summary of the complete T2T-CHM13 human genome assembly. (A) karyoploteR (25) ideogram of the T2T-CHM13v1.1 assembly improvements. The bottom track shows the density of known genes in green and new paralogs in red. GRCh38 gaps and issues that are resolved by the CHM13 assembly are highlighted by black rectangles. Above, the density of segmental duplications is given in blue (26) and centromeric satellites (CenSat) in red (27). The top track is a local ancestry analysis where the majority of the genome is predicted to be of European ancestry (1000 Genomes EUR), with regions of admixture colored as specified in the legend. (B) New bases in the CHM13 assembly relative to GRCh38 per chromosome, with the acrocentrics highlighted in yellow. (C) New or structurally variable bases added by sequence type (“CenSat & SDs” is the overlap between these two annotations). (D) Total non-gap bases in UCSC reference genome releases dating back to September 2000 (hg4) and ending with T2T-CHM13 in 2021.

Despite the functional importance of these missing or erroneous regions, the Human Genome Project was officially declared complete in 2003 (10), and there was limited progress towards closing the remaining gaps in the years that followed. This was largely due to limitations of its construction discussed above, but also due to the sequencing technologies of the time, which were dominated by low-cost, high-throughput methods capable of sequencing only a few hundred bases per read. Thus, shotgun-based assembly methods were unable to surpass the quality of the existing reference. However, recent advances in long-read genome sequencing and assembly methods have enabled the complete assembly of individual human chromosomes from telomere to telomere without gaps (11, 12). In addition to using long reads, these T2T projects have targeted the genomes of clonal, complete hydatidiform mole (CHM) cell lines, which are almost completely homozygous and therefore easier to assemble than heterozygous diploid genomes (13). This single-haplotype, de novo strategy overcomes the limitations of the GRC’s mosaic BAC-based legacy, bypasses the challenges of structural polymorphism, and allows the use of modern genome sequencing and assembly methods.

Application of long-read sequencing for the improvement of the human reference genome followed the introduction of PacBio’s single-molecule, polymerase-based technology (14). This was the first commercial sequencing technology capable of producing multi-kilobase sequence reads, which, even with a 15% error rate, proved capable of resolving complex forms of structural variation and gaps in GRCh38 (9, 15). The next major advance in sequencing read lengths came from Oxford Nanopore’s single-molecule, nanopore-based technology, capable of sequencing “ultra-long” reads in excess of 1 Mbp (16), but again with an error rate of 15%. By spanning most genomic repeats, these ultra-long reads enabled highly continuous de novo assembly (17), including the first complete assemblies of a human centromere (ChrY) (18) and a human chromosome (ChrX) (11). However, due to their high error rate, these long-read technologies have posed considerable algorithmic challenges, especially for the reliable assembly of long, highly similar repeat arrays (19). Improved sequencing accuracy simplifies the problem, but past technologies have excelled at either accuracy or length, not both. PacBio’s recent “HiFi” circular consensus sequencing offers a compromise of 20 kbp read lengths and a median accuracy of 99.9% (20, 21), which has resulted in unprecedented assembly accuracy with relatively minor adjustments to standard assembly approaches (22, 23). Whereas ultra-long nanopore sequencing excels at spanning long, identical repeats, HiFi sequencing excels at differentiating subtly diverged repeat copies or haplotypes.

In order to create a complete and gapless human genome assembly, we leveraged the complementary aspects of PacBio HiFi and Oxford Nanopore ultra-long read sequencing, combined with the essentially haploid nature of the CHM13hTERT cell line (hereafter, CHM13) (24). The resulting T2T-CHM13 reference assembly removes a 20-year-old barrier that has hidden 8% of the genome from sequence-based analysis, including all centromeric regions and the entire short arms of five human chromosomes. Here we describe the construction, validation, and initial analysis of the first truly complete human reference genome and discuss its potential impact on the field.

Cell line and sequencing As with many prior reference genome improvement efforts (1, 9, 13, 24, 28, 29), including the T2T assemblies of human chromosomes X (11) and 8 (12), we utilized a complete hydatidiform mole for sequencing. CHM genomes arise from the loss of the maternal complement and duplication of the paternal complement postfertilization and are, therefore, homozygous for one set of alleles. This simplifies the genome assembly problem by removing the confounding effect of heterozygous variation. We selected CHM13 for its stable 46,XX karyotype compared to other CHMs (11), but later found that CHM13 does possess a low level of heterozygosity, notably including a megabase-scale heterozygous deletion within the rDNA array on Chromosome 15, which was revealed by both FISH and nanopore sequencing (Figs. S1-2, Note S1). This and other identified heterozygous variants appear fixed in CHM13 and may have arisen during growth of the mole or passaging of the cell line. Local ancestry analysis shows the majority of the CHM13 genome is of European origin, including regions of Neanderthal introgression, with some predicted admixture from other populations (30) (Fig. 1A, Note S2).

Over the past 6 years, we have extensively sequenced CHM13 with multiple technologies (Note S3), including 30× PacBio circular consensus sequencing (HiFi) (29), 120× Oxford Nanopore ultra-long read sequencing (ONT) (11, 12), 100× Illumina PCR-Free sequencing (ILMN) (1), 70× Illumina / Arima Genomics Hi-C (Hi-C) (11), BioNano optical maps (11), and Strand-seq (29). Here we developed new methods for assembly, polishing, and validation that better utilize these datasets. In contrast to the first T2T assembly of Chromosome X (11)—which relied on ONT sequencing to create a backbone that was then polished with other technologies—we shifted to a new strategy that leverages the combined accuracy and length of HiFi reads to enable assembly of highly repetitive centromeric satellite arrays and closely related segmental duplications (12, 22, 29).

Genome assembly The basis of the T2T-CHM13 assembly is a high-resolution assembly string graph (31) built directly from HiFi reads. In a bidirected string graph, nodes represent unambiguously assembled sequences and edges correspond to the overlaps between them, due to either repeats or true adjacencies in the underlying genome. The HiFi-based string graph was constructed using a purpose-built method that combines components from the HiCanu (22) and Miniasm (32) assemblers along with specialized graph processing. Although HiFi reads are very accurate, their primary error mode is small insertions or deletions within homopolymer runs, so, like HiCanu, the first step of the T2T string graph construction process was to “compress” homopolymer runs in the reads to a single nucleotide (e.g., [A]n becomes [A]1 for n > 1) (33). All compressed reads were then aligned to one another to identify and correct small errors, and differences within simple sequence repeats were masked to overcome this other known source of HiFi errors (22). After compression, correction, and masking, only exact overlaps were considered during graph construction, and new methods were developed for iterative graph simplification, as described in the supplementary methods (Fig. S3, Note S4). Edges in the resulting string graph correspond to exact overlaps of at least 8 kbp in homopolymer-compressed space.

In the resulting graph, most chromosomes are represented by one or more connected components, each having a mostly linear structure (Fig. 2A). This suggests very few perfect repeats greater than roughly 10 kbp exist between different chromosomes or distant loci, with the exception of the five acrocentric chromosomes, which form a single connected component in the graph. Another complex region is the HSat3 array on Chromosome 9, which includes a recent multi-megabase tandem HSat3 duplication consistent with the 9qh+ (34) karyotype of CHM13 (Fig. S4). Minor fragmentation of the chromosomes into multiple connected components resulted from HiFi sequencing dropout across some GA-rich simple sequence repeats, presumably due to a bias of the HiFi sequencing or base-calling process (22). These gaps were later filled using a prior ONT-based assembly (CHM13v0.7) (11).

This collection has no description yet. Contact the owner of this collection about setting it up on OpenSea!
Contract Address0x495f...7b5e
Token ID
Token StandardERC-1155
BlockchainEthereum
MetadataCentralized
Creator Earnings
info
0%
Event
Price
From
To
Date