Yarek Waszul/The New York Times

May 4, 2022

Mapping the Final Blueprints: Identifying the Remaining 8 Percent of a Whole Human Genome

Print More

In 2003, the Human Genome Project made history by sequencing 92 percent of the human genome, which created breakthroughs in medicine such as the ability to identify specific mutations that may lead to cancer.

Almost 20 years later on March 31, the Telomere-to-Telomere Consortium—an open community-based effort to complete the human genome—identified the last sequences, revealing a complete human genome  for the first time. 

Sanger sequencing, a technique which chops up DNA into small sequences of 100 to 150 base pairs in length to sequence, identifies overlapping patterns and recombines these sequences into a full sequence, was originally used to build the genome in the Human Genome Project. 

However, according to Prof. Praveen Sethupathy, biomedical sciences, this sequencing technique has limitations.

“They’re highly repetitive…which makes it tough to know what order the pieces are in… so those regions were left aside [since] they were too complicated to deal with the [past] technology,” Sethupathy said. 

Prof. Andrew Clark, biology, also stated the two hardest regions to sequence are the telomeres — the tips of the chromosomes — and the centromeres — the middle section of chromosomes where the two identical sister chromatids come together.

In addition to being the ends of chromosomes, telomeres are also the non-coding section of DNA. They consist of repetitive sequences, which makes them hard to sequence since it is difficult to put together all identical pieces of DNA. 

The centromere is necessary for the separation of these sister chromatids during replication because the area around it contains many repetitions of DNA, also making sequencing difficult. 

With the development of next generation sequencing which sequences millions of DNA fragments in parallel fashion, DNA sequences can be quickly read at once allowing whole genomes to be sequenced faster and cheaper. 

Long read sequencing — discovered by Cornell’s Watt Webb and his colleagues in the physics department — is a type of NGS and has greatly contributed to the identification of the remaining human genome. It enables the identification of 10,000 to 100,000 base pairs all at once.

Moving forward with this new discovery, scientists will be able to tap into new phenomena in science and medicine that previously seemed impossible to access. 

According to Prof. Hojoong Kwak, molecular biology and genetics, the identification of these very similar repeating sequences could also provide clues to the evolutionary history of the human genome. 

Over time as humans have evolved, the human genome undergoes multiple replication cycles, allowing mutations to accumulate and leading to more distinguishable sequences. 

However, the similarities of the repeating sequences in the previously missing 8 percent indicated a younger part of the genome that could provide new evidence of the more recent evolutionary origins of the human genome. 

“These repetitive regions might contain small sequences of viruses or genetic parasites that are incorporated into the human genome… and can help us understand how they could impact certain populations and their susceptibility to diseases,” Kwak said. 

Kwak believes these sequences will help him identify unknown sequences that come up in his  research on identifying regulatory regions in the cancer genome and exploring how the genetic variation of the human population can impact gene expression.

Even with the completion of the human genome, the work is still not done. This advancement will catalyze future research in understanding how genomes function and differ from each other and move humanity closer to individualized medicine.