Bringing diversity to the reference genome


The release of the human reference genome in 2001 established a foundation to explore the evolution of the human genome and how it influences human health. A person’s genome has over 6 billion base pairs (the chemical building blocks of DNA represented by the letters CGAT). Overall human genomes are extremely similar. However, small differences play a role in making each individual unique, including their health. A reference genome helps identify these differences by providing a map of where genes are on the genome and what many of the genetic differences mean.

The current reference genome is a composite created from the DNA sequences of about 20 individuals, with most of the sequence coming from one person. As a result, the reference genome does not reflect global genomic diversity. It is also incomplete. While some of the original gaps have been filled in, more than 150 million bases are missing. For example, a study of 910 people of African ancestry compared their genomes to the reference genome, and found that almost 10% of the information contained within their genomes was not present in the reference.

The current reliance on a single, and not quite complete, reference genome clearly needs addressing as sequences that differ considerably from the reference can be incorrectly interpreted leading to incomplete understanding of the genomic causes of disease and hindering improvements in clinical care. To tackle these issues global collaborative sequencing projects such as the Human Pangenome Project are using the latest developments in sequencing technology to fill in the missing gaps and create a human reference genome that better reflects global genomic diversity.

Filling in the gaps

The gaps in the human genome exist in regions that are hard to sequence using current technologies. However, with recent advancements such as long-read sequencing technologies, scientists are now able to interrogate these regions. They can be highly repetitive sections of genetic sequence or structural variants, where large sections of DNA have moved position. It is important to understand these regions because they have been associated with several diseases.

Long-read sequencing reads longer sections of the genome in one go, making it easier to piece together these challenging genome sequences in the correct order, creating a more reliable and accurate reference. A major achievement has been the use of extremely long and highly accurate sequence reads to reconstruct entire human chromosomes from telomere to telomere (T2T). But, despite the success of the T2T assembly, it does not capture the diversity of sequences across populations.

Pangenome research

Pangenome research aims to broaden the reference genome to represent genomic diversity within and across all human populations, vital to addressing the imbalance in population representation in genomic data. The Human Pangenome Project will utilise the latest advances in sequencing technologies – including long-read, single-cell, and advanced imaging – to meet its initial goals of creating 350 highly detailed genome sequences. In time, expanding to sequence thousands of genomes to capture as much human genetic diversity as possible.

The Project is a big science effort, requiring collaboration between multidisciplinary teams of scientists as well as policy experts and ethicists to navigate the technological and societal challenges that will be encountered when collecting data for this project.

Importance of inclusivity

The genomes of different individuals and populations harbour a wealth of information on humanity’s responses to historical environmental and biological pressures. Some of these genetic differences have no effect on a person’s health whilst others can have a profound effect. It is this molecular diversity that underlies genetic disorders, inherited traits and disease susceptibility. Diversity in genomic research has numerous benefits ranging from novel insights into health disparities, better understanding of human biology, improving clinical care, and informing genetic diagnosis.

In addition, therapies and drugs developed using genetic data from specific populations that share the same genetic ancestry will most likely work best in those populations. By examining previously underrepresented populations, new ancestry-specific associations for different diseases could be found, which also furthers the understanding of the genetic background of traits.

Apart from the scientific advancements, inclusion is a matter of justice; individuals benefit most from research conducted in those with a similar ancestral background to them. Including diverse populations in genomic research is the right thing to do for reasons of equity. It will ensure that all populations can benefit from genomic knowledge and its impact on healthcare.

Societal challenges

To increase diversity in genomic datasets there needs to be an acknowledgment and understanding that many of the groups that are underrepresented suffer from health inequalities. Past events have significantly impacted on the public’s perception of genomic research, particularly with abuses of genomic data from certain populations. For example, the Havasupai Tribe where DNA was donated for studies on type 2 diabetes, but was then used without their consent for studies on schizophrenia and migration. This resulted in a lawsuit and the Navajo Nation placing a moratorium on genetic research studies, which is now being reconsidered. Some communities have set out codes of conduct and guidelines of how the scientific community is expected to engage with them, for example the Global Code of Conduct for Research. There are also ongoing challenges around informed consent, privacy, and data sharing.

Population sampling for pangenome efforts require purposeful engagement with communities, fair representation, careful policy and ethical guidance which should ensure respectful partnerships with communities and participants. The Project is tackling this by having social ethicists embedded in the decision-making processes and their continuous vetting within the project. The Project is also encouraging scientists within Indigenous population to generate their own reference sequences. A number of countries have launched their own population-specific projects that aim to produce high-quality reference genomes using their own frameworks for sample collection and consent.

Better care for all

There are many challenges that the Human Pangenome Project will need to overcome, but when done, the release of the pangenome will be a major upgrade in the reference genome. It is expected to accelerate genotype-to-phenotype studies, drive technology innovation and enable a new era of human biomedical research. By being more inclusive and representative of the global population there will be better understanding of disease and how clinical care can be improved, for all. It will transform the way that basic and clinical research is done, while leading to improved standards for genomics research, data sharing, and reproducible workflows. The results of the project are highly anticipated and could create an important shift in how genomics research is done and used in healthcare.

Genomics and policy news

Sign up