First responders who bravely entered the collapsing twin towers on September 11, 2001 were forever changed. But the lasting effects of this tragedy were more than psychological. The first responders were exposed to toxic, carcinogenic particles that irreversibly changed their DNA and subsequently increased their risks for cardiovascular disease and cancer.
Firefighters are already five times more likely to get lung cancer than the general population and twice as likely to develop cardiovascular disease. But a recent study concluded that firefighters on the scene at the twin towers carried more disease-causing mutations than the average firefighter (1).
How did researchers definitively say that these firefighters accumulated these mutations on September 11, 2001? People aren’t mice, and creating a proper study by recruiting firefighters with matched age, race, smoking history, lifestyle, and health backgrounds is nearly impossible. Yet researchers at Vanderbilt University did just that: They recruited more than 200 firefighters who were age, sex, and smoking history matched to 52 firefighters who were exposed to toxic particles as they navigated the burning twin towers.
Beginning in 2004, clinicians began extracting and sequencing DNA from unused blood collected during patient visits to Vanderbilt University Medical Center to develop a biorepository or biobank called BioVU. As of 2018, the repository contained more than 250,000 patient samples, and the researchers boast that they collect 500 additional DNA samples every week.
“We understood almost 20 years ago that coming soon to biomedicine would be the ability to sequence people on scale and then use the genetic information to guide treatment — guide pharma, selection of drugs, selection of drug doses — and identify people at high risk for certain diseases,” said Dan Roden, a cardiac disease researcher and leader of the BioVU biobank. “In order to enable that kind of vision, we created the biobank.”
This team is not the first to start a biorepository. In fact, there are 100-year-old biobanks that include biological samples beyond DNA. In 1862, clinicians collected blood and biopsy samples from soldiers during the Civil War and stored them in Washington, DC. Biobanks weren’t as organized in the 1800s though, and it wasn’t until the early 2000s that biobanks began collecting massive amounts of genetic data for researchers to mine.
Universities first started collecting patient samples in freezers. Now, entire countries such as the United States, United Kingdom, and Canada amass hundreds of thousands of biological samples from highly engaged participants in publicly and privately funded biobanks. Many of these countries follow their participants for decades; starting as young as 18 years of age, participants regularly provide new samples and complete questionnaires about their health and behavior.
With the continuation of these biobanks, scientists will have access to the complete genetic data of individuals — and possibly their children and grandchildren — to solve the genetic mysteries of disease.
“For years, everybody who works in or who's interested in medical research has been talking about precision medicine, or personalized medicine, whatever you want to call it. And it's exactly this approach,” said Naomi Allen, the chief scientist at the UK Biobank. “We want someone to be able to go to their general practitioner, and with a blood test, right then and there, they can tell someone, ‘The good news is you’ve got a very low risk of developing breast cancer, but you’ve got quite a high risk of developing osteoporosis.’”
“That’s what we’re going to do. We’re going to better prevent [disease] from happening in the first place,” she added.
The first generation
The Human Genome project — a 13-year-long quest by researchers across the globe to map the first human genome — published the first (mostly) complete human genome in 2003. Before the first human genome was complete, the UK government thought that the genome held the keys to developing preventative treatments for diseases such as dementia and cancer. In 2002, the Wellcome Trust and the UK department of health announced that they would donate 45 million euros to start the UK Biobank. The venture grew from an idea to a reality only four years later when they started recruiting participants.
Unlike the Vanderbilt clinicians, the organizing researchers at the UK Biobank didn’t ask patients at routine doctor appointments to opt in to their program. Instead, they mailed letters to every person between the ages of 40 and 69 who lived within 30 miles of one of their 22 assessment centers (2). After sending letters to 9 million people over four years, they recruited approximately 500,000 people to participate in the study.
“When we first set up the study, a lot of people asked, ‘Why have you focused on people aged 40 to 69? Why aren’t you recruiting the younger adult population?’” Allen recalled. “People 40 to 69 are young enough that they haven't already developed a lot of the conditions researchers are interested in, such as cardiovascular disease, cancer, or dementia. Researchers can actually look at risk factors before the participants develop those diseases, which is the whole purpose of the study.”
Allen and a board of directors with experience in academia and industry have been following the 500,000 participants since 2006. Participants regularly visit assessment centers where they fill out surveys about their health and medical history. Researchers also collect and store blood samples and measure vital signs such as blood pressure for the participants. The team automatically tracks participants as they age and develop diseases through their clinical medical records, which are updated each time they see a physician, including for annual exams or admission to an emergency room.
The team also sends out web-based surveys to participants with an email address that assess mental health, dementia, and gastrointestinal symptoms. In the coming months, they plan to send out questionnaires to determine the lasting effects of long-COVID in participants who were infected during the pandemic.
“That's obviously the most important part about long-term, longitudinal prospective studies. Once we've got all these people, we then want to find out what happens to them, and what diseases they develop,” Allen said.
She and the team first released longitudinal data from the UK Biobank study in 2017. “At the time, it was by far the largest study that had genetic data coupled with all the lifestyle and clinical health outcomes,” Allen added.
Allen boasted that more than 6,000 publications have used data pulled from the UK Biobank. The massive amount of information the biobank offers has propelled the use of polygenic risk scores (PRS) to predict disease risk in humans. A PRS is calculated using not just one genetic variant, but all genetic variation across an individual’s genome. Each individual variant may only slightly increase a person’s chances of disease, but the cumulative effect of many seemingly low risk variants could put someone at very high risk for developing a disease.
“It really shows the global research community that almost every single condition has a genetic component and that researchers and clinicians need to take into account variation across the whole genome,” said Allen.
Plant and animal breeders had used PRSs to aid in selective breeding efforts for decades before it was first used to calculate human disease risk, but calculating a PRS in humans was extremely challenging before the advent of biobanks (3-4). (The method was first proposed in 1932.) A seminal 2009 publication in Nature — which has been cited more than 4,000 times — used genome-wide association studies (GWAS) to analyze genetic data from more than 3,000 individuals with schizophrenia and 3,000 people without schizophrenia. These data came, in part, from small biobanks such as the Trinity College Biobank (4). The study showed for the first time that a disease with previously unknown genetic underpinnings developed due to polygenic effects (5).
The authors of the Nature paper stated that their results should be replicated in a larger cohort. Researchers have since validated their results, most notably in 2019 when researchers analyzed data from nearly 150,000 individuals in the UK Biobank to calculate PRS for schizophrenia (6).
The next generation
Biobanks are good for more than calculating a PRS. Researchers can use a massive set of data to follow the genetic lives of multiple generations. The UK Biobank researchers recently showed that recruiting the offspring of participants to follow the genetic pathway of a disease across generations is possible.
Researchers at the UK Biobank recruited participants from their 500,000-person cohort along with their children and grandchildren who were over the age of 18 and contracted COVID-19 to provide blood samples for antibody analysis. 116,000 participants and their children and grandchildren volunteered over the course of four weeks. The team collected blood samples from each of the participants every month for six months.
Across this population, 99% of participants still carried antibodies against SARS-CoV-2 three months after their initial infection, and 88% still held antibodies six months later. There were differences between participant populations. People under age 30 had higher rates of infection than people over age 70, and a higher proportion of Black individuals had COVID-19 antibodies than white or Chinese individuals.
“[The study] led on to help the UK government introduce the spacing of vaccines — when someone had their first dose, when they had their second dose, how long they have a third dose based on how long antibodies are likely to persist,” said Allen.
The participants continued turning in blood samples for another 12 months. Allen’s team is currently analyzing the results to see if people still have antibodies against SARS-CoV-2 12 and 18 months after their initial infection.
Allen is consistently impressed with how engaged the participants are in the study. She actually receives emails from participants asking why they haven’t received a questionnaire recently. The participants don’t get any data back from the study, but they hold participant gatherings where researchers present studies that used the biobank.
When Allen’s team asked first degree relatives to participate in the COVID-19 study, they hoped that 20,000 people would respond. 116,000 people registered.
“If there is genomic data from first degree relatives, that opens up a new avenue of research to look at nature versus nurture, how things run in families, and those [mutations] that are de novo. It really helps you to pick apart family structure and how shared environment or hereditary factors influence disease risk,” said Allen. “The added bonus is that researchers can do research on health outcomes that particularly affect younger adults such as mental health and reproductive outcomes.”
The generation’s location
The University of Toronto houses the Canadian Partnership for Tomorrow’s Health (CanPath), which is funded by Canadian Partnership Against Cancer (CPAC) and private entities such as Genome Canada. It includes 330,000 participants between ages 30 and 79 at recruitment. But participants under the age of 35 are more difficult to recruit. This group makes up only 6% of the CanPath cohort.
“Most of our recruitment, regardless, happened beyond age 40,” said Philip Awadalla, the national scientific director for CanPath. “But given the population of the cohort, we have significant numbers to start exploring some of these early disease signatures.”
CanPath is similar to the UK Biobank; they even share research protocols. But there are differences beyond the age range. Researchers using CanPath can directly access biological samples — blood, saliva, urine, among others — from the patient. For example, one research group supported by CanPath develops early cancer detection tests using blood samples collected from participants who developed cancer during the study.
A computational biologist by training, Awadalla is interested in the unique geography of Canada and how environmental exposures specific to certain parts of the country alter an individual’s genome and risk for developing metabolic syndromes common in older adults, such as hypertension, high cholesterol, obesity, and insulin resistance. He initiated a set of studies he calls the EpiCan project in 2016 to investigate.
Canada’s unique provincial healthcare system gives CanPath a peculiar advantage for studying environmental effects. Each of Canada’s 10 provinces and three territories organize their own public healthcare within their region, making everyone’s individual services a bit different. Because of this, each individual’s clinical records are housed by the province or territory they lived in at the time of care.
“What we've had to do is get not just 10 regional provinces to recruit participants, but also get participants to consent to allow us to link to administrative health records, which fortunately or unfortunately, will be hosted at each one, depending on who the data custodian is in each one of these 10 provinces. So in a sense, what CanPath does is consents participants to not only give us information about themselves in terms of giving us biologics that we can store in the biobank, but about what they've been exposed to during their lifetimes,” said Awadalla.
“We can take that information and tie that to information we're seeing at the genomic level and right at the transcriptomic level and use that information to identify and capture genes that may be regulated or dysregulated, depending on what environmental exposures they've been associated with,” said Awadalla.
In 2018, he and his team analyzed data from 1,000 CanPath participants and reported that exposure to air pollution in regions of Quebec caused gene expression patterns associated with respiratory diseases more often than those seen in others with a similar genetic ancestry (7).
“Given what we've learned in Quebec, we're trying to do those same [studies examining] genetic signatures associated with variation across Canada,” said Awadalla. “We could track environmental exposures by postal codes. Does it also impact other things like BMI, blood sugar levels, and so on?”
Awadalla is already studying the genetic influence from other environmental exposures unique to certain regions of Canada. In the Maritimes region, arsenic is naturally present in many of the geological features in the area. Even rocks in someone’s basement in the area could easily expose them to harmful levels of arsenic.
“For many cancer projects, researchers can almost count the number of cigarettes a person has had based on mutations in biopsies from the lung,” he said. “We see in blood samples a specific kind of methylation pattern due to arsenic exposure that's distinct from that of smoking.”
An isolated generation
While researchers in Canada ask questions about the environment’s influence on gene expression, researchers in Finland use their own biobank, FinnGen, to identify rare genetic variants. Similar to the UK Biobank, FinnGen has a whopping 500,000 participants, which equates to 10% of its total population. It also holds complete medical histories of its participants. Each health care visit and service over the lifetime of every resident is stored in Nordic health registers. Researchers can do longitudinal analyses using FinnGen just like they can with the UK Biobank or CanPath.
According to Aarno Palotie, the scientific director of FinnGen, that’s where the similarities end.
“What’s unique is the population. Even if it’s a European population, it’s a population isolate, which sets it apart from a discovery perspective,” said Palotie. “Certain variants have been enriched and they are much more frequent and easier to identify than in mixed populations. And that’s why one of the unique aspects of FinnGen is that we have been able to identify hundreds of coding variants associated to diseases.”
Clinicians were the first to notice the high prevalence of rare diseases in Finland (8). In the 1950s, pediatricians at the University of Helsinki Children’s Hospital repeatedly recorded newborn patients with a familial nephrosis with 100% mortality rate within two years. Most forms of nephrosis are not genetic, and they couldn’t find any solutions in the literature, so they took matters into their own hands.
The pediatricians traced the ancestry of all parents and close relatives of 57 affected families (9). Most of the parents and relatives had common ancestors from a group that settled in an isolated area in the 1500s. The team realized that the newly described congenital nephrosis was a rare, recessive disorder uniquely prevalent in Finland. Since then, researchers have identified more than 30 mostly pediatric diseases that are overrepresented in the isolated Finnish population; they refer to these disorders as Finnish diseases.
Researchers use this population knowledge to identify disease-causing mutations in rare disorders. For example, in 1997, Finnish researchers found that a rare genetic variant in a gene that was previously undescribed, AIRE, caused autoimmune polyendocrinopathy-candidiasis-ectodermal dystrophy (APCED), a rare autoimmune disorder that occurs in about 1 in 90,000 individuals in most countries, but 1 in 9,000 in Finland (10).
Now, finding rare genetic variants in Finnish populations is easier than ever with FinnGen. Palotie was particularly excited about a recent study that revealed a rare variant that protects against glaucoma (11).
“We have an easy time making discoveries, but we also have a harder time for replication. We partner a lot with Estonia, which is the closest genetic relative to our population. However, we are not from Mars. Even if a variant might be specific, the biology is usually the same. We often find rare variants in the same genes, but they have been so low frequency that they couldn’t have been discovered otherwise,” said Palotie.
All of the generations
Palotie said FinnGen-supported results can be validated in biobanks with more diverse populations such as the UK Biobank or CanPath. But people with European-heritage are overrepresented in both of these biobanks. The United States’ National Institutes of Health has created a biobank that reflects the diversity of its population for a program they call All of Us.
“We imagine our study to be very complementary to the ones that are happening now, particularly in in Europe. The hope would be that we can do crosstalk between these longitudinal population studies — discovering in All of Us and replicating in FinGenn, or discovering in UK Biobank and replicating in All of Us — and identify the gaps that exist in all of our studies so that we can continue to, in a longitudinal fashion, really address those gaps,” said Geoffrey Ginsburg, chief medical and scientific officer at the All of Us Program at the National Institutes of Health (NIH).
All of Us is the new kid on the biobank block. It started enrolling participants in 2018 — 12 years after the UK Biobank — and currently boasts 52 publications using the resource compared to the UK Biobanks’ 6,000. Their goal is to enroll one million participants in the program, including 50% from minority populations in an effort to over-represent populations that other biobanks lack.
To recruit a diverse population, the All of Us program is less selective than CanPath or the UK Biobank, which only recruit people between 30 and 80 years of age. Any United States citizen over 18 can join the program. The team is currently organizing a pediatric advisory board to collect biological samples from participants younger than 18 in the future.
“We have funded recruitment sites around the country that have been involved in clinical trials and other longitudinal population studies and have the expertise to bring in participants of all walks of life,” said Ginsburg. “But we recognize that that's also not the only way that people can be recruited. You know, people might just want to raise their hand and say, ‘I want to be part of All of Us.’”
Interested participants simply have to pick up their phone, download an app, and enroll in the All of Us program. (That’s how Ginsburg signed up himself.) They can then go to a designated clinic or hospital and donate their blood for sequencing, or if they don’t want to leave the house, All of Us will mail them a kit to send in a spit sample. Ginsburg said that was a huge help during the initial phases of the COVID-19 pandemic; it’s also how he submitted his initial sample. All of Us even has a recreational vehicle that drives around the country to visit rural areas with individuals who are often overlooked by recruiters for clinical trials and other longitudinal studies.
All of Us recruiters are making headway. Approximately 350,000 people from all 50 states are participating in the All of Us program as of 2022. A study published in April in The Journal of the American Medical Informatics Association investigated if individuals with different demographic backgrounds were more or less willing to share their electronic health records with the program, which allows researchers to look at medical histories decades before volunteers joined the All of Us program (12). 90% of the 26,000 participants enrolled, and only 2% refused to share their electronic health record data. 66.5% of the participants who agreed to share their electronic health records were White, 18.7% were Black/African American, 7.7% were Hispanic. Race was not significantly associated with opting out of sharing health records.
Stephen Mikita, one of the earliest All of Us participants and a former participant advocate for the disabled community on the All of Us participant steering committee, thinks that the key to recruiting participants from groups under-represented in biomedical research is including them in the process in a real way. It’s one of the things that initially drew him to the program. He learned about All of Us while serving on an FDA steering committee.
“It was something that I thought was just extraordinary, ambitious, and so powerful from that early research standpoint. But also from a personal standpoint, as someone who’s lived his life with a rare disease, spinal muscular atrophy type two, I thought, ‘What a wonderful opportunity,’” Mikita said.
The All of Us program is one of the few major biobanks to share data with participants about their own genes. Mikita thinks that’s the key to recruiting people and keeping them engaged.
“Transparency means that we, as participants, should know what is being done with our data to drive the types of discoveries and research that will not only benefit us, but generations of our families and our communities. That is an indispensable core ingredient that really informs every aspect of this program. Participants are elevated and our voices are amplified, just as any other key stakeholder around the table,” said Mikita.
In 2021, All of Us shared “fun” phenotypic data with participants when they first joined, including if they carried a gene for curly hair or brown eyes. Later in 2022, they will send health-related data to participants who wanted to learn more about their health. They will tell patients if they have any genetic variants that the American College of Medical Genetics and Genomics (ACMGG) has deemed clinically actionable. Declaring a variant clinically actionable is no small task; the ACMGG has only given 78 variants this designation.
“A very small number have reached that level of evidence where it's very clear that the genetic variation is very well correlated, almost 100% correlated, with the clinical phenotype of question and that there's demonstrable evidence that having that information can actually be useful and have the utility to potentially change the course of a disease,” said Ginsburg.
For example, certain mutations in breast cancer 1 (BRCA1) or BRCA2 increase a person’s chances of developing breast cancer from 12% to 72% and ovarian cancer from 1.3% to 12% (12). People who have this mutation may want to closely monitor their breasts or even undergo surgery to remove them. There isn’t a good way to monitor for ovarian cancer, so some women opt to remove their ovaries.
But the jury is still out on whether knowledge of genetic mutations has a positive effect on mental health and wellbeing. Ginsburg cited studies that claim people feel less worried when they know they have a high-risk mutation, while Allen from the UK Biobank cited studies that claim the opposite.
“The problem with that is that it causes a lot of anxiety. It causes a lot of work downstream. You need genetic counselors on board. The general practitioners need to know what you're doing. If you've got half a million participants across the UK, you've suddenly massively increased the workload for the national health service, which is what we didn't want to do,” said Allen.
Allen said the UK Biobank recruited 100,000 participants to come in for MRIs to add neurological data to the repository. They decided to report to patients if they had significant abnormalities, such as a potential brain tumor. Only 2% of patients had an abnormality, and less than half of those turned out to be anything significant.
The All of Us team is willing to take the risk. They work with genetic counseling resources to thoughtfully return any genetic results to participants.
“We're very committed to these feedback loops. We want to learn from our participants how they are receiving the results we're giving them, if we're not communicating the results well, if they don't understand them, if there are things that we can do better. We're going to get feedback so that we can continually optimize the way that we learn about delivery of results and the way our participants learn about the meaning of the information we're giving them,” said Ginsburg.
Ginsburg is excited about the wealth of information the All of Us program plans to collect. In the future, the team wants to use proteomics and metabolomics to analyze participants' biological samples. They will collect data from other sources such as wearables and surveys about lifestyle, diet, and socio-economic status.
“Imagine that we define an individual’s polygenic risk score, environmental risk score, and social risk score. In combination with the genetics, it’s going to give us the full complement of our ability to tell you what your true risk for having a disease in the future might be,” he said. “Eventually, we’ll use that power to identify therapeutic options for people who are at risk for prevention, or for people who already have disease.”
- Jarsa, S. et al. High burden of clonal hematopoiesis in first responders exposed to the World Trade Center disaster. Nat Med 28, 468-471 (2022).
- Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Med 1371 (2015).
- Crouch, D.J.M. and Bodmer, W.F. Polygenic inheritance, GWAS, polygenic risk scores, and the search for functional variants. PNAS 117, 18924–18933 (2020).
- International Schizophrenia Consortium. Common polygenic variation contributes to risk of schizophrenia that overlaps with bipolar disorder. Nature 460, 748-752 (2009).
- Smeland, O.B. et al. The polygenic architecture of schizophrenia — rethinking pathogenesis and nosology. Nat Revs Neurol 16, 366–379 (2020).
- Escott-Price, V. et al. The relationship between common variant schizophrenia lability and number of offspring in the UK biobank. Am J Psych 176, 661-666 (2019).
- Fave, M-J et al. Gene-by-environment interactions in urban populations modulate risk phenotypes. Nat Comms 9, 827 (2018).
- Norio, R. Finnish Disease Heritage I. Hum Genet 112, 441–456 (2003).
- Norio R, et al. Heredity in the congenital nephrotic syndrome; a genetic study of 57 Finnish families with a review of reported cases. Ann Pediatric Fenn 12, Suppl 27 (1966).
- Aaltonen, J. et al. An autoimmune disease, APECED, caused by mutations in a novel gene featuring two PHD-type zinc-finger domains. Nat Gen 17, 399-403 (1997).
- Tanigawa, Y. et al. Rare protein-altering variants in ANGPTL7 lower intraocular pressure and protect against glaucoma. PLoS Gen (2020).
- Joseph, C.L.M. et al. Demographic differences in willingness to share electronic health records in the All of Us Research Program. JAMIA 29, 1271-1278.
- NIH. BRCA Gene Mutations: Cancer Risk and Genetic Testing. at < https://www.cancer.gov/about-cancer/causes-prevention/genetics/brca-fact-sheet#how-much-does-having-a-brca1-or-brca2-gene-mutation-increase-a-womans-risk-of-breast-and-ovarian-cancer >