I spent most of last week at the Genomics of Common Diseases meeting held by Nature Genetics and the Wellcome Trust. Below I summarize some of the big themes of the conference (with some editorial and biased by my own opinions), and give the big takeaway message I got from each of the talks.
Before the meeting, Orli Bahcall presented a list of “big questions” regarding the state and future of genomics of common diseases. While going through the questions and absorbing the talks, I came up with the following big themes:
Clearly, GWAS has made a lot of progress in identifying genetic variants associated with disease. We saw many examples of this, including from Martin Tobin in COPD, from Peter Gregerson in rheumatoid arthritis, and from Judy Cho in IBD. GWAS methods have become pretty solid, and sample sizes are continuously expanding to allow large meta-analyses to identify more and more susceptibility loci (see the recent progress in schizophrenia for example). Clearly, one of the big challenges going forward will be in interpreting genetic associations and determining the biology driving them.
Speakers highlighted several ways we can improve association studies and gain better insight into the genetic architecture of disease:
The importance of phenotyping: If we choose the right phenotype, we can increase our power and effect sizes in association tests. Tobin showed how correctly stratifying patient groups, in his case smokers and non-smokers with COPD, can be helpful. Gregerson and Cho emphasized better use of molecular “endophenotypes”, such as biomarkers associated with disease, which may give stronger associations than more heterogeneous disease phenotypes. For instance, in a study on IBD, Cho looked at the more specific phenotype of mycobaecterial infection. In a study on rheumatoid arthritis, Gregerson looked at immunological quantitative traits.
Integrating datasets for better interpretation We are now generating tons of datasets, including transcriptomics, proteomics, epigenomics, transcription factor binding maps, and more. A key to GWAS interpretation will be integrating these datasets on top of association results, as alluded to by Nancy Cox. For example, Huyghe showed that GWAS SNPs act as eQTLs in relevant disease tissues, and we can use such overlaps to learn about function and tissue specificity. Several posters integrated functional datasets to interpret specific GWAS loci.
Better understanding of underlying processes: genetic architecture and mutation We still do not fully understand the genetic architecture of many complex traits. Is most heritability due to rare variants? common variants? interactions we don’t account for? The answer is likely to be different across different traits. Vineeta Agarwala dug into a specific locus, 9p21 implicated across an impressive array of different diseases, to investigate which variants are driving an association with Type 2 Diabetes. She asked whether rare variants on common haplotypes could explain the signal, and concluded that common, not rare, variants are likely the causal loci. This is just one example, and we still have a lot to learn about the architecture both behind GWAS variants and behind genetic variance not yet accounted for by association tests. Additionally, to better assess the relevance of mutations, it is important to have quantitative models that give us better understanding of mutation processes. Molly Przeworski’s keynote emphasized that there are still many key aspects of the point mutation process that we don’t understand. David Goldstein stressed the importance of having an accurate null model when interpreting de novo mutations. Rebecca Rothwell presented a model of mitochondrial transmission. Gil McVean presented on one of my favorite topics: how can we study the whole spectrum of mutation events, including repeats and larger structural variants, not just point mutations? He suggested assembly based approaches for doing this, focusing on some results from plasmodium. While most of these talks did not relate directly to studies of human disease, as Gil pointed out, mutations are at the heart of what we study and it’s important to understand how they work.
There have been some heroic efforts to track down individual GWAS hits all the way to the causal variant and its mechanism. For instance, Hoskins investigated a GWAS hit for pancreatic cancer risk down to an indel that creates a novel binding site that represses a distant gene through long range interaction. These efforts are an extremely important next step in GWAS. Yet, fully investigating each of these loci is a huge task. To make progress on a large scale forward, I think we’re going to have to develop better computational methods for performing this on a large scale. A large part of this will be integrating many layers of genomics datasets, mentioned above. But I’m personally excited by the following classes of methods that will help generate meaningful predictions about the effects of variants:
Heritability partitioning We are learning that many complex traits (e.g. schizophrenia, Type 2 Diabetes, cardiovascular disease, many many others) are quite polygenic. With so many GWAS loci of small effect, how can we possibly learn the big picture of what’s going on? Hilary Finucane presented the LD score metric and how it can be used to measure overall heritability in polygenic traits, but also to partition heritability between different annotations. This is potentially very valuable: for instance, we can learn how much heritability comes from coding regions vs. enhancers vs. promoters. But we can also start to break traits down by cell type and ask for example how much heritability is due to enhancers in a certain cell type vs. another cell type? Some initial results for this: Gusev et al found that DNase Hypersensitivity sites explained on average nearly 80% of heritability at complex traits like schizophrenia; a poster at GCD14 and a tweet from GI2014 suggested that much of heritability in Type 2 Diabetes is partitioned to enhancers in islet cells. These methods seem to be a valuable step in figuring out what biological functions and which cell types are driving many complex traits.
Predicting functional effects of specific variants Beyond learning about overall processes involved in complex traits, it will be important to elucidate the functions of individual variants. Above I discussed the issue of overlapping many different types of functional datasets to do this, but I think an additional set of promising methods are those that directly predict the impact of sequence variation on function. Whereas the first type of approach says something like “this variant overlaps an enhancer and is a conserved base in a transcription factor binding motif”, the second asks questions like: “if I change this A to a T, what happens to the predicted regulatory function of this site?”. Two approaches for this type of analysis caught my attention: Nancy Cox presented a method to use genotypes to predict gene expression levels, and then associate expression levels with a quantitative trait. This accomplishes the goal of both predicting the impact of specific variants, and also directly implicating the genes involved with the trait of interest. Another method I was excited about came from a poster from Dongwon Lee from work he did with Michael Beer (see here and here). They are using a kmer based model to directly impact the influence of sequence variants on marks like DNaseI hypersensitivity and CTCF. I think such model based methods will be important in the future of GWAS interpretation.
As David Altshuler pointed out, in many cases to understand a single genome, we need to have access to the genomes of thousands or millions of people. He mentioned the cases of rare mutations in families, estimating penetrance, ethnic diversity, and cancer. We are now capable of approaching sample sizes of this scale. It is not news that genomics is now generating tons of data. We saw presentations from large projects including the GTEx, 1000 Genomes, and Genomics England’s 100K Genomes Project, plus a talk by Teri Manolio about many different clinical genomics projects from around the world. In a talk titled “Big Data: Transforming Cancer Research and Care”, Lynda Chin presented work on enormous datasets that included genomic, transcriptomic, epigenomic, and more data from many cancer patients. She emphasized the need to develop resources to crowdsource analysis to maximize use of these datasets.
While many of the datasets mentioned above and elsewhere are available publicly or with approved access, it is still difficult to ask questions like “at this location in the genome, what fraction of people have allele A” or “across all eQTL studies ever done, what is the effect of this variant.” While it is possible to answer some of these questions, it requires a huge amount of logistics to query many diverse datasets. If all publicly available genomics data was accessible and query-able within the same “interoperable framework”, it would be much easier to ask these kinds of questions and would make existing datasets much more powerful. This is a major goal of the Global Alliance for Genomics and Health, and I think its importance should be emphasized.
Although this topic did not come up as much as I expected during the meeting, clearly one of the most transformative new technologies in genomics is the ability to precisely edit genomes in place with the use of the CRISPR-Cas9 system. Feng Zhang gave a great overview of the technology, and some exciting new developments they are working on. He was followed by Kiran Musunuru who talked about using CRISPR to follow up a specific GWAS locus.
This technology obviously has tons of use cases. For the first time, instead of just identifying associations, we can directly assess the effects of specific variants within their original genomic context. This will allow us to validate and study effects of GWAS loci and QTLs, easily generate gene knock-outs in tissue-specific models, and more. However, there are still many challenges. Musunuru pointed out issues in throughput: he wants to test all variants at a GWAS locus. Performing separate CRISPR experiments on each variant using current methods would be a huge task. Instead, he outlined a plan where he will generate a pooled library of guide RNAs for each mutation in a cell line with his gene of interest tagged with GFP. He will then assay expression using FACs and analyze extremes of the expression distribution to determine variants leading to increases and decreases in expression. I think this is the direction we need to be thinking. His example is only for a single locus. Imagine now that we’d like to interrogate variants at thousands of loci across the genome. There are still technical challenges, such as isolating specific populations of cells that received specific CRISPR edits, that make genome editing on such a large scale infeasible.
Howard Jacob joked (or not) that the future of common disease is “phenotyping using an Apple watch and clinical sequencing”. However, there are many barriers to overcome. Two key areas were discussed regarding what needs to happen to make clinical genomics a reality:
I really enjoyed seeing the work of deCODE, presented by Hreinn Stefanson, in which they took CNVs implicated in schizophrenia and autism, and asked how they behave in controls. These types of studies are crucial for understanding complex traits: we have to look both at individuals with the disease phenotype, but also at individuals without the phenotype. Interestingly, they found that CNVs for schizophrenia indeed showed a cognitive phenotype in healthy controls, suggesting the variant plays a role in cognition separate from its role in the manifest disease. This study is important both to learn about penetrance, but also because it allows better elucidation of the phenotype conferred by that specific variant.
David Altshuler passionately advocated the importance of penetrance in variant interpretation, both in many of the Q&A sessions, but also in his talk about the Global Alliance. The big issue is that to assess the impact of a variant in an individual, we really need to look at all other individuals that harbor the same variant and ask what their phenotypes were.
(Note a couple are missing, mostly because some were closed to social media, but a couple of them I didn’t attend, sorry!)
Bert Vogelstein (keynote) Deaths from cardiovascular disease over the last 60 years have decreased about 70%, while deaths from cancer have stayed stagnant. Why? Vogelstein suggests this is because most funding for cancer research has focused on curing advanced cancers, but the most promise for impact comes from prevention and early detection.
Lynda Chin We are generating overwhelmingly large cancer genomics datasets: whole genome sequencing, transcriptomics, proteomics, and scores of epigenetic marks. We need to make better visualization and data resources to crowdsource analyses of these data to maximize its utility.
Jason Hoskins Hoskins tackles the important task of following up a GWAS hit for predisposition to pancreatic cancer all the way to the causal variant and mechanism. He finds a double insertion event tagged by the GWAS hit that forms a novel binding site for LEF1 that silences DIS3 through long range interaction.
Brad Bernstein Bernstein uses high throughput and single cell genomics to find the factors driving stemness in glioblastomas.
Martin Tobin Lung function and COPD are highly heritable conditions, and associations found by GWAS explain only a small portion of the genetic variance. By smartly stratifying cohorts (here non-smokers vs. smokers) we can gain better insight into variants contributing to the disease.
Peter Gregersen “Phenotype. phenotype. phenotype.” Focusing on rheumatoid arthritis, Gregersen emphasizes that often disease phenotypes can be very heterogeneous. We can increase our power and effect sizes by looking at more homogeneous phenotypes, such as molecular “endophenotypes” underlying the disease.
Judy Cho Cho also emphasizes, this time in IBD, that a key to GWAS success is identifying the correct molecular endophenotype. She identifies links between IBD and mycobacterial susceptibility.
Eric Green (keynote) Eric Green expresses optimism about the future of genomic medicine, and gives his 6 hottest areas upcoming areas of importance: (1) cancer genomics, (2) pharmacogenomics, (3) genomic medicine “test drive” programs, (4) prenatal and newborn genomic analysis, (5) clinical genomics information systems, and (6) ultra-rare disease genomics.
Matthew Maurano Deep sequencing of DNaseI hypersensitivity sites and other molecular phenotypes elucidates widespread allelic imbalance and shows cell type specific differences in the genome’s regulatory architecture.
Nancy Cox In order for us to accurately translate a variant into patient care, we will need to integrate data from a large number of sources, which by themselves have no clinical utility. Cox also presents a clever approach of using genotypes to predict gene expression, then associating expression levels directly with disease phenotypes to gain insight into key genes related to disease.
Simona Volpi The GTEx project is generating large datasets of genotype and expression data across many individuals in a large range of tissue types that will serve as a valuable resource for interpreting the effect of genetic variants.
Jeroen Huyghe Integrating GWAS hits with functional data will be a key to interpretation. GWAS SNPs from Type 2 Diabetes alter expression in relevant tissues of liver, pancreas, and skeletal muscle.
Howard Jacob “The future of common disease: phenotype with your apple watch, plus sequencing in the clinic”. Jacob claims the future of clinical sequencing lies in whole genome, rather than whole exome sequencing, which should capture a larger range of relevant variation.
Hreinn Stefannson (deCODE) Most association studies study the effect of a variant on individuals with a certain phenotype. However, this does not allow us to answer the question of penetrance. Here deCODE looks at how CNVs implicated in schizophrenia and autism behave in normal control samples. They find that these CNVs significantly affect cognitive function in controls, separate from the manifest disease.
David Goldstein De novo mutations are in clear excess in patients arriving at the genetics clinic. To evaluate the significance of these, it is important to have an accurate null model of de novo mutation.
Leslie Biesecker A key objection to using sequencing results in the clinic is that the penetrance of most variants is unknown. Currently, reported variants suffer from a high false positive rate. Biesecker claims that we have been underestimating the prior probability of disease, and that by more accurately estimating this we may be able to boost the positive predictive value of mutations returned by these tests.
David Hinds (23andMe) 23andMe have collected a large genetic dataset from a diverse set of Americans. Through online surveys, they achieve striking GWAS results for such traits as “do you cry easily” and “are you a morning person”.
Yaniv Erlich Using an awesome analysis of a 43 million person pedigree obtained using social media, Erlich showed that the overwhelmingly best model for the genetic architecture of longevity is an additive model.
Vineeta Agarwala Are “synthetic associations”, or rare variants on common haplotypes, driving GWAS signals? Agarwala finds no evidence of this at the 9p21 locus implicated in Type 2 Diabetes, suggesting the signal is due primarily to common variants.
Hilary Finucane Using a new statistical genetics metric, the LD score, we can more accurately measure the heritability of polygenic traits, and can partition heritability between different functional annotations.
Teri Manolio Manolio describes growing genomic medicine initiatives from around the world.
Tim Hubbard UK is sequencing 100,000 clinical grade genomes with a focus on rare inherited diseases, cancer, and pathogens. This project will be directly integrated with the health care system, and aims to develop important infrastructure for integrating sequencing into health care.
David Altshuler To fully understand the genomics of many diseases, we will need to look across hundreds of thousands or even millions of genomes. No single institute has the capacity for this, and a key element of progress will be to ensure interoperability of genomics data across many institutions worldwide. Unless we take action, genomics data will become “balkanized” in different silos. The Global Alliance seeks to advance interoperability of genomics data by developing important ethical and technical frameworks.
Steve Munger Many mouse studies are done using the Black 6 strain, and are found not to recapitulate human responses. However, this strain behaves like an outlier strain in many respects. At Jackson Laboratory, they are creating an outbred mouse strain from 8 founder strains to allow more complex studies of genotype-phenotype relationships.
Feng Zhang CRISPR-Cas9 allows precise genome-editing. It can be used for a huge variety of genomics applications, including generating genome-scale knockout libraries and editing specific genomic variants. Zhang’s group developed a cre-dependent Cas9 mouse, allowing cell-type specific Cas9 expression and genome editing.
Kiran Musunuru In associated genotype to phenotype, we need to address four key questions: (1) what is the causal variant? (2) How do we link the causal variant to a gene? (3) What is the causal gene? (4) How do we relate the causal gene to the phenotype? Musunuru is following up a GWAS hit for Type 2 Diabetes and HDL in KLF14 using a CRISPR screen to cover all of the variants at this locus.
Molly Przeworski (keynote) There are a lot of basic properties of the human mutation process that we still don’t understand. Recent studies have estimated the germline mutation rate using different methods, including looking at disease related genes, phylogenetic relationships, and deep sequencing trios, each of which gives different estimates. Przeworski presents a model accounting for mutation repair rates, paternal age, and generating time and shows how it could reconcile estimates from different methods.
Gil McVean Many germline mutations are not simple point mutations. How can we access the full spectrum of de novo mutations? McVean looks at plasmodium genomes using an assembly based approach to study all types of variation in a hypothesis free way.
Joanna Mountain (23andme) 23andme looks at the genetic landscape of individuals across the US. Using data from its large user-base, they can infer different ancestry components in diverse populations including Latinos, African Americans, and Americans of European descent.
Rebecca Rothwell Rothwell develops a method of mitochondrial transmission in humans. She shows that a “nucleoid model”, in which groups of 5-10 mitochondria replicate as a unit, fits best to current data.
Julie Segre Segre shows the use of PacBio and Illumina sequencing technologies to track outbreaks of hospital acquired infections.