Welcome and Intro
Modern artificial intelligence (AI) models can accurately predict patient progress, an individual's phenotype, or molecular events such as transcription factor binding. However, they do not explain why selected features make sense or why a particular prediction
The eukaryotic cell is a multi-scale structure with modular organization across at least four orders of magnitude. Two central approaches for mapping this structure – protein fluorescent imaging and protein biophysical association – each generate extensive datasets but of distinct qualities and resolutions that are typically treated separately. Here, we integrate immunofluorescent images in the Human Protein Atlas with ongoing affinity purification experiments from the BioPlex resource to create a unified hierarchical map of eukaryotic cell architecture. Integration involves configuring each approach to produce a general measure of protein distance, then calibrating the two measures using machine learning. The evolving map, called the Multi-Scale Integrated Cell (MuSIC 1.0), currently resolves 69 subcellular systems of which approximately half are undocumented. Based on these findings we perform 134 additional affinity purifications, validating close subunit associations for the majority of systems. The map elucidates roles for poorly characterized proteins, such as the appearance of FAM120C in chromatin; identifies new protein assemblies in ribosomal biogenesis, RNA splicing, nuclear speckles, and ion transport; and reveals crosstalk between cytoplasmic and mitochondrial ribosomal proteins. By integration across scales, MuSIC substantially increases the mapping resolution obtained from imaging while giving protein interactions a spatial dimension, paving the way to incorporate many molecular data types in proteome-wide maps of cells. Co-authors
Comparative functional genomics offers a powerful approach to study the evolution of complex traits by measuring and comparing large-scale molecular profiles, such as transcriptomes, epigenomes, proteomes, across multiple species. These studies have been instrumental in advancing our understanding of the role of gene regulation in the evolution of complex traits and species-specific adaptations. With the availability of functional genomic profiles across species, a parallel goal is to develop computational approaches to analyse these datasets to reveal patterns of conservation and divergence at the molecular level across species. Identification of such patterns can be challenging in complex phylogenies, with a large number of duplications and losses. Furthermore, interpretation of conservation and divergence patterns in the context of known biological processes and pathways can be difficult due to the lack of comprehensive functional annotation in less studied species. Here, we present a comparative study of plant proteomes comprising a novel dataset of six plant species from a diversity of land plant clades which include Arabidopsis thaliana, Medicago truncatula, Oryza sativa (rice), Physcomitrella patens, Solanum tuberosum (potato) and Zea mays (corn). To interpret the evolutionary patterns of conservation and divergence, we develop a novel analytical pipeline to analyse this data, which consists of (a) two multi-task clustering methods, Arboretum-Proteome for identification of modules of genes exhibiting similar protein levels and Muscari for defining modules of genes that have similar co-expression patterns, (b) identification of clade-specific gene sets exhibiting different phylogenetic patterns of conservation and divergence of protein levels, and (c) network-based diffusion to assess the association of different processes with the clade-specific gene sets. Globally we find that protein levels diverge according to phylogenetic distance but is more constrained than the mRNA level. Highly expressed proteins tend to be more conserved than proteins that are not as highly expressed. Furthermore, gene duplication plays an important role in divergence of protein modules. We also found that highly expressed gene modules are enriched in biological processes that are ubiquitously needed, whereas intermediate and less expressed gene modules are diverged and enriched in regulatory and environmental information processing processes. Our clade-specific gene set analysis provided a fine-grained view of evolutionary dynamics of proteomes at the level of sets of genes. Taken together our approach offers a useful resource for performing evolutionary studies and broadly applicable for comparative studies in a phylogeny with a large number of poorly studied species. Co-authors
Obtaining a systems view of cell states and cell state transitions, by understanding transcriptional regulation, is one of the major challenges in the field of genome research. In this study we provide the bridge between the genome and the transcriptome for more than 80 cell types in the fly brain, by profiling chromatin accessibility and transcriptome at single-cell resolution. We have built a resource containing 240,000 single cells, spanning nine developmental time points from larval, pupal, and adult brains. We show that most cell types are characterised by a unique chromatin accessibility landscape, with thousands of specific accessible elements, summing to a total of 207,000 regulatory regions, plus a set of 60,000 enhancers that change during metamorphosis and are linked to specific developmental paths. Exploiting the powerful genetic tools in Drosophila, we generate transgenic enhancer-reporter flies and demonstrate that uniquely accessible regions are functional enhancers, thereby paving the way to the construction of specific genetic driver lines for each neuronal cell type. Finally, using a combination of machine learning methods, we explore the combinatorial transcription factor code for given cell types, and associate regulatory regions with candidate target genes, resulting in the first core gene regulatory networks in the fly brain uniting enhancer sequence, chromatin accessibility and gene expression. Co-authors
High-throughput chromatin immunoprecipitation (ChIP)-based assays capture genomic regions associated with the profiled transcription factor (TF). These regions are usually a mixture of direct and indirect TF-DNA interactions. Our previous method DIVERSITY identifies different modes of protein-DNA binding from ChIP-seq genomic data. These modes show distinct evolutionary signatures and chromatin organization, suggesting functional diversity. ChIP-exo, a modified version of the ChIP-seq protocol maps TF binding regions with higher resolution. It uses an exonuclease enzyme to digest the DNA strands from 5’ to 3’ causing the 5’ cuts to be more concentrated near the binding sites. Thus the profile of the 5’ cuts can be used to precisely identify TF binding sites. Similar to ChIP-seq, ChIP-exo also captures both direct and indirect binding sites of the target protein. But also, different protein-DNA complexes are expected to have different 5’ read distributions. Existing methods that identify putative binding modes either use known motifs to scan the sequences followed by examining their read distributions or refine an initial partitioning of the dataset based on either read distribution or motif discovery. These methods can miss less frequently occurring novel motifs or motifs that have no clear read distribution. We propose a statistical framework that learns a joint distribution over the reads and the nucleotides to learn different modes of protein-DNA binding. It uses no motif databases, but directly learns the distinct motifs and their read profiles from the data. As expected in GR ChIP-exo data, our method finds motifs for GR as well as its cofactors. But interestingly, for most TFs that we investigated, we see small nucleotide differences within the motifs, which are associated with different read profiles, suggesting that the structure of the complex formed at these modes is indeed different. Furthermore, the discovered motifs have distinct evolutionary signatures, which correlate with the associated read distributions. Finally, our method is general enough to be run with any kind of positional data and therefore its applicability goes beyond ChIP-exo experiments. References: 1. S. Mitra, A. Biswas, and L. Narlikar. PLoS Comput. Biol., 14(4):e1006090, 04 2018. 2. H. S. Rhee and B. F. Pugh. Cell, 147(6):1408–1419, Dec 2011 3. S. R. Starick, et al. Genome research 25.6 (2015): 825-835. 4. N. Yamada et al. Bioinformatics 35.6 (2019): 903-913. Co-authors
Take a break and get some fresh air, visit the Poster Hall or stop by Café Connect to network with other attendees.
The growing availability of hundrends of different functional genomic assays across thousands of individuals, presents an exciting opportunity to understand the inner workings of biological systems, so to identify molecular causes of disease. Toward this goal, machine learning (ML) provides a powerful set of tools to integrate diverse datasets, uncovering hidden structure that can reveal how different layers of biological systems relate to each other. However, to harness the power of ML for biology, we need to be able to tune it so to distinguish meaningful structure from those that arise because of artifact and noise. In this talk, I’ll present machine learning approaches recently developed by my lab for leveraging heterogeneous data and prior knowledge to guide discovery of meaningful biological structure. In particular, I will first describe our deep learning approach (AI-TAC) to combining a large compendium of epigenomic data, in order to learn the relationship between non-coding sequence and regulatory activity across the immune system (Yoshida et al., Cell 2019; Maslova et al., PNAS 2020). I will describe how we robustly interrogate this model to gain mechanistic insights into the non-coding genome for implementing cell-type specific gene regulation. I will then focus on the challenging task of understanding molecular causes of complex disease. Here, I will describe our robust ML techniques for revealing mechanistic insights into Alzheimer’s disease by combining large and heterogeneous gene expression datasets from the brain (Mostafavi et al., Nat Neur 2018).
Metastatic progress is the primary cause of death in most cancers, yet the regulatory dynamics driving cellular changes necessary for metastasis remain poorly understood. Multi-omics approaches hold great promise for addressing this challenge; however, current analysis tools have limited capabilities to systematically integrate transcriptomic, epigenomic and cistromic information to accurately define regulatory networks critical for metastasis. To address this limitation, we used a purposefully generated cellular model of colon cancer invasiveness to generate multi-omics data (expression, accessibility, and select histone modification profiles) for increasing levels of invasiveness. These data were analyzed using a rigorous probabilistic framework for joint inference from the resulting heterogeneous data, along with transcription factor binding profiles. Our approach used probabilistic graphical models to leverage the functional information provided by specific epigenomic changes, model the influence of multiple transcription factors simultaneously, and automatically learn the activating or repressive roles of cis- regulatory events. Global analysis of these relationships revealed key transcription factors driving invasiveness, as well as their likely target genes. Disrupting expression of one of the highly ranked transcription factors JunD, an AP-1 complex protein, confirmed functional relevance to colon cancer cell migration and invasion. Transcriptomic profiling confirmed key regulatory targets of JunD, and a gene signature derived from the model demonstrated strong prognostic potential in TCGA colorectal cancer data. Our work sheds new light into the complex molecular processes driving colon cancer metastasis and presents a statistically sound integrative approach to analyze multi-omics profiles of a dynamic biological process. Co-authors
Population studies such as genome-wide association study (GWAS) have identified many genomic variants associated with human diseases. To further understand the potential mechanisms of disease variants, recent statistical methods associate functional omic data (e.g., gene expression) with genotype and phenotype and link variants to genes. However, how to interpret molecular mechanisms from such associations is still challenging. To address this problem, we developed an interpretable deep learning method, Varmole, to simultaneously reveal genomic functions and mechanisms while predicting phenotype from genotype . Varmole embeds multi-omic networks into a deep neural network architecture and prioritizes variants, genes, and regulatory linkages via biological drop-connect without needing prior feature selections. In particular, Varmole predicts disease phenotypes from genotype and gene expression data. Its transparent layer models the regulatory mechanisms of (1) the transcription factors (TFs) to genes as gene regulatory networks (GRNs) and (2) SNPs to gene expression by eQTLs. This part defines the biological architecture of Varmole, not a fully connected “black box”. Further, Varmole regularizes the neural network architecture via drop-connect, enabling its interpretability to rank the importance of SNPs, genes, and their links for prediction, and also address the overfitting issues caused by small numbers of samples vs. large numbers of features (e.g., SNPs, genes). Varmole is also scalable for enabling the implicit feature selection via Lasso regularization and taking input data with continuous values. Finally, Varmole scores each SNP, gene, or SNP-gene pair for their importance to predict the phenotype. We applied Varmole to population-level human brain data in PsychENCODE for predicting schizophrenia from SNP genotype dosage and RNA-seq gene expression. We found that Varmole outperforms other state-of-the-art methods (Accuracy = 0.77). In addition, we further used an integrated gradient-based method to prioritize top genes for predicting schizophrenia and found that they are enriched with many known functions and pathways in schizophrenia, such as neuron development, axon guidance, cell adhesion, calcium signaling, response to external stimulus, NMDA receptor, and insulin secretion. Also, we overlapped the SNP-gene pairs with the interacting enhancers and promoters for the Hi-C data of the human brain. We found that the overlapped SNP-gene pairs have significantly higher importance scores than the rest (p<5e-5). This suggests the distal regulatory roles of those SNPs prioritized by Varmole to the genes, implying potential novel schizophrenia-associated regulatory pathways linking risk variants, enhancers to genes. Varmole is open-source available at: https://github.com/daifengwanglab/Varmole.  This work has been accepted by Bioinformatics. Co-authors
Single-cell RNA sequencing (scRNA-seq) has emerged as a dominant tool for characterizing the transcriptional states of individual cells in diverse biological systems. Such datasets can be used to also infer a temporal ordering of dynamic cellular states or cellular trajectories. However, current computational tools for analyzing scRNA-seq data are often unable to infer coherent cellular trajectories in heterogeneous cellular compartments with complex dynamics. In part, this is because current methods for trajectory inference rely on unbiased dimensionality reduction techniques. Such biologically agnostic ordering of cells can prove difficult for modeling complex developmental pathways, where cells utilizing concurrent transcriptional regulatory modules, such as those controlling cell cycle, metabolism and differentiation may confound trajectory inference. Furthermore, dynamic biological compartments can result in the sparse sampling of key intermediate cell states. These scenarios are especially pronounced in dynamic immune responses of innate and adaptive immune cells. To overcome these limitations, we introduce a supervised machine learning framework, called Pseudocell Tracer, which infers trajectories in pseudospace rather than in pseudotime. This enables us to map out a surface or manifold that predicts the cell states that a cell can occupy during a process. Notably, Pseudocell Tracer is capable of inferring cellular trajectories in complex systems by integrating prior biological knowledge. In contrast, existing computational tools typically analyze scRNA-seq datasets without reference to any of the underlying biology of the system that generates the data. We demonstrate that use of prior knowledge of the underlying biological system aids in the extraction of obscured information from scRNA-seq datasets, especially in context of modeling a complex process. Pseudocell Tracer uses a supervised encoder, trained with adjacent biological knowledge, to project scRNA-seq data into a low-dimensional manifold. Then a generative adversarial network (GAN) is used to simulate pesudocells at regular intervals along a virtual cell-state axis. We demonstrate the utility of Pseudocell Tracer by modeling B cells undergoing immunoglobulin class switch recombination (CSR) during a prototypic antigen-induced antibody response. Our results reveal an ordering of key transcription factors regulating CSR, including the concomitant induction of Nfkb1 and Stat6 prior to the upregulation of Bach2 expression. Furthermore, the expression dynamics of genes encoding cytokine receptors point to the existence of a regulatory mechanism that reinforces IL-4 signaling to direct CSR to the IgG1 isotype. This framework is potentially applicable to single cell data from many other fields with complex dynamics. Co-authors
Cancer treatment has been revolutionized by recent advances in immunotherapy. However, even the state-of-the-art methods, such as immune checkpoint inhibition, have gained success in only a fraction of cancer patients. This limitation stems from the fact that the current knowledge on the tumor microenvironment has only touched the surface of the complex tumor-host interactions. Growing evidence supports the hypothesis of bidirectional causal effects between driver events in the cancer genome and the patient’s immune response against cancer cells. The two well-established phenomena shedding light on these reciprocal interactions are immunoediting and immune-evading mechanisms. The former may determine how the immune system can affect the sub-clonal selection of driver mutations and the latter may explain the effect of oncogenic mechanisms on the immune contexture of tumor microenvironment. Studies focusing on the relationships between cellular mechanisms and tumor immune landscape will push the field of immunotherapy forward by helping researchers to better understand heterogeneity in patients responding to immunomodulation and paving the way towards identification of novel molecular biomarkers of prognosis and therapy response. To address these challenges, we have developed a machine-learning framework to integrate tumor microenvironmental and immune profiles inferred from bulk transcriptomes with patient clinical information and somatic alterations in cancer genomes. Systematic analysis of multiple cancer types reveals dozens of combinations of driver mutations and immune cell infiltrates that define novel infrequent subgroups of high-risk tumors. Prognostic models integrating these genomic and tumor microenvironmental signals outperform baseline models of clinical variables and can be validated computationally using additional tumor cohorts. We can also learn about the underlying biology of these high-risk tumors by analyzing transcriptomic and pathway-level differences. For instance, in hepatocellular carcinoma, high infiltration of monocytes coupled with driver mutations in TP53 indicate poor patient prognosis, while the corresponding group of high-risk tumors is characterized by transcriptome-wide differences spanning hundreds of genes involved in cell cycle deregulation. Our methodology and the resulting observations will serve future studies to gain a system-level understanding of crosstalk between tumors and the immune system required to advance immunotherapy development, and may inform the development of prognostic and predictive molecular biomarkers of precision medicine approaches for multiple cancer types. Co-authors
We are exploring genomic regulatory codes and transcriptional circuits that control distinctive mammalian cellular states and their dynamics. Our experimental models include B cells of the adaptive immune system that are a featured cell state within the ENCODE project.
Gene regulatory networks (GRNs) are widely utilized in systems biology but their explanatory power remains to be fully harnessed by integration of mathematical modeling and computational genomics with experimental testing. We are undertaking such analysis of a GRN that controls exceptional B cell fate dynamics. Upon sensing pathogens, B cells undergo bifurcating trajectories initiated by the transcription factors (TFs) IRF4 and IRF8 that converge to generate germinal center (GC) independent or GC-dependent plasma cells. Combining mechanistic modeling with machine learning, we uncover a dominant parameter in the GRN that underlies this unusual emergent property. This prediction has been experimentally tested and resulted in the discovery of a novel feedforward loop that reinforces the GC trajectory. Collectively, this generalizable strategy illustrates the power of combining mechanistic modeling, machine learning and computational genomics with experimental testing to uncover key design principles underlying GRNs.
To date cis-regulomes underlying cell type-specific mammalian genomic states have been analyzed by structure-based chromatin profiling. By coupling FAIRE-seq with STARR-seq we assemble the first functionally filtered cis-regulome for a mammalian cell type, an activated B cell. Functional enhancers in contrast with accessible chromatin regions, not associated with enhancer activity, are preferentially occupied in vivo in a combinatorial manner by canonical B-lineage determining transcription factors (TFs). Active enhancer sequences are resolved from accessible but inactive chromatin regions by covariance of a diverse set of TF motifs. Hi-C demonstrates enrichment of multiplex, activated enhancer-promoter configurations dominated by long-range interactions. The functionally integrated cis-regulome reveals TF codes for pathway-specific enhancers. The overall framework can be readily extended to diverse mammalian cell types.
Visit the Poster Hall and explore the collection of scientific research. Listen to poster presentations, examine the posters, and visit the presenter's table for a live conversation with the presenter.
Welcome and Intro
A powerful method to study the genotype-to-phenotype relationship is the systematic assessment of mutant phenotypes using genetically accessible model systems. We have developed and applied methods for quantitative analysis of genetic interactions in double mutants using yeast colony size as a proxy for cell fitness. Our global digenic interaction network reveals a hierarchy of functional modules, including pathways and complexes, bioprocesses and cell compartments. Recently, we have leveraged the principles about genetic networks that we discovered in yeast to map genetic interactions in human HAP1 cells using genome-wide CRISPR/Cas9 screens. Our yeast work guided our selection of query genes to screen and provided a road-map for extraction of functional information from the resulting data. The interactions screened to date include more than 85% of the genes in the human genome that are expressed in HAP1 cells, and as was observed in yeast, interaction profile similarity is highly predictive of gene function. I will describe our results in the context of our ongoing efforts to discover the principles of genetic networks in yeast and apply what we learn to understand the functional organization of the human genome.
Recent developments in spatially-resolved, multiplexed transcriptome profiling promise to further our understanding of cell-type composition and the spatial relationships of cells in complex tissues. However, to realize this potential, new computational models are urgently needed to capture the unique properties of spatial transcriptome data for single cells. Here we develop a new framework, named SpiceMix, which significantly advances current methodology for the analysis of spatial transcriptome data by effectively integrating both spatial information and gene expression of single cells. Underlying SpiceMix is a novel probabilistic model for the spatial transcriptome that integrates state-of-the-art models of single cell and spatial data analysis, namely, nonnegative matrix factorization (NMF) and the hidden Markov random field (HMRF). In SpiceMix, each cell is modeled as a node in the HMRF, with edges connecting and constraining neighboring cells. To integrate the NMF formulation of gene expression, the hidden state of each node is defined as the latent variable representation of the cell via NMF. A learned affinity matrix captures the correlation of latent variables between neighboring cells and enforces spatial coherence of cell types. Importantly, the parameters of both NMF and the HMRF are learned simultaneously, by an effective alternating optimization method, allowing each aspect of the model to refine the other, leading to a comprehensive framework of spatial gene expression. To demonstrate the effectiveness of SpiceMix, we designed simulated spatial transcriptomic data sets, that model the mouse cortex, and compared the results from SpiceMix to that of state-of-the-art methods for single-cell and spatial transcriptome analysis. Our results demonstrated that SpiceMix is able to consistently improve upon the inference of intrinsic cell types compared with other methods by learning spatial enrichment patterns in the data. As a proof-of-principle, we further used SpiceMix to analyze spatial transcriptome data of the mouse primary visual cortex acquired by two spatial transcriptomic methods, seqFISH+ and STARmap. We found that SpiceMix refined the cell assignments of the original studies and achieved dramatic improvement over other competing methods. SpiceMix revealed several cell subtypes with strong spatial enrichment that were missed by existing methods. Taken together, we believe that SpiceMix is a new generalizable, unsupervised framework for analyzing spatial transcriptome data with the potential to provide critical new insights into the composition and heterogeneity of the spatial organization of cells. Co-authors
The simplicity and low cell number requirements of assay for transposase-accessible chromatin using sequencing (ATAC-seq) made it the standard method for detection of open chromatin. Moreover, careful consideration of digestion events of open chromatin protocols with computational footprinting (Gusmao et al. 2016; Li et al 2019) allows the detection of transcription factor binding sites and the activity level of transcription factors at particular cell types. The combination of ATAC-seq with single cell sequencing (scATAC-seq) allows the characterization of open chromatin status of thousands of single cells from healthy and diseased tissues. A major drawback of scATAC-seq are the so-called dropout events, i.e. open chromatin regions with no reads due to loss of DNA material during the scATAC-seq protocol. Our estimates indicate that dropout events affect at least 50% of open chromatin sites. Therefore, scATAC-seq data only contain a partial picture of the open chromatin status of single cells, which greatly impairs its computational analysis. We describe scOpen: a computational method for estimation of the open chromatin status of single cells from scATAC-seq experiments (Li, et al. 2020). We demonstrate that scOpen estimated scATAC-seq matrices improves computational analyses of scATAC-seq including clustering of cells, inference of regulatory players and chromatin conformation. Next, we describe scHINT, which allows prediction of transcription factor activity in groups of single cells. We are particularly interested in detection of cellular changes during the onset of fibrosis (Kramann, et al., 2015; 2016). We generated a complex scATAC-seq with 30.000 high-quality cells across the majority of kidney cell types in homeostasis, early and late fibrosis. We detect fibrosis driving cell types and regulatory features driving cellular proliferation and (de)-differentiation of fibrosis driving cells at unprecedented resolution. This case study with this complex and sparse data set further supports the power of scOpen and of footprinting analysis with scHINT for detection of regulators in scATAC-seq data. References Gusmao EG, et al., Nat. Methods. 2016;13(4):303–309. Li Z, et al., Genome Biology, 2019;20(1):45. Li Z et al. bioRxiv. 2019. Kramann R, et al., Cell Stem Cell. 2016;19(5):628–642. Co-authors
Molecular networks with intricate circuit designs and interlocking feedback loop govern cellular dynamics and cell fate decision. A central task in systems biology is to reconstruct the network structure and governing equations. The goal of this study is to reconstruct genome-wide ordinary-differential equation based governing equations from single cell RNA-seq data through integrating single cell measurements and computational analyses. From the transcriptional data (x) and derived instant time derivatives (dx/dt, called RNA velocity), we developed a procedure of learning the analytical form of the vector field F(x) and the equation dx/dt = F(x) in the Reproducing Kernel Hilbert Space. Experimentally we further adapted single cell metabolic RNA-seq labeling (scSLAM-seq) for more accurate RNA velocity estimation. We applied the procedure to several published data sets and our own scSLAM-seq data on human HL60 differentiation. We extracted topological features of vector fields of these systems such as fixed points and separatrices. Analyzing the corresponding Jacobian fields revealed regulation relations of genes. Our memory-seq studies confirm that the vector field reliably predicts cell state evolution over long time. The developed method of single cell vector field reconstruction opens a new direction for systems biology studies that reveals coupling between different regulatory modules. Co-authors
Hematopoietic stem and progenitor cells (HSPCs) in the bone marrow are derived from a small population of hemogenic endothelial (HE) cells located in the major arteries of the mammalian embryo. HE cells undergo an endothelial to hematopoietic cell transition, giving rise to HSPCs that accumulate in intra-arterial clusters (IAC) before colonizing the fetal liver. To examine the cell and molecular transitions between endothelial (E), HE, and IAC cells, and the heterogeneity of HSPCs within IACs, we profiled ∼40 000 cells from the caudal arteries (dorsal aorta, umbilical, vitelline) of 9.5 days post coitus (dpc) to 11.5 dpc mouse embryos by single-cell RNA sequencing and single-cell assay for transposase-accessible chromatin sequencing. We identified a continuous developmental trajectory from E to HE to IAC cells, with identifiable intermediate stages. The intermediate stage most proximal to HE, which we term pre-HE, is characterized by increased accessibility of chromatin enriched for SOX, FOX, GATA, and SMAD motifs. A developmental bottleneck separates pre-HE from HE, with RUNX1 dosage regulating the efficiency of the pre-HE to HE transition. A distal candidate Runx1 enhancer exhibits high chromatin accessibility specifically in pre-HE cells at the bottleneck, but loses accessibility thereafter. Distinct developmental trajectories within IAC cells result in 2 populations of CD45+ HSPCs; an initial wave of lymphomyeloid-biased progenitors, followed by precursors of hematopoietic stem cells (pre-HSCs). This multiomics single-cell atlas significantly expands our understanding of pre-HSC ontogeny. Co-authors
Take a break and get some fresh air, visit the Poster Hall or stop by Café Connect to network with other attendees.
Biological processes, including those involved in immune response, disease progression and development, are often dynamic. To fully understand and reconstruct regulatory and signaling networks inside human cells requires the collection, analysis and integration
Recent advances in genome editing, driven by the discovery and development of the CRISPR-Cas9 system, have significantly reduced the cost of precisely modifying genomic sequence within living cells; however, optimal use of these methods requires knowing precisely what edits will yield a desired effect while minimizing unintended consequences. We propose a method, named Ledidi, for designing edits that induce a desired functional landscape. Ledidi phrases the design task as an explicit optimization problem where the goal is to identify a compact set of edits that results in the desired functional profiles according to a predictive model. This predictive model can be any pre-trained machine learning model: in this work, we chose Basenji because the model provides a detailed output and, thus, fine-grained design over the design process. A difficulty in this optimization problem is that genomic sequence is discrete; however, we overcome this difficulty by using the Gumbel-softmax reparameterization trick that enables standard gradient descent methods to be used on discrete inputs. An important distinction between Ledidi and previous works is that Ledidi does not design entire sequences but rather a small set of edits, and, becuase Ledidi is model-free, does not require the training of a machine learning model to work well. We first validated Ledidi by knocking-out and -in CTCF binding. When applied to 53 CTCF binding sites, we found that Ledidi proposed 3.04 edits on average per locus, primarily on the most conserved nucleotides in the CTCF motif. According to Basenji, these edits reduced predicted CTCF binding signal from a median fold-change of 74.2 to 5.2. Similarly, when trying to induce CTCF binding at regions without a CTCF motif, Ledidi's edits resulted in an increase of predicted signal from a median fold-change of 2.4 to 63.7. Although CTCF is an ideal candidate for initial validation because the protein binds to a known motif, a more compelling use case is when reasonable edits to produce an outcome are not known in advance. Intriguingly, we found that Ledidi was capable of inducing cell-type specific binding of JUND between GM12878 and h1-hESC. In this evaluation, Ledidi edited a sequence that initially exhibited binding in both cell types such that it only exhibited binding in GM12878-- reducing the median signal in h1-hESC from 13.2 to 5.6 while preserving the signal in GM12878. Taken together, the results indicate that Ledidi is a powerful tool to design compact sets of edits. Co-authors
RNA conformational switch has significant impacts on biological processes and diseases. An emerging genetic factor of RNA conformational switch is a new class of single nucleotide variant (SNV)––riboSNitch. Identification of riboSNitch is notably difficult as the signals of RNA structural disruption are often subtle. Here, we introduce a new method, named RiboSNitch Predictor based on Robust Analysis of Pairing probabilities (Riprap), for large-scale riboSNitch identification. Compared to previous approaches, Riprap shows higher accuracy on identifying known riboSNitches captured by various experimental RNA structure probing methods including the parallel analysis of RNA structure and the selective 2’-hydroxyl acylation analyzed by primer extension. Furthermore, Riprap detects a pain-associated riboSNitch that regulates human catechol-O-methyltransferase haplotypes and outputs structurally disrupted regions precisely at base resolution. Riprap provides a new approach to interpreting disease-related genetic variants and is freely available at https://github.com/ouyang-lab/riprap. In addition, we construct a database (RiboSNitchDB) that integrates the annotation and visualization of known riboSNitches and those predicted from the human expression quantitative trait loci derived from the GTEx Portal. RiboSNitchDB is a useful resource for future study of RNA regulatory functions and can be accessed via https://people.umass.edu/ouyanglab/ribosnitchdb. Co-authors
Transcription factors (TFs) bind genomic DNA in a sequence-specific manner to regulate gene expression. In-vitro TF-DNA binding assays can measure the intrinsic DNA binding affinity of individual TFs. Complementary in-vivo TF binding experiments can profile genome-wide TF occupancy that is influenced by several factors besides intrinsic sequence specificity. Deep learning models can accurately map genomic DNA sequence to in-vivo TF binding profiles and predict effects of sequence mutations on binding occupancy. However, it has been difficult to provide biophysical interpretations of these predictions. Here, we show that neural networks trained to model base-resolution genomic occupancy profiles of TFs in yeast and humans can predict effects of sequence variation in and around core binding sites that are remarkably correlated with corresponding binding energy measurements from in-vitro experiments. We show that the models can learn exquisitely detailed motif-flanking sequence preferences of paralogous TFs as well as effects of repetitive motif-flanking sequences on occupancy and affinity. In yeast, models of Pho4 and Cbf1 are able to rank over a million synthetic sequences which contained a high affinity core E-box motif (CACGTG) with systematic sequence variations in the +/-5 bp flanking the motif. In A549 human cell-line, models of glucocorticoid receptor (NR3C1) are able to rank hundreds of synthetic sequences containing the NR3C1 consensus motif sequence with random sequence variation at one or two random positions in the motif. The rankings from the models had strong agreement (R>0.9) with corresponding relative binding affinities for the same libraries of sequences estimated using in-vitro microfluidic experiments. We also find that binding affinity has a much stronger contribution to genomic occupancy signal of TFs in yeast as compared to occupancy profiles of TFs in humans. Our results indicate that with appropriate correction of experimental biases, deep learning models can learn to extract thermodynamic affinities de-novo from genomic occupancy profiles. This unique biophysical interpretation of predictions of deep learning oracles of genomic TF occupancy opens a new avenue to perform massive in-silico perturbation experiments to comprehensively decipher the influence of sequence context and variation on intrinsic affinity and in-vivo occupancy. Co-authors
Over 90% of somatic mutations in cancer genomes lie in non-coding DNA, where binding sites for transcription factors (TFs) are also located. While these sites show an enrichment for somatic mutations [1,2], the mechanism for this enrichment is disputed. Some studies propose that TFs bind to damaged DNA and act as roadblocks for DNA repair, increasing the local mutation rate; other studies point to transcription initiation as the source of the enrichment. A robust understanding of mutagenesis in TF-binding sites is vital to identify non-coding driver mutations. Enrichment of binding site mutations has been observed for several TFs in individual cancers [3,4] but these previous analyses suffer from major limitations. First, they assume that TF recognition of damaged DNA is identical to Watson-Crick DNA. Second, they rely on using ChIP-seq and/or chromatin accessibility data in related cell types/tissues to determine bound regions, as in-vivo data is rarely available for tumors. Thus, the choice of datasets, as well as binding site-calling methods, significantly affect mutation profiles. Finally, previous studies point to potential mechanisms, but do not provide a clear path for experimental validation. We present novel experimental methods and principled computational approaches to examine mutagenesis in TF-binding sites. To test the roadblock hypothesis, we need to map TF binding to damaged DNA. Since most somatic mutations come from DNA mismatches resulting from replication errors, we built a novel high-throughput assay to measure TF binding to mismatched DNA . We show that this binding is not captured by Watson-Crick-based models, underlining the importance of our assay. Furthermore, we present unbiased approaches for determining mutation profiles for TF-binding sites. During tumor development, the mutational and TF-binding landscapes change continuously. By carefully benchmarking accessibility and ChIP-seq data from cell lines, tissues and primary tumors, we build high-confidence ‘binding regions’ that account for these dynamics and are robust to the choice of data. In these regions, we identify TF-binding sites using new computational techniques developed in the lab. Finally, by determining highly-expressed TFs in different tumors, we build robust mutation profiles for TF-binding sites in 11 cancer types. The combination of experimental data and robust mutation profiles allows us to determine TF-binding sites with specific enrichment patterns and tie this to biological mechanisms that can be validated experimentally. 1. Sabarinathan et al, Nature(2016) 2. Perera et al, Nature(2016) 3. Androva et al, Nature Communications(2020) 4. Katainen et al, Nature(2015) 5. Afek et al, Nature(In Press) Co-authors
The clinical application of next generation sequencing and associated computational analyses in the setting of pediatric CNS cancers at Nationwide Children’s Hospital has provided a rich data set for exploring changes in the tumor immune microenvironment. Our interests lie in understanding these changes over time, as primary cancers that were removed by gross total resection recur. As the resulting data inform these interests, we can formulate therapeutic approaches that may address the infiltration of immune cell subsets, reversing the immunosuppressive environment and enabling T cell killing. Further understanding of immune infiltrates from the peripheral circulation, coupled with the ability to detect them in the interval between primary tumor removal by surgery and recurrence may permit early detection of the recurrent tumor as well.
Visit Café Connect to network with other attendees. Ask additional questions at the themed RSG or DREAM Challenge tables, stop by one of our other topic tables or pop into an open networking table and create your own discussion.
Welcome and Introductory Remarks
Identification of pregnancies at risk of preterm birth (PTB), the leading cause of newborn deaths, remains challenging given the syndromic nature of the disease. We report a longitudinal multi-omics study coupled with a DREAM challenge to develop predictive models of PTB. We found that whole blood gene expression predicts ultrasound-based gestational ages in normal and complicated pregnancies (r=0.83), as well as the delivery date in normal pregnancies (r=0.86), with an accuracy comparable to ultrasound. However, unlike the latter, transcriptomic data collected at <37 weeks of gestation predicted the delivery date of one third of spontaneous (sPTB) cases within 2 weeks of the actual date. Based on samples collected before 33 weeks in asymptomatic women, we found expression changes preceding preterm prelabor rupture of the membranes that were consistent across time points and cohorts, involving, among others, leukocyte-mediated immunity. Plasma proteomic random forests predicted sPTB with higher accuracy and earlier in pregnancy than whole blood transcriptomic models (e.g. AUROC=0.76 vs. AUROC=0.6 at 27-33 weeks of gestation). Co-authors
The gene expression of human cells is a complex system with thousands of interacting components. In several studies researchers successfully used machine learning methods to infer high-level biological phenomena like preterm birth, as in the recent DREAM PTB challenge. Can we really get true biologically meaningful insights with this approach? Co-authors
We used a combination of SVM and GPR. The main task of the challenge was to tune the parameters of these two algorithms and assembling them. We included all samples into training (no matter microarray or RNAseq). We quantile normalized each sample. The meaning of tuning parameters in SVM and GPR is to find out how much noise are there in the expression data. It was through a systematic grid search. Models were weighted equally when predictions are combined.
In the DREAM Preterm Birth Prediction Challenge, Transcriptomics (Sub-challenge 2), the goal was to predict the preterm birth phenotypes (sPTD and PPROM) with a minimal set (at most 100) of transcriptomic features. We (team IGIB) performed, 1) differential expression analysis between sPTD vs control and PPROM vs control using t-test or Wilcoxon-test, 2) prioritized top 100 features based on statistical significance p-value, 3) SVM-based classification models (kernel types: linear, sigmoid, and radial) were built with 5-fold cross-validation, and 4) Based on the overall sensitivity and specificity across 5-fold CV, the best SVM-approach, radial-SVM was selected for prediction of the preterm birth phenotypes (sPTD and PPROM). Overall the performances for radial-SVM models, to predict sPTD was 96.51% (sensitivity) and 96% (specificity); and to predict PPROM was 100%(sensitivity) and 100% (specificity).
The ability to discover data across multiple repositories in an interoperable fashion is essential to support a FAIR (Findable, Accessible, Interoperable, Reusable) research ecosystem. With the increasing volume of biomedical data, the ability to describe these data in a standardized format that allows for integration remains largely a human — and difficult to scale — endeavor. The annotation of datasets with consistent metadata can help reveal interrelationships among data repositories and improve the ability for researchers to reuse data generated at disparate sites in their own cancer research and is therefore a necessary step in biological research. Adding machine-readable terms requires considerable manual time and human expertise thus properly annotating data prior to sharing is a bottleneck that impedes scientific discovery. The goal of our Dream Challenge was to address the need for automated tools to facilitate the creation of metadata to enable computational data transformation, query, and analysis.
The ENGR Dynamics approach focused on scalable, computationally efficient heuristics evaluated to score multiple comparisons quickly without complex querying and filtering. While a number of standardized approaches were considered, the final solution focused on capturing the nature of the exceptions and implemented a standardized approach for the columns containing an expected and predictable naming schema. Co-authors
Annotating medical metadata—and metadata in general—is a tedious and error-prone task for humans. There are usually many usable machine-assisted methods to achieve the same goals, from simple rule-based systems to algorithms applying the latest and most sophisticated findings in the ML world. During the Metadata Annotation DREAM Challenge, teams attempted to mimic the ability of individual curators to choose common data elements—standardized and curated definitions of fields that can be used on clinical forms—that are appropriate for a given data set, containing given header labels and data values. The CEDAR Team developed an algorithm which tries to achieve good results against the provided scoring algorithm, while keeping a relatively simple algorithm with a quick runtime. Our team chose this path so that our algorithm can be easily deployed in real life systems, operated in real-time, used to support human selection, and understood and maintained by its adopters. In this talk, we will describe our approach, its strengths and weaknesses, and why we felt it was a good solution for likely real-world applications involving these types of selection problems. Co-authors
The goal of the Metadata Automation DREAM Challenge was to develop a tool to automate the annotation of metadata fields and values in structured biomedical data files with the best candidate Common Data Element (CDE) matches from the Cancer Data Standards Registry and Repository (caDSR). We chose to implement our model in Python 3.6 and approached this challenge from the perspective that it was essentially a fuzzy matching problem. Our approach utilizes Scikit-Learn’s TfidfVectorizer class along with a custom n-gram function to vectorize the data. These term frequency - inverse document frequency (TF-IDF) vectors are passed to Scikit-Learn’s Nearest Neighbor class which returns the k nearest CDE neighbors and their associated distance scores for each column header in the biomedical data file. For the returned CDEs with enumerated values, the Levenshtein distances from the observed values in the data to the CDE’s permissible values are computed using Python’s FuzzyWuzzy library. We then use a decision tree approach based on the TF-IDF distance scores and the observed values’ average Levenshtein distance scores to select and rank the top three CDE matches from the set of nearest neighbors for each column header. In this final ranking step, we apply cutoff values to the distance scores to determine when to include ‘NOMATCH’ as one of the three results. Throughout the challenge we experimented with many aspects of the algorithm including modifying the n-gram function, the selection of caDSR fields to include in the TF-IDF vectorization and applying different cutoff values. The final version of our model was arrived at by selecting the features and parameters that maximized the overall score across all the provided test datasets. Co-authors
Take a break and get some fresh air, visit the Poster Hall or stop by Café Connect to network with other attendees.
The Columbia Cancer Target Discovery and Development (CTD2) Center has developed Pancancer Analysis of Chemical Entity Activity (PanACEA), a database of dose-response curves and drug-perturbed RNAseq profiles for 400 clinical oncology drugs. We used this resource to host the CTD2 Pancancer Drug Activity DREAM Challenge, a crowdsourced competition to develop and benchmark computational models for the prediction of drug polypharmacology using drug sensitivity and gene expression information. We provided dose-response and drug-perturbed RNAseq data on 32 kinase inhibitors and asked the community to use this data to predict target binding across 255 kinases. Top performing teams employed two distinct strategies: simple similarity analysis using many highly curated training datasets, or more advanced deep-learning trained on a single large data set. Detailed analyses of the best performing methods provide (1) a framework for using pharmacogenomic data to predict drug-target interactions, (2) reconciliation of different “drug-target” gold-standard definitions, and (3) insights into therapeutically actionable associations between kinase signalling and transcriptional networks. Co-authors
Misidentifying a drug’s mechanism of action is a common problem in drug discovery. Despite recent efforts on profiling of transcriptomics changes after drug treatment, it remains unknown whether they can facilitate the prediction of drug targets. The CTD2 Pancancer Drug Activity DREAM Challenge provided dose-response and drug-gene signatures on 32 kinase inhibitors and asked the participants to predict binding targets of these anonymous drugs. We have collected: 1) drug sensitivity data; 2) gene signature data and 3) drug-target interaction data. We utilized the DrugComb (http://drugcomb.fimm.fi), which is a crowd-sourcing database of comprehensive drug sensitivity data for combinatorial and monotherapy screenings. Furthermore, we determined the robust drug sensitivity metrics including IC20 and RI (relative inhibition) score, which is based on the area under the log10-scaled dose-response curves. Drug target interactions are derived from DrugTargetCommons (http://drugtargetcommons.fimm.fi/), which is a crowd-sourcing database to manually curate the drug-target bioactivity values from the literature. The final training dataset includes 116 drugs that have the cell line sensitivity features (d = 2*11), consensus gene expression signatures (d = 973, provided by organizers) as well as drug target profiles (d = 1259). To determine the best machine learning models to predict the drug targets, we considered two classes of methods, including weighted averaging and regression. For weighted averaging, the prediction was made based on the multiplication of the Pearson correlation matrix and the drug-target interaction matrix; while for regression, we considered standard machine learning algorithms including ElasticNet, RandomForest and GBM, for which the model was trained on the n = 116 compounds that were found in the training set, and then tested on the n = 32 Challenge compounds. We found that regression methods produced less accurate results, probably due to overfitting. Instead, our weighted averaging method, which directly uses Pearson correlation to transform the original predictor space into a drug similarity space, seemed to produce superior performance. In conclusion, we believed the hypothesis holds true that drug targets can be inferred from their drug responses and perturbational profiles, with the proper choice of data and model. Specifically, we found that RI and IC20 are robust estimates of the drug responses. Deeply-curated quantitative pharmacological databases (ie. DrugComb, DrugTargetCommons and L1000) pave ways for advanced pharmacological modelling which may help identify the mechanisms of drugs with improved accuracy. Co-authors
Accurately identifying drug-target interactions (DTIs) in silico can greatly facilitate the process of drug discovery and development as it can provide valuable insights into the drug mechanisms of action and off-target adverse events. With the emergence of chemogenomic data (e.g., drug perturbational gene expression profiles), researchers now can utilize more information beyond the drug structures to build data-driven DTI prediction tools. In this talk, we present a winning method in CTD-squared Pancancer Drug Activity Dream Challenge for this problem. We develop a multitask neural network approach to simultaneously model DTI relationships for a set of targets based on drug perturbed gene expression data. By incorporating a positive-unlabeled learning objective and a multitask learning constraint (i.e., graph Laplacian regularization), our method exhibits strong predictive power in both computational experiments and competitions. Co-authors
Rheumatoid arthritis (RA) is a common chronic autoimmune disease characterized by inflammation of the synovium leading to joint space narrowing and bony erosions around the joints. The current state-of-the-art method for quantifying the degree of joint damage is human visual inspection of radiographic images by highly trained readers. This tedious, expensive, and non-scalable method is an impediment to research on factors associated with RA joint damage and its progression, and may delay appropriate treatment decisions by clinicians. We sought to develop automatic, rapid, accurate methods to quantify the degree of joint damage in patients with RA using machine learning or deep learning through the community crowdsourced RA2-DREAM Challenge. The motivation for the Challenge, background related to the scoring of joint damage in RA, and the scored radiographic images from clinical studies that supported the Challenge will be described. In addition, each of the three sub-challenges will be discussed: 1: Predict overall RA damage from radiographic images of hands and feet; 2: Predict joint space narrowing scores from radiographic images of hands and feet. 3: Predict joint erosion scores from radiographic images of hands and feet. Co-authors
We'll talk about our entry to the RA2 DREAM Challenge, which won the overall damage prediction category (SC1) - for details see our writeup at https://www.synapse.org/#!Synapse:syn21478998/wiki/604432 The main difficulty in this competition was the lack of training data. We'll review the strategies we used to deal with this, including: - Using a DL model to convert all images to the same dihedral orientation. - Using a DL model to locate joints and cut out joint images - this enabled us to merge groups of joints into one model, multiplying the training data available per prediction. - Thoughtful use of data augmentation, including perspective warps. - Using a carefully chosen pretrained architecture and cross-validation for final damage prediction. Used together, these strategies enabled us to use potentially higher-performance deep learning models without overfitting. Finally, we'll discuss what we think is an interesting open question: whether to use a postprocessing stage in which we adjust a patient's individual joint predictions based on the predictions from their other joints. Unlike the other winning entries, we didn't do this, because we felt unsure about whether this is a good thing to do in practice. We'll present some preliminary analysis of this question based on the competition training set. Co-authors
Rheumatoid arthritis (RA) is an autoimmune disease affecting joints of hands, feet, wrists, ankles, elbows, and knees. It is estimated that about 0.6 percent of the adults in the United States are affected by joint damages associated with RA, including pain and swelling arounds the joint regions. A standard way to evaluate joint damages is manually examining the radiographic images of joints and estimating the severity of joint space narrowing and erosion, which are labor-intensive and time-consuming even for experienced radiologists. Here we present a deep learning-based approach for automatically predicting joint damages and segmenting the regions of interest. This approach ranked top in the 2020 RA2 DREAM Challenge - Automated Scoring of Radiographic Joint Damage.
In this work we created a method for automated joint scoring where we confidently detect joints in the hands and feet and we score them with an intricate ensemble model while taking into account joint damage of all limbs with a random forest model. Our approach is very well thought out and we experimented with many failed attempts to make the score better. As far as we are aware there are no similar work in the literature to ours and we have not used any additional datasets in order to achieve these results.
Visit the Poster Hall and explore the collection of scientific research. Listen to poster presentations, examine the posters, and visit the presenter's table for a live conversation with the presenter.
Welcome and Introductory Remarks
Poultry diseases, such as Salmonella, Newcastle, and Coccidiosis, have a significant economic impact on poultry production in Africa every year. In East Africa, most poultry farmers operate on a small scale and do not have a systematic way to collect, analyze and store information related to disease diagnostics. Convolutional Neural Networks (CNNs) have outperformed traditional imaging techniques in solving practical problems including diseases diagnostics. We discuss the application of CNNs in poultry diseases diagnostics using fecal imagery dataset. With the help of CNNs, farmers will have the potential to better diagnose poultry diseases and improve livestock health.
In the era of precision medicine, acute myeloid leukemia (AML) patients have few therapeutic options: “7 + 3” induction chemotherapy has remained the standard for decades. While several agents targeting the myeloid marker CD33, alterations in FLT3 or IDH1/2, or the anti-apoptotic protein BCL2 have demonstrated efficacy in patients, responses are muted in some populations and relapse remains prevalent. There is an urgent need for targeted treatment options that are tailored to more refined patient subpopulations in order to achieve durable responses. To address this need, we hosted an NCI-sponsored Beat AML DREAM Challenge under the auspices of the Cancer Target Discovery and Development (CTD2) program. In this community-wide assessment, participants predicted ex vivo sensitivity of AML patient primary cells to 122 targeted and chemotherapeutic agents using genomic, transcriptomic, and clinical data (sub-Challenge 1; SC1) and predicted clinical response using these data as well as the ex vivo drug sensitivity data (SC2). Data were furnished by the Beat AML initiative, which comprehensively profiled AML patient samples using whole-exome sequencing (WES), transcriptome sequencing (RNA-seq), and ex vivo functional drug sensitivity screens. Participants developed and tuned their methods using published training data (n=213 specimens) and subsequently received scored submissions on published “leaderboard” data (n=80). Final submissions were ranked on validation data (n=65) we generated for this Challenge using a primary scoring metric, with statistical ties resolved using a secondary metric. Twenty eight participants entered submissions for SC1. We applied two baseline comparator models: a ridge regression model using only expression data (primary metric Spearman’s rho = 0.32; secondary metric Pearson’s r = 0.32) and a Bayesian multitask multiple kernel learning method using expression and mutation data (rho = 0.31; r = 0.32), which was the top-performing method in a related assessment of drug sensitivity prediction across breast cancer cell lines in vitro. The top-performing participant improved upon both models (rho = 0.37; r = 0.38). Six of the top seven participants, including the first-ranked, used multitask approaches or otherwise shared information across the drugs. Fourteen participants entered submissions for SC2. A baseline Cox proportional hazards model with LASSO regularization using all available data modalities achieved a concordance index (CI; primary metric) of 0.68 and an AUC (secondary metric) of 0.65. Four participants were tied based on the primary metric, with the top participant determined by the secondary metric (CI = 0.77; AUC = 0.75). Co-authors
Sub-Challenge 2 (SC2) concerns predicting clinical response [represented as days of survival after inclusion in the study] using ex vivo drug sensitivity, genomic variants, gene expression, and clinical data for the patients. We approach this problem using a two-step survival analysis where the covariates of significance identified by univariate Cox analyses are used in a multivariate Cox model to predict the survival rates. The clinical variables, age and prior malignancy status as well as principal components of gene expression turn out to explain much of the survival rates variability.
"Recent advances in mobile health have demonstrated great potential to leverage sensor-based technologies for quantitative, remote monitoring of health and disease - particularly for diseases affecting motor function such as Parkinson’s disease. While infrequent doctor’s visits along with patient recall can be subject to bias, remote monitoring offers the promise of a more objective, holistic picture of the symptoms and complications experienced by patients on a daily basis, which is critical for making decisions about treatment. Previous work, including the 2017 Parkinson’s Disease Digital Biomarker DREAM Challenge, showed that Parkinson’s diagnosis and symptom severity can be predicted using wearable and consumer sensors worn during the completion of specific short tasks. The BEAT-PD Challenge sought to understand whether symptom severity could be predicted from passive monitoring of patients, as they went about their daily lives, which is a critical component to developing algorithms for remote monitoring. To this end, we leveraged two previously unavailable data sets which collected passive accelerometer data from wrist-worn devices coupled with patient self-reports of symptom severity. Participants were asked to build patient-specific models to predict on/off medication status (subchallenge 1), dyskinesia, an often-violent involuntary movement which arises as a side-effect of medication (subchallenge 2), and tremor (subchallenge 3) for 28 patients. The participant models were compared to a patient-specific null model. Through this challenge, as well as the post-challenge community phase, we determined that sensor measurements from passive monitoring of Parkinson’s patients can be used to predict symptom severity for a subset of patients. Moreover, these models were also predictive for in-clinic physician-assessments of severity. Patient predictability was generally not related to factors like sample size or reporting lag but was somewhat related to overall disease severity." Co-authors
Wearables hold potential for rich monitoring of patient state, particularly in chronic conditions such as Parkinson's disease. However, clinically useful information is difficult to extract due to the high dimensionality and large amounts of noise inherent to real world sensor data. In such data regimes, deep learning techniques can be susceptible to overfitting, and simpler techniques may actually be preferable. We developed a data pipeline to predict on-off states for Parkinson’s disease from wearable accelerometer data while minimizing overfitting. The input to our pipeline was raw sensor data consisting of triaxial acceleration time series signals measured from smartwatches. We combined Individual sensor axes and removed gravitational acceleration from the combined signal. We then extracted time-series features from the processed signal and fit a random forest to predict on-off state for each patient. To expand the training set, we divided each full-length observation into 10 second segments. Our pipeline generated predictions for each segment and used the ensembled median value as the prediction for the observation. This pipeline significantly outperformed the null model, as well as deep learning approaches, in both an internal validation and a held-out test set. Our approach emphasized parsimony and interpretability without sacrificing model performance. Co-authors
While there is inherent value in a clinician examination, the “gold standard” clinical assessment for Parkinson's disease (PD), the MDS-UPDRS, is subjective, administered sporadically, and may not be reflective of the full burden of disease severity. Thus, more frequently-administered, objective measurements via digital technologies can allow for a more accurate detailing of one's disease severity and treatment response. The BEAT-PD DREAM Challenge offered a dataset of smartwatch/smartphone accelerometer recordings from 16 patients with PD, who self-reported their level of dyskinesia (e.g. on a scale of 0-4) during each recording period. We trained a random forest regression model to predict the level of dyskinesia based on measurements extracted from the accelerometer signals. 16 features were used from the accelerometers, such as the mean acceleration and dominant frequency of motion. In addition to the accelerometer features, the patient characteristics (e.g. age, gender, and baseline UPDRS scores) were also used to train the model. This allows the model to develop branches personalized for certain (types of) patients. Personalization is important not only due to differing patient lifestyles and disease progression, but also because the labels for these data are patient-reported, i.e. they are subjective. The total set of features can be reduced by principal component analysis (PCA) or recursive feature elimination (RFE) without significant impact to accuracy. The model makes a prediction for every 30 seconds of activity. For more stable predictions, these estimates can be averaged over longer periods of time (such as the 20-minute recordings of the DREAM challenge). Our model predicted dyskinesia severity with a mean per-patient error of 0.4053. In validation, we found that the model performed well on less severe dyskinesias, but under-estimated in relatively rare cases of high severity (e.g. 4 out of 4 dyskinesia). Future improvements could be made by addressing this class imbalance. We would also like to incorporate time and date into the model to capture circadian patterns. The current version outperformed all 37 other teams in the BEAT-PD DREAM Challenge. We found that the UPDRS scores were very important features for dyskinesia prediction. In many cases, sensor-derived features were secondary to the UPDRS values. Some of the most important sensor-derived features were the mean acceleration, power spectral entropy, and correlation coefficients between acceleration axes. Our code is publicly available: https://bitbucket.org/atpage/beat-pd/. Co-authors
The hallmark of digital medicine is the ability to monitor patients remotely without a physician. While accelerometer/gyroscope-based digital biomarkers have been developed to classify many diseases such as Parkinson’s, in general it remains an open question whether they can be used to monitor severity, particularly in a free-living environment. We report modalities and algorithms that combat the confounding factors in free-living environments and enable remote tremor severity monitoring for individual Parkinson’s patients. We found the fundamental reasons why previous attempts failed: direct regression against severity scores indeed produced no signal as existing studies, and we point to the critical aspects in constructing personalized parameters that allowed the model to place top in the BEAT-PD End Point Challenge. We envision that the methodology will have direct applications in clinical trials and patient care that requires objective, fine-grained scoring and can be adopted to the digital biomarker field for many other neurological or movement conditions.
Implementation of machine learning-based methods in healthcare is of high interest and has the potential to positively impact patient care. To that end, real world accuracy and outcomes from the application of these methods remain largely unknown, and performance on different subpopulations of patients also remains unclear. In order to address these important questions, we hosted a community challenge to evaluate disparate methods that predict healthcare outcomes. We focused on the prediction of all-cause mortality as it is quantitative and clinically unambiguous. In order to overcome patient privacy concerns, we employed a Model-to-Data approach, allowing citizen scientists and researchers to train and evaluate machine learning models on electronic health records from the University of Washington medical system. We held the EHR DREAM Challenge: Patient Mortality from May 2019 to April 2020. We asked participants to predict the 180 day mortality status from the last visit that each patient had in UW Medicine. In total, we had 354 registered participants, coalescing into 25 independent teams. The top performing team achieved an area under the receiver operator curve of 0.947 (95% CI 0.942, 0.951) and an area under the precision-recall curve of 0.487 on all patients over a one year observation of a large health system. In a follow up phase of the challenge, we extracted the trained features from the best performing methods and evaluated the generalizability of models across different patient populations, revealing that models differ in accuracy on subpopulations, such as race or gender, even when they are trained on the same data and have similar accuracy on the population. This is the broadest community challenge focused on the evaluation of state-of-the-art machine learning methods in healthcare performed to date and shows the importance of prospective evaluation and collaborative development of individual models.
The aim of the EHR DREAM Challenge is to predict the mortality status of patients within 180 days of their last visit using their EHR data. We trained and tuned a LightGBM model to predict the mortality risk of each patient with a tailored feature engineering step. In particular, we used the ontology-rollup to reduce the dimensionality of features and used time binning and sample reweighting to capture the longitudinal feature information. The AUROC and AUPR of our model on the independent validation data are 0.9470 and 0.4779 respectively, which both are the highest among all models submitted for this challenge. Co-authors
The Future of DREAM
Visit Café Connect to network with other attendees. Ask additional questions at the themed RSG or DREAM Challenge tables, stop by one of our other topic tables or pop into an open networking table and create your own discussion.