- Sep 10th: Sanjay Krishnan, Assistant Professor, Department of Computer Science, The University of Chicago
- Sep 17th: Michael Schatz, Bloomberg Distinguished Professor, Computer Science & Biology, John Hopkins University
- Sep 24th: Dr. Dimitriy Dligach, Associate Professor of Computer Science, Loyola University
- Oct 01st: Natalie Parde, Assistant Professor, Department of Computer Science, University of Illinois at Chicago
- Oct 08th: Mohammed El-Kebir, Assistant Professor, Thomas M. Siebel Center for Computer Science, Univerisity of Illinois, Urbana-Champaign
- Oct 15th: Sergey Koren, Staff Scientist, National Institutes of Health (NIH), Bethesda
- Oct 22nd: Nik Sultana, Assistant Professor of Computer Science, Illinois Institute of Technology, Chicago
- Oct 29th: Wesley Swingley, Associate Professor, Dept. of Biological Sciences, Northern Illinois University, DeKalb
- Nov 05th: Indika Kahanda, Assistant Professor, School of Computing, University of North Florida, Jacksonville
- Nov 12th: Andrew (Andy) Dahl, Assistant Professor of Medicine, The University of Chicago
- Nov 19th: NO SEMINAR – FINALS WEEK – GOOD LUCK!
Title:Histograms and How to Make Them Better
Abstract: Summarizing a large dataset with a reduced-size “synopsis” has applications from visualization to approximate computing. Data dimensionality is an acute obstacle where techniques that work well in lower dimensions, such as histograms, fail to scale to higher-dimensional data. My talk surveys a few years of research in this area by my research group and discusses the theory and practice of high-dimensional data summarization. This survey will start by understanding how histograms fail at high-dimensional estimation and simple, but powerful extensions that have a much operating regime (and why these extensions work!). Then, I will discuss the relationship between data summarization and generative modeling in machine learning. I will conclude by describing the practical computer systems that we are building with these algorithmic building blocks.
Bio: Sanjay Krishnan is an Assistant Professor of Computer Science at the University of Chicago. His research studies the intersection of machine learning and database systems. Sanjay completed his PhD and Master’s Degree at UC Berkeley in Computer Science in 2018. Sanjay’s work has received a number of awards including the 2016 SIGMOD Best Demonstration award, 2015 IEEE GHTC Best Paper award, and Sage Scholar award.
Title: Basepairs to petabytes: Computing the Genomics Revolution
Abstract:The last 20 years have been a remarkable era for biology and medicine. One of the most significant achievements has been the sequencing of the first human genomes, which has laid the foundation for profound insights into human genetics, the intricacies of regulation and development, and the forces of evolution. Incredibly, as we look into the future over the next 20 years, we see the very real potential for sequencing more than 1 billion genomes, bringing even deeper insight into human genetics as well as the genetics of millions of other species on the planet. Realizing this great potential for medicine and biology, though, will only be achieved through the integration and development of highly scalable computational and quantitative approaches that can keep pace with the rapid improvements to biotechnology. During this presentation, I aim to chart out these future technologies, anticipate the major themes of research, and call out the challenges ahead.
Bio: Michael Schatz, Bloomberg Distinguished Professor of Computer Science and Biology at Johns Hopkins University, is among the world’s foremost experts in solving computational problems in genomics research. His innovative biotechnologies and computational tools to study the sequence and function of genomes are advancing the understanding of the structure, evolution, and function of genomes for medicine – particularly autism spectrum disorders, cancer, and other human disease – and agriculture.
Title: Automatic Phenotyping in the Age of Deep Learning
Abstract: It is often estimated that 80% of clinical data today is stored in an unstructured form, mostly as electronic health records (EHR). Within this corpus of text lies a vast amount of valuable information that can be leveraged for phenotyping, pharmacogenomic studies, and clinical decision support, ultimately improving patient care and reducing healthcare costs. Until fairly recently, automatic phenotyping (patient cohort identification) had been conducted using feature-based approaches in combination with linear classifiers. Deep learning revolutionized clinical informatics, but obtaining large datasets to take advantage of highly expressive neural network models is difficult and expensive. In this talk, I will argue that amenability to pretraining is a key benefit of deep learning for healthcare. I will then outline my contributions related to pretraining phenotyping classifiers using various sources of freely available supervision. If the time permits, I will briefly review several other projects involving substance misuse classification and information extraction from medical records.
Bio: The overarching goal of Dr. Dligach’s research is developing methods for automatic semantic analysis of texts. His work spans such areas of computer science as natural language processing, machine learning, and data mining. Most recently his research has focused on semantic analysis of clinical texts. He works both on method development and applications. Prior to to joining Loyola, Dr. Dligach was a researcher at Boston Children’s Hospital and Harvard Medical School. Dr. Dligach received his PhD in computer science from the University of Colorado Boulder, his MS in computer science from the State University of New York at Buffalo, and his BS in computer science from Loyola University Chicago.
Title: The Doctor (or Chatbot?) Is In: Towards Automated Support for Healthcare Tasks using Natural Language Processing
Abstract:Natural language processing is a powerful tool that opens a wide range of opportunities in many domains, including the healthcare sector. In this talk I’ll introduce two intriguing healthcare tasks recently explored by my research team: predicting cognitive health status and detecting medical self-disclosure. We explore the former at both a coarse-grained level, classifying individuals into dementia and control groups, and a fine-grained level, predicting cognitive health scores along a continuum. For the latter, we develop a large dataset from publicly available posts in online health communities and train a predictive model that establishes a strong performance benchmark for the task. Finally, I’ll conclude by introducing some intriguing directions for future work in the healthcare space.
Bio:Natalie Parde is an Assistant Professor in the Department of Computer Science at the University of Illinois at Chicago, where she also co-directs UIC’s Natural Language Processing Laboratory. Her research interests are in natural language processing, with emphases in healthcare applications, interactive systems, multimodality, and creative language. Her research has been funded by the National Science Foundation, the Office Ergonomics Research Committee, the Discovery Partners Institute, and several internal seed funding programs. She serves on the program committees of the Conference on Empirical Methods in Natural Language Processing (EMNLP), the Association for Computational Linguistics (ACL), and the North American Chapter of the ACL (NAACL), among other conferences and workshops. In her spare time, Dr. Parde enjoys engaging in mentorship and outreach for underrepresented CS students.
Title: Combinatorial Algorithms for Tumor Phylogenetics
Abstract: Cancer is a genetic disease, where cell division, mutation and selection produce a heterogeneous tumor composed of multiple subpopulations of cells with different sets of mutations. During later stages of cancer progression, cancerous cells from the primary tumor migrate and seed metastases at distant anatomical sites. The cell division and mutation history of an individual tumor can be represented by a phylogenetic tree, which helps guide patient-specific treatments. In this talk, I will introduce combinatorial algorithms for reconstructing tumor phylogenies from bulk DNA sequencing data, where the measurements are a mixture of thousands of cells. These algorithms are based on a combinatorial characterization of phylogenetic trees as a restricted class of spanning trees in a graph, a characterization that also demonstrates the computational complexity of the problem. In addition, I will introduce a novel framework for analyzing the history of cellular migrations between anatomical sites in metastatic cancers. Finally, I will discuss algorithmic challenges in tumor phylogeny reconstruction from single-cell DNA sequencing data.
Bio: El-Kebir received his PhD in Computer Science at VU University Amsterdam and Centrum Wiskunde & Informatica (2015) under the direction of Jaap Heringa and Gunnar Klau. He did postdoctoral training with Ben Raphael at Brown University and Princeton University (2014-2017). In 2018, he joined the University of Illinois at Urbana-Champaign as an Assistant Professor of Computer Science. El-Kebir has affiliate faculty appointments in Electrical and Computer Engineering, the Institute of Genomic Biology and the National Center for Supercomputing Applications. He received the National Science Foundation CISE Research Initiation Initiative (CRII) Award in 2019 and the CAREER Award in 2021.
Title: The complete sequence of a human genome
Abstract:In 2001, Celera Genomics and the International Human Genome Sequencing Consortium published their initial drafts of the human genome, which revolutionized the field of genomics. While these drafts and the updates that followed effectively covered the euchromatic fraction of the genome, the heterochromatin and many other complex regions were left unfinished or erroneous. Addressing this remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium has finished the first truly complete 3.055 billion base pair (bp) sequence of a human genome, representing the largest improvement to the human reference genome since its initial release. The new T2T-CHM13 reference includes gapless assemblies for all 22 autosomes plus Chromosome X, corrects numerous errors, and introduces nearly 200 million bp of novel sequence containing 2,226 paralogous gene copies, 115 of which are predicted to be protein coding. The newly completed regions include all centromeric satellite arrays and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies for the first time.
Bio: Sergey received his PhD in computer science in 2012 under the supervision of Mihai Pop at the University of Maryland. He joined the National Bioforensics Analysis Center in 2011 and was appointed as an associate principal investigator in 2014. During this time, he pioneered the use of single-molecule sequencing for the reconstruction of complete genomes. In 2015, he joined the National Human Genome Research Institute as a founding member of the Genome Informatics Section. His research focuses on the efficient analysis of large-scale genomic datasets and new methods for metagenomic analysis and assembly of high-noise single-molecule sequencing data.
Title:Disaggregation and Placement of In-Network Programs
Abstract: Programmable network switches and NICs are enabling the execution of increasingly rich computations inside the network using languages like P4. Today’s in-network programming approach maps a whole P4 program to a single target, limiting a P4 program’s performance and functionality to what a single target device can offer. Disaggregating a single P4 program into subprograms that execute across different targets can improve performance, utilization, and cost. But doing this manually is tedious, error-prone and must be repeated as topologies or hardware resources change. This talk describes Flightplan: a target-agnostic, programming toolchain that helps with splitting a P4 program into a set of cooperating P4 programs and maps them to run as a distributed system formed of several, possibly heterogeneous targets. We’ll look at evaluation results from testbed experiments and simulation. During the talk I’ll also describe how Flightplan’s design addresses practical concerns, including the provision of a distributed diagnostics interface and the mitigation of partial failures. Code, documentation, tests, a demo, and videos can be obtained from https://flightplan.cis.upenn.edu/
Bio: Before joining Illinois Tech I was a postdoc at the UPenn Distributed Systems Lab and at the Cambridge Systems Research Group where I worked on various research projects on computer systems. Up to my PhD I did theoretical research. For my PhD I developed a compiler-based approach to proof translation, and before that I worked on constructive proof search and the verification of refactorings using interactive theorem-proving. I did my undergrad at the University of Malta.
Title: Disentangling the Complex Network of Soil Bacteria in a Restored Prairie Chronosequence
Abstract: Understanding microbial diversity and function in natural ecosystems has long remained a challenging endeavor. The function of microbial populations in temperate soils is still relatively understudied due to the sheer diversity of life in these systems and the underrepresentation of cultured isolates to study soil metabolic and elemental cycles. As part of a larger project investigating plant-soil interactions in two northern Illinois prairie restoration systems, this work seeks to use machine learning methodologies to disentangle soil microbial responses to a number of ecological stressors, including fire, bison presence, nutrient abundance, and pH.
Bio: Dr. Swingley’s research focuses on three approaches to tackle the central challenges in analyzing complex environmental communities: 1) to develop novel computational techniques to inform a new generation of genomic and community genomic data; 2) to model the co-evolution of organisms and the environment; and 3) to illuminate the evolutionary origin and history of phenotypes and environmental adaptation.
Title:Biomedical Natural Language Processing for Extracting Protein-Phenotype Relations from Text
Abstract: Natural language processing is concerned with the interaction of computers and human languages, and programming computers to process and understand large natural language corpora. Biomedical Natural Language Processing (BioNLP) is the application of natural language processing to biology and medicine. Based on the urgent need for utilizing the knowledge discovered from the exponentially growing biomedical literature, BioNLP is becoming one of the rapidly growing research areas of interest, especially because it can alleviate the challenges associated with the manual curation of literature. One of my primary research interests is to work with biologists, clinicians, and biocurators to develop innovative BioNLP methods for automated curation tasks. In this talk, I will describe our recent work that deals with developing BioNLP methods for extracting human protein-phenotype relations from biomedical literature using machine learning.
Bio: Dr. Indika Kahanda is an Assistant Professor at the School of Computing, University of North Florida, where he directs the BioMedInfo (bioinformatics and biomedical informatics) lab. Prior to that, he has worked as Assistant Professor in the Gianforte School of Computing at Montana State University. He focuses on the application of machine learning and natural language processing techniques to solve problems involving large-scale biological, molecular, and biomedical data. In particular, he investigates approaches to (1) develop computational methods for functional genomics and (2) develop natural language processing tools for biomedical literature and clinical notes. He received his Ph.D. in Computer Science from Colorado State University in 2016 in the area of Bioinformatics, a Master of Science in Computer Engineering from Purdue University in 2010, and a Bachelor of Science in Computer Engineering from University of Peradeniya, Sri Lanka in 2007.
Title:What’s a subtype? Using genetics to identify endotypes of complex disease
Abstract: Many common diseases result from a heterogeneous mix of causal factors that differentially impact different people. This suggests dividing patients into subtypes to improve power and precision in scientific studies and clinical treatment. Nonetheless, subtypes remain under-characterized in many diseases, which has motivated computational approaches to uncover novel subtypes using disease-relevant features. However, many prior approaches are severely biased or under-powered, and they rarely shed light on core disease biology. In this talk, I argue that genetics can be used to help identify, characterize, and biologically validate disease subtypes, and propose our genetic subtyping framework. I describe recent methods that focus on two key challenges: (i) true genetic effects are often individually weak and distributed across the genome and (ii) false positive genetic subtypes can, and do, exacerbate medical racism. I anchor the discussion on applications to subtyping asthma and metabolism, which reveal dovetailing heterogeneity in genetic architecture and treatment response.
Bio: Dr. Dahl develops statistical methods to better understand the genetic basis of complex diseases. He focuses on genetic approaches to define and validate new disease subtypes, with the aim of making treatment more precise. Currently, he is developing tools to parse the functional genomic basis of metabolic diseases like type 2 diabetes and psychiatric diseases like major depression.