Fall 2020 Schedule – CDM Colloquium

Sept. 11^th

Title: Why distributed systems are hard, and how to deal with it.

Abstract: Why are distributed systems hard? The short answer is because what we want to do is impossible. I’ll show you why. However, we are engineers, so we still need a solution. I’ll discuss strategies for developing and deploying distributed systems in today’s world. I’ll show ways that ZooKeeper can help, and explain why some of the design choices were made. Finally, I’ll also talk about some of the problems that we still need to watch out for when working with distributed systems.

Bio:Ben joined the Computer Science department at San José State University in Fall of 2018 where he teaches courses in networking, operating systems, and distributed computing. Before joining SJSU he helped make the world more open and connected at Facebook for 5 years working on the massively distributed systems there and developing cross-platform frameworks for mobile devices. A short stint at an awesome startup, 6 years at Yahoo! Research, 11 years at IBM Almaden research, and a Ph.D. from UC, Santa Cruz sums up the previous decades.

Sept. 18^th

Title: ChronoLog: A Distributed Shared Tiered Log Store with Time-based Data Ordering

Abstract: Modern applications produce and process massive amounts of activity (or log) data. Traditional storage systems were not designed with an append-only data model and a new storage abstraction aims to fill this gap: the distributed shared log store. However, existing solutions struggle to provide a scalable, parallel, and high-performance solution that can support a diverse set of conflicting log workload requirements.Finding the tail of a distributed log is a centralized point of contention. In this paper, we show how using physical time can help alleviate the need of centralized synchronization points. We present ChronoLog, a new, distributed, shared, and multi-tiered log store that can handle more than a million tail operations per second. Evaluation results show ChronoLog’s potential, outperforming existing solution by an order of magnitude.Keywords: distributed log, shared log, tiered storage

Bio: Dr. Anthony Kougkas is a Research Assistant Professor of Computer Science at the Department of Computer Science in the Illinois Institute of Technology. He is a faculty member and the director of I/O research development of the Scalable Computing Software laboratory at Illinois Tech. He recently received his PhD under Dr. Xian-He Sun titled “Accelerating I/O Using Data Labels: A Contention-aware, Multi-tiered, Scalable, and Distributed I/O Platform”. Dr. Kougkas is an ACM/IEEE member and is very active at the storage community serving as a member of technical program committees of several conferences. Before joining Illinois Tech, he worked for more than 12 years as a military officer. He holds a B.Sc. in Military Science, an MBA in Leadership, and an M.Sc. in Computer Science all received in Athens, Greece. His research is focused in Parallel and Distributed systems, Parallel I/O optimizations, HPC storage, BigData analytics, I/O Convergence, and I/O Advanced Buffering. He is the receiver of the 2019 Karsten Schwan Best Paper Award for his work LABIOS at the 28th International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC’19). More information about Dr. Kougkas can be found at akougkas.com.

Sept 25^th

Title: Recomputing Provenance Graphs in Constrained Environments

Abstract: The conduct of reproducible science improves when computations are portable and verifiable. A container runtime provides an isolated environment for running computations and thus is useful for porting applications on new machines. Current container engines, such as LXC and Docker, however, do not track provenance, which is essential for verifying computations. In this paper, we present SciInc, a container runtime that tracks the provenance of computations during container creation. We show how container engines can use audited provenance data for efficient container replay. SciInc observes inputs to computations,and, if they change, propagates the changes, re-using partially memoized computations and data that are identical across replay and original run. We chose light-weight data structures for storing the provenance trace to maintain the invariant of shareableand portable container runtime. To determine the effectiveness of change propagation and memoization, we compared popular container technology and incremental recomputation methods using published data analysis experiments.

Bio: Tanu Malik is an assistant professor in the School of Computing and directs the Data Systems and Optimization Lab. Her research interests span topics in database systems, data provenance, distributed systems, and cyber-infrastructure for scientific data management. Her group is currently developing methods and systems for improving conduct of reproducible science in computational and data science disciplines. Tanu received the 2019 NSF CAREER award for her work on computational reproducibility. She is also chosen as a 2019 Fellow for Better Scientific Software.

Tanu has actively collaborated with scientists across several institutions. Her research is funded by the National Science Foundation, the Department of Energy, the Sloan Foundation, and the Bloomberg Foundation. She can be reached at tanu at cdm dot depaul dot edu.

Oct. 2^nd

Title: Evaluation of a process for novice debugging

Abstract: Debugging code is a complex task that requires knowledge about the mechanics of a programming language, the purpose of a given program, and an understanding of how the program achieves the purpose intended. It is generally accepted that prior experience with similar bugs improves the debugging process and that a systematic process is needed to be able to successfully move from the symptoms of a bug to the cause. Students who are learning to program may struggle with one or more aspect of debugging, and anecdotally, spend a lot of their time debugging faulty code.

In this work we analyze student answers to questions designed to focus student attention on the symptoms of a bug and to use those symptoms to generate a hypothesis about the cause of a bug. To ensure students focus on the symptoms rather than the code, we use paper-based exercises that ask students to reflect on various bugs and to hypothesize about the cause. We analyze the students’ responses to the questions and find that using our structured process most students are able to generalize from a single failing test case to the likely problem in the code, but they are much less able to identify the appropriate location or an actual fix.

Bio: Amber Settle is a Professor in the School of Computing at DePaul University and has been on the fulltime faculty since 1996. She earned a B.S. in mathematics and a B.A. in German from the University of Arizona, and a M.S. and Ph.D. in computer science from the University of Chicago. Her research interests include computer science and information technology education and theoretical computer science.

She has served on the Advisory Board for the ACM Special Interest Group for Computer Science Education (SIGCSE) since 2010 and is the immediate SIGCSE past chair. Dr. Settle has also served on the program and/or conference committees for RESPECT 2016, SIGITE/RIIT 2013, 2014, and 2015, and ITiCSE 2013. She has been a Distinguished Member of the ACM since 2019.

Oct. 9^th

Title: Pomsets with Preconditions: A Simple Model of Relaxed Memory

Abstract: A memory model is a contract between a programmer and a system implementor which indicates the allowed outcomes of any given program. Some of the things allowed on your computer might surprise you!
In your first systems class you learned a simple model of virtual memory: a nice flat address space. You also learned that this is a lie! Memory systems are remarkably complicated. The situation is even more complex when you consider the aggressive optimization performed by current compilers. Although system designers are able to hide much of this complexity, they can’t hide it all without killing performance. For fifteen years, researchers have been looking for a model that is understandable to programmers while still allowing efficient implementation.
In this talk, we present a (relatively) simple model that does almost everything we want. The model combines an idea from the 60s (preconditions) with an idea from the 80s (pomsets). We show that the resulting model (1) supports compositional reasoning for temporal safety properties, (2) supports all reasonable sequential compiler optimizations, (3) allows programmers to use a simplistic model for race-free programs, and (4) compiles to X64 and ARMv8 microprocessors without requiring extra fences on relaxed accesses.

Bio: James Riely has taught at DePaul since 1999. He’s been working with memory models since 2009.

Oct. 16^th

Title: Building an Efficient In-memory Index Data Structure for String Keys

Abstract: In-memory data management systems, such as key-value stores, have become an essential infrastructure in today’s big data processing and cloud computing. They rely on efficient index structures to access data. While unordered indexes, such as hash tables, can perform point queries with O(1) time, they cannot be used in many scenarios where range queries must be supported. Many ordered indexes, such as B+-tree and skip list, have a O(log N) lookup cost, where N is the number of keys in an index. For an ordered index hosting billions of keys, it may take more than thirty key-comparisons in a lookup, which is an order of magnitude more expensive than that on a hash table. With availability of large memory and fast networks in today’s data centers, this O(log N) time is taking a heavy toll on applications that rely on ordered indexes.

This talk will present Wormhole, a new index data structure for string keys, that takes O(log L) worst-case time for looking up a key with a length of L. The low cost is achieved by simultaneously leveraging strengths of three indexing structures, namely hash table, prefix tree, and B+-tree, to orchestrate a single fast ordered index.

Bio: Xingbo Wu is an Assistant Professor in the Department of Computer Science at the University of Illinois at Chicago (UIC). He received a PhD in Computer Engineering from the University of Texas at Arlington in 2018. The goal of his work is to build fast and efficient data management systems for data centers and clouds. His research spans multiple layers of computer systems and his work has addressed efficiency and performance issues in KV stores, filesystems, virtual block devices, and Flash SSDs.

Oct. 23^rd

Title: STARE Towards Integrative Analysis Of Diverse Big Earth Science

Abstract: A major gap in Earth Science Data analysis is the integration of diverse un-gridded (Level 1 and 2) observations. Gridded data (Level 3) can bring data together on common spatiotemporal grids, but at the cost of interpolation and the loss of important physical characteristics of the original data. Furthermore, current methods of storing observations into arrays in files on computer storage systems breaks observations’ geo-spatial (and/or temporal) alignment. The SpatioTemporal Adaptive Resolution Encoding (STARE) provides a universal way to encode and index geo-spatiotemporal regions for use on distributed parallel computing resources, scaling to both the diversity and the volume required for integrative Earth Science Data analysis. With STARE, aligning and combining un-gridded data at its finest resolution is dramatically improved compared to conventional approaches, enabling a host of important, but currently unfeasible, scientific analyses. In this presentation, we will show how STARE can be used to index, co-align, and integrate diverse low-level observations.

Bio: Dr. Michael Rilee, of Rilee Systems Technologies and NASA Goddard Space Flight Center (GSFC), is the Principal Investigator for the NASA/ACCESS-17 SpatioTemporal Adaptive Resolution Encoding (STARE) effort to develop a universal indexing scheme of integrative analysis of diverse data. He has incorporated STARE into SciDB. His systems experience includes the integration and test of multi-million-dollar supercomputing assets and applying them to a range of problems arising in aerospace science and engineering. He has co-authored a book on remote sensing from space, and his early career focused on the application of advanced computing technologies to plasma physics and autonomous science operations onboard spacecraft. He was awarded a Ph.D. in Astronomy from Cornell University for his research on solar flare energy release.

Oct. 30^th

Title: The Real Logic of Drawing Graphs

Abstract: Computational problems in graph drawing (information visualization) often run into precision issues: when drawing a graph, how precisely do the vertices need to be placed? For some problems, it turns out, these precision issues are unavoidable, and the reason for that is related to the logic of the real numbers. In this talk, we give an introduction to the existential theory of the real numbers, which captures the complexity of problems as diverse as the rectilinear crossing number, tensor rank, the art gallery problem, Nash Equilibria, and tiling puzzles.

Bio: Marcus Schaefer came to CDM in 1999 as an Assistant Professor of computer science after finishing his PhD at the University of Chicago. Previously, he had obtained master degrees in mathematics and computer science at the Universitaet Karlsruhe in Germany. He has published actively in complexity theory and graph theory in particular, including numerous conference talks and journal publications.

Nov. 6^th

Title: Trait-based aerial dispersal of symbiotic microbes.

Abstract: Dispersal is a fundamental process influencing both large-scale biogeographical patterns and local community assembly, but considerable knowledge gaps exist for dispersal of microbial fungi that form Earth’s most common symbioses. Variation in dispersal is predicted among species due to variation in spore traits, but these predictions have yet to be empirically tested. Furthermore, passive accumulation of fungal propagules in urban environments suggests rapid aerial dispersal. Using field experiments, functional traits, and high-throughput sequencing we investigate the prevalence and patterns of aerial dispersal of fungi. These data inform microbial dispersal conceptual models and guide predictions about how stochastic processes drive microbial biogeography and community structure.

Bio: Dr. Bala Chaudhary is an Assistant Professor in the Department of Environmental Science and Studies at DePaul University in Chicago. She is a trained biological scientist, conducting her undergraduate degree at the University of Chicago and her MS and PhD at Northern Arizona University. Research in her lab examines plant-soil-microbial ecology to address landscape-scale questions in natural and managed ecosystems from deserts to rain forests to cities. Bala has numerous publications in high profile journals and, in 2019, she received a National Science Foundation CAREER Award to study microbial biogeography as well as ways to promote racial and ethnic diversity in STEM.

Nov. 13^th

Title: Reproducible Distributed Systems Benchmarking With Container-native Workflows

Experimenting with complex software systems requires practitioners to follow a list of quite onerous steps in order to get it right. Properly carrying out an experiment requires setting up infrastructure; deploying the software system under study; baselining the infrastructure; installing analysis utilities; running experiment commands and scripts iteratively; plotting and analyzing results; among others. Performing these steps manually can be cumbersome and error-prone. In this talk, I present our work on automating experimentation workflows using the Popper container-native workflow automation engine. We give an overview of workflow tools, container- and cloud-native workflow engines, and introduce Popper. To exemplify the utility of this approach, we show example workflows used in our laboratory for benchmarking and baselining Ceph, a storage management platform. Lastly, I briefly introduce and present benchmarking results for SkyhookDM, a Ceph extension that offloads computation and other data management tasks to the storage layer in order to reduce client-side resources needed for data processing in terms of CPU, memory, IO, and network traffic.

Bio: Ivo is a Research Scientist at UC Santa Cruz; an Incubator Fellow at the UC Santa Cruz Center for Research on Open Source Software (CROSS); and Adjunct Professor at University of Sonora (Mexico). Ivo is interested in large-scale distributed data management systems, applied aspects of data science, and reproducibility. Ivo’s 2019 PhD dissertation focused on the practical aspects in the reproducible evaluation of systems research, work for which Ivo was awarded the 2018 Better Scientific Software Fellowship. Ivo is currently working on Popper, as part of the CROSS Incubator Program.

Sept. 11th

Title: Why distributed systems are hard, and how to deal with it.

Sept. 18th