Winter 2021 Schedule – CDM Colloquium

Jan. 8^th

Title: Automatically Detecting, Mitigating and Fixing Software Vulnerabilities

Abstract: Software vulnerabilities is a severe threat to cybersecurity. Exploiting software vulnerabilities allows adversaries to mount large-scale security attacks and compromise protected computer systems, which is evidenced by the recent Russian attacks against protected systems of top U.S federal agencies. In this talk, I will demonstrate the need for automatic solutions to address software vulnerabilities and present approaches that I have developed to detect, mitigate and fix real-world software vulnerabilities. These approaches leverage novel program analysis techniques and deep learning to address three main challenges: 1) detecting software vulnerabilities with high accuracy; 2) mitigating a large number of software vulnerabilities rapidly and safely, and 3) generating correct security patches for complex software vulnerabilities.

Bio: Zhen Huang is an Assistant Professor in the School of Computing at DePaul University. He earned his Ph.D in Computer Engineering from the University of Toronto in 2018. His research focuses on using program analysis techniques and machine learning to address software security issues.

Jan. 15^th

Title: A Framework to Reverse Engineer Database Memory by Abstracting Memory Areas

Abstract:

The contents of RAM in an operating system (OS) are a critical source of evidence for malware detection or system performance profiling. Digital forensics focused on reconstructing OS RAM structures to detect malware patterns at runtime. In an ongoing arms race, these RAM reconstruction approaches must be designed for the attack they are trying to detect. Even though database management systems (DBMS) are collectively responsible for storing and processing most data in organizations, the equivalent problem of memory reconstruction has not been considered for DBMS-managed RAM.

In this talk, we describe a systematic approach to reverse engineer data structures and access patterns in DBMS RAM. We evaluate our approach with the four most common RAM areas in well-known DBMSes and describe the design of each area-specific query workload and the process to capture and quantify that area at runtime. We further evaluate our approach by observing the RAM data flow in presence of built-in DBMS encryption and illustrate the practical data leak implications for the four major memory areas.

Bio: Dr. Alexander Rasin is an Associate Professor in the College of Computing and Digital Media (CDM) at DePaul University. He received his Ph.D. and M.Sc. in Computer Science from Brown University, Providence. He is a co-Director of Data Systems and Optimization Lab at CDM and his primary research interest is in cybersecurity problems of preventing data tampering and exfiltration, database forensic analysis, and fine-grained access control polices. Dr. Rasin’s other research projects focus on building and tuning the performance of domain-specific data management systems — including biomedical data integration, user-defined predicate query optimization, and physical database design and indexing. Several of his research projects are supported by NSF and NIST.

Jan. 22^nd

Title: Exploration and Exploitation in Evolutionary Algorithms: Recent Developments

Abstract: It has been acknowledged for a long time that achieving a balance between exploration and exploitation in Evolutionary Algorithms is of primary importance. However, how to measure exploration and exploitation directly has been an open problem, and a common belief is that clear identification of exploration and exploitation is not possible. In this talk, after a brief introduction of Evolutionary Algorithms, our novel direct measure of exploration and exploitation is discussed that is based on attraction basins — parts of a search space where each part has its own point called an attractor, to which neighboring points tend to evolve. Each search point can be associated with a particular attraction basin. If a newly generated search point belongs to the same attraction basin as its parent, then the search process is identified as exploitation, otherwise as exploration. In the last part, I will mention some open problems in the field of Evolutionary Evolutionary Algorithms, such as replicability of experiments and fairness in comparisons.

Bio: Marjan Mernik received the M.Sc. and Ph.D. degrees in computer science from the University of Maribor in 1994 and 1998, respectively. He is currently a professor at the University of Maribor, Faculty of Electrical Engineering and Computer Science. He was a visiting professor at the University of Alabama at Birmingham, Department of Computer and Information Sciences. His research interests include programming languages, compilers, domain-specific (modeling) languages, grammar-based systems, grammatical inference, and evolutionary computations. He is a member of the IEEE, ACM, and EAPLS. Dr. Mernik is the Editor-In-Chief of the Journal of Computer Languages, as well as Associate Editors of the Applied Soft Computing journal, Information Sciences journal, and Swarm and Evolutionary Computation journal. He is being named a Highly Cited Researcher for years 2017 and 2018. More information about his work is available at https://lpm.feri.um.si/en/members/mernik/

Jan. 29^th

Title: Structuring Notebooks Around Their Outputs

Abstract: Computational notebooks provide a setting where users can rapidly examine and evaluate intermediate outputs as solutions are explored through blocks of code named cells. However, output in notebook systems like Jupyter is multi-faceted and includes textual streams written during execution as well as rich output generated from the final expression in a cell. This talk will discuss approaches to structuring notebooks around named outputs, improving their display, and enhancing methods to use and recall them. Dataflow notebooks elevate outputs to define links between cells, making computations more traceable and reproducible. By improving output displays to be more compact yet allow users to expand details on-demand, this work also addresses navigation issues in notebooks. This work has been implemented as JupyterLab extensions: ipycollections and dfnotebook improve Jupyter’s display of rich output collections and help users recall and reproduce past outputs, respectively.

Bio: David Koop is an Assistant Professor in the Department of Computer Science at Northern Illinois University. His research interests include data visualization, reproducibility, and computational notebooks. A focus of his research is on methods that support users in data exploration, analysis, and visualization tasks so they can focus on important ideas and decisions. During his work, he has collaborated with scientists in the fields of climate science, quantum physics, and invasive species modeling.

Feb. 5^th

Title: Parallelism and regex: Step 0: memchr.

Abstract: With the prophesied end of Moore’s law, manufacturers and programmers alike are turning to parallelism to boost performances. As one of the simplest and most widespread computational tasks, regular expression matching (regex) is an obvious candidate for studying how parallelism can be leveraged. Starting at the very bottom of this study, we concentrate on the regular expression “.*a.*”, that is, “the string contains an ‘a'”—this is traditionally implemented using the libc functions memchr or strchr. We study how this task can be sped up using instruction-level parallelism (SIMD instructions) and core-level parallelism (threads). In this talk, I will report on several bottlenecks and sweet spots that we identified in a variety of implementations of memchr.

Work in progress with K. Endres (DePaul) and C. Paperman (U. Lille, France).

Bio: Michaël joined DePaul in 2019 as an Assistant Prof. He specializes in the theoretical aspects of computer science, in particular automata theory, logic, and circuit complexity.

Feb. 12^th

Title: How the NBA uses Data Science and Computer Science to Analyze Basketball Statistics

Abstract: Over the last decade, the available data to describe NBA basketball has grown exponentially. With that, the NBA has employed cutting edge technologies to analyze basketball statistics to engage and educate fans while also providing NBA teams with the data needed for their research. This talk will review specific projects to demonstrate the technologies and techniques the NBA uses to make the most of the available data. There will also be a focus on the importance of subject matter expertise in solving data science problems.

Bio: Charlie Rohlf is the Associate Vice President of Stats Technology & Product Development at the NBA. His team is responsible for the software development and data science used to ingest, process and deliver the NBA’s basketball statistics to its products. He received his MS in Computer Science from DePaul in 2012.

Feb. 19^th

Title: ApproxTuner: A Compiler and Runtime System for Adaptive Approximations

Abstract: Manually optimizing the tradeoffs between accuracy, performance and energy for resource-intensive applications with flexible accuracy or precision requirements is extremely difficult.
We present ApproxTuner, an automatic framework for accuracy-aware optimization of tensor-based applications while requiring only high-level end-to-end quality specifications. ApproxTuner implements and manages approximations in algorithms, system software, and hardware. The key contribution in ApproxTuner is a novel three-phase approach to approximation-tuning that consists of development-time, install-time, and run-time phases. Our approach decouples tuning of hardware-independent and hardware-specific approximations, thus providing retargetability across devices. We evaluate ApproxTuner across 10 convolutional neural networks (CNNs) and a combined CNN and image processing benchmark. For the evaluated CNNs, using only hardware-independent approximation choices we achieve a mean speedup of 2.1x (max 2.7x) on a GPU, and 1.3x mean speedup (max 1.9x) on the CPU, while staying within 1 percentage point of inference accuracy loss.

Bio: Hashim is a final year PhD candidate working with Dr. Vikram Adve at the University of Illinois at Urbana-Champaign.
His research interests lie at the intersection of Compilers, Approximate Computing, Deep Learning, Systems, and Static Analysis.
His work focuses on building compiler infrastructure that improves performance and reduces the energy usage on resource-constrained systems. He also takes interest in developing abstractions, analyses, and techniques that enable the use of approximations with minimal programmer/user involvement.

Feb. 26^th

Title: ZettaScale Computing on Exascale Platforms

Abstract: We outline the vision of “Learning Everywhere,” which captures the impact of learning methods coupled to traditional HPC methods. We present several examples of “effective performance” improvements for traditional HPC simulations that coupling HPC with learning methods provides. We discuss how we are applying the “Learning Everywhere” paradigm to advance therapeutics for COVID19 — as part of the DOE’s Medical Therapeutics project under the umbrella of the National Virtual Biotechnology Laboratory. We will discuss performance challenges of scalable and integrated HPC & AI software infrastructure, and outline how RADICAL-Cybertools address these challenges.

Bio: Shantenu Jha is the Chair of Computation & Data Driven Discovery Department at Brookhaven National Laboratory, and Professor of Computer Engineering at Rutgers University. His research interests are at the intersection of high-performance distributed computing and computational & data science. Shantenu leads the the RADICAL-Cybertools project which are a suite of middleware building blocks used to support large-scale science and engineeringapplications. He was appointed a Rutgers Chancellor’s Scholar (2015) and was the recipient of the inaugural Chancellor’s Excellence in Research (2016) for his cyberinfrastructure contributions to computational science. He is a recipient of the NSF CAREER Award (2013), winner of IEEE SCALE 2018 award, and the Gordon Bell Award (2020), as well as several other prizes at SC’xy and ISC’xy. More details can be found at: http://radical.rutgers.edu/shantenu

Mar. 5^th

Title: See the World Through Network Cameras

Abstract: Millions of network cameras have been deployed worldwide. Real-time data from many network cameras can offer instant views of multiple locations with applications in public safety, transportation management, urban planning, agriculture, forestry, social sciences, atmospheric information, and many others. In this talk, I will discuss the real-time data available from worldwide network cameras and potential applications, such as using cameras to analyze social distancing and vehicular traffic during the COVID-19 pandemic via our database of more than 35,000 network cameras around the world. I will present specific technical strategies to address the challenge of discovering network cameras and creating a camera database of public network cameras, which the subject of our recently accepted journal article ACM Transactions on Internet Technology. I will explain our recent work to use open source archiving tools we developed in our research group to collect and analyze a large visual dataset from our camera database. This dataset has grown to 100TB after collecting data since March 2020 (when the pandemic “began”) on a visualization focused supercomputer at Argonne National Laboratory. We continue to collect data, since COVID-19 is far from over, with the hope being that our methods will be able to help with future pandemics or other emergencies. Our use of supercomputers is interesting, because we primarily use these resources to address the data-intensive nature of this project, since they are among the few computing resources that have access to large-scale storage networks and computing nodes with multiple GPGPUs available. Nevertheless, these powerful computers enable us to run the object detectors and other computer vision methods in situ (in place) on the large visual dataset and get up-to-date information as various lockdown measures are imposed and relaxed (sometimes to be repeated). As this talk is also facing research students and is an invitation to potential collaborators, I will close with a discussion of challenges and opportunities and offer remarks about how to do research with large, distributed, and changing software teams comprising mostly undergraduate students and the crucial role of developing leadership strategies and embracing software engineering practices.

Bio: George K. Thiruvathukal is a Professor of Computer Science at Loyola University, Chicago. He is also the Co-Leader of the CAM2 Project at Purdue University for Software Engineering and HPC/Distributed Systems, and Director of the Software Systems Laboratory.

Mar. 12^th

Title: Index-free reachability in massive graphs using random walks

Reachability is a key primitive in graph processing across a range of applications. Traditionally this problem has been approached by computing reachability indexes. However with the explosion in the size of graph data sets this approach has hit a barrier. We show how to use random walks to answer reachability queries efficiently, albeit with a slight increase in query time, without the overhead of having to compute and store an index. The benefits of our technique are magnified in the labelled graph setting. Here we address reachability queries that have an additional constraint specified as a regular expression on the labels of the path. Such queries arise in many practical graph query languages such as SPARQL from W3C, Cypher of Neo4J, Oracle’s PGQL and LDBC’s G-CORE. There are no known practical index-based solutions that address a full range of regular expressions for such queries. We show how to handle such queries at a high level of generality using a random-walk based method.

Amitabha Bagchi is Professor of Computer Science and Engineering at IIT Delhi. A theoretician by training and temperament, Amitabha’s core interests are in algorithms, data structures, graphs and probability. He has collaborated with domain experts on areas as diverse as network measurement, information retrieval, graph data processing, logistics and OR and social network analysis among others.

Mar. 19^th

Title: CyberGISX for Reproducible and Scalable Geospatial Research and Education

Abstract: Geospatial research and education have become increasingly computation-, data-, and collaboration-intensive. In this context, cyberGIS has emerged as new-generation geographic information science and systems (GIS), seamlessly integrating advanced cyberinfrastructure, GIS, and spatial analysis and modeling capabilities to enable broad research and education advances. Through holistic integration of high-performance and distributed computing, data-driven knowledge discovery, visualization and visual analytics, and collaborative problem-solving capabilities., CyberGIS helps bridge the significant digital divide between advanced cyberinfrastructure and geospatial communities. This talk introduces recent advances of cyberGIS with a particular focus on CyberGISX that uses Jupyter notebooks as the primary user-interface for conducting reproducible geospatial research and education at scale based on cutting-edge cyberinfrastructure and cyberGIS capabilities.

Bio: Anand Padmanabhan is a Research Associate Professor at the Department of Geography and Geographic Information Science at the University of Illinois at Urbana-Champaign (UIUC). He received his Ph.D. in Computer Science from the university of Iowa and his research interests include advanced cyberinfrastructure and cyberGIS; geospatial data science and big data; geographic information systems and science (GIS); high performance, data-intensive and cloud computing; and parallel and distributed systems. He has developed a number of tools and algorithms to enable cyberinfrastructure environments and has worked with large national cyberinfrastructure projects like the Open Science Grid (OSG) and NSF Extreme Science and Engineering Discovery Environment (XSEDE), engaging them to solve geospatial problems.

Jan. 8th

Title: Automatically Detecting, Mitigating and Fixing Software Vulnerabilities

Jan. 15th