Research
Joining the lab
I welcome graduate and undergraduate researchers passionate about improving the quality of computing systems. Please review the lab description and the active projects to understand whether you are interested in joining our work. If so, follow these instructions to check your qualifications and submit your application.
Introduction
This page describes the lab’s general directions (first section), as well as some details about the lab (second section) For a full list of publications, see here.
Our lab conducts software engineering research with an eye to practical impact. We understand two modes of impact:
- Empirical: Our tools have found plenty of defects and security vulnerabilities. Our research results have been adopted by major companies (Microsoft, IBM, Google) and major software systems (Node.js, Python, Ruby).
- Theoretical: Software engineering is always changing. Part of software engineering theory is to articulate and organize concepts so that practicing engineers can make sense of them (ontology). We write papers about this, and convey key ideas to practitioners through a Medium blog (75,000 views and counting).
Research directions
Broadly speaking, our lab’s research is “systems and security”-oriented software engineering.
Web security, focused on Software Engineering for Domain-Specific Languages
This research topic is about domain-specific languages, with applications to web security. Domain-specific languages (DSLs) simplify the engineering of complex computing systems. DSLs allow engineers to express domain-specific information fluently, rather than staggering through an articulation in a general-purpose language.
Regular expressions (regexes) are a widely used, hard to master DSL for string-matching problems. They often cause software defects. Regexes gone awry have caused Internet-scale outages and are a potent denial of service vector.
In our regex investigations, we have measured the difficulties that practitioners experience, and guided programming language designers toward regex engines that reflect the needs of practitioners.
Here are some of the questions we have explored:
- How widespread of a problem is Regex Denial of Service? (ESEC/FSE’18, ICSE’22).
- How hard are regexes to work with? (ESEC/FSE’19, ASE’19)
- How generalizable is regex research? (ASE’19)
- How might we address Regex Denial of Service? (IEEE S&P’21, IEEE S&P’23, arXiv’24)
This work is supported by NSF SaTC-2135156.
Software Engineering for Data-Centric Computing (SE4ML, pre-trained models)
Complex computing systems incorporate machine learning models – data-centric computing components that predict the future based on the past (e.g. data science, ML, deep learning). Correct machine learning models requires the development of the models themselves, and the use of analysis pipelines to automatically and repeatedly process batches of data. Engineering these models is a critical aspect of modern computing.
Some problems in this domain are traditional, e.g. documenting one’s code, promoting modularity, and porting concepts from one programming language (or ML framework) to another. Other problems are new, e.g. understanding the nature of software re-use in this context (pre-trained models).
Here are some of the questions we have explored:
- How might provenance be applied to assist data scientists? (SIGMOD’19 demo, VLDB’20)
- What are the challenges and practices for the reuse of machine learning models? (ESEC/FSE-IVR’22, ICSE’23, JVA’23, ESEM’24, arXiv’24)
- What challenges arise when replicating deep learning models? (CSE’20 poster, arXiv’21, arXiv’24, EMSE’24)
- What should go into a dataset for mining pre-trained model packages? (MSR-Dataset’23, MSR’24).
- What are the usage practices and challenges of deep learning interoperability software such as ONNX? (ISSTA’24).
This research is supported financially by Google, Cisco, and NSF OAC-2107230.
The Failure-Aware Software Development Lifecycle (FA-SDLC)
All engineered systems fail — they do not fulfill their purposes, deviating from their specification or expected performance. International standards therefore recommend that engineering organizations undertake two complementary activities to respond to failure: (1) proactively anticipating failures to mitigate them (e.g., during design and implementation), and (2) analyzing failures to find opportunities for improvement (e.g., during incident postmortems and retrospectives). For software, we call the resulting engineering process the Failure-Aware Software Development Lifecycle (FA-SDLC).
We are interested in understanding the technical and human/organizational/social factors that support the FA-SDLC.
Here are some of the questions we have explored:
- Are software engineering researchers consistent and coherent in their analysis of failures? (ESEC/FSE-IVR’22, ESEC/FSE-IVR’23)
- What are the characteristics of failures in IoT systems? (ASE-NIER’22)
- How do engineering students respond to lessons about failures? (SERP4IoT’23)
- Can large language models help us automate the analysis of failures in “open-source intelligence” such as the news? (SCORED’23, arXiv’24)
- How do standards and regulations influence software engineering practice? (FSE’24, ICSE-Poster’24, ESEM’24, USENIX Security’24)
Software Engineering in Cyber-Physical Systems (IoT)
Software influences the physical world one way or another. Unlike traditional business software, in which physical-world effects are mediated by humans, Internet of Things (IoT) systems allow software to directly interact with the physical world through interconnected devices.
Embedded systems are some of the oldest computing systems (e.g. avionics), and there are well established engineering methods to reduce catastrophic failure. However, these methods are not being applied in many safety-sensitive contexts such as medical devices.
Here are some of the questions we have explored:
- How do software engineers think about machine learning and cybersecurity for IoT products? (SERP4IoT’22)
- Can we apply traditional program analyses to embedded software applications? (DSN-Disrupt’23, LCTES-WIP’23).
- How do we achieve good performance in resource-constrained environments (e.g. for security, for deep learning, etc.)? (HotMobile’22, ISLPED’21, ASP-DAC’22, ISLPED’22, Computer’23).
- How effective are bounded systematic techniques in validating embedded network stacks? (ASE’23)
This research is supported financially by Cisco and Rolls Royce.
Software Infrastructure: Software Supply Chains
Many modern software applications are composed of business logic and external components. This fact is recursive — the external components themselves have external components. The result is called the software supply chain. Traditional validation techniques suffice for assessing the correctness of the resulting applications. However, the degree of trust in third-party component providers necessitates understanding and measuring the risks (notably in security) of this practice.
Here are some of the questions we have explored:
- What are general principles of secure software supply chains? (SCORED’22a)
- What are the characteristics of software supply chains in machine learning? (SCORED’22b)
- How commonly do open-source and commercial software artifacts provide provenance via cryptographic signatures, and what factors influence them? (S&P’24, arXiv’24)
Some papers from other areas (notably my work on pre-trained models) also fall into this category.
This research is supported financially by Google, Cisco, and NSF POSE-22297403.
About the Duality Lab
Overview
The vision of the Duality Lab is to improve the quality of complex computing systems.
We believe that computing systems will eventually mediate many human interactions with other humans and with the surrounding world. We therefore seek to improve the human experience by improving the quality of computing systems. Two factors are foundational to our success:
- Our diverse team helps us understand the ways that computing systems are used and perceived by many kinds of humans. Computing systems will touch all of humanity, and so all of humanity is needed to develop them.
- Our data-driven and systems approach grounds our work in real-world computing systems, ensuring that our findings and proposals impact the quality of computing systems in the here-and-now, not in the what-might-be.
In order to improve the quality of software-intensive computing systems, we take a scientific engineering approach.
- We empirically study engineering failures to drive the development of tools and systems that reflect practitioners’ needs and address their misconceptions.
- We blend techniques from software engineering, systems, and security in order to understand, measure, and ameliorate the issues that computing practitioners face.
- We apply methodologies appropriate to the task at hand: static and dynamic program analysis, pattern recognition and machine learning, algorithm development, and plenty of system building and hacking.
What’s in a name?
The Duality Lab is an abbreviation of the Davis Quality Lab.
“Quality”
What do we mean by “quality”? Some of our projects focus on functional properties like correctness and security, while others consider engineering process and human perspectives.
Since we must understand engineering practice before we can improve it, our research often has an empirical bent — examining engineering artifacts (e.g. mining software repositories) and engineers themselves (e.g. surveys and interviews).
“Duality”
Quality is often approached dualistically — technical or social, but not both. We aim to unite these perspectives.
- We believe that designing a high-quality system requires technical sophistication.
- We also believe that designing a high-quality system requires considering how humans will use it.
Call this what you will: human-in-the-loop, a socio-technical perspective, etc. We believe it is the only way to achieve truly high-quality computing systems.
Lab members
I am delighted to supervise many hard-working and talented students. You could join them! Here are the instructions to get started.
PhD
- Wenxin Jiang
- Paschal C. Amusuo
- Dharun Anand
- Kelechi G. Kalu
- Purvish Jajal
- Berk Çakar
- Huiyun Peng
- Daniel “Hocka” Lugo, US Space Force
- Drew Rozema
MSc
Undergraduate
- The Pre-Trained Models research team (through Purdue’s VIP program)
- Charlie Sale
- Nathaniel Bielanski
- Owen Cochell
- Ethan Burmane
- Arav Tewari
- Sophie Chen
Alumni
- Taylor Schorlemmer, MSc 2024, will serve as a cyber-officer in the US Army
- Jason Jones, MSc 2024, BotDojo
- William “Trey” Maxam, MSc 2023, will serve as an instructor at the US Coast Guard Academy
- Geoffrey Cramer, MSc 2023, Boundless
- Matthew Campbell, BSc 2024, Cisco
- Kyle Robinson, BSc 2024, Lockheed-Martin
- Ananya Singh, BSc 2023, Google
- Evan Williams, BSc 2023 (transferred to Cornell), SWE at AWS and lab assistant at Stanford
- David Li, BSc 2022, Google
- Zach Ghera, BSc 2022, Google
- Allen Liu, BSc 2022, Amazon
- Feny Patel, BSc 2022, Meta
- Efe Barlas, BSc 2022, Amazon
- Xin Du, BSc 2022, Aviatrix
- Diego Montes, BSc 2022, SpaceX
- Naveen Vivek, BSc 2022, AMD
- Anirudh Vegesana, BSc 2021, Pursuing MSc in CS@Stanford
- Vishnu Banna, BSc 2021, Apple