How to scale sophisticated program analyses to large codebase has been a key challenge in the program analysis research for at least a decade. The inability of scaling is the major factor that prevents analysis-based techniques (e.g., verification, model checking, and static bug detection) from being widely adopted in industry. Program analysis researchers tackle the problem typically by developing approximations, trading off analysis capability for scalability. However, approximations render analyses less useful and, even with approximations, most analyses still cannot scale to large software on the computing frontier, such as Hadoop, HDFS, or Spark. Prof. Xu's systems-building experience enables him to think from a different angle: now that efficient systems can be built to process datasets as large as the whole internet or human genome, why don't we shift our focus from improving algorithms and making better analysis tools to leveraging decades of experience in the systems community to develop efficient Big Code analysis systems (not just tools)? We hope that with the help of a series of systems-level attempts, we can unleash the power of program analysis on big code, helping realize the dream of developing fully-verified, bug-free software.
Modern computing has entered the era of Big Data. Developing systems that can scale to massive amounts of data with relatively small amounts of resources is a key challenge faced by both researchers and practitioners. Conventional wisdom about scalability is that the performance of a system should increase proportionally with the increase of any kind of resource (e.g., CPU, memory, or network bandwidth). However, we believe this is not sufficient. An important yet ignored aspect of scalability is that if the amount of resource of some kind decreases, the performance of the system should not be reduced proportionally. In other words, the system should be designed in a way so that resources of other kinds can be exploited to remedy the performance reduction resulting from the lost resource. Driven by this insight, our group has made a number of attempts towards building highly-adaptive Big Data systems that can automatically adapt their behaviors to the amount of available resources.
Despite the employment of faster CPUs and larger memory systems, the levels of inefficiencies in real-world programs grow surprisingly fast and there is an ever-increasing demand for performance optimization in modern software. Performance and scalability issues are becoming increasingly critical partly due to the pervasive use of object-oriented programming languages. The inefficiencies inherent in the implementation of an object-oriented language as well as the commonly adopted design and implementation principles in the object-oriented community often combine to hurt performance. The community-wide recognition of the importance of abstraction and reuse results in increased emphasis on modular design, declaration of general interfaces, and use of models and patterns. Programmers are taught to focus first and foremost on them, taking it for granted that compilers and run-time systems can remove all the inefficiencies. In a large program that is typically built on top of many layers of frameworks and libraries, a small set of inefficiencies can multiply and quickly get magnified to slow down the system. When the call stack grows to be deep, the usefulness of the dataflow analyses in a dynamic compiler becomes limited and the optimizer can no longer remove these inefficiencies. As a result, many applications suffer from chronic run-time performance problems that significantly affect scalability and performance. This is a serious problem for real-world software systems used every day by thousands of businesses. The pressing need for new optimization techniques can be especially seen as object-orientation goes everywhere into systems of any size. The extensive use of object-oriented languages in the development of memory-constrained applications such as smartphone apps (e.g., Java used in Android and C# used in Windows phones) and data-intensive systems (e.g., Hadoop, Giraph, and Hyracks) introduces numerous research challenges-- these systems have small memory space but large amounts of data to process and inefficiencies in these systems can be significantly exacerbated. The burden of reducing unnecessary work should not be only on the shoulder of hardware designers, especially in the modern era when Moore's dividend becomes less obvious. It strongly calls for higher-level performance optimization techniques that can detect and remove inefficiencies for all categories of object-oriented applications. Our group has an established record on performance optimization for large-scale systems. Our recent efforts focus on the following projects: