This morning, I gave a talk at the DARPA MUSE Conference in Austin Texas on behalf of the GitHub Semantic Code team. The following is a written account of my experiences and learnings from the event. In addition to my talk, which focused on work being done by the Semantic Code and Machine Learning teams, we met with groups in academia, government and industry to learn about emerging technologies in the fields of Program Analysis and Machine Learning. These discussions led to a better understanding of how we can form collaborations with these groups and incorporate their techniques into our own work.
The Defense Advanced Research Project Agency (aka DARPA ) is part of the United States Department of Defense. They develop emerging technologies used by the military. Fun fact: DARPA was created in the 50s by President Eisenhower during the Space Race. For fans of the era, this is a great LEGO kit. DARPA has grown to influence several non-military projects, commonly funding and collaborating with external groups. Significant developments have been produced in networking, robotics and GUI technology. In fact, DARPA partially funded a robotics project I worked on at the University of Waterloo.
Mining and Understanding Software Enclaves (aka MUSE) is a group within DARPA. A large proportion of their work overlaps with our own. Much like us, they acknowledge that software systems are increasingly complex. Sophisticated tooling to support software development remains deficient—especially at scale. To address these challenges, MUSE does exploratory research within the fields of code analysis, code synthesis, automation and security. One of their initiatives is to expand commercial adoption of their technologies, and connect research groups to industry. They also held a breakout session, which we attended, to explore such opportunities.
While GitHub’s broader vision was shared, there was naturally a lot of interest in the work we’re doing on the Semantic Code team.
Wait, what does Semantic code do again?
Simply put—we want to make it easier to code.
Advancements in computation have coincided with the rise of abstractions, toolsets, standards and best practices for engineers to lean on. While this has helped mature software engineering as a discipline, software development itself lacks the intelligent automation that several other labor-intensive tasks have benefited from. Several heuristics and frameworks guide evaluation of code quality, but it is difficult to empirically determine whether a piece of software is “good”. The process suffers from a lack of quantitative and scientific rigor. Furthermore, the tools intended to support development are often scattered and unreliable. There is immense potential for intelligent automation to improve software development, classification and evaluation.
This is a humorous example, but these types of issues have far-reaching consequences (beyond just making
npm install slow for a few seconds). As explained, code bloat slows down sites, eats up bandwidth and drains batteries. On an aggregate level—this results in enormous costs. Developers should not have to worry about meticulously auditing the dependencies we introduce to our projects, should there be unnecessary surprises (malicious or benign). GitHub can help with that, and several other challenges that cripple developer productivity.
The takeaway that struck me the most is the multiplicity of approaches being taken to do things like program analysis, code synthesis and security vulnerability repair. I’m excited to see if the connections made at this conference result in tighter collaboration between researchers and GitHub. I know this topic is top of mind for many folks at GitHub—particularly @robrix and @mijuhan.
If you’re interested in learning about some of the sessions, read on.
Takeaways from sessions
It is worth noting that most of the approaches were language-agnostic.
Two Six Labs, David Slater, Understanding source code functionality and structure using deep learning
David spoke about his work on source code summarization. Their goal is to auto-generate documentation (for better or worse) based on source code, in addition to providing rich annotation and search over a large corpus code. To do this, they’ve developed three models:
- Using a convolutional neural network model and applying semantic tags (for example, “image processing” or “recursion”) to source code documents of arbitrary languages.
- A Machine Learning model that provides natural language summaries of source code using a sequence-to-sequence approach.
- A Machine Learning model that breaks up source code into logical segments, with the intention of making code annotations a lot more granular.
Within GitHub, @hamelsmu successfully prototyped auto-generated Issue summaries, also using a sequence-to-sequence model. Beyond Issue summarization, we have ML projects in flight capable of analyzing code blocks.
From the Semantic Code perspective, I thought their third approach was interesting since it looked directly at source code since it was different from our work relying primarily on Abstract Syntax Trees (ASTs).
Draper, Marc McConley, Machine Learning for Vulnerability, Classification and Repair
Draper’s DeepCode program focuses on using ML techniques for vulnerability classification and repair. They use a combination of deep learning and traditional techniques over a training corpus to distinguish code with and without vulnerabilities. Draper does this through an integrated tool chain that scrapes and builds open-source code. They extract features from this code that are used to train their learning algorithms that locate patterns representing errors. Once identified, they repair these errors, with an accuracy rate of 90%. Once they accumulated enough good and bad examples in the training corpus, they’re interested in training to repair for realistic applications. To do this, they developed a generative adversarial network (GAN) to train for repair so they would not have to rely on good-bad pairs.
They’ve been able to harvest ~200,000 repos and their total database has over 5 terabytes of training artifacts.
As you know, we shipped security alerts on GitHub in November, so it was exciting to see different approaches being taken to accomplish the same objective.
University of Pennsylvania, Mayur Naik, Hunting Software Bugs Using Machine Learning
This talk was interesting because it focused on a meta challenge: while there are several approaches to software reliability—how can we improve program analysis? Given the constraints of any non-trivial program property is undecidable, what does it mean to improve program analysis? Not only that, but the approach they used deviated from much of the strategies I’ve been exposed to on the Semantic Code team. The main premise of the talk was that most approaches that automate repair use logical reasoning. While logical reasoning techniques offer a lot of pros (correctness guarantees, easier to interpret), they lack the ability to handle the uncertainty and noise that arises in the wild (often from imprecision, missing code, or imperfect environment models). The Machine Learning approach they used integrates existing logical reasoning techniques with probabilistic reasoning. By incorporating probabilistic reasoning, you get the benefit of richer data, whereas the logical reasoning portion allows for more accurate program analysis.
This made me even more amped to see us starting to have more conversations with the Machine Learning team and investigating how we can best work with them to combine approaches (for example, doing deep learning over ASTs).
Kestrel, Eric Smith, Safely Using Code from the Internet
This team does program synthesis based on functional specifications. The goal is to find an existing program from a big code corpus to provide desired functionality and also validate the program found by proving its correctness. To find this code, they extract program features using static analysis and then use clustering and program similarity. They’ve also developed quite a bit of tools to help process and search the corpus, and to also remove duplication.
Most of the work they’ve done so far was over Bitcoin data. I asked whether cryptographic algorithms had properties that made them suitable candidates for this type of exploration, but learned that their decision was purely based on the hype and trendiness of the blockchain.
Facebook did not give a presentation, but I had a discussion with their Machine Learning Lead. My conversation with him sparked a lot of curiosity about their work. He immodestly proclaimed they are making strides developing an AI complete code assistant. It is capable of predicting not just the next token, but can complete the entire line and next couple of lines. They trained their language models over a large corpus of data and I’m curious to learn more.