GitHub Copilot: Can this ‘AI programmer’ really improve developer productivity?

GitHub has released a study showing that its recently released Copilot code completion tool really does correlate with improved developer productivity. 

GitHub Copilot, an AI pair-programming service, was made generally available a month ago at a cost of $10 per user per month or $100 per user per year. 

It's an extension to Microsoft's Visual Studio Code editor that suggests to developers code that they can accept, reject or edit. The code suggestions are generated by the natural language AI model based on OpenAI's Codex, itself a version of GPT-3, and was trained on billions of publicly available lines of source code, including code published on GitHub. 

Copilot has courted some controversy as not all developers are happy with their code being used to train it. But now GitHub has now released a study that aimed to test its theory that Copilot leads to higher productivity rates among developers. 

Its researchers analyzed 2,631 survey responses from developers using Copilot and matched their responses to measurements collected from the IDE (integrated development environment). Part of the challenge was finding the best method to measure Copilot's effect on developer productivity. 

“We find that acceptance rate of shown suggestions is a better predictor of perceived productivity than the alternative measures,” the authors explain. 

In other words, GitHub is attempting to define how its service should be measured when assessing its impact on developer productivity. 

GitHub's method of measuring productivity is different to another non-GitHub study published in April about Copilot's impact on developer productivity that measured completion times for repetitive tasks. Its authors concluded that Copilot didn't necessarily improve the task completion time or success rate, yet most of its 24 participants preferred to use Copilot because it often provided a useful starting point and saved the effort of searching online.  

One of the GitHub study's authors, Albert Ziegler, compares the service to “a pair programmer with a calculator attached”, which is really good at fiddly stuff yet trustworthy enough to close all brackets in the right order. He challenges the idea that developers want to boost productivity by cutting down on online searches for code snippets to reuse.  

“But the word ‘productivity' in development contains a wide range of possible practical meanings. Do developers ideally want to save keyboard strokes or avoid searches on Google and StackOverflow?,” asks Ziegler in a blogpost. “Should GitHub Copilot help them stay in the flow by giving them highly accurate solutions on mechanical, calculator-like tasks? Or, should it inspire them with speculative stubs that might help unblock them when they're stuck?

The three key questions GitHub's study asked developers included:

  1. Do people feel like GitHub Copilot makes them more productive?
  2. Is that feeling reflected in any objective usage measurements?
  3. Which usage measurements best reflect that feeling?

Ziegler notes the findings of the study show that Copilot “correlates with improved developer productivity”. It got the the strongest connection by dividing the number of accepted suggestions by the number of shown suggestions. 

“This acceptance rate captures how many of the code suggestions GitHub Copilot produces are deemed promising enough to accept,” he notes. 

Also, developers who report the highest productivity gains with Copilot also accept the largest number of shown code suggestions.

The study found different levels of acceptance rates for different languages. 

“We are aware that there are significant differences for how GitHub Copilot performs for different programming languages,” GitHub's authors note. “The most common languages among our user base are TypeScript (24.7% of all shown completions in the observed time frame, 21.9% for users in survey), JavaScript (21.3%, 24.2%), and Python (14.1%, 14.5%). The latter two enjoy higher acceptance rates, possibly hinting at a relative strength of neural tooling versus deductive tooling for untyped languages. Regardless of language, survey participants had a slightly higher acceptance rate than the whole user base,” the authors note in the report. 

GitHub's authors also note that its measures for persistence — or how many suggestions were retained over time — wasn't aligned with reported productivity. 

“In common with prior work we collected measurements about the acceptance of completions, but we also developed measures of persistence. This was based on the idea that for longer completions a developer might have to take more action after accepting a completion such as deleting or correcting an erroneous one.

“We were surprised to find that acceptance rate (number of acceptances normalized by the number of shown completions) was better correlated with reported productivity than our measures of persistence.”

However, they argue the lack of connection with persistence makes sense because Copilot's value isn't in how many lines of code it automates correctly, but in giving users a template to tweak.     

But in hindsight, this makes sense. Coding is not typing, and GitHub Copilot's central value lies not in being the way the user enters the highest possible number of lines of code. Instead, it lies in helping the user to make the best progress towards their goals,” they note.  

“A suggestion that serves as a useful template to tinker with may be as good or better than a perfectly correct (but obvious) line of code that only saves the user a few keystrokes. This suggests that a narrow focus on the correctness of suggestions would not tell the whole story for these kinds of tooling.”

Source