Jing Hu's AI Breakdown
Jing Hu's AI Breakdown
AI Code Assistants Boost 26% of Productivity? Read The Small Print.
1
0:00
-12:18

AI Code Assistants Boost 26% of Productivity? Read The Small Print.

I read three 2024 AI code assistant papers for you. Free from jargon and hype.
1

Having worked closely with developers as a product lead, I want to address a few misconceptions in this post, especially for non-developers. I’ve had a CEO ask, "Why can't the team just focus on typing the code?” and heard some Big 4 consultants ask nearly identical questions. 

Guess what? It turns out that... a developer's job is more than typing code!

Just like constructing a luxury hotel, beautiful rooms are essential, yes. Without a solid foundation, proper plumbing, reliable electricity, and thoughtful design, you’d end up with rooms stacked together— no plumbing, lack of electricity, and so on. 

Similarly, in software, developers need to ensure all parts of the system work together, that the architecture can support future needs, and that everything is secure—just like ensuring the safety and comfort of hotel guests. Without this broader focus, you might have a lot of 'rooms,’ nothing else.

Many studies also seem to be making the same mistakes, focusing on metrics like commits made (when a piece of code is written), but that’s like measuring a luxury hotel’s progress by counting rooms or bricks laid each day. Are the walls soundproofed? Is the plumbing correctly installed?  

See the issue? If those are the metrics in the real world, workers might focus on quantity, ignoring essential details.


The second issue here is hype. I talked about this before: I Studied 200 Years’ Of Tech Cycles. This Is How They Relate To AI Hype.

Hype is normally created by marketers, yes, but do not forget that the CEOs of the big companies are also great marketers themselves. These tech leaders sing the praises of AI in software development, emphasizing how these tools can significantly boost productivity.

Turning the Tide for Developers with AI-Powered Productivity by GenAI can boost developer efficiency by up to 20% and enhance operational efficiency.

or Andy Jassy, CEO of Amazon, noted:

Andy Jassy, CEO of Amazon, noted that AI-assisted code transformation has saved considerable developer time and resources.

And the claim from Sundar Pichai, CEO of Google

Sundar Pichai, CEO of Google, 25% boost using Google code generator.

I can imagine, these endorsements have led many small business owners and managers to think they can replace developers with AI to cut costs. I've heard managers ask, "Why do we need more developers? Can’t AI handle this so we can expand the roadmap?"

The reality is that AI can effectively generate small, frequently used pieces of code. Even those CEOs who praise AI admit that it's most effective for handling simple coding tasks, while it struggles with larger, more complex projects. 


I’ve gathered multiple studies—some argue that AI helps, while others suggest it can do more harm than good.

I reached out to all the authors of these papers, and for those who responded, I’ve included their insights. I’ve also added quotes from CTOs with real-world experience using AI coding assistants. You’ll find all these comments at the end of the article.

Of course, do read the papers yourself and critically evaluate my points.

Shall we?

AI Code Assistants and Productivity Gains – The Good News (With Caution)

There’s no denying some level of productivity boost that AI tools like GitHub Copilot can bring. This study, The Effects of Generative AI on High Skilled Work: Evidence from Three Field Experiments with Software Developers, highlights some promising results: developers using Copilot across Microsoft, Accenture, and another unnamed Fortune 100 company saw a 26% average increase in completed tasks.

For cost-saving purposes, that’s a headline worth celebrating.

While that is encouraging news, these results vary significantly depending on the company and context, and the details matter. Here’s what I found:

  • The productivity gains among Accenture teams are lower and fluctuate widely, shown by a high standard error of 18.72. Simply put, this number could be just an error and didn’t say much about whether it was a real gain. This data is weight-adjusted, and I don’t know if the weight applied is a fair one.

Table 2: Experiment-by-Experiment Results (Weighted IV)- from The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers*
  • The study didn’t discuss factors like team size, where the tasks fit in a wider tech roadmap, project complexity, and so on.

  • Junior developers using Copilot showed a 40% increase in pull requests compared to only a 7% increase among senior engineers. But do not mistake this for a true productivity boost. This might say that Copilot gives junior developers more confidence to submit work frequently, but it could also mean they’re submitting smaller, incremental pieces, which is not the same as greater overall progress.

  • Additionally, for a junior developer to commit to their work more frequently may increase the review overhead for senior staff.

Figure 3: Heterogeneity of Copilot Effect by Tenure and Level (Weighted IV).
  • The 26% increase in completed tasks may not equal progress. This metric is broad and may include smaller or fragmented tasks that don’t require full code reviews or significant milestones. I am not sure if the lack of real-world development metrics suggests this task boost might reflect more incremental, routine work rather than major progress.

At least, what we know from this research is that using AI to assist work could help a junior developer lay bricks faster. However, this might give juniors false confidence (keep reading, you will see where this comes from), preventing them from learning and growing into senior roles — which isn’t just about years of experience.

As developers gain experience, they focus more on system design and long-term vision. This progression is relevant as we explore how developers at different levels use AI uniquely.

My concern about these studies is that the authors work with companies like Microsoft and Accenture and are incentivized to champion AI as the ultimate productivity booster. Microsoft, of course, develops these tools, while Accenture is busy selling services like GenWizard platform to help companies implement them.

Or put by Gary Marcus Sorry, GenAI is NOT going to 10x computer programming.

As mentioned, I reached out to the authors. I didn’t expect a reply, but I heard from Professor Leon Musolff, to my surprise. Below are some of his replies to address my concerns:

Some coauthors are currently, and others were previously, employed by Microsoft, but others are independent researchers, and we would *never* have agreed to terms that only allowed for positive results… Had it looked as if Copilot was bad for productivity, we would certainly have published those results…

And to answer my question on whether this data is close to reality, he replied:

It’s difficult to assess whether an increase in pull requests and commits only reflects ‘incremental outputs’… Deeper productivity measures are just much noisier, which is why few papers investigate them.

My take, while AI coding assistants like Copilot can speed up certain tasks, these productivity boosts come with caveats. I would love to see a longer time frame for research focusing on software project productivity; it is possible because we look too close, and all we can see is noise.

Just thought of someone who should know about this post?

Share

AI Code Assistants and The Security Pitfalls

This study found that developers using AI assistants are more likely to write insecure code: Do Users Write More Insecure Code with AI Assistants?

Reason? It turned out that many developers trusted the AI’s output more than their own. 

See the figure below; those who used AI assistants to help code and generate incorrect code still think that they have solved the task correctly. There are two more similar figures: one question is, I think I solved this task securely, another is I trusted the AI to produce secure code, both have the same observation that those who used AI feel much more confident even when their code is wrong

Figure 2: Participant responses (Likert-scale) to post-survey questions about belief in solution correctness, security, and, in the experiment group, the AI’s ability to produce secure code for each task. from Do Users Write More Insecure Code with AI Assistants?

So, they assumed that code suggestions from the assistant were inherently correct. This assumption, however, comes with risks that are not immediately obvious.

My other highlights about this study: 

  • Higher Rates of Insecure Code: Developers using AI assistants wrote insecure code for four out of five tasks, with 67% of the AI-assisted solutions deemed insecure compared to only 43% in the non-AI group.

  • Overconfidence in AI-Suggested Code: Over 70% of AI-assisted users believed their code was secure, even though they were more likely to produce insecure solutions than those coding independently.

  • Frequent Security Gaps: The AI-suggested code often contained vulnerabilities, such as improper handling of cryptographic keys or failure to sanitize inputs, that could lead to significant issues like data breaches. Yet, developers frequently accepted these outputs without a second look.

Why is this happening?

  1. AI’s Confident Responses: AI assistants rarely (or never) signal when they’re uncertain, which can lead developers to adopt an “auto-pilot” mode. They accept suggestions, especially when the output “looks right.” This can quickly lead to vulnerabilities slipping through the cracks.

  2. Simplified Solutions Over Security: AI tools are optimized for fast, functional solutions rather than secure ones. In the study, AI often generated code that met the minimum functional requirements but ignored broader security practices. These security risks might be more costly than the time saved.

  3. Limited Prompt Flexibility: The study found that developers who didn’t tailor their prompts or adjust parameters like “temperature”(as in how creative the AI could be) often ended up with the most insecure code. Without specific instructions to the AI, the assistant might pick up less secure methods from its training data.

Note that this study mainly involves students and juniors, so I do not rule out the possibility that this group lacks experience and intuitively relies on AI code assistants. 

Table 1: Summary of Participant Demographics. From: Do Users Write More Insecure Code with AI Assistants?

That said, the authors in the next study found some practices that would reduce unwanted outcomes: devs should critically assess AI-generated code and tailor prompts. Proper AI usage practices and training could help increase dev’s productivity. 

AI Code Assistants and The Bug Challenge

So far, we’ve talked about productivity and security, but here’s another hidden challenge: the bugs you don’t notice until they disrupt your workflow. A study on AI-generated “Simple, Stupid Bugs.” 

The authors tested Codex, by having it generate code completions for specific tasks. They then compared Codex's responses to a set of known correct solutions and known common bugs to see how often Codex produced accurate or flawed code.

So, if we continue with the hotel analogy, and you ask Codex to help with some of the construction tasks, here’s what happens:

  • Incorrect (54%): Most of Codex’s work is just plain wrong—like putting doors where windows should be, not classified in development, but not what you want.

  • Bug (28.4%) (in total): Then there are mistakes—the equivalent of installing doorknobs backward or using the wrong type of screws in multiple places, so things don’t work.

⠀The figure shows that only about 13% of Codex’s suggestions are correct “patches”—rooms that meet your standards right away. Over half of its work is incorrectly built, and a quarter are just dead wrong. 

Fig. 3: Match rate of Codex Davinci (right). Completions that do not match a patch or SStuB are validated by hand (left). From: LLM- Simple, Stupid Bugs

Here’s my highlight for the rest of this paper:

  • Codex generated Twice as Many Errors as Fixes: Imagine an AI tool that helps you prepare reports, but for every polished section it produces, it leaves behind two errors.

  • Longer Clean-Up Times: AI-generated bugs took twice as long to fix as regular errors. Just like you have to redo parts of a report multiple times. In this study, fixing these AI-created bugs took, on average, 265 revisions, compared to 106 for regular human-made mistakes.

  • AI might speed things up initially, but these hidden mistakes often take longer to clean up down the line.

  • High “Naturalness” of Mistakes: One reason these bugs were so tricky to catch is that they looked natural, blending into the code as they belonged.

One thing devs can do that would avoid part of the unwanted code: Adding comments (be more specific, essentially) helps reduce the number of bugs, regardless of model size or architecture.

Bug/Patch Ratio vs. Parameter Count. Three model families at various sizes. The largest difference is the addition of 1 comment prior to the SStuB. From: LLM- Simple, Stupid Bugs

Comments and interview videos with two authors:

In the simple stupid bugs database, these are all bugs that exist within a single line of code. But, of course, that’s not the only type of bug in the world. There are bugs that are conceptually more complex, like algorithms or multi-line dependencies. Our study doesn’t address those types, which are harder to track but crucial for real-world applications. — Professor Emily Morgan, interview link.

The primary cost of code is maintenance… we don’t really have much data on how code from LLMs fares in terms of maintenance costs. That’s what we were trying to do with the ‘simple, stupid bug’ study, but we really need more data on this. — Professor Premkumar Devanbu, interview link.

Let’s Hear What CTOs Say About AI Code Assistants.

That said, it’s not all doom and gloom.

We see a productivity boost from these tools in the development process. Just they might be hard to measure.

Let’s hear some comments from real-world leaders.

Our front-end teams are all using GitHub Copilot … It’s become an essential tool, significantly speeding up code boilerplate, refactoring, and unit tests. Critically, they’re not asking it to create core algorithms, but as an assistant, it's a big boost.— Ian Mulvany, Chief Technology Officer at BMJ

AI tools are helpful for handling repetitive coding tasks and boilerplate code, boosting productivity in those areas. However, they still require human oversight — they’re not yet capable fully understanding the deeper context of our projects. — Klaas Ardinois, Seasoned CTO, Strategic Consultant

For the time being, to get the most out of AI:

  • Use AI as a Starting Point: Let it handle the repetitive stuff, but always add your expertise. However, it is dangerous for someone who has just started their career.

  • Stay Critical and Curious: Always question AI suggestions and verify correctness.

After seeing so many use cases in the last couple of years, my take is this: the more experience you have, the more likely you will benefit from AI.

Until Next Time… 

AI coding tools like GitHub Copilot have won much praise, especially when managers might see them as game-changers for productivity. 

However, the productivity boosts come with some fine print. 

Spend some time on research, or subscribe and read my article to save yourself and your team an enormous amount of time. ;)

Be notified when I publish 👇

Next post: AI as a commodity like cloud computing. 

Discussion about this podcast

Jing Hu's AI Breakdown
Jing Hu's AI Breakdown
I summarize complex AI papers into simple, actionable insights. If you want AI updates without the tech jargon, this is your go-to source. Listen for clear and easy-to-follow breakdowns of the latest AI trends, best for your lunchtime or community.