AI Models from Google, OpenAI, Anthropic Solve 0% of ‘Hard’ Coding Problems

cm0002@lemmy.world · 8 months ago

AI Models from Google, OpenAI, Anthropic Solve 0% of ‘Hard’ Coding Problems

Endmaker@ani.social · edit-2 8 months ago

In the ‘Medium’ difficulty category, OpenAI’s o4-mini-high model scored the highest at 53.5%.

This fits my observation of such models. o4-mini-high is able to help me with 80-90% of the problems at work. For the remaining problems, it would come up with a nonsensical solution and no matter how much I prompt it, it would tunnel-vision on that specific approach. It could never second guess itself and realise that its initial solution is completely off the mark, and try an entirely differently approach. That’s where I usually step in and do the work myself.

It still saves me time with the trivial stuff though.

I can’t say the same for the rest of the LLMs. They are simply no good at coding and just waste my time.

yogsototh@programming.dev · 8 months ago

I didn’t see Claude 4 Sonnet in the tests and this is the one I use. And it looks like about the same category as o4 mini from my experience.

It is a nice tool to have in my belt. But these LLM based agents are still very far from being able to do advanced and hard tasks. But to me it is probably more important to communicate and learn about the limitations about these tools to not lose tile instead of gaining it.

In fact, I am not even sure they are good enough to be used to really generate production-ready code. But they are nice for pre-reviewing, building simple scripts that don’t need to be highly reliable, analyse a project, ask specific questions etc… The game changer for me was to use Clojure-MCP. Having a REPL at disposal really enhance the quality of most answers.

Ugurcan@lemmy.world · 8 months ago

For me, it’s the Claude Code where everything finally clicked. For advanced stuff, sure they’re shit when they left alone. But as long as I approach it as a Junior Developer (breaking down the tasks to easy bites, having a clear plan all the time, steering away from pitfalls), I find myself enjoying other stuff while it’s doing the monkey work. Just be sure you provide it with tools, mcp, rag and some patience.

technocrit@lemmy.dbzer0.com · 8 months ago

Search engines are able to help me with 100% of work.

Rikudou_Sage@lemmings.world · 8 months ago

I remember those times, too (well, some 99.9%, there are still the few issues I never found solution to).

But these times are long past, search engines suck nowadays.

nieceandtows@programming.dev · 8 months ago

Not anymore. They’ve all made deals with each other, and search engines SUCK these days

🇨🇦 tunetardis@piefed.ca · 8 months ago

For instance, if an AI model could complete a one-hour task with 50% success, it only had a 25% chance of successfully completing a two-hour task. This indicates that for 99% reliability, task duration must be reduced by a factor of 70.

This is interesting. I have noticed this myself. Generally, when an LLM boosts productivity, it shoots back a solution very quickly, and after a quick sanity check, I can accept it and move on. When it has trouble, that’s something of a red flag. You might get there eventually by probing it more and more, but there is good reason for pessimism if it’s taking too long.

In the worst case scenario where you ask it a coding problem for which there is no solution—it’s just not possible to do what you’re asking—it may nevertheless engage you indefinitely until you eventually realize it’s running you around in circles. I’ve wasted a whole afternoon with that nonsense.

Anyway, I worry that companies are no longer hiring junior devs. Today’s juniors are tomorrow’s elites and there is going to be a talent gap in a decade that LLMs—in their current state at least—seem unlikely to fill.

Modern_medicine_isnt@lemmy.world · 8 months ago

Sadly, the lack of junior devs means my job is probably safe until I am ready to retire. I have mixed feelings about that. On the one hand, yeah for me. On the other sad for the new grads. And sad for software as a whole. But software truely sucks, and has only been enshitifying worse and worse. Could a shake up like this somehow help that? I don’t see how, but who knows.

Zexks@lemmy.world · 8 months ago

I’ve noticed this too and it’s even weirder when you compare it to a physics question. It very consistently tells me when my recent brain fart of an idea is just plain stupid. But it will try eternally to help me find a coding solution even it it just keeps going in circles.

otacon239@lemmy.world · 8 months ago

I think part of this comes down to the format. Physics can often be analogized and can be very conversational when it comes to demonstrating ideas.

Most code also looks pretty similar if you don’t know how to read it and unlike language, the syntax is absolute with no room for interpretation or translation.

I’ve found it’s consistently good if you treat it like a project specification list, including all of your requirements in a list format in the very first message and have it psuedocode the draft along with list what libraries it wants to use and make sure they work how you expect.

There’s some screening that goes into utilizing it well and that only comes with already knowing roughly how to code what you’re trying to make.

Beej Jorgensen@lemmy.sdf.org · 8 months ago

Sucks for today’s juniors, but that gap will bring them back into the fold with higher salaries eventually.

Schal330@lemmy.world · 8 months ago

In the worst case scenario where you ask it a coding problem for which there is no solution—it’s just not possible to do what you’re asking—it may nevertheless engage you indefinitely until you eventually realize it’s running you around in circles.

Exactly this, and it’s frustrating as a Jr dev to be fed this bs when you’re learning. I’ve had multiple scenarios where it blatantly told me wrong things. Like using string interpolation in a terraform file to try and set a dynamic source - what it was giving me looked totally viable. It wasn’t until I dug around some more that I found out that terraform init can’t use variables in the source field.

On the positive side it helps give me some direction when I don’t know where to start. I use it with a highly pessimistic and cautious approach. I understand that today is the worst it’s going to be, and that I will be required to use it as a tool in my job going forward, so I’m making an effort to get to grips when working with it.

katy ✨@piefed.blahaj.zone · 8 months ago

ai is basically just the worst answer on stackexchange

gens@programming.dev · 8 months ago

It’s a rubber ducky that talks back. If you don’t take it seriously, it can reach the level of usefulness just above a wheezing piece of yellow rubber.

Saledovil@sh.itjust.works · 8 months ago

They aren’t as cute as actual rubber ducks, though.

nieceandtows@programming.dev · 8 months ago

Actual rubber ducks don’t randomly spew bullshit either

daniskarma@lemmy.dbzer0.com · 8 months ago

The bullshit is good it triggers the Cunningham’s Law in my brain.

Sometimes it’s easier to come up with a solution correcting something blatantly wrong than doing it from scratch.

merc@sh.itjust.works · 8 months ago

It’s literally the most common answer on stackexchange.

FizzyOrange@programming.dev · 8 months ago

I don’t think that’s a surprise to anyone that has actually used them for more than a few seconds.

MTK@lemmy.world · 8 months ago

Please babe! Just one more parameter, then it will be AGI!

ulterno@programming.dev · 8 months ago

Just 1 more kiloton of Uranium.
It will be ready by the time that’s depleted.

atzanteol@sh.itjust.works · 8 months ago

The claims that AI will be surpassing humans in programming are pretty ridiculous. But let’s be honest - most programming is rather mundane.

ulterno@programming.dev · 8 months ago

Never have I had to implement any kind of ridiculous algorithm to pass tests with huge amounts of data in the least amount of memory, as the competitive websites show.

It has been mostly about:

Finding the correct library for a job and understanding it well, to prevent footguns and blocking future features
Design patterns for better build times
Making sane UI options and deciding resource alloc/dealloc points that would match user interaction expectations
cmake

But then again, I haven’t worked in FinTech or Big Data companies, neither have I made an SQL server.

magikmw@lemm.ee · 8 months ago

Because actually writing code is the least important part of programming.

Rikudou_Sage@lemmings.world · edit-2 8 months ago

I mean, not the least important, it is an important part. But way less than a common person thinks.

technocrit@lemmy.dbzer0.com · 8 months ago

Pretty sure that autocomplete would be terrible at these tasks too.

ulterno@programming.dev · 8 months ago

There are some times when I wish I were better at regexp and scripting.
Times when I am writing a similar kind of thing again and again, which is just different enough (and small enough number of repetitions) that it doesn’t seem viable to make the script.

At those times, I tend to think - maybe Cursor would have done this part well - but have no real idea since I have never used it.

On the other hand, if I had a scripting endpoint from clang, ^[1], I would have used that to make a batch processor for even a repetition as small as 10 and wouldn’t have thought once about AI.

which would have taggified parts of code (in the same tone as “parts of speech”) like functions declaration, return type, function name, type qualifier etc. ↩︎

wetbeardhairs@lemmy.dbzer0.com · 8 months ago

Well, this kind of AI won’t ever be useful as a programmer. It doesn’t think. It doesn’t reason. It cannot make decisions besides using a ton of computational power and enormous deep neural networks to shit out a series of words that seem like they should follow your prompt. An LLM is just a really, really good next-word guesser.

So when you ask it to solve the Tower of Hanoi problem, great it can do that. Because it saw someone else’s answer. But if you ask it to solve it for a tower than is 20 disks high it will fail because no one ever talks about going that far and it flounders. It’s not actually reasoning to solve the problem - it’s regurgitating answers it has ingested from stolen internet conversations. It’s not even attempting to solve the general case because it’s not trying to solve the problem, it’s responding to your prompt.

That said - an LLM is also great as an interface to allow natural language and code as prompts for other tools. This is where the actually productive advancements will be made. Those tools are garbage today but they’ll certainly improve.

atzanteol@sh.itjust.works · 8 months ago

Well, this kind of AI won’t ever be useful as a programmer

It already is.

childOfMagenta@jlai.lu · 8 months ago

You mean useful to a programmer, or as useful as a programmer?

atzanteol@sh.itjust.works · 8 months ago

Ah - yeah I read that wrong. It’s useful to a programmer.

wetbeardhairs@lemmy.dbzer0.com · 8 months ago

I explicitly meant “as”. It’s great as autocomplete. Not as an agent to complete programming tasks.

childOfMagenta@jlai.lu · 8 months ago

Thanks for clarifying.

atzanteol@sh.itjust.works · 8 months ago

It’s great as autocomplete.

I love the weird need to downplay just how good AIs are by calling them “autocomplete”.

wetbeardhairs@lemmy.dbzer0.com · 8 months ago

Did you even read my earlier comment?

Ledivin@lemmy.world · 8 months ago

My productivity has at least tripled since I started using Cursor. People are actually underestimating the effects that AI will have in the industry

PushButton@lemmy.world · 8 months ago

It means the AI is very helpful to you. This also means you are as good as 1/3 of an AI in coding skills…

Which is not a great news for you mate.

atzanteol@sh.itjust.works · 8 months ago

Ah knock it off. Jesus you sound like people in the '90s mocking “intellisense” in the IDE as somehow making programmers “less real programmers”.

It’s all needless gatekeeping and purity test BS. Use tools that are useful. Don’t worry if it makes you less of a man.

Feyd@programming.dev · 8 months ago

It’s not gate keeping it is true. I know devs that say ai tools are useful but all the ones that say it makes them multiples more productive are actually doing negative work because I have to deal with their terrible code they don’t even understand.

atzanteol@sh.itjust.works · 8 months ago

The devs I know use it as a tool and check their work and fully understand the code they’ve produced.

So your experience vs. mine. I suspect you just work with shitty developers who would be producing shitty work whether they were using AI or not.

Ledivin@lemmy.world · edit-2 8 months ago

I literally don’t write code anymore, I write detailed specs, invest a lot of time into my guardrails and integrations, and review changes from my agents. My code quality has not fallen, in fact we’ve been able to be much more strict about our style guidelines.

My job has changed completely, but the results are the same - simply much, much faster. And to be clear, this is in code bases that are hundreds of thousands of lines deep, across multiple massive monorepos, and using context from several different documentation sites - both internal and external.

If anything, people are understating the effects this will have over the next year, let alone further. The entry-level IC dev is dead. If you aren’t producing at least twice as fast as you used to, you’re going to be left behind. I cannot possibly suggest strongly enough that you start learning how to use it.

Feyd@programming.dev · 8 months ago

Sure, Jan

technocrit@lemmy.dbzer0.com · 8 months ago

People are actually underestimating the effects that AI autocomplete will have in the industry

Rikudou_Sage@lemmings.world · 8 months ago

True, I use some local model by Jetbrains that only completes a single line and that’s my sweet spot, it usually guesses the line well and saves me some time without forcing me to read multiple lines of code I didn’t write.

AlecSadler@lemmy.blahaj.zone · 8 months ago

Tripled is an understatement for me. Cursor and Claude Code are a godsend for OE for me.

daniskarma@lemmy.dbzer0.com · 8 months ago

They have their uses. For instance the other day I needed to read some assembly and decompiled C, you know how fun that can be. LLM proved quite good at translating it to english. And really speed up the process.

Writing it back wasn’t that good though, just good enough to point in a direction but I still ended up writing the patcher mostly by myself.

Lemminary@lemmy.world · 8 months ago

the other day I needed to read some assembly and decompiled C

As one casually does lol Jokes aside, that’s pretty cool. I wish I had the technical know-how and, most importantly, the patience for it.

FizzyOrange@programming.dev · edit-2 8 months ago

Assembly is very simple (at least RISC-V assembly is which I mostly work with) but also very tedious to read. It doesn’t help that the people who choose the instruction mnemonics have extremely poor taste - e.g. lb, lh, lw, ld instead of load8, load16, load32, load64. Or j instead of jump. Who needs to save characters that much?

The over-abbreviation is some kind of weird flaw that hardware guys all have. I wondered if it comes from labelling pins on PCB silkscreens (MISO, CLK etc)… Or maybe they just have bad taste.

I once worked on a chip that had nested acronyms.

Lemminary@lemmy.world · 8 months ago

The over-abbreviation is some kind of weird flaw that hardware guys all have

My bet is on the teaching methods in uni. From what I’ve seen, older teaching methods use terrible variable names for a production environment. I think it unfortunately sticks because students get used to it and find it easier & faster than typing things out.

amorpheus@lemmy.world · 8 months ago

Who needs to save characters that much?

Do you realize how old assembly language is?

It predates hard disks by ten years and coincided with the invention of the transistor.

FizzyOrange@programming.dev · 8 months ago

Do you realize how old assembly language is?

Do you? These instructions were created in 2011.

It predates hard disks by ten years and coincided with the invention of the transistor.

I’m not sure what the very first assembly language has to do with RISC-V assembly?

mormund@feddit.org · 8 months ago

If you’re interested in getting into it, download Ghidra and open an older program/game in it that you like. The decompiler is pretty amazing imo, so you rarely have to look at the assembly. But it also cross-references them so you can look at the decompiled C Code and the associated assembly. It’s pretty fun 😊

technocrit@lemmy.dbzer0.com · 8 months ago

Ok, but there’s no “AI” involved in this process.

Outsider9042@lemmynsfw.com · 8 months ago

About all they are good for is generating boilerplate code. Just far less efficiently than a snippet library.

Tony Bark@pawb.social · 8 months ago

Yup. All that effort just to be good at basic code scaffolding.

Glitchvid@lemmy.world · 8 months ago

I keep getting told that AI is good at boilerplate code, and like, so is eclipse – if you know the kb shortcuts to autogenerate method stubs, classes, etc.

FizzyOrange@programming.dev · 8 months ago

AI is good at more than just generating stubs, filling in enum fields, etc. I wouldn’t say it’s good at stuff beyond just “boilerplate” - it’s good at stuff that is not difficult but also isn’t so regular that it’s possible to automate using traditional tools like IDEs.

Writing tests is a good example. It’s not great at writing tests, but it is definitely better than the average developer when you take the probability of them writing tests in the first place into account.

Another example would be writing good error context messages (e.g. .with_context() in Rust). Again, I could write better ones than it does. But like most developers there’s a pretty high chance that I won’t bother at all. You also can’t automate this with an IDE.

I’m not saying you have to use AI, but if you don’t you’re pointlessly slowing yourself down. That probably won’t matter to lots of people - I mean I still see people wasting time searching for symbols instead of just using a proper IDE with go-to-definition.

Glitchvid@lemmy.world · 8 months ago

Writing tests is a good example. It’s not great at writing tests, but it is definitely better than the average developer when you take the probability of them writing tests in the first place into account.

Outside of everything else discussed here, this is something I disagree with on a fundamental level, flawed tests are worse than no tests, IMO.
Not to get too deep in to the very contentious space of testing in development, but when it comes to automated testing, I think we’re better off with more rigorous^[1] testing instead of just chasing test coverage metrics.

Validating tests through chaos/mutagen testing; or model verification (e.g. Kani) ↩︎

FizzyOrange@programming.dev · 8 months ago

flawed tests are worse than no tests

I never said you should use flawed tests. You ask AI to write some tests. You READ THEM and probably tweak them a little. You think "this test is basic but better than nothing and it took me 30 seconds. You commit it.

ulterno@programming.dev · 8 months ago

I guess, it would be useful as a kind of fuzzer with a high price to performance ratio.
Making tests to try and find vulnerabilities.

For normal functionality testing though, better off making it yourself.

ulterno@programming.dev · 8 months ago

instead of just using a proper IDE with go-to-definition

I have seen people use an IDE that has the functionality, but searching it instead.
And then not finding it because it is in an out of project header, even though the IDE provides an F1 help (documentation) for that symbol.

technocrit@lemmy.dbzer0.com · edit-2 8 months ago

It’s seems pretty obvious that crappy automated testing would be better than no testing. That’s not much of a challenge. Nor is it “AI”.

FizzyOrange@programming.dev · 8 months ago

It absolutely is a challenge. Before AI there weren’t any other systems that could do crappy automated testing.

I dunno what you mean by “it’s not AI”. You write the tests using AI. It’s AI.

Tony Bark@pawb.social · 8 months ago

Of course. We’re just pointing out a potential niche use.

atzanteol@sh.itjust.works · 8 months ago

Just far less efficiently than a snippet library.

Your snippet library can convert a large JSON file to a Java class using Java property naming conventions and including annotations for Jackson where the names differ from the JSON?

Glitchvid@lemmy.world · 8 months ago

https://transform.tools/json-to-java

technocrit@lemmy.dbzer0.com · edit-2 8 months ago

Thanks for posting. People really out here believing that the most basic scripts are Johnny 5.

atzanteol@sh.itjust.works · 8 months ago

I mean - the AI models do a lot more than just that one task you know. Or are you a “real programmer” who doesn’t need AI tools because “real men” don’t need help?

Outsider9042@lemmynsfw.com · 8 months ago

The keyword is efficiently.

I’m not anti ai. I’m anti burning down the down planet to solve problems that were already solved years ago with a few scripts.

django@discuss.tchncs.de · edit-2 8 months ago

Also: correctly, without inventing or garbling data.

Using a llm for this task will require intensive proof-reading, as it might have changed the data somewhere.

atzanteol@sh.itjust.works · 8 months ago

So no, your snippet library can’t do that.

technocrit@lemmy.dbzer0.com · edit-2 8 months ago

What does “AI” have to do with this? smh.

Stubb@lemmy.sdf.org · 8 months ago

I’ve found that AI is only good at solving programming problems that are relatively “small picture” — or if it has to do with the basics of a language — anything else that it provides a solution for you will have to re-write completely once you consult with the language’s standards and best practices.

Rikudou_Sage@lemmings.world · 8 months ago

Well, I recently did kind of an experiment, writing a kid game in Kotlin without ever using it. And it was surprisingly easy to do. I guess it helps that I’m fluent in ~5 other programming languages because I could tell what looked obviously wrong.

My conclusion kinda is that it’s a really great help if you know programming in general.

Shanmugha@lemmy.world · 8 months ago

Come on, guys, any second now. Aany second…

Modern_medicine_isnt@lemmy.world · 8 months ago

Fortunately, 90% of coding is not hard problems. We write the same crap over and over. How many different creat an account and signin flows do we really need. Yet there seem to be an infinite amount, and each with it’s own bugs.

xthexder@l.sw0.com · edit-2 8 months ago

The hard problems are the only reason I like programming. If 90% of my job was repetitive boilerplate, I’d probably be looking elsewhere.

I really dislike how LLMs are flooding the internet with a seemingly infinite amount of half-broken TODO-app style programs with no care at all for improving things or doing something actually unique.

Modern_medicine_isnt@lemmy.world · 8 months ago

A lot of people don’t realize how many times the problem they are solving has already been solved. But after being in the industry for 3 decades, very few things people are working on haven’t been done before. They just get put together in different combinations.

As for AI, I have found it decent at wruting one time scripts to gather information I need to make design decisions. And it’s a little quicker when I need to look up a syntax for a language or like a resource name for terraform. But even one off scripts I sometimes have to ask it if a while loop wouldn’t be better and such.

danzania@infosec.pub · 8 months ago

Funny how I never see articles on Lemmy about improvements in LLM capabilities.

Nullagon@ani.social · 8 months ago

i would guess a lot of the pro ai stuff is from corpos given the fact good press is money to them.

Rayquetzalcoatl@lemmy.world · 8 months ago

Probably because nobody really wants to read absolute nonsense.

funkless_eck@sh.itjust.works · 8 months ago

there aren’t that many, if you’re talking specifically LLMs, but ML+AI is more than LLMs.

Not a defence or indictment of either side, just people tend to confuse the terms “LLM” and “AI”

I think there could be worth in AI for identification (what insect in this, find the photo I took of the receipt for my train ticket last month, order these chemicals from lowest to highest pH…) - but LLMs are only part of that stack - the input and output - which isn’t going to make many massive breakthroughs week to week.

Glitchvid@lemmy.world · 8 months ago

The recent boom in neural net research will have real applicable results that are genuine progress: signal processing (e.g. noise removal), optical character recognition, transcription, and more.

However the biggest hype areas with what I see as the smallest real return is in the huge model LLM space, which basically try to portray AGI as just around the corner. LLMs will have real applications in summarization, but largely otherwise they just generate asymptotically plausible babble, very good for filling the Internet with slop, not actually useful to replace all the positions OAI, et al, need it to (for their funding to be justified).

finitebanjo@lemmy.world · edit-2 8 months ago

Because Lemmy is more representative of scientists and underprivileged while other media is more representative of celebrities and people who can afford other media, like hedge funds or tech monopolies.

AI Models from Google, OpenAI, Anthropic Solve 0% of ‘Hard’ Coding Problems

AI Models from Google, OpenAI, Anthropic Solve 0% of ‘Hard’ Coding Problems

AI Models from Google, OpenAI, Anthropic Solve 0% of ‘Hard’ Coding Problems | AIM