I pitted Claude 3.5 Sonnet against AI coding tests ChatGPT aced – and it failed creatively

David Gewirtz/ZDNET

Final week, I obtained an e-mail from Anthropic saying that Claude 3.5 Sonnet was obtainable. In response to the AI firm, “Claude 3.5 Sonnet raises the trade bar for intelligence, outperforming competitor fashions and Claude 3 Opus on a variety of evaluations.”

The corporate added: “Claude 3.5 Sonnet is good for advanced duties like code era.” I made a decision to see if that was true.

Additionally: How to use ChatGPT to create an app

I am going to topic the brand new Claude 3.5 Sonnet mannequin to my commonplace set of coding assessments — assessments I’ve run towards a variety of AIs with a variety of outcomes. Wish to observe together with your individual assessments? Level your browser to How I test an AI chatbot’s coding ability – and you can too, which incorporates all the usual assessments I apply, explanations of how they work, and what to search for within the outcomes.

OK, let’s dig into the outcomes of every take a look at and see how they examine to earlier assessments utilizing Microsoft Copilot, Meta AI, Meta Code Llama, Google Gemini Advanced, and ChatGPT.

1. Writing a WordPress plugin

At first, this appeared to have a lot promise. Let’s begin with the consumer interface Claude 3.5 Sonnet created primarily based on my take a look at immediate.

cleanshot-2024-06-26-at-13-28-382x — Screenshot by David Gewirtz/ZDNET

That is the primary time an AI has determined to place the 2 knowledge fields side-by-side. The format is clear and appears nice.

Claude additionally determined to do one thing else I’ve by no means seen an AI do. This plugin will be created utilizing simply PHP code, which is the code operating on the again finish of a WordPress server.

Additionally: How I test an AI chatbot’s coding ability – and you can too

However some AI implementations additionally have added JavaScript code (which runs within the browser to manage dynamic consumer interface options) and CSS code (which controls how the browser shows info).

In a PHP setting, for those who want PHP, JavaScript, and CSS, you may both embrace the CSS and JavaScript proper within the PHP code (that is a characteristic of PHP), or you may put the code in three separate information — one for PHP, one for JavaScript, and one for CSS.

Often, when an AI needs to make use of all three languages, it exhibits what must be lower and pasted into the PHP file, then one other block to be lower and pasted right into a JavaScript file, after which a 3rd block to be lower and pasted right into a CSS file.

However Claude simply supplied one PHP file after which, when it ran, auto-generated the JavaScript and CSS information into the plugin’s house listing. That is each pretty spectacular and considerably wrong-headed. It is cool that it tried to make the plugin creation course of simpler, however whether or not or not a plugin can write to its personal folder depends on the settings of the OS configuration — and there is a very excessive likelihood it may fail.

I allowed it in my testing setting, however I might by no means enable a plugin to rewrite its personal code in a manufacturing setting. That is a really critical safety flaw.

Additionally: How to use ChatGPT to write code: What it can and can’t do for you

Regardless of the pretty artistic nature of Claude’s code era answer, the underside line is that the plugin failed. Urgent the Randomize button does completely nothing. That is unhappy as a result of, as I mentioned, it had a lot promise.

Listed here are the combination outcomes of this and former assessments:

Claude 3.5 Sonnet: Interface: good, performance: fail
ChatGPT GPT-4o: Interface: good, performance: good
Microsoft Copilot: Interface: enough, performance: fail
Meta AI: Interface: enough, performance: fail
Meta Code Llama: Full failure
Google Gemini Superior: Interface: good, performance: fail
ChatGPT 4: Interface: good, performance: good
ChatGPT 3.5: Interface: good, performance: good

2. Rewriting a string perform

This take a look at is designed to guage how the AI does rewriting code to work extra appropriately for the given want; on this case — {dollars} and cents conversions.

The Claude 3.5 Sonnet revision correctly eliminated main zeros, ensuring that entries like “000123” are handled as “123”. It correctly permits integers and decimals with as much as two decimal locations (which is the important thing repair the immediate requested for). It prevents unfavourable values. And it is sensible sufficient to return “0” for any bizarre or surprising enter, which prevents the code from abnormally ending in an error.

Additionally: Can AI detectors save us from ChatGPT? I tried 6 online tools to find out

One failure is that it will not enable decimal values alone to be entered. So if the consumer entered 50 cents as “.50” as an alternative of “0.50”, it could fail the entry. Primarily based on how the unique textual content description for the take a look at is written, it ought to have allowed this enter kind.

Though a lot of the revised code labored, I’ve to depend this as a fail as a result of if the code had been pasted right into a manufacturing venture, customers wouldn’t have the ability to enter inputs that contained solely values for cents.

Listed here are the combination outcomes of this and former assessments:

Claude 3.5 Sonnet: Failed
ChatGPT GPT-4o: Succeeded
Microsoft Copilot: Failed
Meta AI: Failed
Meta Code Llama: Succeeded
Google Gemini Superior: Failed
ChatGPT 4: Succeeded
ChatGPT 3.5: Succeeded

3. Discovering an annoying bug

The large problem of this take a look at is that the AI is tasked with discovering a bug that is not apparent and — to resolve appropriately — requires platform data of the WordPress platform. It is also a bug I didn’t instantly see alone and, initially, requested ChatGPT to resolve (which it did).

Additionally: The best free AI courses in 2024 (and whether AI certificates are worth it)

Claude not solely obtained this proper — catching the subtlety of the error and correcting it — nevertheless it was additionally the primary AI since I printed the full set of tests online to catch the truth that the publishing course of launched an error into the pattern question (which I subsequently fastened and republished).

Listed here are the combination outcomes of this and former assessments:

Claude 3.5 Sonnet: Succeeded
ChatGPT GPT-4o: Succeeded
Microsoft Copilot: Failed. Spectacularly. Enthusiastically. Emojically.
Meta AI: Succeeded
Meta Code Llama: Failed
Google Gemini Superior: Failed
ChatGPT 4: Succeeded
ChatGPT 3.5: Succeeded

Thus far, we’re at two out of three fails. Let’s transfer on to our final take a look at.

4. Writing a script

This take a look at is designed to see how far the AI’s programming data goes into specialised programming instruments. Whereas AppleScript is pretty frequent for scripting on Macs, Keyboard Maestro is a business software offered by a lone programmer in Australia. I discover it indispensable, nevertheless it’s simply considered one of many such apps on the Mac.

Nonetheless, when testing in ChatGPT, ChatGPT knew how one can “communicate” Keyboard Maestro in addition to AppleScript, which exhibits how broad its programming language data is.

Additionally: From AI trainers to ethicists: AI may obsolete some jobs but generate new ones

Sadly, Claude doesn’t have that data. It did write an AppleScript that tried to talk to Chrome (that is a part of the take a look at parameter) nevertheless it ignored the important Keyboard Maestro part.

Worse, it generated code in AppleScript that might generate a runtime error. In an try and ignore case for the match within the take a look at, Claude generated the road:

if theTab's title incorporates enter ignoring case then

That is just about a double error as a result of the “incorporates” assertion is case insensitive and the phrase “ignoring case” doesn’t belong the place it was positioned. It brought about the script to error out with an “Ignoring cannot go after this” syntax error message.

Listed here are the combination outcomes of this and former assessments:

Claude 3.5 Sonnet: Failed
ChatGPT GPT-4o: Succeeded however with reservations
Microsoft Copilot: Failed
Meta AI: Failed
Meta Code Llama: Failed
Google Gemini Superior: Succeeded
ChatGPT 4: Succeeded
ChatGPT 3.5: Failed

General outcomes

Listed here are the general outcomes of the 5 assessments:

I used to be considerably bummed about Claude 3.5 Sonnet. The corporate particularly promised that this model was suited to programming. However as you may see, not a lot. It isn’t that it may possibly’t program. It simply cannot program appropriately.

Additionally: I used ChatGPT to write the same routine in 12 top programming languages. Here’s how it did

I maintain in search of an AI that may finest the ChatGPT options, particularly as platform and programming setting distributors begin to combine these different fashions straight into the programming course of. However, for now, I am going again to ChatGPT after I want programming assist, and that is my recommendation to you as properly.

Have you ever used an AI that can assist you program? Which one? How did it go? Tell us within the feedback beneath.

You’ll be able to observe my day-to-day venture updates on social media. You should definitely subscribe to my weekly update newsletter, and observe me on Twitter/X at @DavidGewirtz, on Fb at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.

Source link

I pitted Claude 3.5 Sonnet against AI coding tests ChatGPT aced – and it failed creatively

The best Black Friday Roborock deals 2024: Early sales available now

The best Black Friday storage and SSD deals 2024: Early sales available now

Get 3 months of Xbox Game Pass Ultimate for 28% off with this deal

JD Vance’s Senate Office Fires Key Adviser Who Posted About Drug Use on Reddit

MAGA Influencers Are Making One Last Push for Donald Trump

Upgrade to Microsoft Office Pro and Windows 11 Pro for 87% off

Market Talk – November 4, 2024

War Room Founder Steve Bannon and Tucker Carlson on 2024 Election “Trump is the Choice of the Country. In a Free and Fair Election, Without Cheating, He Would Win” (VIDEO) | The Gateway Pundit

Harris and Trump make a furious final push before Election Day

US election 2024: What are Harris and Trump’s positions on the key issues? | US Election 2024 News

Editors Picks

Market Talk – November 4, 2024

War Room Founder Steve Bannon and Tucker Carlson on 2024 Election “Trump is the Choice of the Country. In a Free and Fair Election, Without Cheating, He Would Win” (VIDEO) | The Gateway Pundit

Harris and Trump make a furious final push before Election Day

US election 2024: What are Harris and Trump’s positions on the key issues? | US Election 2024 News

Insiders update rumors regarding Mets, Yankees possibly signing Soto, Alonso

I pitted Claude 3.5 Sonnet against AI coding tests ChatGPT aced – and it failed creatively

1. Writing a WordPress plugin

2. Rewriting a string perform

3. Discovering an annoying bug

4. Writing a script

General outcomes

Keep Reading