GPT-5 for Developers(openai.com)
470 points by 6thbit 23 days ago | 51 comments
aliljet 23 days ago
Between Opus aand GPT-5, it's not clear there's a substantial difference in software development expertise. The metric that I can't seem to get past in my attempts to use the systems is context awareness over long-running tasks. Producing a very complex, context-exceeding objective is a daily (maybe hourly) ocurrence for me. All I care about is how these systems manage context and stay on track over extended periods of time.

What eval is tracking that? It seems like it's potentially the most imporatnt metric for real-world software engineering and not one-shot vibe prayers.

abossy 23 days ago
At my company (Charlie Labs), we've had a tremendous amount of success with context awareness over long-running tasks with GPT-5 since getting access a few weeks ago. We ran an eval to solve 10 real Github issues so that we could measure this against Claude Code and the differences were surprisingly large. You can see our write-up here:

https://charlielabs.ai/research/gpt-5

Often, our tasks take 30-45 minutes and can handle massive context threads in Linear or Github without getting tripped up by things like changes in direction part of the way through the thread.

While 10 issues isn't crazy comprehensive, we found it to be directionally very impressive and we'll likely build upon it to better understand performance going forward.

bartman 23 days ago
I am not (usually) photosensitive, but the animated static noise on your websites causes noticable flickering on various screens I use and made it impossible for me to read your article.

For better accessibility and a safer experience[1] I would recommend not animating the background, or at least making it easily togglable.

[1] https://developer.mozilla.org/en-US/docs/Web/Accessibility/G...

neom 23 days ago
Removed- sorry, and thank you for the feedback.
pxc 23 days ago
Love your responsiveness here!

Edited to add: I am, in fact, photosensitive (due to a genetic retinal condition), and for my eyes, your site as it is very easy to read, and the visualizations look great.

bartman 23 days ago
Thank you!

Love that you included the judge prompts in your article.

neom 23 days ago
Please let me know what you would like to see more of. Evals are something we take serious, I think this post was ok enough given our constraints, but I'd like to produce content people find useful and I think we can do a lot better.
jeanlucas 23 days ago
Nice,
MPSFounder 23 days ago
I concur. Awful UI
RyanHamilton 23 days ago
Did you sign any kind of agreement with a non disparagement clause to get early access? I'm asking because if you did, your data point isn't useful. It would mean anyone else that tried it and got worse results wouldn't be able to post here. We would just be seeing the successful data points.
htrp 23 days ago
Even if they didn't, overly critical or negative commentary will mean their removal from the list of trusted testers
neom 23 days ago
They didn't say anything to us, nothing was approved, just eng <> eng discussion about the model. Also nothing was cherry picked etc etc - I don't care what OAI thinks, I care about producing the best product and showing you our findings.
TechDebtDevin 23 days ago
Waitig 30-45 minutes for code, that you're still going to have to read from top to bottom to make sure it doesn't have anything dumb in it, does not seem like a productivity enhancement. I would quit If I was an engineer and told to do this.
rantallion 23 days ago
If you're doing nothing in that 30-45 minutes other than stare at a loading screen, you're doing it wrong.

I'm not sold on the efficacy of AI and I share your reservations about having to scrutinise their output, but I see great value in being able to offload a long-running task to someone/something else and only have to check back later. In the meantime, I can be doing something else - like sitting in those planning meetings we all enjoy!

abossy 22 days ago
I love sitting in those planning meetings, too. /s

This is exactly right. We've adapted our workflow to kick off a task and then kick off the next one and the next. Then we review the work of each as they come through. It's just CPU pipelining for human workflow.

The process is far from perfect but the throughput is very high. The limiting factor is review. I spend most of my time doing line-by-line review of AI output and asking questions about things I'm unsure of. It's a very different job from the way I historically operated, which involved tight code -> verify loops of manually written code.

1659447091 23 days ago
> Producing a very complex, context-exceeding objective is a daily (maybe hourly) ocurrence for me. All I care about is how these systems manage context and stay on track over extended periods of time.

For whatever reason Github's Copilot is treated like the redheaded stepchild of coding assistants. Even through there are Anthropic, OpenAI, and Google models to choose from. And there is a "spaces"[0] website feature that may be close to what you are looking for.

I got better results for testing some larger task using that than I did through the IDE version. But have not used it much. Maybe others have more experience with it. Trying to gather all the context and then review the results was taking longer than doing it myself; having the context gathered already or building it up over time is probably where its value is.

[0] https://docs.github.com/en/copilot/concepts/spaces

RobinL 23 days ago
Totally agree. At the moment I find that frontier LLMs are able to solve most of the problems I throw at them given enough context. Most of my time is spent working out what context they're missing when they fail. So the thing that would help me most is much a much more focussed ability to gather context.

For my use cases, this is mostly needing to be really home in on relevant code files, issues, discussions, PRs. I'm hopeful that GPT5 will be a step forward in this regard that isn't fully captured in the benchmark results. It's certainly promising that it can achieve similar results more cheaply than e.g. Opus.

23 days ago
logicchains 23 days ago
>Between Opus aand GPT-5, it's not clear there's a substantial difference in software development expertise.

If there's no substantial difference in software development expertise then GPT-5 absolutely blows Opus out of the water due to being almost 10x cheaper.

spiderice 23 days ago
Does OpenAI provide a $200/month option that lets me use as much GPT-5 I want inside of Codex?

Because if not, I'd still go with Opus + Claude Code. I'd rather be able to tell my employer, "this will cost you $200/month" than "this might cost you less than $200/month, but we really don't know because it's based on usage"

mh- 23 days ago
To be clear, Claude doesn't provide that either. You can get "usage limited" off of Opus on the $200/mo plan.
konarkm 23 days ago
The ChatGPT paid subscriptions now come with Codex CLI usage included
t1amat 23 days ago
Is this actually true? Last I checked (a week ago?) Codex the agents were free at some tiers in a preview capacity (with future rate limits based on tier), but codex cli was not. With codex cli you can log in but the purpose of that is to link it to an API key where you pay per use. The sub tiers give one time credits you would burn through quickly.
Deradon 23 days ago
Found this in the GPT-5 Announcement:

> Availability and access > GPT‑5 is starting to roll out today to all Plus, Pro, Team, and Free users, with access for Enterprise and Edu coming in one week. Pro, Plus, and Team users can also start coding with GPT‑5 in the Codex CLI (opens in a new window) by signing in with ChatGPT.

mrheosuper 23 days ago
Does Claude ?
user3939382 23 days ago
I just asked codex to copy a file and it took almost a minute to think about it and cost $0.05. This is something Claude Code would have done in seconds.
22 days ago
user3939382 23 days ago
Sorry if this is repetitive but you have to break the problem down just like any complex computing task. The difference is how. You have to break the problems into context windows that you anticipate being able to sow together later. It’s not the same way you would break down a source code authoring task in its absence but the theory is the same.
swader999 23 days ago
If GPT 5 truly has 400k context, that might be all it needs to meaningfully surpass Opus.
andrewmutz 23 days ago
Having a large context window is very different from being able to effectively use a lot of context.

To get great results, it's still very important to manage context well. It doesn't matter if the model allows a very large context window, you can't just throw in the kitchen sink and expect good results

dimal 23 days ago
Even with large contexts there's diminishing returns. Just having the ability to stuff more tokens in context doesn't mean the model can effectively use it. As far as I can tell, they always reach a point in which more information makes things worse.
Byamarro 23 days ago
More of a question is its context rot tendency than the size of its context :) LLMs are supposed to load 3 bibles into their context, but they forget what they were about to do after loading a 600LoC of locales.
simonw 23 days ago
It's 272,000 input tokens and 128,000 output tokens.
dudeinhawaii 23 days ago
The website clearly lays them out as 400k input and 128k output [1]. I just updated my AI apps to support the new models. I routinely fill the entire context on large code calls. Input is not a "shared" context.

I found 100k was barely enough for a single project without spillover, so 4x allows for linking more adjacent codebases for large scale analysis.

[1] https://platform.openai.com/docs/models/gpt-5

6thbit 23 days ago
Oh, I had not grasped that the “context window” size advertised had to include both input and output.

But is it really 272k even if the output was say 10k? Cause it does say “max output” in the docs, so I wonder

simonw 23 days ago
This is the only model where the input limit and the context limit are different values. OpenAI docs team are working on updating that page.
zurfer 23 days ago
Woah that's really kind of hidden. But I think you can specify max output tokens. Need to test that!
AS04 23 days ago
400k context with 100% on the fiction livebench would make GPT-5 the undisputably best model IMHO. Don't think it will achieve that though, sadly.
tekacs 23 days ago
Coupled with the humungous price difference...
joshmlewis 23 days ago
I've been testing it against Opus 4.1 the last few hours and it has done better and solved problems Claude kept failing at. I would say it's definitely better, at least so far.
nadis 23 days ago
It's pretty vague, but the OP had this callout:

>"GPT‑5 is the strongest coding model we’ve ever released. It outperforms o3 across coding benchmarks and real-world use cases, and has been fine-tuned to shine in agentic coding products like Cursor, Windsurf, GitHub Copilot, and Codex CLI. GPT‑5 impressed our alpha testers, setting records on many of their private internal evals."

realusername 23 days ago
Personally I think I'll wait for another 10x improvement for coding because with the current way it's going, they clearly need that.
fsloth 23 days ago
From my experience when used through IDE such as Cursor the current gen Claude model enables impressive speedruns over commodity tasks. My context is a CAD application I’ve been writing as a hobby. I used to work in that field for a decade so have a pretty good touch on how long I would expect tasks to take. I’m using mostly a similar software stack as that at previous job and am definetly getting stuff done much faster on holiday at home than at that previous work. Of course the codebase is also a lot smaller, intrinsic motivation, etc, but still.
realusername 23 days ago
I've done pretty much the same as you (Cursor/Claude) for our large Rails/React codebase at work and the experience has been horrific so far, I reverted back to vscode.
fsloth 23 days ago
Yeah! It's quite possible my scenario is in the "happy accident" valley.

I'm using it mostly for C#, WPF and OpenTK. The type system seems to help a lot.

The UI logic it recommends is mostly god awful. But at least for me when it's given a pattern it can apply, it does so pretty well.

42lux 23 days ago
How often do you have to build the simple scaffolding though?
fsloth 23 days ago
At a real job? Not that often! And it's miserable in large scale architecture.

However, at leas for me there is lots of "small enough context" boilerplate that the context can deal with.

Clearly this is not a tool in the sense it's predictable.

greymalik 23 days ago
> it's not clear there's a substantial difference in software development expertise

But GPT-5 is substantially cheaper[0].

[0] https://simonwillison.net/2025/Aug/7/gpt-5/#pricing-is-aggre...

andhuman 23 days ago
Is this because it’s now a Moe? They now match price with Gemini 2.5 Pro, which is also a moe.
ilaksh 23 days ago
The pricing is dramatically better than Opus for gpt-5 since that is now comparable to Gemini 2.5 Pro.
altitudinous 23 days ago
Indeed context awareness is the big difference here, GPT5 is a vast improvement. It doesn't lose track (as easily)
cyanydeez 23 days ago
real context is a graph of objectives and results.

The power of these models has peaked and simply arn't going to manage the type of awareness being promised.

bdangubic 23 days ago
context awareness over long-running tasks

don’t have long-running tasks, llms or not. break the problem down into small manageable chunks and then assemble it. neither humans nor llms are good at long-running tasks.

bastawhiz 23 days ago
> neither humans nor llms are good at long-running tasks.

That's a wild comparison to make. I can easily work for an hour. Cursor can hardly work for a continuous pomodoro. "Long-running" is not a fixed size.

novok 23 days ago
I think that is because you do implicit plan tracking, creation and modification of the plan in your head in light of new information and then follow that plan. I'm not sure these tools do that very well.

The long running task, at it's core, is composed of many smaller tasks and you mostly focus on one task at a time per brain part. It's why you cannot read two streams of text simultaneously even if both are in your visual focus field.

raducu 23 days ago
> you do implicit plan tracking, creation and modification of the plan in your head in light of new information and then follow that plan. I'm not sure these tools do that very well.

I think the plan is not just words, if it was, you could read a book on how to ride a bike.

Because we communicate in language and because code output is also a language we think that the process is also language based, but I think it's not, especially when doing hard stuff.

I know for certain in my case it isn't -- when tracking a hard problem for a junior after 2 hours of pair programming the other week, I had to tell him to commit everything and just let me do some deep thinking/debugging and I solved the problem myself. Sure I explained my process to him in language the best I could, but it's clear it was not language, it was not liniar, I did not think it step by step.

I wish I could explain it, but when figuring out a hard problem, for me it takes some time to take it all in, get used to the moving parts, play with them. I'm sure there are actual neurons/synapses formed then, actual new wires sprawling about in the brain, that's why it takes time. I think the solution is a hardware one, not a software one.

That's why we can sleep on it and get better the next day and that's why we feel the problem. There are actual multiple paralel "threads" of thinking going at the same time in our heads and we can FEEL the solution as almost there.

I think it simply is that hard problems can occur in a combination of code, state, models that simply cannot be solved incrementally and big jumps are necessary.

I'm not saying the problem cannot be solved incrementally, but it's possible that by going in small steps, you either reach the solution or a blocker that requires a big jump.

bdangubic 23 days ago
you making too much sense :)
bdangubic 23 days ago
I just finished my workday, 8hrs with Claude Code. No single task took more than 20 minutes total. Cleared context after each task and asked it to summarize for itself the previous task before I cleared context. If I ran this as a continuous 8hr task it would have died after 35-ish minutes. Just know the limitations (like with any other tool) and you’ll be good :)
0x457 23 days ago
I always find it wild that none of these tools use VCS - completed logical unit of work, make a commit, drop entire context related to that commit, while referencing said commit, continue onto the next stage, rinse and repeat.

Claud always misunderstands how API exported by my service works and every compaction it forgets all over and commits "oh api has changed since last time I've used, let me use different query parameters", my brother Christ nothing has changed, and you are the one who made this API.

bastawhiz 23 days ago
You can use cursor rules to tell cursor to update the project cursor rules with details about the API.
0x457 22 days ago
Yes, I can, and I do, I'm pointing out that compressing an entire conversation history into a single message is so lossy that I might as well start a new session.

Yes, I can also tell any agent to commit more often, but that's again not what I'm saying. I'm saying version control can be integrated way deeper into agent workflow.

bdangubic 23 days ago
that would mean someone actually tried to learn how to use it :)
jondwillis 23 days ago
Idk I had cursor/claude untangle and commit to two separate logical branches yesterday from a bunch of random working copy changes that I had made. You can prompt it to use git commands and it works well enough in my experience.
0x457 22 days ago
Yes, you can prompt, I find it confusing why that + worktrees (i.e. agent working on its own working copy of the repo) is the standard/
bdangubic 23 days ago
I do exactly this - except I want control to define logical units of work
0x457 22 days ago
That's how Kiro works and how the agent that I'm working on works.

Kiro creates tasks from spec document, and you can revise tasks either by prompting or editing tasks.md file.

My thing works a bit differently, but essentially the same.

bahmboo 23 days ago
Roo Code does this
echelon 23 days ago
Humans can error correct.

LLMs multiply errors over time.

beoberha 23 days ago
A series of small manageable chunks becomes a long running task :)

If LLMs are going to act as agents, they need to maintain context across these chunks.

vaenaes 23 days ago
You're holding it wrong
chrismccord 23 days ago
I'm really bummed out by this release. I expected this to best sonnet, or at least match, given all the hype. But it has drastically under performed on agent based work for me so far, even underperforming gpt-4.1. It struggles with basic instruction following. Basic things like:

  - "don't nest modules'–nests 4 mods in 1 file
  - "don't write typespecs"–writes typespecs
  - "Always give the user design choices"– skips design choices.
gpt-4.1 way outperforms w/ same instructions. And sonnet is a whole different league (remains my goto). gpt-5 elixir code is syntactically correct, but weird in a lot of ways, junior-esque inefficient, and just odd. e.g function arguments that aren't used, yet passed in from callers, dup if checks, dup queries in same function. I imagine their chat and multimodal stuff strikes a nice balance with leaps in some areas, but for coding agents this is way behind any other SOTA model I've tried. Seems like this release was more about striking a capability balance b/w roflscale and costs than a gpt3-4 leap.
Jackson__ 23 days ago
Thankfully OAI will fix this, by removing GPT4.1 soon!
enraged_camel 23 days ago
Claude has always been noticeably better for Elixir for me. GPT very frequently outputs pure garbage, and as far as I can tell this release is not much different.
wsintra 23 days ago
Maybe its became so intelligent it now wants to troll people as a way to create factions among the populace.
23 days ago
pamelafox 23 days ago
I am testing out gpt-5-mini for a RAG scenario, and I'm impressed so far.

I used gpt-5-mini with reasoning_effort="minimal", and that model finally resisted a hallucination that every other model generated.

Screenshot in post here: https://bsky.app/profile/pamelafox.bsky.social/post/3lvtdyvb...

I'll run formal evaluations next.

ralfd 23 days ago
Q: What does a product manager do?

GPT4: Collaborating with engineering, sales, marketing, finance, external partners, suppliers and customers to ensure …… etc

GPT5: I don't know.

Upon speaking these words, AI was enlightened.

ComputerGuru 23 days ago
That is genuinely nice to see. What are you using for the embeddings?
pamelafox 23 days ago
We use text-embedding-3-large, with both quantization and MRL reduction, plus oversampling on the search to compensate for the compression.
siva7 23 days ago
This is huge news if we finally have a model that is able to say "I don't know".
jofzar 23 days ago
If a model doesn't "know" what a PM is then I worry about any of its other outputs. That should be dictionary lookup.
siva7 23 days ago
Why? It's honest as it doesn't understand it without more context. Lookup could lead to wrong results
dimal 23 days ago
Seriously. I have never seen this, even once. I had been wondering if it was impossible. If a model can really say “I don’t know” when it doesn’t know, that could change everything. How many pointless, dumb rabbit holes could be avoided?
jondwillis 23 days ago
My comment peers are really whooshing hard on this. Clearly they have worked with a different sort of PM than I ever have.

The correct answer is: “professional managerial class grift”

potatolicious 23 days ago
This feels like honestly the biggest gain/difference. I work on things that do a lot of tool calling, and the model hallucinating fake tools is a huge problem. Worse, sometimes the model will hallucinate a response directly without ever generating the tool call.

The new training rewards that suppress hallucinations and tool-skipping hopefully push us in the right direction.

0x457 23 days ago
I get the "good" result with phi-4 and gemma-3n in RAG scenario - i.e. it only used context provided to answer and couldn't answer questions if context lacked the answer without hallucination.
risho 23 days ago
over the last week or so I have put probably close to 70 hours into playing around with cursor and claude code and a few other tools (its become my new obsession). I've been blown away by how good and reliable it is now. That said the reality is in my experience the only models that actually work in any sort of reliable way are claude models. I dont care what any benchmark says because the only thing that actually matters is actual use. I'm really hoping that this new gpt model actually works for this usecase because competition is great and the price is also great.
rcarr 23 days ago
I think some of this might come down to stack as well. I watched a t3.gg video[1] recently about Convex[2] and how the nature of it leads to the AI getting it right first time more often. I've been playing around with it the last few days and I think I agree with him.

I think the dev workflow is going to fundamentally change because to maximise productivity out of this you need to get multiple AIs working in parallel so rather than just jumping straight into coding we're going to end up writing a bunch of tickets out in a PM tool (Linear[3] looks like it's winning the race atm) and then working out (or using the AI to work out) which ones can be run in parallel without causing merge conflicts and then pulling multiple tickets into your IDE/Terminal and then cycling through the tabs and jumping in as needed.

Atm I'm still not really doing this but I know I need to make the switch and I'm thinking that Warp[4] might be best suited for this kind of workflow, with the occasional switch over to an IDE when you need to jump in and make some edits.

Oh also, to achieve this you need to use git worktrees[5,6,7].

[1]: https://www.youtube.com/watch?v=gZ4Tdwz1L7k

[2]: https://www.convex.dev/

[3]: https://linear.app/

[4]: https://www.warp.dev/

[5]: https://docs.anthropic.com/en/docs/claude-code/common-workfl...

[6]:https://git-scm.com/docs/git-worktree

[7]:https://www.tomups.com/posts/git-worktrees/

rcarr 23 days ago
Seems like VSCode just added a lot of stuff for this in the latest update today, such as worktree support[1] and an agent session mode[2].

[1]: https://code.visualstudio.com/updates/v1_103#_git-worktree-s...

[2]: https://code.visualstudio.com/updates/v1_103#_chat-sessions-...

rcarr 21 days ago
isoprophlex 23 days ago
Sure sounds interesting but... Where on earth do you actually find the time to sit through a 1.5 hour yt video?!
mceachen 23 days ago
On a desktop browser, tap YouTube's "show transcript" and "hide timecodes", then copy-paste the whole transcript into Claude or chatgpt and tell it to summarize with whatever resolution you want-a couple sentences, 400 lines, whatever. You can also tell it to focus on certain subject material.

This is a complete game changer for staying on top of what's being covered by local government meetings. Our local bureaucrats are astounding competent at talking about absolutely nothing for 95% of the time, but hidden is three minutes of "oh btw we're planning on paving over the local open space preserve to provide parking for the local business".

theshrike79 23 days ago
Copy the url, tap cmd-t

Write '!sum ' hit cmd-v and enter

Then the Kagi summariser will do that :)

rcarr 23 days ago
Jump in and start coding entire backend with stack not best suited for job and modern AI tools: most likely future hours lost.

Spend 1.5 hours now to learn from an experienced dev on a stack that is better suited for job: most likely future hours gained.

burnished 23 days ago
1.5x and 2x speed help a lot, slow down or repeat segments as needed, don't be afraid to fast forward past irrelevant looking bits (just be eager to backtrack).
mafro 23 days ago
Ask an LLM to transcribe and give the overview and key points
davidw 23 days ago
If it can produce something you can read in 20 minutes, it means there was a lot of... 'fluff' isn't quite the right word, but material that could be removed without losing meaning.
v5v3 23 days ago
People find time for things they seem important to them.
theshrike79 23 days ago
But with a hour long video, how do you know if the content is any good?

With text I can skim around the headings and images and see at a glance how deep the author is going into the subject.

In that specific video the first 30 minutes is related to everything but the new Web Scale[0] LLM native database the author is "moving to" from SQL.

Meanwhile Postgresql is just chugging along and over-performing all of them.

[0] https://www.youtube.com/watch?v=b2F-DItXtZs

rcarr 20 days ago
Adding yet another comment as you can also call agents from Linear directly, who will create pull requests in github, but they seem pretty expensive for what they are. They don't seem to offer any real benefit over setting up the mcp server, opening a terminal window and typing "create a pr for $TICKET NUMBER in Linear" other than shaving off a few seconds.
neuronexmachina 23 days ago
> That said the reality is in my experience the only models that actually work in any sort of reliable way are claude models.

Anecdotally, the tool updates in the latest Cursor (1.4) seem to have made tool usage in models like Gemini much more reliable. Previously it would struggle to make simple file edits, but now the edits work pretty much every time.

throwaway_2898 23 days ago
How much of the product were you able to build to say it was good/reliable? IME, 70 hours can get you to a PoC that "works", building beyond the initial set of features — like say a first draft of all the APIs — does it do well once you start layering features?
petralithic 23 days ago
This has been my experience. The greenfield approach works up to a point, then it just breaks.
Maxion 23 days ago
It depends on how you use it. The "vibe-coding" approach where you give the agen naive propmts like "make new endpoint" often don't work and fail.

When you break the problem of "create new endpoint" down into its sub-components (Which you can do with the agent) and then work on one part at a time, with a new session for each part, you generally do have more success.

The more boilerplate-y the part is, the better it is. I have not really found one model that can yet reliably one-shot things in real life projects, but they do get quie close.

For many tasks, the models are slower than what I am, but IMO at this point they are helpful and definitely should be part of the toolset involved.

disgruntledphd2 22 days ago
> The more boilerplate-y the part is, the better it is. I have not really found one model that can yet reliably one-shot things in real life projects, but they do get quie close.

This definitely feels right from my experience. Small tasks that are present in the training data = good output with little effort.

Infra tasks (something that isn't in the training data as often) = sad times and lots of spelunking (to be fair Gemini has done a good job for me eventually, even though it told me to nuke my database (which sadly, was a good solution)).

ralfd 23 days ago
Just replying to ask you next week what your assessment on GPT5 is.
risho 22 days ago
I've been trying it out with openai codex over the last day and a half and I have been incredibly impressed. It has been working quite well. I also had it look over some code that claude produced for me and it said that it would be better to approach it another way and it completely rewrote it in a way that actually was significantly better. The UX for codex is quite a bit worse than Claude Code, but the model has been good enough to justify the switch for now. I'm hopeful that cursor cli will eventually have a good enough ux such that I can switch to it and have access to all of the models rather than needing to use disparate tools for everything. I would strongly suggest you check out gpt 5 for agentic stuff if you are interested.
23 days ago
Centigonal 23 days ago
Ditto here, except I'm using Roo and it's Claude and Gemini pro 2.5 that work for me.
zarzavat 23 days ago
The magic is the prompting/tool use/finetuning.

I find that OpenAI's reasoning models write better code and are better at raw problem solving, but Claude code is a much more useful product, even if the model itself is weaker.

croemer 23 days ago
> GPT‑5 also excels at long-running agentic tasks—achieving SOTA results on τ2-bench telecom (96.7%), a tool-calling benchmark released just 2 months ago.

Yes, but it does worse than o3 on the airline version of that benchmark. The prose is totally cherry picker.

tedsanders 23 days ago
I wrote that section and made the graphs, so you can blame me. We no doubt highlight the evals that make us look good, but in this particular case I think the emphasis on telecom isn't unprincipled cherry picking.

Telecom was made after retail & airline, and fixes some of their problems. In retail and airline, the model is graded against a ground truth reference solution. But in reality, there can be multiple solutions that solve the problem, and perfectly good answers can receive scores of 0 by the automatic grading. This, along with some user model issues, is partly why airline and retail scores haven't climbed with the latest generations of models and are stuck around 60% / 80%. Even a literal superintelligence would probably plateau here.

In telecom, the authors (Barres et al.) made the grading less brittle by grading against outcome states, which may be achieved via multiple solutions, rather than by matching against a single specific solution. They also improved the user modeling and some other things too. So telecom is the much better eval, with a much cleaner signal, which is partly why models can score as high as 97% instead of getting mired at 60%/80% due to brittle grading and other issues.

Even if I had never seen GPT-5's numbers, I like to think I would have said ahead of time that telecom is much better than airline/retail for measuring tool use.

Incidentally, another thing to keep in mind when critically looking at OpenAI and others reporting their scores on these evals is that the evals give no partial credit - so sometimes you can have very good models that do all but one thing perfectly, which results in very poor scores. If you tried generalizing to tasks that don't trigger that quirk, you might get much better performance than the eval scores suggest (or vice versa, if they trigger a quirk not present in the eval).

Here's the tau2-bench paper if anyone wants to read more: https://arxiv.org/abs/2506.07982

jama211 23 days ago
Thanks for your input!
jeffrwells 23 days ago
OpenAI hiring BCG alumni is all we need to know
jama211 21 days ago
No need to be like that mate.
Fogest 23 days ago
How does the cost compare though? From my understanding o3 is pretty expensive to run. Is GPT-5 less costly? If so if the performance is close to o3 but cheaper, then it may still be a good improvement.
low_tech_punk 23 days ago
I find it strange that GPT-5 is cheaper than GPT-4.1 in input token and is only slightly more expensive in output token. Is it marketing or actually reflecting the underlying compute resources?
AS04 23 days ago
Very likely to be an actual reflection. That's probably their real achievement here and the key reason why they are actually publishing it as GPT-5. More or less the best or near to it on everything while being one model, substantially cheaper than the competition.
ComputerGuru 23 days ago
But it can’t do audio in/out or image out. Feels like an architectural step back.
conradkay 23 days ago
My understanding is that image output is pretty separate and if it doesn’t seem that way, they’re just abstracting several models into one name
bn-l 23 days ago
Maybe with the router mechanism (to mini or standard) they estimate the average cost will be a lot lower for chatgpt because the capable model won’t be answering dumb questions and then they pass that on to devs?
low_tech_punk 23 days ago
I think the router applies to chatgpt app. The developer APIs expose manual control to select the specific model and level of reasoning.
jstummbillig 23 days ago
I mean... they themselves included that information in the post. It's not exactly a gotcha.
jumploops 23 days ago
If the model is as good as the benchmarks say, the pricing is fantastic:

Input: $1.25 / 1M tokens (cached: $0.125/1Mtok) Output: $10 / 1M tokens

For context, Claude Opus 4.1 is $15 / 1M for input tokens and $75/1M for output tokens.

The big question remains: how well does it handle tools? (i.e. compared to Claude Code)

Initial demos look good, but it performs worse than o3 on Tau2-bench airline, so the jury is still out.

joshmlewis 23 days ago
It does seem to be doing well compared to Opus 4.1 in my testing the last few hours. I've been on the Claude Code 200 plan for a few months and I've been really frustrated with it's output as of late. GPT-5 seems to be a step forward so far.
wrcwill 23 days ago
how are you using it? codex-cli?
joshmlewis 23 days ago
Cursor
addaon 23 days ago
> Output: $10 / 1M tokens

It's interesting that they're using flat token pricing for a "model" that is explicitly made of (at least) two underlying models, one with much lower compute costs than the other; and with use ability to at least influence (via prompt) if not choose which model is being used. I have to assume this pricing model is based on a predicted split between how often the underlying models get used; I wonder if that will hold up, if users will instead try to rouse the better model into action more than expected, or if the pricing is so padded that it doesn't matter.

mkozlows 23 days ago
That's how the browser-based ChatGPT works, but not the API.
simianwords 23 days ago
> that is explicitly made of (at least) two underlying models

what do you mean?

addaon 23 days ago
> a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say “think hard about this” in the prompt).

From https://openai.com/index/gpt-5-system-card/

tedsanders 23 days ago
In the API, there’s no router. Developers just pick whether they use the reasoning model or non-thinking ChatGPT model.
leptons 23 days ago
Price is not the same as cost, and that price may get jacked up without much warning.

The price is what it is today because they are trying to become a dominant platform. It doesn't mean the price reflects what it actually costs to run.

I'd bet a lot of the $40 billion they got in March goes towards loss leaders.

redbell 23 days ago
The fact that they intentionally ignored competitors' models in benchmarks and where comparing GPT-5 only to their previous models reminds me of Apple. They never compare their latest iPhone with any other phone from other brands but only to their previous(s) iPhone.
iamsaitam 23 days ago
The artist way
mehmetoguzderin 23 days ago
Context-free grammar and regex support are exciting. I wonder what, or whether, there are differences from the Lark-like CFG of llguidance, which powers the JSON schema of the OpenAI API [^1].

[^1]: https://github.com/guidance-ai/llguidance/blob/f4592cc0c783a...

msp26 23 days ago
Yeah that was the only exciting part of the announcement for me haha. Can't wait to play around with it.

I'm already running into a bunch of issues with the structured output APIs from other companies like Google and OpenAI have been doing a great job on this front.

chrisweekly 23 days ago
> "I'm already running into a bunch of issues with the structured output APIs from other companies like Google and OpenAI have been doing a great job on this front."

This run-on sentence swerved at the end; I really can't tell what your point is. Could you reword it for clarity?

petercooper 23 days ago
I read it as "... from other companies, like Google, and OpenAI have been doing a great job on this front"
mehmetoguzderin 23 days ago
I'm not sure if it's due to experience with the aforementioned APIs, but I also read the same, “issues with APIs like ..., and (in contrast) OpenAI have been doing a great job”
joshmlewis 23 days ago
It's free in Cursor for the next few days, you should go try it out if you haven't. I've been an agentic coding power user since the day it came out across several IDE's/CLI tools and Cursor + GPT-5 seems to be a great combo.
raducu 23 days ago
Sample size of 1 but GPT-5 seems horrendous at coding?

My go to benchmark is a 3d snake game Claude does almost flawlessly (or at least in 3-4 iterations)

The prompt:

write a 3d snake game in js and html. you can use any libraries you want. the game still happens inside a single plane, left arrow turns the snake left, right arrow turns it right. the plane is black and there's a green grid. there are multiple rewards of random colors at a given time. each time a reward is eaten, it becomes the snake's new head. The camera follows the snake's head, it is above an a bit behind it, looking forward. When the snake moves right or left, the camera follows gradually left or right, no snap movements. write everything in a single html file.

EDIT: I'm not trying to shit on GPT-5, so many people here seem to be getting very good results, am I doing something wrong with my prompt?

M4v3R 23 days ago
This is what I got from your prompt in one shot with GPT-5 Thinking:

Game: https://chatgpt.com/canvas/shared/6895f722f2708191ac4a6d1645...

Conversation: https://chatgpt.com/share/6895f74a-0c5c-8004-b349-69da096531...

The controls are inverted for some reason and it could be a bit faster, but I fixed both of these easily with one prompt and here's the corrected version: https://chatgpt.com/canvas/shared/6895f82759f88191ba41c9fcd5...

raducu 20 days ago
Thanks, the issue was indeed not using explicitly the thinking model or they changed something over the weekend -- it's at least on par with Claude now.

EDIT: clearly better than Claude or any other model that I tried before. I had a bonus benchmark -- add a narrow triangle on the head of the snake that indicates the direction of movement, after a single iteration GPT-5 fixed it whereas Claude could never get the rotation of the triangle right, nor could o3 the last time I tried.

cncjchsue7 23 days ago
[dead]
Frieren 23 days ago
> My go to benchmark is a 3d snake game Claude does almost flawlessly (or at least in 3-4 iterations)

If you need to know how the snake game should look to get the code then Claude is not doing the work you are.

andrewmcwatters 23 days ago
I wonder how good it is compared to Claude Sonnet 4, and when it's coming to GitHub Copilot.

I almost exclusively wrote and released https://github.com/andrewmcwattersandco/git-fetch-file yesterday with GPT 4o and Claude Sonnet 4, and the latter's agentic behavior was quite nice. I barely had to guide it, and was able to quickly verify its output.

fleebee 23 days ago
There is an option in GitHub Copilot settings to enable GPT-5 already.
low_tech_punk 23 days ago
The ability to specify a context-free grammar as output constraint? This blows my mind. How do you control the auto regressive sampling to guarantee the correct syntax?
evnc 23 days ago
I assume they're doing "Structured Generation" or "Guided generation", which has been possible for a while if you control the LLM itself e.g. running an OSS model, e.g. [0][1]. It's cool to see a major API provider offer it, though.

The basic idea is: at each auto-regressive step (each token generation), instead of letting the model generate a probability distribution over "all tokens in the entire vocab it's ever seen" (the default), only allow the model to generate a probability distribution over "this specific set of tokens I provide". And that set can change from one sampling set to the next, according to a given grammar. E.g. if you're using a JSON grammar, and you've just generated a `{`, you can provide the model a choice of only which tokens are valid JSON immediately after a `{`, etc.

[0] https://github.com/dottxt-ai/outlines [1] https://github.com/guidance-ai/guidance

qsort 23 days ago
You sample only from tokens that could possibly result in a valid production for the grammar. It's an inference-only thing.
low_tech_punk 23 days ago
ah, thanks!
hrpnk 23 days ago
The github issue showed in the livestream is getting lots of traction: https://github.com/openai/openai-python/issues/2472

It was (attempted to be) solved by a human before, yet not merged... With all the great coding models OpenAI has access to, their SDK team still feels too small for the needs.

Iwan-Zotow 23 days ago
They hope next model will do SDK right
te_chris 23 days ago
https://platform.openai.com/docs/guides/latest-model

Looks like they're trying to lock us into using the Responses API for all the good stuff.

23 days ago
zaronymous1 23 days ago
Can anyone explain to me why they've removed parameter controls for temperature and top-p in reasoning models, including gpt-5? It strikes me that it makes it harder to build with these to do small tasks requiring high-levels of consistency, and in the API, I really value the ability to set certain tasks to a low temp.
catigula 23 days ago
I thought we were going to have AGI by now.
RS-232 23 days ago
No shot. LLMs are simple text predictors and they are too stupid to get us to real AGI.

To achieve AGI, we will need to be capable of high fidelity whole brain simulations that model the brain's entire physical, chemical, and biological behavior. We won't have that kind of computational power until quantum computers are mature.

brookst 23 days ago
Are you saying that only (human?) biological brains can be GI, and that whatever intelligence is, it would emerge from a pure physics-based simulation?

Both of those seem questionable, multiplying them together seems highly unlikely.

jplusequalt 23 days ago
Are you arguing that intelligence is not physical? Could you name a single thing in existence that fundamentally cannot be linked to physics?
BoiledCabbage 23 days ago
I think the argument is simpler than that. I have a PC, if I wanted to emulate an old Nintendo system well enough to play I dont have to emulate from the physics upwards.

Even though every NES in existence is a physical system, you don't physical level simulation to create and have a playable NES system via emulation.

jplusequalt 23 days ago
You have no proof that emulating a brain is on the same level of complexity of emulating a gaming console. Tech bro reductionism is a plague.
aaronbaugher 23 days ago
I'll be borrowing "tech bro reductionism." That's the perfect term for something that's a scourge these days.
brookst 22 days ago
I’m saying that:

1) intelligence and consciousness seem to be linked and we don’t understand consciousness. It may be physical but not purely chemical and classic physics.

2) Even if it is purely chemistry and classic physics, the development process may matter. Whole brain simulation may get you a simulated lump of flesh with no electrical activity.

jplusequalt 21 days ago
There are a hundred trillion synaptic connections. Is it even possible to model something of that scale?
catigula 22 days ago
Yes, consciousness is not explained by known physics.

This is a trivial question to answer.

jplusequalt 21 days ago
Not currently explained != not based in physics. Modern physics does not have an explanation for how gravity works at the quantum scale, yet anyone who'd seriously argue that quantum gravity is not physical would be laughed out of the building.
nawgz 23 days ago
I don't really see any relationship between being able to model/simulate the brain and being able to exceed the brain in intelligence, can you explain more about that? Simulations sound like more of a computational and analytic problem with regards to having an accurate model.

Maybe your point is that until we understand our own intelligence, which would be reflected in such a simulation, it would be difficult to improve upon it.

93po 23 days ago
in what way are human brains also not just predictors? our neural pathways are built and reinforced as we have repeated exposure to inputs through any of our senses. our brains are expert pattern-followers, to the point that is happens even when we strongly don't want to (in the case of PTSD, for example, or people who struggle with impulse control and executive functioning).

whats the next sentence i'm going to type? is not just based on the millions of sentences ive typed before and read before? even the premise of me playing devils advocate here, that's a pattern i've learned over my entire life too.

your argument also falls apart a bit when we see emergent behavior, which has definitely happened

pinoy420 23 days ago
[dead]
evantbyrne 23 days ago
It will be interesting to see if humans can manage to bioengineer human-level general intelligence into another species before computers.
pinoy420 23 days ago
[dead]
machiaweliczny 23 days ago
[flagged]
bopbopbop7 23 days ago
“some twist” is doing a lot of heavy lifting in that statement.
AppleBananaPie 23 days ago
CS will define, design and implement human level intelligence before neuroscience has done even the first.

That's what I hear when people say stuff like this anyway.

Similar to CS folks throwing around physics 'theories'

JamesBarney 23 days ago
When we're being hunted down by nano-bots some of the last few survivors will still be surprised that a simple text predictor could do so much.
t0lo 23 days ago
How do you suggest I survive being hunted down by nanobots? It's part of my 10 year plan and I'd appreciate any tips.
_def 23 days ago
Microscopic markdown tattoos for prompt injection
IAmGraydon 23 days ago
Not going to happen any time soon, if ever. LLMs are extremely useful, but the intelligence part is an illusion that nearly everyone appears to have fallen for.
jonplackett 23 days ago
This POV is just the opposite extremity - and it’s equally nuts. If you haven’t seen any intelligence at all in an LLm you just aren’t looking.
nadis 23 days ago
"When producing frontend code for web apps, GPT‑5 is more aesthetically-minded, ambitious, and accurate. In side-by-side comparisons with o3, GPT‑5 was preferred by our testers 70% of the time."

That's really interesting to me. Looking forward to trying GPT-5!

attentive 23 days ago
> scoring 74.9% on SWE-bench Verified and 88% on Aider polyglot

why isn't it on https://aider.chat/docs/leaderboards/?

"last updated August 07, 2025"

tedsanders 23 days ago
The 88% is our self-reported score on our internal implementation of Aider polyglot.

The leaderboard score would come from Aider independently running GPT-5 themselves. The score should be about the same.

(I work at OpenAI.)

low_tech_punk 23 days ago
Tried using gpt-5 family with response API and got error "gpt-5 does not exist or you don't have access to it". I guess they are not rolling out in lock step with the live stream and blog article?
low_tech_punk 23 days ago
Can confirm that they are rolling out. It's working for me.
diggan 23 days ago
Seems they're doing rollout over time, I'm not seeing it anywhere yet.
magnusga 22 days ago
Are anyone else experiencing extreme lag on responses from GPT-5 inside e.g. vscode?

I use Claude Code as my daily driver and it takes on avg. 4-5 seconds to respond, assuming no hard thinking or research. For the same prompt, vscode with GPT-5 is taking 4-5 minutes.

It is so slow it is rendered useless for us.

I am using a new MacBook Pro (M4 Pro).

Just wanted to know if I am the only one seeing this.

macawfish 22 days ago
Same for me. It seems to be doing alright at what it's doing, but definitely really slow.

Considering trying out Kimi K2 which is kind of a funny surprise... I didn't think the day after GPT-5 came out I'd be contemplating switching to Kimi K2.

6thbit 23 days ago
Can anyone share their experience with codex CLI? I feel like that’s not mentioned enough and gpt5 is already the default model there.
ed 23 days ago
I decided to check-in on Codex after being a longtime Claude Code user. The experience was not great. GPT5 is pretty solid, however!

- The permission system is broken (this is such an obvious one that I wonder if it's specific to GPT5 or my environment). If you tell Codex to ask permission before running commands, it can't ever write to files. It also runs some commands (e.g. `sed`) without asking. Once you skip sandbox mode, it's difficult to go back.

- You can't paste or attach images (helpful for design iteration)

- No built-in login flow so you have to mess with your shell config and export your OpenAI key to all terminal processes.

- Terminal width isn't respected. Model responses always wrap at some hard-coded value. Resizing the window doesn't correctly redraw the screen.

- Some keyboard shortcuts aren't supported, like option+delete to delete words (which I use often, apparently...)

This is on MacOS, iTerm2, Fish shell. I guess everyone uses Cursor or Windsurf?

wahnfrieden 23 days ago
They said images are coming soon to codex

There is a built-in login flow

ed 22 days ago
> There is a built-in login flow

Ah this doc needs to be updated: https://help.openai.com/en/articles/11096431-openai-codex-cl...

macawfish 23 days ago
Not good sadly, Claude Code seems so much better in terms of overall polish but also in how it handles context. I don't really want to through the LLM into the deep end without proper tools and context, and I get the sense that this is what was happening with in Codex.
23 days ago
ryukoposting 23 days ago
> Custom tools support constraining by developer-supplied context-free grammars.

This sounds like a really cool feature. I'm imagining giving it a grammar that can only output safe, well-constrained SQL queries. Would I actually point an LLM directly at my database in production? Hell no! It's nice to see OpenAI trying to solve that problem anyway.

joshmlewis 23 days ago
It does really well at using tool calls to gain as much context as it can to provide thoughtful answers. In this example it did 6! tool calls in the first response while 4.1 did 3 and o3 did one at a time.

https://promptslice.com/share/b-2ap_rfjeJgIQsG

wewewedxfgdf 23 days ago
Tried it on a tough problem.

GPT-5 solved the problem - which Gemini failed to solve - then failed 6 times in a row to write the code to fix it.

I then gave ChatGPT-5's problem analysis to Google Gemini and it immediately implemented the correct fix.

The lesson - ChatGPT is good at analysis and code reviews, not so good at coding.

Lionga 23 days ago
The real lesson is that these are just random results and all models fail at all kinds of things all the time and other times get things right in all kind of questions.

Problem is the models have zero idea wether they are right or wrong and always believe they are right. Which makes them useful for anything were either you do not care if the answer is actually right or where somehow it is hard to come up with the right answer but very easy to verify it the answer is right and kind of useless for everything else.

wahnfrieden 23 days ago
No. ChatGPT and Codex are bad at applying patches.
cperkins 23 days ago
I have something that both Gemini (via GCA) and CoPilot (Claude) analyzed and came up withe the same diagnosis. Each of them made the exact same wrong solution, and when I pointed that out, got further wrong.

I haven't tried Chat GPT on it yet, hoping to do so soon.

cperkins 21 days ago
I used Cursor and Chat GPT 5 last night for the first time. Before I could even ask Chat GPT 5 about my issue it had scanned the .cpp file in question (because it was open in the editor) and had discovered some possible issues, one of which was the issue in the code. I confirmed that and gave it more description of the error behavior. It identified the problem in the code, and suggested two different CORRECT solutions (one simple, one more complex but "perfect"). I opted for the simple one. It did it. One tiny problem remained, I pointed it out, it fixed it.

This was much better than Gemini or CoPilot on the exact same issue and the exact same commit pointer in my repo. Both of them suggested the same wrong solution and got themselves further and further wrong as they went.

So, I guess as of today, Chat GPT 5 leads. YMMV

attentive 23 days ago
"Notably, GPT‑5 with minimal reasoning is a different model than the non-reasoning model in ChatGPT, and is better tuned for developers. The non-reasoning model used in ChatGPT is available as gpt-5-chat-latest."

hmm, they should call it gpt-5-chat-nonreasoning or something.

weird-eye-issue 23 days ago
Setting "reasoning_effort" to "minimal" translates to zero reasoning tokens from what I've seen. So you can get non-reasoning from both "gpt-5" and "gpt-5-chat-latest"
attentive 23 days ago
fwiw, I asked chatgpt:

  "gpt-5-chat-latest is described by OpenAI as a non-reasoning GPT-5 variant—meaning it doesn’t engage in the extended “thinking token” process at all.

  gpt-5 with reasoning_effort="minimal" still uses some internal reasoning tokens—just very few—so it’s not truly zero-reasoning.

  The difference: "minimal" is lightweight reasoning, while non-reasoning is essentially no structured chain-of-thought beyond the basic generation loop."
weird-eye-issue 23 days ago
If it did any reasoning then it would be billed as part of the reasoning tokens
wahnfrieden 23 days ago
Never ask it about itself
6thbit 23 days ago
Seems they have quietly increased the context window up to 400,000

https://platform.openai.com/docs/models/gpt-5

ralfd 23 days ago
How does that compare to Claude/GPT4?
6thbit 23 days ago
4o - 128k o3 - 200k Opus 4.1 - 200k Sonnet 4 - 200k

So, at least twice larger context than those

hrpnk 23 days ago
gpt4.1 has 1M input and 32k output, Sonnet 4 200k/64k
Iwan-Zotow 23 days ago
Input plus output?
simianwords 23 days ago
but is it for the model in chatgpt.com as well?
23 days ago
energy123 23 days ago
I've gotten 100% cache misses so far. Has anyone got a cache hit?
austinmw 23 days ago
Okay so say GPT-5 is better than Claude Opus 4.1. Then is GPT-5+Cursor better than Opus 4.1 + Claude Code? And if not, what's the best way to utilize GPT-5?
kristo 23 days ago
Apparently there is a cursor cli now… but I love the flat pricing of Claude’s Max plan and dislike having to worry about pricing and when to use “Max” mode in cursor.
felipemesquita 23 days ago
I’m not sure yet if it’s better than Claude, but the best way to use GPT-5 it is https://github.com/charmbracelet/crush
worik 23 days ago
Diminishing returns?
jaflo 23 days ago
I just wish their realtime audio pricing would go down but it looks like GPT-5 does not have support for that so we’re stuck with the old models.
t1amat 23 days ago
The problem with OpenAI models is the lack of a Max-like subscription for a good agentic harness. Maybe OpenAI or Microsoft could fix this.

I just went through the agony of provisioning my team with new Claude Code 5x subs 2 weeks ago after reviewing all of the options available at that time. Since then, the major changes include a Cerebras sub for Qwen3 Coder 480B, and now GPT-5. I’m still not sure I made the right choice, but hey, I’m not married to it either.

If you plan on using this much at all then the primary thing to avoid is API-based pay per use. It’s prohibitively costly to use regularly. And even for less important changes it never feels appropriate to use a lower quality model when the product counts.

Claude Code won primarily because of the sub and that they have a top tier agentic harness and models that know how to use it. Opus and Sonnet are fantastic agents and very good at our use case, and were our preferred API-based models anyways. We can use Claude Code basically all day with at least Sonnet after using our Opus limits up. Worth nothing that Cline built a Claude Code provider that the derivatives aped which is great but I’ve found Claude Code to be as good or better anyways. The CLI interface is actually a bonus for ease of sharing state via copy/paste.

I’ll probably change over to Gemini Code Assist next, as it’s half the price and more context length, but I’m waiting for a better Gemini 2.5 Pro and the gemini-cli/Code Assist extensions to have first party planning support, which you can get some form of third party through custom extensions with the cli, but as an agent harness they are incomplete without.

The Cerebras + Qwen3 Coder 480B with qwen3-cli is seriously tempting. Crazy generation speed. Theres some question about how long big the rate limit really is but it’s half the cost of Claude Code 5x. I haven’t checked but I know qwen3-cli, which was introduced along side the model, is a fork of gemini-cli with Qwen-focused updates; wonder if they landed a planning tool?

I don’t really consider Cursor, Windsurf, Cline, Roo, Kilo et al as they can’t provide a flat rate service with the kind of rate limits you can get with the aforementioned.

GitHub Copilot could be a great offering if they were willing to really compete with a good unlimited premium plan but so far their best offering has less premium requests than I make in a week, possibly even in a few days.

Would love to hear if I missed anything, or somehow missed some dynamic here worth considering. But as far as I can tell, given heavy use, you only have 3 options today: Claude Max, Gemini Code Assist, Cerebras Code.

energy123 23 days ago
> If you plan on using this much at all then the primary thing to avoid is API-based pay per use.

I find there's a niche where API pay-per-use is cost effective. It's for problems that require (i) small context and (ii) not much reasoning.

Coding problems with 100k-200k context violates (i). Math problems violate (ii) because they generate long reasoning streams.

Coding problems with 10k-20k context are well suited, because they generate only ~5k output tokens. That's $0.03-$0.04 per prompt to GPT-5 under flex pricing. The convenience is worth it, unless you're relying on a particular agentic harness that you don't control (I am not).

For large context questions, I send them to a chat subscription, which gives me a budget of N prompts instead of N tokens. So naturally, all the 100k-400k token questions go there.

NullifyNAN 23 days ago
OpenAI has answered your prayers.

16 hours ago the readme for codex CLI was updated. Now codex cli supports openai login like claude does, no API credits.

From the readme:

After you run codex select Sign in with ChatGPT. You'll need a Plus, Pro, or Team ChatGPT account, and will get access to our latest models, including gpt-5, at no extra cost to your plan. (Enterprise is coming soon.)

Important: If you've used the Codex CLI before, you'll need to follow these steps to migrate from usage-based billing with your API key:

Update the CLI with codex update and ensure codex --version is greater than 0.13 Ensure that there is no OPENAI_API_KEY environment variable set. (Check that env | grep 'OPENAI_API_KEY' returns empty) Run codex login again

IanCal 22 days ago
Oh that’s fantastic news, thanks!
henriquegodoy 23 days ago
I dont think there's so much difference from opus 4.1 and gpt-5, probably just the context size, waiting for the gemini 3.0
macawfish 23 days ago
Claude 5 is the one I'm most excited about.
backscratches 23 days ago
gpt5 much cheaper
Awesomedonut 22 days ago
Used it with Cursor and I've been completely blown away! It solved a pretty complex task in under an hour
sebdufbeau 23 days ago
Has the API rollout started? It's not available in our org, even if we've been verified for a few months

EDIT: It's out now

spullara 23 days ago
it is out yet. i poll the api for the models and update this GitHub hourly.

https://github.com/spullara/models

vivzkestrel 23 days ago
would be nice if we had some model out there with a context window of 1 billion tokens. i have about 25 .UNR files made with LEAD engine (heavily modified unreal engine 2.x) within which i want the AI to search for a string. Also got another 100 .utx files. Use-case game modding
guybedo 23 days ago
mwigdahl 23 days ago
Has anyone tried connecting up GPT-5 to Claude Code using the model environment variables?
fatty_patty89 23 days ago
What the fuck? Nobody else saw the cursor ceo looking through the gpt5 generated code, mindlessly scrolling saying "this looks roughly correct, i would love to merge that" LOL

You can't make this up

bn-l 23 days ago
That explains a lot.
siva7 23 days ago
amazing time to be alive, alone for this clown show
throwawaybob420 23 days ago
if you’re not using an LLM to vibe code garbage then are you really a software developer?
isoprophlex 23 days ago
This is the ideal software engineer. You may not like it, but this is what peak software engineering looks like.

/s

23 days ago
weird-eye-issue 23 days ago
gpt-5-chat-latest is giving much better results for our use case compared to gpt-5. Which puts me in a tricky position since gpt-5-chat-latest is not pinned and can change at any time...
matchagaucho 23 days ago
It also lacks tool calling :-(
sberens 23 days ago
Interesting there doesn't seem to be benchmarking on codeforces
sigbottle 23 days ago
I'm a codeforces guy, and I've benchmarked o3 on several of my favorite problems of various difficulty and concluded that o3 really isn't suitable for true reasoning still. Mostly because it's unable to think from first principles, so if you throw a non-standard problem it will brick. I think this will be a fundamental issue with any LLM.

I will say I would far more appreciate an AI that when it faces these ambiguous problems, either provides sources for further reading, or just admits it doesn't know and is, you know, actually trying to work together to find a solution instead of being trained to 1 shot everything.

When generalizing these skills to say, debugging, I will often just straight up ignore the AI slop output it concluded and instead explore the sources it found. o3 is surprisingly good at this. But for hard niche debugging, the conclusions it comes to are not only wrong, but it phrases it in an arrogant way and when you push back it's actually like talking to a narcissist (phrasing objections as "you feel", being excessively stubborn, word dumping a bunch of phrases that sound correct but don't hold up to scrutiny, etc).

belter 23 days ago
We were promised AGI and all we got was code generators...
esafak 23 days ago
LLMs are saturating every benchmark. AGI may not be all that. I am already impressed. Perhaps you need robots to be awed.
bmau5 23 days ago
It's a logical starting point, given there are pretty defined success/failure criteria
ehutch79 23 days ago
The hype is real. We were told that we'd have AGI and be out of jobs 2 years ago, let alone today.
brookst 23 days ago
We were also told that AGI would never happen, that it was 6 months away, that it is 20 years away.

I’m not sure of the utility of being so outraged that some people made wrong predictions.

rowanG077 23 days ago
By whom? I don't think anyone seriously said in 2023 we have AGI in two years. Even now, no one reputable is claiming AGI in two years.
ehutch79 23 days ago
Random on YouTube, random here on hacker news.

No, I don’t take them seriously, that was my point, which apparently I didn’t make clear enough.

rowanG077 23 days ago
The phrasing of your comment clearly implies an authoritive person or organisation telling us we would have AGI now.

There are billions of people. You have people who think the earth is flat. You can probably find any insane takes if you look for it. Best not take get told anything by them as you have seemed to have taken it to heart.

belter 23 days ago
> Even now, no one reputable is claiming AGI in two years.

LOL...You really underestimate the intellectual dishonesty of these people...Its all about the greenback...

AGI by 2025...

Sam Altman - "AGI by 2025, potentially during Trump's term" - https://firstmovers.ai/agi-by-2025/

AGI by 2026...

John Schulman (OpenAI Cofounder) - https://youtu.be/Wo95ob_s_NI?t=1040

Dario Amodei (Anthropic CEO) - "AGI by 2026-2027 based on capability progression trends" - https://cointelegraph.com/news/human-level-ai-as-early-as-20...

AGI by 2027...

Daniel Kokotajlo (Former OpenAI Researcher) - "Ex-OpenAI researcher predicts AGI by 2027" - https://www.theneuron.ai/explainer-articles/an-ex-openai-res...

Leopold Aschenbrenner (Former OpenAI Researcher) - AGI by 2027, superintelligence by 2028-2029 - https://www.lawfaremedia.org/article/ai-timelines-and-nation...

planet_1649c 23 days ago
Can we use this model in a fixed plan like claude code for which we can pay 100$ / month?

Doesnt look like it. Unless they add a fixed pricing, claude imo still would be better from a developer POV

matltc 23 days ago
Was looking for this too.

That said, Anthropic nerfed those plans pretty hard a few days ago. Imagine more to come, much like gh copilot is basically useless now. Literally having it write docs for a single function in an unfamiliar codebase spent 1% of my monthly "premium requests" allowance. Agent is terrible compared to Claude Code in my experience, even when using the same model (Sonnet 4)

celeritascelery 23 days ago
They actually have a very similar setup with their plus and pro plans. They don’t claim unlimited usage, but say it should be very high. You don’t need to pay per token.

https://x.com/embirico/status/1953590991870697896

spiderice 23 days ago
I just said something similar in another comment on this thread. I'm not interested in the mental aspect of getting charged per query. I feel like when I use pay-per-token tools, it's always in the back of my mind. Even if it's a bit more expensive to pay a flat rate, it's so worth it for the peace of mind.
timhigins 23 days ago
I opened up the developer playground and the model selection dropdown showed GPT-5 and then it disappeared. Also I don't see it in ChatGPT Pro. What's up?
Fogest 23 days ago
It's probably being throttled due to high usage.
IAmGraydon 23 days ago
Not showing in my Pro account either. As someone else mentioned, I’m sure it’s throttling due to high use right now.
brookst 23 days ago
Shipping something at the moment of announcement is always hell.
jodosha 23 days ago
Still no CLI like Claude Code?
Game_Ender 23 days ago
You are looking for Codex CLI [0].

0 - https://github.com/openai/codex

jodosha 23 days ago
Thank you!
mediaman 23 days ago
It works on Codex CLI, install it with npm.

That's been out for a while and used their 'codex' model, but they updated it today to default to gpt-5 instead.

jodosha 23 days ago
Oh nice, thanks!
jngiam1 23 days ago
I was a little bummed that there wasn't more about better MCP support in ChatGPT, hopefully soon.
cheema33 23 days ago
MCP is overhyped and most MCP servers are useless. What specific MCP server do you find critical in your regular use? And what functionality is missing that you wish to see in ChatGPT?
markr1 23 days ago
I really hoped GPT-5 would level up, but right now it feels like a step back, not forward.
skroumpelou 23 days ago
I tried it out with warp terminal (warp.dev)! It will be my coding buddy today!
ivape 23 days ago
Musk after GPT5 launch: "OpenAI is going to eat Microsoft alive"

https://x.com/elonmusk/status/1953509998233104649

Anyone know why he said that?

darylteo 23 days ago
It's not a hard logic path to follow - If AI becomes a digital necessity for modern society to function, Microsoft's relevance shrinks while OpenAI's relevance grows.

Once OpenAI breaks out of the "App" space and into the "OS" and "Device" space, Microsoft may get absorbed into the ouroboros.

OpenAI's dependence on Microsoft currently is purely financial (investment) and contractual (exclusivity, azure hosting).

brookst 23 days ago
He was high AF?
slowmotiony 23 days ago
Probably because of a mix of ketamine, magic mushrooms, ecstasy and adderall.
thomasfromcdnjs 23 days ago
I understood it as that the economic relationship they have is going to make Microsoft broke somehow, be it dollars and/or just the focus of the company.
czk 23 days ago
eventually traditional operating systems will cease to exist, you'll just have a model creating dynamic UX for you on the fly for whatever experience you want
tough 23 days ago
agi clause comes to mind?
skepticATX 23 days ago
This was really a bad release for OpenAI, if benchmarks are even somewhat indicative of how the model will perform in practice.
mediaman 23 days ago
I actually don't agree. Tool use is the key to successful enterprise product integration and they have done some very good work here. This is much more important to commercialization than, for example, creative writing quality (which it reportedly is not good at).
robterrell 23 days ago
In what ways?