I feel like I must have plateued and don't know what to do next to level up. I'm currently on the $100/month codex plan and it seems fine using 5.5-xhigh all the time. I think of what to do next, have a chat session to determine exactly what to ask for up to the point of being ready to implement, and then codex churns on a commit-sized task whereupon I briefly check it on my local dev server. If necessary I ask for a change. Then I ask it to commit and recommend the next step based off the spec. Oftentimes I have to "approve" an out-of-sandbox request anyway.
I haven't found anything that requires running all night. I could tell it to one-shot a big plan but given how often I realize I want an intermediary thing to be slightly different it seems like a waste of effort.
I'm guessing the next thing I should probably look into is some sort of machine vm I can tunnel my codex-gui requests to so I don't have to deal with the sandbox approvals (I don't want to give it "dangerous" access to my entire mac).
I don't understand what people are doing with their side projects that is leading them to churn through tokens so quickly, to the point of requiring two $200/month subscriptions and a bunch of token charges besides.
I'm on $100 Claude. I have a setup with bespoke local services that mitigates some high token consumption scenarios with local LAN services. I screen mcp's and hooks for cache poisoning. I run 100% on Opus with max effort, and never came close to hitting 5 hour or weekly limits before the Fable release. I am in Claude Code at least 20hrs a week.
I see people just completely wasting tokens with ridiculous setups, 100% hitting cache misses as well as dumping huge files into context all the time.
Just learn how these things work, or pay the price I guess.
I have been on $100/mo claude and it has been churning out quite good software for months now. like i estimate what would have taken me three ish years, assuming i didn't burn out from failure (i would have). i only hit limits when i double fisted claude with my main project and my side project. just the other day i noticed i had been stuck on 4.5 because i failed to update the npm package.
Luckily I needed a new laptop and I bought an M1 Max secondhand from a friend quite cheaply because it was fast enough to recompile something else I am interested in.
So for me, there is no additional hardware cost; it was acquired in replacement.
I run the AI models at home on this kit because I want to; I'll use openrouter if I need to.
I accept the economics of this article are right. But I feel so incredibly sad about this outcome that we're now just to be people caretaking machines that do the job we loved that actually I am not sure that exercising this nuance is going to matter in the long term.
It turns out it is a mistake I have made in my life — now really unfixable because I am a bit too old — to believe that I will always find enough fulfilment in my work to offset the absence of personal fulfilment elsewhere; I have always enjoyed being able to help people directly by doing a thing I love and I am good at, and that has kept away the sadness of finding it difficult to build a conventional family life to enjoy.
I assumed I would always find some new way to find that enjoyment, but even the slim enjoyment from being able to explore this stuff on my own kit in my own terms will not be enough if the pendulum does not swing back towards human effort.
It is a dismal world we have made for ourselves. Lately I have found myself dreading growing too much older in it.
Also, I would anticipate at least a 5 year lifespan for a current generation card. The 3090 is still respectable simply because it has 24GB of RAM which, for years, has been the limiting factor for ML at home. If you got a 6000, sure it’s going to cost 7-8k, but the resale value is likely to be very good. Even the 3090 is 50%+ of RRP still. And if you’re not doing LLMs, it’s an interesting value proposition for “classic” CNN vision model training. You can fit enormous batch sizes on 96 GB. The biggest reason to upgrade is perf/watt has about doubled (eg 4000 pro Blackwell is half the 3090 for similar).
People tend to assume the capex is thrown away but as we’ve seen with RAM, don’t be so sure you won’t be able flip it if you need to.
If you have solar, it is not, because you have battery and equipment degradation from cycle charging, c’mon man…
I would agree with you if you said it was vastly cheaper overall (with the initial equipment investment amortized over time) compared to The Power Company.
In many states, even if you are generating electricity and selling it back to the power company, they still gonna charge you normal rates of usage because greed.
If you go off grid, you have bigger things to worry about than how to power your AI cluster. It’s manageable enough if you have land but that’s in scarce supply.
> if you have solar, it is not, because you have battery and equipment degradation from cycle charging, c’mon man…
no, the rate of that is pretty independent of use. unless you live in a place where selling energy back rules are designed to screw the solar owner (California)
There's actually an interesting thought experiment here: if it takes you a full day to build something that AI would otherwise build in a day, do you end up using more power, or less? What is the break-even point, purely from a power consumption perspective?
If an identical task takes a day on both sides, then the human route uses less energy, surely.
Brains are thousands or maybe even millions of times more fuel-efficient than computers and you are alive for the whole day either way, right? You probably eat about the same even.
The reason executives think AI is more efficient is that it more space efficient than a human and doesn't demand to be paid or work only a set number of hours. Everything with computing is more efficient if you resent having to give money to other humans. If they could just not have you be alive when they don't need you, it'd possibly be different.
Even though I think at a typical British freelance rate and a truly unsubsidised token price, the AI is possibly more expensive than me. And as a freelancer, from their perspective I really am not alive until they need me. (This is what it often feels like)
The reality is the human and the AI aren't used to build the same things anyway so it's a comparison you can't really make.
Brains are efficient, but civilized humans aren't. In the USA, adults consume at a rate of about 10kW -- only 1-2% of that being the human's metabolism, the rest being HVAC, electrical devices, etc.
For comparison, a modern frontier model like Gemini 3.5 Pro consumes about 15kW -- so only about 1.5x the fully loaded human. In an 8h workday, that model would crank through ~80M tokens (~$5k at API prices). That's ~4 major refactors of a 10k LOC codebase, so probably not a very realistic comparison to a single human dev.
I think a more useful comparison, based on my experience, is that an engineer with AI support can get one 8h day's worth of unassisted work done in 1h. So, the 25 kWh consumed during collaboration (conservatively assuming I keep the GPU hot for the whole hour) frees up the remaining 70 kWh I'll draw down for the day to be spent in some other way.
Studies on grandmaster chess players indicate that at most you burn 10% more calories when engaged in deep thought than when you're at rest. So the energy "attributable" to an hour of knowledge work is like 10 calories (average sedentary calorie burn is like 80-100 per hour; add a max of 10% for the thinking gets you 8-10 calories). A pound of potatoes is like a buck and is about 320 calories. So you're looking at like 3 cents an hour at most to cover that energy burn. It's definitely even less; I certainly don't think as hard as a grandmaster chess player.
Then, assume power costs 20 cents per kilowatt hour (US avwrage) To match the human 3 cents per hour, you need an average of 150 watts of power drawn per hour. That's in the range of a budget graphics card, but not much past there.
However, if you sleep instead of sitting around, you can probably make AI cost competitive. Sleeping drops your metabolic rate by more, and lying down in bed (as opposed to sitting) also reduces calorie burn. Combined, you can reduce your burn by like 30 calories an hour. At the new 9 cents per hour human cost, you can afford to run a higher end graphics card at ~450 watts per hour. That puts you in RTX 3090 range.
I'm assuming that you need to feed the human being (i.e. you) regardless of whether you use that human being for writing code or not. So, by this metric, there is simply no breaking even point. The cost of human + AI is always going to be higher than the cost of human.
Speaking personally: yes. That's literally what I'm planning to do this afternoon because it's noon and I'm already done with the coding tasks I had on my plate today.
Luckily the future is absolutely going to be that star trek one where technological abundance means we are all wealthy and have free time to develop personally, and not the future where all the money bubbles up into the hands of a thin-skinned malignant narcissist who wants to play with launching rockets and provoking racial violence /s
I invested about $4,000 in an NVIDIA DGX Spark several months ago. 128 GB of unified RAM, and the NVIDIA GB10 chip. With the RAM, the several CPU cores, and the 4 TB NVMe SSD, it's a very capable ARM64 Linux computer even without the GPU, and so far I've mostly been using it as such. But I wonder, what's the most capable model, specifically for coding, that can run well on that hardware?
I find just going via Deepseek's platform API directly, using their V4 flash model, and hooking into a harness like Opencode more than acceptable. Think I've spent maybe $10 over a couple of weeks.
I did explore self-hosting models but hardware right now is just too expensive.
Directly at DeepSeek? It was my understanding (but I didn't check) that some other AI operators were providing (some of?) DeepSeek's model for cheaper prices.
Still, that's interesting. What do you get for that price? Only coding, or also e.g. image generation?
> Do that well and you can build what a team of twenty engineers would put out in a month for around a thousand dollars.
What does this look like after 6-12 months? Like, how much code are you trying to write total?
Maybe it just doesn’t click in my mind, but sometimes I wonder about how much work people are trying to do and how they actually have enough to get done so quickly in such a short amount of time.
For me, investing in hardware seems to be the way to go.
I learned coding nearly 24 years ago and still learning new stuff all the time. At no point in time I had to rely on a subscription model to learn and do new stuff.
If LLM and agents are the default tools for coding and building software, at least for next few years, it seems like a no-brainer to invest $2000-3000 on hardware, like a Halo Strix PC.
I wondered if there might be a no brainer "free" option on discarded hardware.
I have a GTX1080ti which i think is circa 2018, it's unused, more than paid for itself over the years, owes me nothing at this point so the hardware is free.
It runs Gemma e4b multimodal, qwen 3.5 8b or the qwen 4b embeddings models well enough (40+ t/s for the LLMs).
The machine consumes 350 watts at the wall when under load (3 watts when sleeping, 80w at idle). Electricity costs me £0.035GBP/kwh which is cheap for the UK (load shifting via house battery).
144k output tokens for around 1pence (and takes an hour to do that in theory).
It's only JUST cheaper to use than the far more capable deepseek v4 flash model despite the free hardware and ~10x cheaper than normal electricity.
Yes and no. Hardware does lock you in. Granted, I am happy with my 128gb of shared memory, but I am mildly concerned that it actually is more expensive now than when I bought mine. It does not bode well for the future; not when combined with recent WH admin moves on Anthropic and the reality that next batch of good models may require more than 128gb to run well.
edit: I am not dismissing local. I am one such user ( though I have subs too ), but one has to be clear eyed about the trade-offs.
$3k isn't getting you frontier model capability. It's barely getting you any capability if that's split into buying an entire PC rather than just GPUs.
With you here. I'm using my cheapo 16gig vram card I picked up a year or so ago, and I'm like -- yes, I percieve that you can pay for way more tokens per second that I can do at home.
But that feels like measuring productivity in lines of code. For what I'm doing, I'm not seeing the benefit in any subscription.
Sure, I can't one-prompt a whole new boring CRUD app, but oh well.
Can I run something comparable to Opus 4.6 locally yet? I keep hearing conflicting things. If I can spend 10k to do that I would cancel my subscription. The problem is I don’t wanna spend the money to find out myself.
If you want frontier-level, the economically reasonable option is OpenRouter or a direct sub to frontier-of-your-choice.
The reality is that they do not offer configurations that would allow a consumer to run that much VRAM on a single setup to protect datacenter margins. Apple used to, and they stopped, those devices are going for ~$20k+ each on ebay now.
You can get very, very capable models on a 3090/4090/5090/6000 series card. But if you want 'frontier level' you are investing ~22k at a bare minimum if you go new. Used you can probably build your own server for much cheaper up-front cost but it's likely going to be 4-6x+ electricity usage.
There are also significant economies of scale (namely: utilization and batching), which tend to make inference on a shared server more economical even after the operator takes a cut.
I truly think by 2028 we'll have integrated chip systems that'll be able to run opus 4.8 level models at ~500 watts at acceptable performance. Honestly I think now is the worst time to invest in AI hardware. Get your harness ready and processes perfected with hosted models, and wait a few years to buy hardware to transition to running models locally
Some benchmarks have shown Kimi K2.6 within error-bar distance of Opus 4.6, and you can run it on eight RTX6000s. Right now it's not possible to set up a machine like that from scratch for less than $100K... but right now it's also hard to put a price on autonomy.
Best you could do is connect two Mac Studio M3 Ultra 512G RAM each with Thunderbolt. Then theoretically you can run frontier Chinese models (but not Deepseek v4 Pro yet). That would be about $20k.
But - good luck finding them. Apple discontinued the model a few months ago. And more recently, even 256G model was discontinued. Big AI really really does not want people to get off their needle.
AI coding at home literally costs $100/month. I'm wondering where $400 is coming from? $100 is more than enough for "coding at home", IMO. I rarely face the limits, and when I do it's just a time for a quick walk anyway.
I recently made an AI Agent and surprisingly coding with DeepSeek V4 Flash is quite cheap. It probably has to do with the aggressive prompt caching. I'm using OpenRouter with Novita AI as the preferred provider.
I’m using zen because I have a Claude subscription and just like dabbling with the other models and I was shocked at how little flash cost but it was noticeably not at the level I’d like my model to be.
For me MiniMax 3 has really hit the sweet spot of being very cheap, though more than flash, but I’d also very capable.
I've been thinking a lot about this and my personal take right now is that at some near-medium future the models abvailable to run at home and the hardware needed to use them will be enough.
My baseline is sonnet 4.6. I think it's good enough for most tasks sincerly. So, from what I see, we are already at a point where we don't need frontier models for serious coding and debuging. Give it a couple of years and that level will fit 120B models.
At the same time, we saw the rise of direct acess memory systems like DGX or Stryx Halo that will allow to run models of this size for "cheap" in the medium term.
That's what I'm betting in. That in 2 years I can buy a system for about $2500 that will run a model that's similar to Sonnet 4.6 locally.
I might be spectacularly wrong though. But I'm willing to wait and use subscriptions/API calls for now.
Hardware and provider juggling is a way to go, although I think it is also worth mentioning that the cost is not only the price-per-token, but first of all, the amount of tokens used.
Depending on what one builds, comprehensive documentation and applicable skills and memory tools often allow for a substantial reduction of tokens previously used by the agent to comprehend and remember what is being built
What kind of usage chews through Claude Max x20? I use several agents with max effort in parallel and usually end up with something like 50% weekly usage. Fable almost allowed me to get to 70% but then they started resetting the limits mid-week and of course now ended the whole thing.
There’s a lot of Xeon chips for $10 on eBay. Too bad there’s no drive for cpu based inference. The data center will need to swap out the older gpu clusters so what does that do for hardware pricing on data center gpus? H100 are cheap enough but the power requirements make it a long term net negative for how much pay for power in California.
I think someone could find some way to use the smaller local models to write code. Some kind of framework or harness or language or something. But not too many people are working on that because the big models are pretty cheap and a lot better.
Maybe one possible path(to make weaker models highly capable) is making the job of the llm as easy as possible.
I wonder if part of the solution is building/finding the right libraries, with the right documentation/language/API(one that plays well with LLM's) and maybe creating some synthetic data around them - to make it very easy for the llm.
And maybe there could be a business model around creating those libraries.
I think as well there might be "algorithms" that can work with local LLMs. With local LLMs there is a small context window, but not that much cost per token. So perhaps there is a way to do lots of small prompts that work in a sequence to produce a result.
Like perhaps you could produce 5 versions of a piece of code, and then compare them to choose the best.
Also if the local LLMs can call tools, maybe you can use static analysis tools to catch errors and try again in a loop or process of some sort.
There also might be certain languages that work better because those languages have better static checks.
I mean, this is what I'm doing. I'm guessing my process is very different because I'm holding the hand of the project way more along the way, but even that to me probably makes for a more enjoyable.
Which is to say, I might use AI to do an outline/organizational , but I'm prompting every chunk of code "one-by-one," (e.g. at about the "function" level) which still feels lightyears ahead of what I used to do.
Yeah, although that is pushing every rate limit and no one knows what happens if you do that consistently? I think $4,000/mo is probably a good estimate for an individual dev doing synchronous coding agent work.
Yeah, I agree. I've been consistently getting about $1,000/month of value out of the $100/month subscription for OpenAI, and about the same for Anthropic.
Maybe today but it's not a law of nature. It seems inevitable that AI models and coding agents will be fully commoditized eventually, just like computers, game engines, compilers, web servers, and so many other technologies have been.
At the end of the day, AI models are relatively small files that we run little CUDA programs on.
Is spending (metered money) even worth it? Perhaps for most I mean "beyond like a 30 bucks a month," but for me I'm literally not spending more money beyond my very cheapo 16gb video card.
No clue what y'all are doing, perhaps because I'm hobbying, and also I'm old and can perhaps do more of this by hand.
But I'm basically just doing what I did before, plus ollama self hosted and sometimes gemini and I feel like I'm going lightspeed beyond what I've ever done.
And I suppose this is still very fine-grained. I have it make a draft, then just have them fix/change it step by step?
I tried one of the bigger boys that can one-shot apps, which I guess is cool, but I'm finding it's just as hard to modify as if I just grabbed someone elses repo on github.
Fixed-price monthly plans ought to be sufficient for most people who actually review their spec and code, for building production-grade software that stand the test of time. A careful spec+review+iteration takes time, resetting the usage quota. Granted, security audits uses tokens too.
If you still need more tokens, odds that you're vibecoding unmaintainable throwaway trash.
With access to view usage for my org and conversations with developers, I think much of the high token usage is a result of people not knowing how to right size the model for the given task. The trend seems to be to pick the most powerful model and use it for everything. Based upon git metrics, I'm one of the top performing engineers at my org and I've yet to run into any overage or throttling on the $200/mo anthropic sub.
You can have opencode and switch between multiple providers based on the tasks you are doing on the fly, normal tasks use deepseek for example, hard one use gpt5 or opus4, and track the usage with something like codexbar or similar. Openrouter seems to charge extra on top of the api costs, same with zen ide, so keep that in mind.
> The first is to self host. You buy the machine, run open source models locally, and pay nothing per token after that.
In the good ol' days, we bought machines not only to run stuff, but to experiment.
I understand today experiments are limited. Inference is reasonable, fine-tuning is either niche or a stretch, and base training is impossible.
*That is bound to change*, and when it does, there will be an avalanche of hobbysts and amateurs poking at base training. They'll find optimizations no one found before, synthetize data no one ever imagined to synthetize, and when that happens we'll start getting libre models.
So, yeah. Right now, buying the machine doesn't pay off that well, unless you want to pioneer this stuff in severe adverse conditions (hardware prices inflated, etc). Eventually, it will.