I was hoping that near the end the author would have tried to contributed any new tests to the official testsuite to help catch these same errors elsewhere.
For the first one, Zero Is Not Null maybe there is a missing test in in the call_indirect test?
https://github.com/WebAssembly/testsuite/blob/main/call_indi...
On the "server side" (i.e. training) you can use the current gen models to improve the training data by running many parallel environments with a similar loop as above. Then incorporate the new data and repeat. Reminiscent of the old GAN approach, where the generator and discriminator are trained together in an adversarial regime. The end result should be safer code on "vanilla" prompts. "Write an API that does x y z" should now contain the learnings from this loop, and the models should produce better code.
Works really well for every verifiable scenario. And as the models become better, they can also more reliably create environments that closely match real-world scenarios. If you also have some data from human devs (say you run a subsidised coding model for a few months), even better.
An example of turning a "normal" repo into a verifiable environment that I read recently in the Cursor blog: take a repo, ask an LLM to remove a feature, verify that the app still works w/o the feature, verify that the tests for that feature fail. Ask a generator to "add feature x". Verify with the original tests. If pass -> give carrot :)
The key is composition. Once you unlock a new capability, that gets implemented and incorporated into the next training run. Pretty neat, I would say, and the main driver for the recent increase in the breadth of capabilities for new models.
More like, give $$$ pass or not.
It's good if no LLMs can find a bug. It certainly does not mean there isn't one...
I've found LLMs to be very disappointing at identifying overly complex code (that they've written) and the correct architectural decisions to 1) make the code actually work, and 2) be simple, maintainable, and future proof.
They can certainly find some bugs, which definitely has value, but I've not had much success with them writing code that simply has no bugs...
That requires simplicity and architectural correctness, something LLMs are good at vaguely bullshitting, but not very good at getting correct.
I think this can be solved by feeding them the right metrics, but I haven't found prior art for how to algorithmically pinpoint: 1) what is actually complex in a bad way (there's a lot of ways to do this roughly), and 2) where exactly the problem is most acutely (less prior art here, but some), and 3) what viable solutions are.
If you can get better at 1 and 2, the LLMs can get much better at 3.
Anybody who has ideas, I'd love to hear them, as this is what I'm working on now.
The upside of the lack of real constructors is less incidental complexity which every object having a constructor written which then has to be read and maintained.
Another option of course is to write constructors - there's nothing to stop you doing so in go and using those when creating objects (e.g. foo.New() whenever you want one of these things), but it'd be a convention rather than something required.
So throwing his own, apparently poorly written, creation under the bus will get him applause and promotions by the AI lunatics.
It is a currently popular strategy among AI boosters.
You think he cynically decided to boost his career by writing a detailed description of the exploits found in his own software.
Is there no room in your model of the world for someone to figure out something interesting using AI tools and then write about it just because they like sharing interesting information?
It’s a heuristic after all. There is no proof one way or the other.
Did you even have 2-3 minutes to click around his website and gave a read to his other article "Something that I used to love"?
Your type of disparaging comments give the impression of HN to others what HN totally isn't. I don't know if you wrote your comment esp. for engagement baiting.
Didn't know this is a thing... interesting for a company that's marketing their Mythos so hard not allowing security prompts.
I am also curious how the cheaper Chinese models do, I have an Opencode Go plan, so I'll let 'em rip over the weekend, hopefully I get to see a few bugs!
The whole point of Mythos/Glasswing is "our best models are scary good at security research, so much so that we won't let them help you find vulnerabilities unless you are a trusted partner".
Even if it is marketing, at least there is some positive side effects of identified and closed security flaws.
We can save that dialogue for finding bugs in widely used projects.
Edit: Something I tried to reply to a now-dead top level comment here: Whoever claims that new accounts alone is a signal for submission-boosting comments etc. needs to update their heuristics.