Every time you find a runtime bug, ask the LLM if a static lint rule could be turned on to prevent it, or have it write a custom rule for you. Very few of us have time to deep dive into esoteric custom rule configuration, but now it's easy. Bonus: the error message for the custom rule can be very specific about how to fix the error. Including pointing to documentation that explains entire architectural principles, concurrency rules, etc. Stuff that is very tailored to your codebase and are far more precise than a generic compiler/lint error.
It’s kind of a best case scenario use-case - linters are generally small and easy to test.
It’s also worth noting that linters now effectively have automagical autofix - just run an agent with “fix the lints”. Again, one of the best case scenarios, with a very tight feedback loop for the agent, sparing you a large amount of boring work.
To find most runtime bugs (e.g. incorrect regex, broken concurrency, incorrect SQL statement, ...) you need to understand the mental model and logic behind the code - finding out if "is variable XYZ unused?" or "does variable X oveshadow Y" or other more "esoteric" lint rules will not catch it. Likelihood is high that the LLM just hallucinated some false positive lint rule anyways giving you a false sense of security.
I'm not sure if there's some subtlety of language here, but from my experience of javascript linting, it can often prevent runtime problems caused by things like variable scoping, unhandled exceptions in promises, misuse of functions etc.
I've also caught security issues in Java with static analysis.
But the author claims that you can catch runtime bugs by letting the LLM create custom lint rules, which is hyperbole at least and wrong at most and giving developers a false sense of security at worst.
Context - I have a 200k+ LOC Python+React hobby project with a directory full of project-specific "guidelines for doing a good job" agent rules + skills.
Of course, agent rules are often ignored in whole or in part. So in practice those rules are often triggered in a review step pre-commit as a failsafe, rather than pulled in as context when the agent initially drafts the work.
I've only played for a few minutes, but converting some of these to custom lint rules looks quite promising!
Things like using my project's wrappers instead of direct calls to libs, preferences for logging/observability/testing, indicators of failure to follow optimistic update patterns, double-checking that frontend interface to specific capabilities are correctly guarded by owner/SKU access control…
Lots of use cases that aren't hard for an agent to accurately fix if pointed at directly, and now that pointing can happen inline to the agent work loop without intervention through normal lint cleanup, occurring earlier in the process (and faster) than is caught by tests. This doesn't replace testing or other best practices. It feels like an additive layer that speeds up agent iteration and improves implementation consistency.
Thanks for the tip!
The detection is based on dumb string literal heuristics, but has proven rather effective. Example patterns:
const hardcodedInfrastructure = { url: /^https?:\/\/(localhost|127\.0\.0\.1|192\.168\.\d+\.\d+|10\.\d+\.\d+\.\d+|172\.(1[6-9]|2\d|3[01])\.\d+\.\d+)(:\d+)?/i, dbUrl: /^(postgresql|postgres|mysql|mongodb):\/\/.*@(localhost|127\.0\.0\.1|192\.168\.\d+\.\d+|10\.\d+\.\d+\.\d+|172\.(1[6-9]|2\d|3[01])\.\d+\.\d+)/i, localhost: /^localhost$/i, localhostPort: /^localhost:\d+$/i, };
Backpressure != feedback (the more general term). And in the agentic world, we use the term 'context' to describe information used to help LLMs make decisions, where the context data is not part of the LLM's training data. Then, we have verifiable tasks (what he is really talking about), where RL is used in post-training in a harness environment to use feedback signals to learn about type systems, programming language syntax/semantics, etc.
The term back pressure actually comes from mechanical engineering in the context of steam engines.
It first appeared in a dictionary 160 years ago.
Words are just words. Mathematicians very well understand that words mean nothing, what matters are definitions and the author provides one.
E.g. natural numbers may or may not contain the number 0, but that's irrelevant, because what mathematicians care for are definitions, so they will state that natural numbers are a given a set of positive whole numbers (including or not the number 0) and avoid arguing about labels. You can call them funky numbers or neet numbers, doesn't matter.
Same applies here. Your comment is pointless because the author does provide a definition for back pressure in the context of his blog post and what matters is discussing the concept he labels in the context of LLMs.
I'm not trying to discount any attempt to correct people, especially when it gets confusing (like here, I was also confused honestly), but we could formulate it nicer IMHO.
Context is also a misnomer, where in fact it's just a part of prompt.
Prompt itself is also a misnomer, where in fact it's just part of model input.
Model input is also a misnomer, in fact it's just first input token + prefill for model output to generate more output.
Harness is also a misnomer, where it's just scaffold / tools around the model input/output.
see https://ghuntley.com/pressure
i have the pleasure to work with moss and he came up with a way to explain what is in my head with ease.
The back pressure I need cannot come from automated testing or access to an LSP.
The back pressure I need comes from following rules it has been given, or listening to architectural or business logic feedback.
On that, I still cannot make it work like I want. Going to provide a simple example with Claude Codes.
I have a frontend agent instructed to not use any class or style ever, only the design system components and primitives.
Not only it will ignore those very quickly, but when it proposes edits and I give feedback the agent ignores them completely and instead it keeps suggesting more edits.
Thus I had to revert to deleting the agent completely and rely on the main thread for doing that work.
Same applies with any other agent.
Eg compiler errors, unit tests, mcp, etc.
Ive heard of these; but havent tried them yet.
https://github.com/hmans/beans
https://github.com/steveyegge/gastown
Right now i spent a lot of “back pressure” on fitting the scope of the task into something that will fit in one context window (ie the useful computation, not the raw token count). I suspect we will see a large breakthrough when someone finally figures out a good system for having the llm do this.
I've found https://github.com/obra/superpowers very helpful for breaking the work up into logical chunks a subagent can handle.
The answer is not more natural language guardrails, it is in (progressive) formal specification of workflows and acceptance criteria. The task cannot be marked as complete if it is only accessible through an API that rejects changes lacking proof that acceptance criteria were met.
What we do at https://minfx.ai (a Neptune/Wandb replacement) is we use TONS of custom lints. Anytime we see some undesireable repeatable agent behavior, we add it as a prompt modification and a lint. This is relatively easy to do in Rust. The kinds of things I did are:
- Specify maximum number of lines / tabs, otherwise code must be refactored.
- Do not use unsafe or RefCells.
- Do custom formatting, where all code looks the same: order by mods, uses, constants, structs/enums, impls, etc. In particular, I added topological ordering (DAG-ordering) of structs, so when I review code, I build up understanding of what the LLM actually did, which is faster than to read the intermediate outputs.
- Make sure there are no "depedency cycles": internal code does not use public re-exports, so whenever you click on definitions, you only go DEEPER in the code base or same file, you can't loop back.
- And more :-)
Generally I find that focusing on the code structure is super helpful for dev and for the LLM as well, it can find the relevant code to modify much faster.
I've recently discovered that if a model gets stuck in a loop on a tool call across many different runs, it's almost certainly because of a gap in expectations regarding what the available tools do in that context, not some random model failure mode.
For example, I had a tool called "GetSceneOverview" that was being called as expected and then devolved into looping. Once I counted how many times it was looping I realized it was internally trying to pass per-item arguments in a way I couldn't see from outside the OAI API black box. I had never provided a "GetSceneObjectDetails" method (or explanation for why it doesn't exist) so it tried the next best thing foreach item returned in the overview.
I went one step further and asked the question "can the LLM just directly tell me what the tooling expectation gap is?" And sure enough it can. If you provide the model with a ReportToolIssue tool, you'll start to get these insights a lot more directly. Once I had cleared non-trivial reports of tool concerns, the looping issues all but vanished. It was catching things I simply couldn't see. The best insight was the fact that I hadn't provided parent ids for each scene object (I assumed not relevant for my test command), so it was banging its head on those tools trying to figure out the hierarchy. I didn't realize how big a problem this was until I saw it complaining about it every time I ran the experiment.
Most of my feedback that can be automated is done either by this or by fuzzing. Would love to hear about other optimisations y'all have found.
There are also openapi spec validators to catch spec problems up front.
And you can use contract testing (e.g. https://docs.pact.io/) to replay your client tests (with a mocked server) against the server (with mocked clients)--never having to actually spin up both a the same time.
Together this creates a pretty widespread set of correctness checks that generate feedback at multiple points.
It's maybe overkill for the project I'm using it on, but as a set of AI handcuffs I like it quite a bit.
What we often like to do in a PR - look over the code and say "LGTM" - I call this "vibe testing" and think it is the real bad pattern to use with AI. You can't commit your eyes on the git repo, and you are probably not doing as good of a job as when you have actual test coverage. LGTM is just vibes. Automating tests removes manual work from you too, not just make the agent more reliable.
But my metaphor for tests is "they are the skin of the agent", allow it to feel pain. And the docs/specs are the "bones", allow it to have structure. The agent itself is the muscle and cerebellum, and the human in the loop is the PFC.
For complicated things, it helps to impose a TDD workflow: define the test first. And of course you can get the LLM to write those as well. Cover enough edge cases that it can't take any short cuts with the implementation. Review tests before you let it proceed.
Finally skills help remove a lot of the guess work out of deciding which tools to run when. You can just tell it what to run, how to invoke it, etc. and it will do it. This can save a bit of time. Simple example, codex seems to like running python things a lot. I have uv installed so there is no python on the path; you need to call python3. Codex will happily call python first before figuring that out. Every time. It will just randomly call tools, fall back to some node.js alternative, etc. until it finds some combination of tools to do whatever it needs to do. You can save a lot of time by just making it document what it is doing in skill form (no need to write those manually, though you might want to review and clean them up).
I've been iterating on a Hugo based static website. After I made it generate a little test suite, productivity has gone up a lot. I'm able to do fairly complex changes on this thing now and I end up with a working website every time. It doesn't stop until tests pass. It doesn't always do the right thing in one go but I usually get there in a few attempts. It takes a few seconds to run the tests. They prove that the site still builds and runs, things don't 404, and my tailwind styling survives the build. I also have a few checks for link and assets not 404ing. So it doesn't hallucinate image links that don't exist. I made it generate all those tests too. I have a handful of skills in the repository outlining how/when to run stuff.
I did some major surgery on this website. I made it do a migration from tailwind 3 to 4. I added a search feature using fuse.js and made it implement reciprocal rank fusion for that to get better ranking. Then I decided to consolidate all the javascript snippets and cdn links into a vite/typescript build. Each of these tasks were completed with pretty high level prompts. Basically, technical debt just melts away if you focus it on addressing that. It won't do any of this by itself unless you tell it to. A lot depends on your input and direction. But if you get structured, this stuff is super useful.
* (I've moved those comments to https://news.ycombinator.com/item?id=46675246. If you want to reply, please do so there so we can hopefully keep the main thread on topic.)
I’ve been wondering why I can’t use it to generate electricity.