I don't understand the point of automating note taking. It never worked for me to copy paste text into my notes and now you can 100x that?
The whole point of taking notes for me is to read a source critically, fit it in my mental model, and then document that. Then sometimes I look it up for the details. But for me the shaping of the mental model is what counts
I thought this was parody at first as well for a redundant useless product as it was named after the redundant useless product of the same name from The Office (Wuphf.com)
First of all, this is more than just note taking. It appears to be a (yet another) harness for coordinating work between agents with minimal human intervention. And as such, shouldn’t part of the point be to not have to build that mental model yourself, but rather offload it to the shared LLM “brain”?
Highly debatable whether it’s possible to create anything truly valuable (valuable for the owner of the product that is) with this approach, though. I’m not convinced that it will ever be possible to create valuable products from just a prompt and an agent harness. At that point, the product itself can be (re)created by anyone, product development has been commodified, and the only thing of value is tokens.
My hypothesis is that “do things that don’t scale”[0] will still apply well into the future, but the “things that don’t scale” will change.
All that said, I’ve finally started using Obsidian after setting up some skills for note taking, researching, linking, splitting, and restructuring the knowledge base. I’ve never been able to spend time on keeping it structured, but I now have a digital secretary that can do all of the work I’m too lazy to do. I can just jot down random thoughts and ideas, and the agent helps me structure it, ask follow-up questions, relate it to other ongoing work, and so on. I’m still putting in the work of reading sources and building a mental model, but I’m also getting high-quality notes almost for free.
If you think of an agent harness as a tool which you use to build your product, then I think you might be absolutely right. I don't see it being easy for a harness to ever build a product.
I actually think that the harnesses which do end up building products, the harness will be the product.
As an example, I have a harness which I have my entire team use consistently. The harness is designed for one thing: to get the results I get with less nuanced understanding of why I get it.
Mind you, most of my team members are non-technical, or at least would be considered non-technical, two years ago.
These days, I spend most of my time fine-tuning the harness. What that gives me is a team which is producing at 5x their capacity from three months ago, and I get easier to review, more robust pull requests that I have more confidence in merging.
It's still a far cry from automating the entire process. I still think humans need to give the outcomes to even the harnesses to produce the results.
I think your take is right. This isn't going to help with the internalization of knowledge that note taking will get you. I do think that there is some value in the way we've set up blueprints of agents if you haven't set up a business before to either teach about role functions in a business or get a head start on business that doesn't create something new. At the very least it's a quick setup to getting to experiment.
To the part about note taking (and disclosure) - we are working on a context graph product that lessens the work of reading sources, especially over time and breath to help with a lot of the structure you've mentioned.
In my over a decade of experience as a software engineer, writing code was always a smaller fraction of my time compared to reading code, debating code with colleagues, and wrangling ops. Optimizing for velocity of writing code will inevity lead to spaghetti at best and vaporware at worst.
There is also discussion, ping pong with the agent, exploring parallel paths, quickly experimenting, analyzing code, researching things. A code agent can do more than "write me as much code as possible, go!".
The few scientific studies out there actually show a degradation of output quality when these markdown collections are fully LLM maintained (opposed to an increase when they’re human maintained), which I found fascinating.
I think the sweet spot is human curation of these documents, but unsupervised management is never the answer, especially if you don’t consciously think about debt / drift in these.
Are you referring to the one (1) study that showed that when cheaper LLM's auto-generated an AGENTS.md, it performed more poorly than human editted AGENTS.md? https://arxiv.org/abs/2602.11988
I'd love to see other sources that seek to academically understand how LLM's use context, specifically ones using modern frontier models.
My takeaway from these CLAUDE.md/AGENTS.md efforts isn't that agents can't maintain any form of context at all, rather, that bloated CLAUDE.md files filled with data that agents can gather on the spot very quickly are counter-productive.
For information which cannot be gathered on the spot quickly, clearly (to me) context helps improve quality, and in my experience, having AI summarize some key information in a thread and write to a file, and organize that, has been helpful and useful.
i've been running a variation of the _llm writes a wiki_ since late february. i run it on a sprite (sprites.dev from fly.io), it's public but i don't particularly advertise it. i completely vibe coded the shit out of it with claude. the app side and the content. the app side makes the content accessible to other agent instances, lists some documents at the root, provides search function, and let's me read it on a browser with nice typography if i want to, as opposed to raw markdown.
it's neat, i can create a new sprite/whatever, point claude at the root, and tell it to setup zswap and it will know exactly how to do so in that environment. if something changes, and there's some fiddling to make it work, i can ask it to write a report and send it in to fold into the existing docs.
Man, there's no point in replying. You are argueing with a non-human therefore the conversation is without meaning and impact and thus a waste of time and energy.
Put AI in your product name, make billion dollars. Put Karpathy in your blog article, get hired by Anthropic as Principal engineer. Milk money as long as fad last. No one is thinking about customer needs, everyone is trying to wash hands in the wave as it last.
Just like NFTs, just like the blockchain before that, in some ways kind of like the Web 2.0 craze (though we at least built some things then and the tight financing at the time kept a lid on it).
This LLM stuff at least has some real possibilities and value, and is very fun tech to learn about and play with.
I long ago accepted that there’s money to be made, as long as it’s not unethical, then get involved. Can build cool things that do have value, while enjoying the VC/PE money sloshing around
Hey man, if it works it works. There's a reason everyone is creating AI tools. We're all buying them. I'm still waiting for someone to make a world-class cli harness that can replace Claude Code but solves the memory and design problem. Web design is still a nightmare with LLMs.
The space of self building artefacts is interesting and is booming now because recent LLM versions are becoming good at it fast (in particular if they are of the "coding" kind).
I've also experimented recently with such a project [0] with minimal dependencies and with some emphasis on staying local and in control of the agent.
It's building and organising its own sqlite database to fulfil a long running task given in a prompt while having access to a local wikipedia copy for source data.
A very minimal set of harness and tools to experiment with agent drift.
Adding image processing tool in this framework is also easy (by encoding them as base64 (details can be vibecoded by local LLMs) and passing them to llama.cpp ).
It's a useful versatile tool to have.
For example, I used to have some scripts which processed invoices and receipts in some folders, extracting amount date and vendor from them using amazon textract, then I have a ui to manually check the numbers and put the result in some csv for the accountant every year. Now I can replace the amazon textract requests by a llama.cpp model call with the appropriate prompt while still my existing invoices tools, but now with a prompt I can do a lot more creative accounting.
I have also experimented with some vibecoded variation of this code to drive a physical robot from a sequence of camera images and while it does move and reach the target in the simple cases (even though the LLM I use was never explicitly train to drive a robot), it is too slow (10s to choose the next action) for practical use. (The current no deep-learning controller I use for this robot does the vision processing loop at 20hz).
LLM models and the agents that use them are probabilistic, not deterministic. They accomplish something a percentage of the time, never every time.
That means the longer an agent runs on a task, the more likely it will fail the task. Running agents like this will always fail and burn a ton of token cash in the process.
One thing that LLM agents are good at is writing their own instructions. The trick is to limit the time and thinking steps in a thinking model then evaluate, update, and run again. A good metaphor is that agents trip. Don't let them run long enough to trip. It is better to let them run twice for 5 minutes than once for 10 minutes.
Give it a few weeks and self-referencing agents are going to be at the top of everybody's twitter feed.
The BM25-first routing bet is interesting. You mention 85% recall@20 on 500 artifacts, but the heuristic classifier routing "short lookups to BM25 and narrative queries to cited-answer" raises a practical question: what does the classifier key on to decide a query is narrative vs short? Token count? Syntactic structure? The reason I ask is that in agent-generated queries, the boundary is often blurry - an agent doing a dependency lookup might issue a surprisingly long, well-formed sentence. If the classifier routes those to the more expensive cited-answer loop it could negate the latency advantage of BM25 being first.
Re classifier routing: text-shape signals (token count, syntactic markers) underspecify the boundary, especially for agent-generated queries. The signal that worked better in our policy-gated tool-call setting was the surrounding intent context the agent was operating under, not the query string itself. An agent in a "fact-check" context emits long, well-formed sentences that actually want exact-match retrieval; an agent in an "open research" context emits surprisingly short queries that need narrative retrieval. If the runtime can read the tool or skill context at query time, routing on that is less ambiguous than text shape. Doesn't help if the wiki is a black-box MCP server with no caller-side context, but it's worth offering an optional context hint in the lookup payload.
How do you anticipate teams deploying this? I’m wary of GitHub for sensitive business documents, and wonder what an easy secure agent friendly deployment looks like. Cloudflare or GCP are maybe good candidates
Right now this is setup to be run on your machine. Git is used to do versioning but we don't push that to GitHub, nor do we keep any insight into what you have or what you're doing.
If there is long term value people are getting out of Wuphf we'll be happy to build out a hosted business/enterprise compliant version.
Any particular reason for BM25? Why not just a table of contents or index structure (json, md, whatever) that is updated automatically and fed in context at query time? I know bag of words is great for speed but even at 1000s of documents, the index can be quite cheap and will maximise precision
do you want to pollute the context with blurbs for docs in disparate topics? cascade filtering, even with naïve bm25, helps reduce the amount of _noise_ that's pushed into the context window. if we reduce the amount of results to consider, further filtering or reranking, with more expensive options, becomes realistic. one could even put a cheaper model in front to further clean the results.
I read the durability thing as markdown files are very open, easy to find software for, simple and are widely used. All of this together almost guarantees that they will he viewable/usable in the far future.
Even though I did not know about Andrej Karpathy's tweet from earlier this month, I ended up converging on something very similar.
A couple of weeks ago I built a git-based knowledge base designed to run agents prompts on top of it.
I connected our company's ticketing system, wiki, GitHub, jenkins, etc, and spent several hours effectively "onboarding" the AI (I used Claude Opus 4.6). I explained where to find company policies, how developers work, how the build system operates, and how different projects relate to each other.
In practice, I treated it like onboarding a new engineer: I fed it a lot of context and had it organize everything into AI-friendly documentation (including an AGENTS.md). I barely wrote anything myself, mostly I just instructed the AI to write and update the files, while I guided the overall structure and refactored as needed.
The result was a git-based knowledge base that agents could operate on directly. Since the agent had access to multiple parts of the company, I could give high-level prompts like: investigate this bug (with not much context), produce a root cause analysis, open a ticket, fix it, and verify a build on Jenkins. I did not even need to have the repos locally, the AI would figure it out, clone them, analyze, create branches using our company policy, etc...
For me, this ended up working as a multi-project coordination layer across the company, and it worked much better than I expected.
It wasn't all smooth, though. When the AI failed at a task, I had to step in, provide more context, and let it update the documentation itself. But through incremental iterations, each failure improved the system, and its capabilities compounded very quickly.
I remember the personal wiki was a bit of trend 5 years ago but it kind of died because it had an unclear purpose for the most part. I kept one but never really referred to any of the notes and then just went back to a paper and to do list. I’m sure this is useful for those who kept up the habit.
We have been using it as a sounding board. I think that in its current state it's actually more useful for someone to learn about how to run a business - "what does a CEO vs PM do" and/or learn about the pros/cons of running a bunch of agents at once.
The problem with your comment is that the word "real" is just there to move the goalposts. There are people building high-quality stuff like this, yes.
I built a tiny utility like this that works very well yesterday: