This model is trained on a custom dataset of 280k examples then tested on 1k very similar examples from the same dataset. Of course it is specialized to outperform general models on this specific task in this specific domain with this specific json format for output.
This is a reasonable hobby project and interesting approach to synthetic data generation but not impressive research.
At minimum you should test your model on other benchmarks that have similar tasks e.g. docbench
It is probably obvious to most who follow the space closely, but you'd be surprised how many engineers don't recognize this.
I don't know how you would even begin to make this kind of same observation for ML models, but seems possible. The 2010s weren't exactly building out "trivial" models, but compared to the architectures and optimizations out now, yeah those models are toy by comparison.
My understanding is generally this is not considered an obvious result. In that high parameter generalist models largely outperform lower parameter specialists.
The real issue is they tested on data in their training set. *
* Incorrect-- Edit misread parent comment.
It's worth pointing out that that's technically not testing on the training set, but looking at how similar examples are in the dataset, it's clear that severe overfitting would be unavoidable. That also makes the headline very misleading.
The weights may not be published since using it for document extraction on even the same format but with slightly different content or lengths would show how abysmal this finetune does outside of the synthetic data.
Hm, no.
They trained on a part of their synthetic set and tested on another part of the set. Or at least that's what they said they did:
> from which 1,000 were held out as a benchmark test set.
Emphasis mine.
> generation of 281,128 augmented examples
All example are already correlated because they are generated in the same way.
All examples of “document information extraction” would be correlated no matter where they come from because they all would be “document information extraction” examples…
The real question is whether or not the examples are representative of the broad “document information extraction” use-case.
During SFT, it uses the full training dataset[1]:
df = pd.read_csv('data/extraction_training_data.csv')
And during the evaluation, it uses the middle part of the same dataset[2]:
df = pd.read_csv('data/extraction_training_data.csv')
df = df[100000:100000+NUM_TEST_SAMPLES]
Also, you split train/test/val by chunk and not by document[3]. Then, the model "has seen" the documents that you're using to evaluate it (even if you're not evaluating it on the same chunks).
[1]: https://github.com/herniqeu/extract0/blob/0f8696a6fb1b620658...
[2]: https://github.com/herniqeu/extract0/blob/0f8696a6fb1b620658...
[3]: https://github.com/herniqeu/extract0/blob/0f8696a6fb1b620658...
Everything past GPT5 has been ... fine. It's better at chat (sort of, depending on your tone preferenc) and way better at coding/tool use. In our product (plan out a migration with AI), they've gotten worse, because they want to chat or code. I'd have expected the coding knowledge to generalize, but no! Especially Claude really wants to change our code or explain the existing plan to me.
We're getting around it with examples and dynamic prompts, but it's pretty clear that fine-tuning is in our future. I suspect most of the broad-based AI success is going to look like that in the next couple years.
They trained on synthetic extractions like "extract equations from arXiv papers" and "extract regulatory information from FDA documents," then tested on more synthetic extractions from the same sources. Essentially, "model trained on synthetic arXiv/PubMed/FDA extractions performs better on more synthetic arXiv/PubMed/FDA extractions than a model that never saw this distribution."
I'd like to see how it handles extractions from a real contract, or a low quality scan of a financial document, or processes a format it didn't see in training. o3 very likely handles these variations better, but we don't have that data to compare.
We need the model weights or tests on standard benchmarks to verify if this generalizes beyond documents that look like the training distribution.
But it's informative for the engineers that need something right now, because it means taking the best general purpose tool and specializing it will outperform the general tool, and you can sustain that if you are willing to always hop tools and respecialize. As we may.
Sure you can throw more compute at it. But it cost a lot of money and you hit resource limits.
We have been doing an end run around the bitter lesson with prompt engineering. Also by using different models for vision vs. text. By getting (human coding) agents to "think" and run code.
The bitter lesson might be that you cant predict what the thing is that will be most optimal tomorrow and any player in the AI game can be innovated out of existence at any time.
Maybe anyone except TMSC.
Like I wanted this from a year or two ago to just lets say have a model which lets say is genuinely really really good at sveltekit as an example instead of a model which is good at a lot of different things of sorts yknow
A model for sveltekit, A model for react and for coding general purpose too and preferably we can have a website which can make it easy to find these models/run them, ollama comes to my mind right now but it has really enshittened a little bit from the time when I was thinking about this but so maybe now a little competition on that side wouldn't hurt I suppose.
https://github.com/herniqeu/extract0
To quote Mulder: I want to believe.
There is so much research that shows you can beat frontier models with very little investment. It's confusing that the industry at large hasn't caught up with that
You need some serious resources to do this properly, think about granite docling model by IBM.
For LLM: Finetuning makes sense for light style adjustments with large models (eg. customize a chat assistant to sound a certain way) or to teach some simple transformations (eg. a new output format). You get away with 100-1000 samples.
If you want to teach new behaviour you need a lot of data, likely too much to justify the investment for your average chatgpt wrapper AI company. The pragmatic choice is often to just prompt engineer and maybe split your task and combine multiple prompts
Can't seem to see it on the arxiv site.
Open-Source style small players will actually solve problems with AI.
And the big money invested things are going to do stupid pointless bubbly things at best, or enshittify other good things at worst.
Govern yourselves accordingly.
oAI just announced like 5bn revenue for half a year, with 13bn projected till end of year. Doesn't seem so pointless now, does it?
Did you miss when I said "bubble?" Sigh, y'all are not serious.
I guess this is a small step forward, if nothing else, to the day when I can actually teach a model something in situ on my personal machine (notice I said machine, not "machines") in a very short amount of time. I feel that until then, LLMs and similar technologies won't be maximally helpful. They're very useful, but not maximally helpful.