In OCR, even when the characters are poorly scanned, the deep domain understanding these large multi modal AIs have allows it to understand what the document actually meant - this is going to be order id because in the million invoices I have seen before order id is normally below order date - etc. The same issue is going to be there in ASR also is my worry.
With OCR the risk is you get another xerox[1] incident where all your data looks plausible but is incorrect. Hope you kept the originals!
(This is why for my personal doc scans, I use OCR only for full text search, but retain the original raw scans forever)
[1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...
For example, if the prompt includes that Caitlin is an accountant and Kaitlyn is an engineer, if you transcribe "Tell Kaitlyn to review my PR" it will know who you're referring to. That's something WER doesn't really capture.
BTW, I built an open-source Mac tool for using gpt-4o-transcribe with an OpenAI API key and custom prompts: https://github.com/corlinp/voibe
Probably the answer is simply to tweak the metric so it's a bit more smart than WER - allow "unclear" output which is penalised less than actually incorrect answers. I'd be surprised if nobody has done that.
>Timestamps/Speaker diarization. The model does not feature either of these.
What a shame. Is whisperx still the best choice if you want timestamps/diarization?
My experiences with Google’s Chirp have been horrendous, with it sometimes skipping sections of speech entirely, hallucinating speech where the audio contains noise, and unreliable word level timestamps. And this all is even with using their new audio prefiltering feature.
AWS works slightly better, but also has trouble with keeping word level timestamps in sync.
Whisper is nice but hallucinates regularly.
OpenAI’s new transcription models are delivering accurate output but do not support word level timestamps…
A lot of this could be worked around by sending the resulting transcripts through a few layers of post processing, but… I just want to pay for an API that is reliable and saves me from doing all that work.
See the very bottom of the page for a transcription with timestamps.
It doesn't use an extra model (so it supports every language that works with Whisper out of the box and use less memory), it works by applying Dynamic Time Warping to cross-attention weights.
So far, the best I have found while testing models for my language learning app (Copycat Cafe) is Soniox. All others performed badly for non native accents. The worst were whisper-based models because they hallucinate when they misunderstand and tend to come up with random phrases that have nothing to do with the topic.
It has the most crisp, steady P50 of any external service I've used in a long time.
My experience with Cohere and interacting with their sales engineers has been boring, I say that is the most flattering way possible. Embeddings are a core service at this point like VMs and DBs. They just need to work and work well and thats what they're selling.
And someone has already converted it to onnx format: https://huggingface.co/eschmidbauer/cohere-transcribe-03-202... - so it can be run on CPU instead of GPU.
This kids make sense because "compiling" (training) the model cost inhibitly much, and we can still benefit from the artifacts.
Accurate and fast model, very happy with it so far!
This is a good option. Will check it out.
Seems to not be to difficult in finding or creating training code. So a pretty decent amount of high quality training data should be many hours. And a few hours in high end data enter GPU compute, and many iterations to get it right.