1. Curate a large representative subsample of tweets.
2. Feed all of them to an LLM in a single call with the prompt along the lines of "generate N unique labels and their descriptions for the tweets provided". This bounds the problem space.
3. For each tweet, feed them to a LLM along with the prompt "Here are labels and their corresponding descriptions: classify this tweet with up to X of those labels". This creates a synthetic dataset for training.
4. Encode each tweet as a vector as normal.
5. Then train a bespoke small model (e.g. a MLP) using tweet embeddings as input to create a multilabel classification model, where the model predicts the probability for each label that it is the correct one.
The small MLP will be super fast and cost effectively nothing above what it takes to create the embedding. It saves time/cost from performing a vector search or even maintaining a live vector database.
That way you can effectively handle open sets and train a more accurate MLP model.
With your approach I don't think you can get a representative list of N tweets which covers all possible categories. Even if you did, the LLM would be subject to context rot and token limits.
Just using embeddings you can get really good classifiers for very cheap
You can use small embeddings models too, and can engineer different features to be embedded as well
Additionally, with email at least, depending on the categories you need, you only need about 50-100 examples for 95-100% accuracy
And if you build a simple CLI tool to fetch/label emails, it’s pretty easy/fast to get the data
How big should my sample size be to be representative ? It’s a fairly large list of docs across several products and deployment options. I wanted to pick a number of docs per product. Maybe I’ll skip the steps 4/5 as I only need to repeat it occasionally once I labelled everything once
For training the model downstream, the main constraint on dataset size is how many distinct labels you want for your use case. The rules of thumb are:
a) ensuring that each label has a few samples
b) atleast N^2 data points total for N labels to avoid issues akin to the curse of dimensionality
The OP has 6k labels and discusses time + cost, but what I found is:
- a small, good enough locally hosted embedding model can be faster than OpenAI's embedding models (provided you have a fast GPU available), and it doesn't cost anything
- for just 6k labels you don't need Pinecone at all, with Python it took me like a couple of seconds to do all calculations in memory
For classification + embedding you can use locally hosted models, it's not a particularly complex task that requires huge models or huge GPUs. If you plan to do such classification tasks regularly, you can make a one-time investment (buy a GPU) and then you'll be able to run many experiments with your data without having to think about costs anymore.
This is sensitive to the initial candidate set of labels that the LLM generates.
Meaning if you ran this a few times over the same corpus, you’ll probably get different performance depending upon the order of the way you input the data and the classification tag the LLM ultimately decided upon.
Here’s an idea that is order invariant: embed first, take samples from clusters, and ask the LLM to label the 5 or so samples you’ve taken. The clusters are serving as soft candidate labels and the LLM turns them into actual interpretable explicit labels.
Reference: https://blog.invidelabs.com/how-invide-analyzes-deep-work/
I wrote a categorization script that sorts customer-service calls into one of 10 categories. Wrote descriptions of each category, then translated into embedding.
Then created embeddings for the call notes and matched to closest category using cosine_similarity.
If your categories are dynamic, the way OP handles it will be much cheaper as the number of tweets (or customer service calls in your case) grows, as long as the cache hit rate is >0%. Each tweet will get it's own label, i.e. "joke_about_bad_technology_choices". Each of these labels gets put into a category, i.e. "tech_jokes". If you add/remove a category you would still need to re-calculate everything, however you would only need to re-calculate the labels to categories as opposed to every single tweet. Since similar tweets can share the same labels, you end up with less labels than total amount of tweets. As you reach the asymptotic ceiling, as mentioned in OPs post, your cost to re-embed labels to categories also becomes an asymptotic ceiling.
If the number of items you're categorizing is a couple thousand at most and you rarely add/remove categories, it's probably not worth the complexity. But in my case (and ops) it's worth it as the number of items grows infinitely.
[1] https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...
[2] https://huggingface.co/BAAI/bge-m3
In my recent project I used openai's embedding model for that because of its convenient api and low cost.
Formatting the input text to have a consistent schema is optional but recommended to get better comparisons between vectors.
- Fetch a list of my unique tags to get a sense of my topics of interests
- Have the AI dig into those specific niches to see what people have been discussing lately
- Craft a few random tweets that are topic-relevant and present them to me to curate
Is very powerful workflow that is hard to deliver on without the class labels.
There's certainly more tweaking that needs to be done but I've been pretty happy with the results so far.
1: jesterengine.com
Will it be any better if you sent a list of existing tags with each new text to the LLM, and asked it to classify to one of them or generate a new tag? Possibly even skipping embeddings and vector search altogether.
I was thinking giving the LLM a tool `(query: string) => string[]` to retrieve a list of matching labels to check if they already exist.
But the above approach sounds similar to OP, where they use embeddings to achieve that.
I actually built a project for tagging posts exactly the way you described.
So the cache check tries to find if a previously existing text embedding has >0.8 match with the current text.
If you get a cache hit here, iiuc, you return that matched' text label right away. But do you also insert a text embedding of the current text in the text embeddings table? Or do you only insert it in case of cache miss?
From reading the GitHub readme it seems you only "store text embedding for future lookups" in the case of cache miss. This is by design to keep the text embedding table not too big?
For instance: I love McDonalds (1). I love burgers. (0.99) I love cheeseburgers with ketchup (?).
This is a bad example but in this case the last text could end up right at the boundary of the similarity to that 1st label if we did not store the 2nd, which could cause a cluster miss we don't want.
We only store the text on cache misses, though you could do both. I had not considered that idea but it make sense. I'm not very concerned about the dataset size because vector storage is generally cheap (~ $2/mo for 1M vectors) and the savings in $$$ not spend generating tokens covers for that expense generously.
The idea is also that this would be a classification system used in production whereby you classify data as it comes, so the "rolling labels" problem still exists there.
In my experience though, you can dramatically reduce unwanted bias by tuning your cosine similarity filter.