That's a fascinating claim, and it does not align with my anecdotal experience using the web for many years.
edit: ok, I bothered to look this up: Microsoft had a guy do a study on nigerian scams, the guys who wrote Freakonomics did a sequel referencing that study and drew absurb unfounded conclusions, which have been repeated over and over. Business as usual for the fig-leaf salesmen.
For fuel, Google results were 90% scams, for coffee machines closer to 75% The scams are fairly elaborate: they clone some legitimate looking sites, then offer prices that are very competitive -- between 50% and 75% of market prices -- that put them on top of SEO. It's only by looking in details at contact information that there are some things that look off (one common thing is that they may encourage bank transfers since there's no buyer protection there, but it's not always the case).
A 75% market rate is not crazy "too good to be true" thing, it's in the realm of what a legitimate business can do, and with the prices of the items being in the 1000s, that means any hooked victim is a good catch. A particular example was a website copying the one for a massive discount appliance store chain in the Netherlands. They had a close domain name, even though the website looked different, so any Google search linked it towards the legitimate business.
You really have to apply a high level of scrutiny, or understand that Google is basically a scam registry.
why did you change subject to scams?
I believe that detecting whether an ad is clickbait is a similar problem -- not exactly the same, but it suffers from the same issues:
- it's not well defined at all.
- any heuristic is constantly gamed by bad actors
- it requires a deeper, contextual analysis of the content that is served
- content analysis requires a notion of what is reputable or reasonable
If I take an LLM's definition of "clickbait", I get "sensationalized, misleading, or exaggerated headlines"; so scams would be a subset of it (it is misleading content that you need to click through). They do not provide their definition though.
So you have Google products (both the Products search and the general search) that recommend scams with an incredible rate, where the stakes are much higher. Is it reasonable that they're able to solve the general problem? How can anyone verify such a claim, or trust it?
Given I'll often see the same fraudulent ad repeated I think anecdotal experience is there are not many of them.
I can even talk to friends about the most boring fraudulent ads and they know them. i.e. Elon doubling your bitcoin scams.
For normal ads unless they are viral, there are millions out there that are never repeated or not even seen.
Because fraud ads have short lifetimes pulled out of 'production traffic' you can collect many for the training data
I assume 'clickbait' is the safety word for 'fraud'
Specifically, post training you measure those on an holdout set and then you slice the results based on features. While these models tend to be more complex and potentially less understandable we feel the pros out-weight the cons.
Additionally, giving access to a confidence score to your end users is really useful to have them trust the predictions and in case that there is a non-0 cost for acting due to false positives/negatives you can try to come up with a strategy that minimize the expected costs.
> To find the most informative examples, we separately cluster examples labeled clickbait and examples labeled benign, which yields some overlapping clusters
How can you get overlapping clusters if the two sets of labelled examples are disjoint?
What's disjoint are the training labels and the classifier's output - not the values in high-dimension space. For classification tasks, there can be neighboring items in the same cluster but separated by the hyperplane - and therefore placed in different classes despite the proximity.
Typically LLMs don't produce usable embeddings for clustering or retrieval and embedding models trained with contrastive learning are used instead, but there seems to be no mention of any other models than LLMs.
I'm also curious about what type of clustering is used here.
The obfuscation being use of a support vector machines which are the goto for selecting the Support vectors and ignoring the outliers and distance being defined between embedding vectors.
I could be wrong they could be using something different for clustering or fancier like a variant of DBScan.