Fresh Hacker News | An NSFW filter for Marginalia search

▲An NSFW filter for Marginalia search(marginalia.nu)

53 points by speckx 4 hours ago | 4 comments

▲ChadNauseam 38 minutes ago

Does marginalia_nu not use embedding models as part of search? I guess I assumed it would. If you have embeddings anyway, decision trees on the embedding vector (e.g. catboost) tend to work pretty well. Fine-tuning modernbert works even better but probably won't meet the criteria of "really fast and run well on CPUs". That said, the approach described in the article seems to work well enough and obviously provides extremely cheap inference

▲marginalia_nu 25 minutes ago

It does not use any transformer models right now. I've made experiments with BERT-adjacent methods, but not found them fast enough to be useful. Basically, whatever approach is used, it needs to do inference at ~10us latencies to either make real-time result filtering viable, or <1ms not add unreasonable overhead to processing-time result labeling.

▲marginalia_nu 2 hours ago

This was a very meandering project, and trying to corral it into some sort of coherent narrative was a bit of an undertaking on its own. Hopefully it makes some sense.

▲BrunoBernardino 1 hour ago

Hi Viktor! Really cool write-up, thanks! Uruky is already using the `nsfw` param, but set to `0` or `1`, and I see in your example this looks like a new value option (`2`) that's "better" than `1`? How "safe" is it to implement it as the value to send when someone wants SFW results?

▲marginalia_nu 1 hour ago

0 disables all filtering

1 filters 'harmful' sites per the UT1 blacklists

2 is 1 + the new NSFW filter.

The new filter works pretty good in my assessment. It's not infallible, but it gives significantly cleaner results.

And if you do find queries it fails to sanitize, I'd love to hear about them.

▲BrunoBernardino 1 hour ago

Thanks, already implemented and tested a couple of queries and it does look good!

▲IncreasePosts 35 minutes ago

Can you add 3, which only returns content flagged as NSFW?

So I can make sure I know what sites to stay away from, of course

▲marginalia_nu 9 minutes ago

Wouldn't work very well, in that you'd awful recall.

The way the filter is implemented, it runs after the query has been executed. I'd have to run it at document processing time, code in a pseudo-keyword for the label, and then add that to the query.

It's doable, but I question whether the juice is worth the squeeze.

▲VectorLock 11 minutes ago

Or perhaps -2

▲8organicbits 1 hour ago

Have you seen many examples of websites labeling themselves, perhaps using rating meta tags (<meta name="rating" ...>)? Self-labeling seems valuable in some ways, but I don't think I've seen it catch on.

▲marginalia_nu 1 hour ago

Meta tags are almost universally garbage, but the presence of '18 USC 2257' (or U.S.C.) is a very strong NSFW signal.

▲Wingy 1 hour ago

Does this comment make this page NSFW on Marginalia?

▲GenericDev 1 hour ago

[dead]