> The internet is no longer a safe haven for software hobbyists
Maybe I've just had bad luck, but since I started hosting my own websites back around 2005 or so, my servers have always been attacked basically from the moment they come online. Even more so when you attach any sort of DNS name to it, especially when you use TLS and the certificates, guessing because they end up in a big index that is easily accessible (the "transparency logs"). Once you start sharing your website, it again triggers an avalanche of bad traffic, and the final boss is when you piss of some organization and (I'm assuming) they hire some bad actor to try to make you offline.
Dealing with crawlers, bot nets, automation gone wrong, pissed of humans and so on have been almost a yearly thing for me since I started deploying stuff to the public internet. But again, maybe I've had bad luck? Hosted stuff across wide range of providers, and seems to happen across all of them.
My stuff used to get popped daily. A janky PHP guestbook I wrote just to learn back in the early 2000s? No HTML injection protection & someone turned my site into spammy XSS hack within days. A WordPress installation I fell behind on patching? Turned into SEO spam in hours. A redis instance I was using just to learn some of their data structures that got accidentally exposed to the web? Used to root my computer and install a botnet RAT. This was all before 2020.
I never felt this made the internet "unsafe". Instead, it just reminded me how I messed up. Every time, I learned how to do better, and I added more guardrails. I haven't gotten popped that obviously in a long time, but that's probably because I've acted to minimize my public surface area, used star-certs to avoid being in the cert logs, added basic auth whenever I can, and generally refused to _trust_ software that's exposed to the web. It's not unsafe if you take precautions, have backups, and are careful about what you install.
If you want to see unsafe, look at how someone who doesn't understand tech tries to interact with it. Downloading any random driver or exe to fix a problem, installing apps when a website would do, giving Facebook or Tiktok all of their information and access without recognizing that just maybe these multi-billion-dollar companies who give away all of their services don't have your best interests in mind.
Hosting a WP with any amount of by script kiddies written third-party plugins without constant vigilance and keeping things up to date is a recipe for disaster. This makes it a job guarantee. Hapless people paying for someone to set up a hopelessly over-complicated WP setup, paying for lots of plugins, and constant upkeep. Basically, that ecosystem feeds an entire community of "web developers" by pushing badly written software, that then endlessly needs to be patched and maintained. Then the feature creep sets in and plugins stray from the path of doing one thing well, until even WP instance maintainers deem them too bloated and look for a simpler one. Then the cycle begins anew.
I really like how you take these situations and turn them into learning moments, but ultimately what you’re describing still sounds like an incredibly hostile space. Like yeah everyone should be a defensive driver on the road, but we still acknowledge that other people need to follow the rules instead of forcing us to be defensive drivers all the time.
The worst feeling I ever had was from exposing a samba share to the Internet in the 2000s and having that get popped and my dad’s company getting hacked because of the service I set up for him.
I have a personal domain that I have no reason to believe any other human visits. I selfhost a few services that only I use but that I expose to the internet so I can access them from anywhere conveniently and without having to expose my home network. Still I get a constant torrent of malicious traffic, just bots trying to exploit known vulnerabilities (loads of them are clearly targeting WordPress, for example, even though I have never used WordPress). And it has been that way for years. I remember the first time I read my access logs I had a heart attack, but it's just the way it is.
I’ve often thought about writing a script to use those bot attacks as a bit of a honey pot. The idea would be if someone is viewing a site with a brand new SSL certificate, that it can’t be legitimate traffic, so just block that ip/subnet outright at the firewall. Especially if they are looking for specific URLs like Wordpress installations. There are a few good actors that also hit sites quickly (ex: I’ve seen Bing indexing in that first wave of hits), but those are the exception.
Sadly, like many people, I just deal with the traffic as opposed to getting around to actually writing a tool to block it.
and it has been that way for a long time. Hosting a service on the internet means some one is *constantly* knocking at your door. It would be unimaginable if every few 10-1000s of milliseconds someone was trying a key in my front door, but that's just what it is with an open port on the internet.
I recently provisioned a VPS for educational purposes. As part of teaching public/private network interfaces in Docker, and as a debug tool, I run netstat pretty easily on.
Minutes after coming into existence, I have half a dozen connections to sshd from Chinese IP addresses.
Just put sshd on a nonstandard port, and 95% of the traffic goes away. Vandals can't be bothered with port-scanning, probably because the risk of getting banned before the scan is even complete is too high.
My first ever deployed project was breached on day 1 with my database dropped and a ransom note in there.
Was a beginner mistake by me that allowed this, but it's pretty discouraging. Its not the internet that sucks, its people that suck.
The Internet is not safe, and Let's Encrypt shows us this. They're a great service, but the moment you put something on the Internet and then give it a SSL/TLS certificate, evil will hammer your site to trying to find a WordPress admin page.
The public internet is a incredibly hostile infosec environment and you pretty much HAVE to block requests based on real time threat data like
https://www.misp-project.org/feeds/
It is fun to create honeypots for things like SSH and RDP and automatically block the source IPs
I have been using zipbombs and they were effective to some extent. Then I had the smart idea to write about it on HN [0]. The result was a flood of new types of bots that overwhelmed my $6 server. For ~100k daily request, it wasn't sustainable to serve 1 to 10MB payloads.
I've updated my heuristic to only serve the worst offenders, and created honeypots to collect ips and repond with 403s. After a few months, and some other spam tricks I'll keep to myself this time, my traffic is back to something reasonable again.
I do not have a solution for blog like this but if you are self hosting I recommend enabling mTLS on your reverse proxy.
I'm doing this for a dozen services hosted at home. The reverse proxy just drops the request if user does not present a certificate. My devices which can present cert can connect seamlessly. It's a one time setup but once done you can forget about it.
Wireguard is much better. Not only is it easier to set up/maintain, it even works on Android and iOS. I used to use client authentication for my private git server, but getting client certs installed on every client browser or app was a pain in the ass, and not even possible for some mobile browsers.
Today, my entire network of self hosted stuff exists in a personal wireguard VPN. My firewall blocks everything except the wireguard port (even SSH).
That's fine if you're hosting stuff just for yourself but not really practical if you're hosting stuff you want others to be able to read, such as a blog.
You can mTLS to CloudFlare too, if you’re not one of the anti-CloudFlare people. Then all traffic drops besides traffic that passes thru CF and the mTLS handshake prevents bypassing CF.
Upvoted not because the internet has ever been a safe haven, but for simply taking a moment to document the issue. But then again, I can't even give away a feed of what's bouncing off of my walls, drowning in my moat.
(An Alibaba /16? I block not just 3/8, but every AWS range I can find.)
> Fail2ban was struggling to keep up: it ingests the Nginx access.log file to apply its rules but if the files keep on exploding…
> [...]
> But I don’t want to fiddle with even more moving components and configuration
My Gitea instance also encountered aggressive scraping some days ago, but with highly distributed IP & ASN & geolocation, each of which is well below the rate of a human visitor. I assume Anubis will not stop the massively funded AI companies, so I'm considering poisoning the scrapers with garbage code, only targeting blind scrapers, of course.
Sadly we're now seeing services that sell proxy services that allows you to scape from a wide variety of residential IPs, some even goes so far as to labels their IPs as "ethically sources".
Anubis is definitely playing the cat-and-mouse game to some extent, but I like what it does because it forces bots to either identify themselves as such or face challenges.
That said, we can likely do better. Cloudflare does good in part because Cloudflare runs so much traffic, so they have a lot of data across the internet. Smaller operators just don't get enough traffic to really deal with banning abusive IPs without banning entire ranges indefinitely, not ideal. I hope to see a solution like Crowdsec where reputation data can be crowdsourced to block known bad bots (at least for a while since they are likely borrowing IPs) while using low complexity (potentially JS-free) challenges for IPs with no bad reputation. It's probably too much to ask for Anubis upstream which is probably already too busy dealing with the challenges of what it already does at the scale it is operating, but it does leave some room for further innovation for whoever wants to go for it.
In my opinion there is at least no reason why it is not plausible to have a drop-in solution that can mostly resolve these problems and make it easier for hobbyists to run services again.
Since I moved my DNS records to Cloudflare (that is: nameserver is now the one from Cloudflare), I get tons of odd connections, most notably SYN packets to eihter 443 or 22, which never respond back after the SYN-ACK. They ping me once a second in average, distributing the IPs over a /24 network.
I really don't understand why they do this, and it's mostly some shady origins, like vps game server hoster from Brazil and so on.
I'm at the point where i capture all the traffic and looks for SYN packets, check the RDAP records for them to decide if I then drop the entire subnets of that organization, whitelisting things like Google.
Digital Ocean is notoriously a source of bad traffic, they just don't care at all.
These are spoofed packets for SYNACK reflection attacks. Your response traffic goes to the victim, and since network stacks are usually configured to retry SYNACK a few times, they also get amplification out of it
> like vps game server hoster from Brazil and so on.
Probably someone DDoSing a Minecraft server or something.
People in games do this where they DDoS each other. You can get access to a DDoS panel for as little as $5 a month.
Some providers allow for spoofing the src ip, that's how they do these reflection attacks. So you're not actually dropping the sender of these packets, but the victims.
Consider turning reverse path filter to strict as a basic anti spoofing method and see if it helps
No but your electricity company will absolutely rat you out if your electricity usage skyrockets and the police will pop by to see if you’re running a grow op or something.
I wonder if you can have a chain of "invisible" links on your site that a normal person wouldn't see or click.
The links can go page A -> page B -> page C, where a request for C = instant IP ban.
I self host and I have something like this but more obvious: i wrote a web service that talks to my mikrotik via API and add the IP of the requester to the block list with a 30 day timeout (configurable ofc). It hostname is "bot-ban-me.myexamplesite.com" and it is like a normal site in my reverse proxy. So when I request a cert this hostname is in the cert, and in the first few minutes i can catch lots of bad apples. I do not expect anyone to ever type this. I do not mention the address or anything anywhere, so the only way to land there is to watch the CT logs.
Scrapers nowadays can use residential and mobile IPs, so banning by IP, even if actual malicious requests are coming from them, can also prevent actual unrelated people from accessing your service.
Unless you're running a very popular service, unlikely that a random residential IP would be both compromised by a malicious VPN and also trying to access your site legitimately.
Anyone who owns a chrome extension with 50k+ installs is regularly asked to sell it to people (myself included). The people who buy the extensions try to monetize them any way they can, like proxying traffic for malicious scrapers / attacks.
There was an article just yesterday which detailed doing this as not in order to ban but in order to waste time. You can also zip bomb people which is entertaining but probably not super effective.
We do something similar for ssh. If a remote connection tries to log in as "root" or "admin" or any number of other usernames that indicate a probe for vulnerable configurations, that's an insta-ban for that IP address (banned not only for SSH but for everything).
I wonder if a proof of work protocol is a viable solution. To GET the page, you have to spend enough electricity to solve a puzzle. The question is whether the threshold could be low enough for typical people on their phones to access the site easily, but high enough that mass scraping is significantly reduced.
Thanks for these references! I imagine the numbers would be entirely different in our context (20 years later and web serving, not email sending). And the idea of spammers using bot nets (therefore not paying for computer themselves) would be less relevant to LLM scraping. But I’ll try to check for forward references on these.
> And the idea of spammers using bot nets (therefore not paying for computer themselves) would be less relevant to LLM scraping.
It's possible that the services that reward users for running proxies (or are bundled with mobile apps with a notice buried in the license) would also start rewarding/hiding compute services as well. There's currently no money in it because proof-of-work is so rare, but if it changes, their strategy might too.
I feel like it could work. If you think about it, you need the cost to the client to be greater than the cost to the server. As long as that is true the server shouldn't mind about increased traffic because it's making a profit!
Very crudely if you think that a request costs the server ~10ms of compute time and a phone is 30x slower then you'd need 300ms of client compute time to equal it which seems very reasonable.
The only problem is you would need a cryptocurrency that a) lets you verify tiny chunks of work, and b) can't be done faster than you can do it on a phone using other hardware, and c) lets a client mine money without being to actually spend it ("homomorphic mining"?).
I don't know if anything like that exists but it would be an interesting problem to solve.
The problem is that the attacker isn't using a phone, they are using some type of specialized hardware.
I still think it is possible with some customized variant of RandomX. The server could even make a bit of money by acting as a mining pool by forcing the clients to mine a certain block template. It's just that it would need to be installed as a browser plugin or something, it wouldn't be efficient running within a page.
Also the verification process for RandomX is still pretty intensive. so there is a high minimum bar for where it would be feasible.
I wonder why is it that we get an increase in these automated scrapers and attacks as of late (some few years); is there better (open-source?) technology that allows it? Is it because hosting infrastructure is cheaper also for the attackers? Both? Something else?
Maybe the long-term solution for such attacks is to hide most of the internet behind some kind of Proof of Work system/network, so that mostly humans get to access to our websites, not machines.
What's missing is effective international law enforcement. This is a legal problem first and foremost. As long as it's as easy as it is to get away with this stuff by just routing the traffic through a Russian or Singaporean node, it's going to keep happening. With international diplomacy going the way it has been, odds of that changing aren't fantastic.
The web is really stuck between a rock and a hard place when it comes to this. Proof of work helps website owners, but makes life harder for all discovery tools and search engines.
An independent standard for request signing and building some sort of reputation database for verified crawlers could be part of a solution, though that causes problems with websites feeding crawlers different content than users, an does nothing to fix the Sybil attack problem.
I don't want governments to have this level of control over the internet. It seems like you are paving over a technological problem with the way the internet is designed by giving some institution a ton of power over the internet.
The alternative to governments stopping misbehavior is every website hiding behind Cloudflare or a small number of competitors, which is a situation that is far more susceptible to abuse than having a law that says you can't DDoS people even if you live in Singapore.
It really can not be overstated how unsustainable the status quo is.
This is already kind of true with every global website, the idea of a single global internet is one of those fairy tale fantasy things, that maybe happened for a little bit before enough people used it. In many cases it isn't really ideal today.
I don’t think this can solved legally without compromising anonymity. You can block unrecognized clients and punish the owners of clients that behave badly, but then, for example, an oppressive government can (physically) take over a subversive website and punish everyone who accesses it.
Maybe pseudo-anonymity and “punishment” via reputation could work. Then an oppressive government with access to a subversive website (ignoring bad security, coordination with other hijacked sites, etc.) can only poison its clients’ reputations, and (if reputation is tied to sites, who have their own reputations) only temporarily.
> but then, for example, an oppressive government can (physically) take over a subversive website and punish everyone who accesses it.
Already happens. Oppressive governments already punish people for visiting "wrong" websites. They already censor internet.
There are no technological solutions to coordination problems. Ultimately, no matter what you invent, it's politics that will decide how it's used and by whom.
Good points; I would definitely vouch for an independent standard for request signing + some kind of decentralized reputation system. With international law enforcement, I think there could be too many political issues for it not become corrupt
It's not necessarily going through a Russian or Singaporean node though, on the sites I'm responsible for, AWS, GCP, Azure are in the top 5 for attackers. It's just that they don't care _at all_ about that happening.
I don't think you need world-wide law-enforcement, it'll be a big step ahead if you make owners & operators liable. You can limit exposure so nobody gets absolutely ruined, but anyone running wordpress 4.2 and getting their VPS abused for attacks currently has 0 incentive to change anything unless their website goes down. Give them a penalty of a few hundred dollars and suddenly they do. To keep things simple, collect from the hosters, they can then charge their customers, and suddenly they'll be interested in it as well, because they don't want to deal with that.
The criminals are not held liable, and neither are their enablers. There's very little chance anything will change that way.
The big cloud provides needs to step up and take responsibility. I understand that it can't be to easy to do, but we really do need a way to contact e.g. AWS and tell them to shut of a costumer. I have no problem with someone scraping our websites, but I care that they don't do so responsibly, slow down when we start responding slower, don't assume that you can just go full throttle, crash our site, wait, and then do it again once we start responding again.
You're absolutely right: AWS, GCP, Azure and others, they do not care and especially AWS and GCP are massive enablers.
I'm very aware of that, yes. There needs to be a good process, the current situation where AWS simply does not care, or doesn't know also isn't particularly good. One solution could be for victims to notify AWS that a number of specified IP are generating an excessive amount of traffic. An operator could then verify with AWS traffic logs, notify the customer that they are causing issue and only after a failure to respond could the customer be shut down.
You're not wrong that abuse would be a massive issue, but I'm on the other side of this and need Amazon to do something, anything.
I'm pretty sure it is the commercial demand for data from AI companies. It is certainly the popular conception among sysadmins that it is AI companies who are responsible for the wave of scrapers over the past few years, and I see no compelling alternative.
Another potential cause: It's way easier for pretty much any person connected to the internet to "create" their own automation software by using LLMs. I could wager even the less smart LLMs could handle "Create a program that checks this website every second for any product updates on all pages" and give enough instructions for the average computer user to be able to run it without thinking or considering much.
Multiply this by every person with access to an LLM who wants to "do X with website Y" and you'll get an magnitude increase in traffic across the internet. This been possible since what, 2023 sometime? Not sure if the patterns would line up, but just another guess for the cause(s).
Attached to IP address is easiest to grok, but wouldn't work well since addresses lack affinity. OK, so we introduce an identifier that's persistent, and maybe a user can even port it between devices. Now it's bad for privacy. How about a way a client could prove their reputation is above some threshold without leaking any identifying information? And a decentralized way for the rest of the internet to influence their reputation (like when my server feels you're hammering it)?
Do anti-DDoS intermediaries like Cloudflare basically catalog a spectrum of reputation at the ASN level (pushing anti-abuse onus to ISP's)?
This is basically what happened to email/SMTP, for better or worse :-S.
20+ years ago there were mail blacklists that basically blocked residential IP blocks as there should not be servers trying to send normal mail from there. Now you must try the opposite, blacklist blocks where only servers and not end users can come from, as there is potentially bad behaved scrapers in all major clouds and server hosting platforms.
But then there are residential proxies that pay end users to route requests from misbehaved companies, so that door is also a bad mitigation
It's interesting that along another axis, the inertia of the internet moved from a decentralized structure back toward something that resembles mainframes. I don't think those axes are orthogonal.
Reputation plus privacy is probably unsolvable; the whole point of reputation is knowing what people are doing elsewhere. You don’t need reputation, you need persistence. You don’t need to know if they are behaving themselves elsewhere on the Internet as long as you can ban them once and not have them come back.
Services need the ability to obtain an identifier that:
- Belongs to exactly one real person.
- That a person cannot own more than one of.
- That is unique per-service.
- That cannot be tied to a real-world identity.
- That can be used by the person to optionally disclose attributes like whether they are an adult or not.
Services generally don’t care about knowing your exact identity but being able to ban a person and not have them simply register a new account, and being able to stop people from registering thousands of accounts would go a long way towards wiping out inauthentic and abusive behaviour.
The ability to “reset” your identity is the underlying hole that enables a vast amount of abuse. It’s possible to have persistent, pseudonymous access to the Internet without disclosing real-world identity. Being able to permanently ban abusers from a service would have a hugely positive effect on the Internet.
A digital "Death penalty" is not a win for society, without considering a fair way to atone for "crimes against your digital identity".
It would be way to easy for the current regime (whomever that happens to be) to criminalize random behaviors (Trans People? Atheists? Random nationality?) to ban their identity, and then they can't apply for jobs, get bus fare, purchase anything online, communicate with their lawyers, etc.
Describing “I don’t want to provide service to you and I should have the means of doing so” as a “digital death penalty” is a tad hyperbolic, don’t you think?
> It would be way to easy for the current regime (whomever that happens to be) to criminalize random behaviors (Trans People? Atheists? Random nationality?) to ban their identity, and then they can't apply for jobs, get bus fare, purchase anything online, communicate with their lawyers, etc.
Authoritarian regimes can already do that.
I think perhaps you might’ve missed the fact that what I was suggesting was individual to each service:
> Reputation plus privacy is probably unsolvable; the whole point of reputation is knowing what people are doing elsewhere. You don’t need reputation, you need persistence. You don’t need to know if they are behaving themselves elsewhere on the Internet as long as you can ban them once and not have them come back.
I was saying don’t care about what people are doing elsewhere on the Internet. Just ban locally – but persistently.
If creating an identity has a cost, then why not allow people to own multiple identities? Might help on the privacy front and address the permadeath issue.
Of course everything sounds plausible when speaking at such a high level.
I agree and think the ability to spin up new identities is crucial to any sort of successful reputation system (and reflects the realities of how both good and bad actors would use it). Think back to early internet when you wanted an identity in one community (e.g. forums about games you play) that was separate from another (e.g. banking). But it means those reputation identities need to take some investment (e.g. of time / contribution / whatever) to build, and can't become usefully trusted until reaching some threshold.
Because of course what this world needs is for the wealthy to have even more advantages over the normies. (Hint: If you're reading this, and think you're one of the wealthy ones, you aren't)
I guess it is just because 1) They can, and 2) Everyone wants some data. I think it would be interesting if every website out there starts to push out BS pages just for scrappers. Not sure how much extra cost it's going to take if a website puts up say 50% BS pages that only scrappers can reach, or BS material with extremely small fonts hidden in regular pages that ordinary people cannot see.
Why? It’s because of AI. It enables attacks at scale. It enables more people to attack, who previously couldn’t. And so on.
It’s very explainable. And somehow, like clockwork, there are always comments to say “there is nothing new, the Internet has always been like this since the 80s”.
You know, part of me wants to see AI proliferate into more and more areas, just so these people will finally wake up eventually and understand there is a huge difference when AI does it. When they are relentlessly bombarded with realistic phone calls from random numbers, with friends and family members calling about the latest hoax and deepfake, when their own specific reputation is constantly attacked and destroyed by 1000 cuts not just online but in their own trusted circles, and they have to put out fires and play whack-a-mole with an advanced persistent threat that only grows larger and always comes from new sources, anonymous and not.
And this is all before bot swarms that can coordinate and plan long-term, targeting specific communities and individuals.
And this is all before humanoid robots and drones proliferate.
Just try to fast-forward to when human communities online and offline are constantly infiltrated by bots and drones and sleeper agents, playing nice for a long time and amassing karma / reputation / connections / trust / whatever until finally doing a coordinated attack.
Honestly, people just don’t seem to get it until it’s too late. Same with ecosystem destruction — tons of people keep strawmanning it as mere temperature shifts, even while ecosystems around the world get destroyed. Kelp forests. Rainforests. Coral reefs. Fish. Insects. And they’re like “haha global warming by 3 degrees big deal. Temperature has always changed on the planet.” (Sound familiar?)
Look, I don’t actually want any of this to happen. But if they could somehow experience the movie It’s a Wonderful Life or meet the Ghost of Christmas Yet to Come, I’d wholeheartedly want every denier to have that experience. (In fact, a dedicated attacker can already give them a taste of this with current technology. I am sure it will become a decentralized service soon :-( )
Our tech overlords understand AI, especially any form of AGI, will basically be the end of humanity. That’s why they’re entirely focused on being the first and amassing as much wealth in the meanwhile, giving up on any sort of consideration whether they’re doing good for people or not.
From governments, of course. There were times when criticism of anything and everything was a common and safe practice online. There are very few places where it is possible to keep practicing this now.
You need to terminate the TLS connection yourself so this prevents people from using DNS proxy, e.g. Cloudflare. Then you have to run a server that has a module that computes the ja3/ja4, e.g. nginx. Even then, it's possible to set your client hello in python/curl/etc. to exactly mirror the JA4 of your chosen browser like Chrome. So ja4 stops basic bots but most seasoned scrapers already implement valid ja4s/ja3s
I very much relate to the author's sour mood and frustration. I also host a small hobby forum and have experienced the same attacks constantly, and it has gotten especially bad the last couple of years with the rise of AI.
In the early days I put Google Analytics on the site so I could observe traffic trends. Then, we were all forced to start adding certificates to our sites to keep them "safe".
While I think we're all doomed to continue that annual practice or get blocked by browsers, I have often considered removing Google Analytics. Ever since their redesign it is essentially unusable for me now. What benefit does it bring if I can't understand the product anymore?
Last year, in a fit of desperation, I added Cloudflare. This has a brute force "under attack" mode that seems to stop all bots from accessing the site. It puts up a silly "hang on a second, are you human" page before the site loads, but it does seem to work. It is great UX? No, but at least the site isn't getting hammered by various locations in Asia. Cloudflare also let me block entire countries, although that seems to be easily fooled.
I also don't think a lot of the bots/AI crawlers honor the rules set in the robots.txt. It's all an honor system anyway, and they are completely lacking in it.
There need to be some hard and fast rules put in place, somehow, to stop the madness.
Cloudflare does work, but it often destroys the experience for legitimate users. On the website I manage, non-technical users were often getting stuck on the Cloudflare captcha, so I ended up removing it.
Then there's also the issue with dependence to US-based services, but that may not be an issue for you.
> Other things I’ve noticed is increased traffic with Referer headers coming from strange websites such as bioware.com, mcdonalds.com, and microsoft.com
I've been seeing this too, I guess scrapers think they can get through some blockers with a referrer?
sad but hosting static content like his site in a cloud would save him a headache. i know i know, "do it yourself" and all but if that is his path he knows the price. maybe i am wrong and do not understand the problem but it seems like he is asking for a headache.
I think the author would agree, and is in fact the point of his post.
The only way to solve these problems is using some large hosted platform where they have the resources to constantly be managing these issues. This would solve their problem.
But isn't it sad that we can't host our own websites anymore, like many of us used to? It was never easy, but it's nearly impossible now and this is only one reason.
i think it has been a hard to host a site since about 2007. i stopped then because it is too much work to keep it safe. even worse now but it has always been extra work since search engines. maybe the OP is just getting older and wants to spend time with his kids and not play with nginx haha.
Unpopular opinion: the real source of the problem is not scrapers, but your unoptimized web software. Gitea and Fail2ban are resource hogs in your case, either unoptimized or poorly configured.
My tiny personal web servers can whistand thousands of requests per second, barely breaking a sweat. As a result, none of the bots or scrapers are causing any issue.
"The only thing that had immediate effect was sudo iptables -I INPUT -s 47.79.0.0/16 -j DROP" Well, by blocking an entire /16 range, it is this type of overzealous action that contributes to making the internet experience a bit more mediocre. This is the same thinking that lead me to, for example, not being able to browse homedepot.com from Europe. I am long-term traveling in Europe and like to frequent DIY websites with people posting links to homedepot, but no someone at HD decided that European IPs couldn't access their site, so I and millions of others are locked out. The /16 is an Alibaba AS, and you make the assumption that most of it is malicious, but in reality you don't know. Fix your software, don't blindly block.
The Internet has really been an interesting case study for what happens between people when you remove a varying number of layers of social control.
All the way back to the early days of Usenet really.
I would hate to see it but at the same time I feel like the incentives created by the bad actors really push this towards a much more centralized model over time, e.g. one where all traffic provenance must be signed and identified and must flow through a few big networks that enforce laws around that.
"Socialists"* argue for more regulations; "liberals" claim that there should be financial incentives to not do that.
I'm neither. I believe that we should go back to being "tribes"/communities. At least it's a time-tested way to – maybe not prevent, but somewhat allieviate – the tragedy of the commons.
(I'm aware that this is a very poor and naive theory; I'll happily ditch it for a better idea.)
Little would prevent attacks by APTs and other powerful groups. (This, btw., is one of the few facets of this problem that technology could help solve.) But a trivial change: a hard requirement to sign up (=send a human-composed message to one of the moderators) to be able to participate (or, in extreme cases, to read the contents) "automagically" stops almost all spam, scrapers (in the extreme case), vandalism, etc. (from my personal experience based on a rather large sample).
I think it's one of the multi-faceted problems where technology (a "moat", "palisade", etc. for your "tribe") should accompany social changes.
I run a dedicated firewall/dns box with netfilter rules to rate limit new connections per IP. It looks like I may need to change that to rate limit per /16 subnet...
Some HNers already mentioned that the internet has not been a safe haven for a long time. All these vulnerability scanners and parsers were pinging my localhost servers even in mid 2k. It has just become worse, and even OSS and usually captcha-free places are installing things like Anubis [1].
All of this reminds me of some of Gibson's short stories I read recently and his description of Cyberspace: small corporate islands of protected networks in a hostile sea of sapient AIs ready to burn your brain.
Luckily, LLMs are not there yet, except you can still get your brain burnt from AI slop or polarizing short videos.
If you know that you don't have customers or users in the area, or very few, then go for it.
I worked in e-commerce previously, we reduce fraud to almost zero by banning non-local cards. It affected a few customers that had international credit cards, but not enough to justify dealing with the fraud. Sometimes you just need to limit your attack surface.
The Internet was a scene, and like all scenes it's done now the corpos have moved in and taken over (because at that point it's just ads and rent extraction in perpetuity). I dunno what/where/when the next tech scene will be, but I do know it's not going to come from Big Tech. See: Metaverse.
The problem with anything, anything, without a centralized authority, is that friction overwhelms inertia. Bad actors exist and have no mercy, while good people downplay them until it’s too late. Entropy always wins. Misguided people assume the problem is powerful people, when the problem is actually what the powerful people use their authority to do, as powerful people will always exist. Accepting that and maintaining oversight is the historically successful norm; abolishing them has always failed.
As such, I don’t identify with the author of this post, about trying to resist CloudFlare for moral reasons. A decentralized system where everyone plays nice and mostly cooperates, does not exist any more than a country without a government where everyone plays nice and mostly cooperates. It’s wishful thinking. We already tried this with Email, and we’re back to gatekeepers. Pretending the web will be different is ahistorical.
The internet has made the world small and that's a problem. Nation states typically had a limited range of broadcasting their authority in the more distant past. A bad ruler couldn't rule the entire world, nor could they cause trouble with the entire world. From nukes to the interconnected web the worst of us with power can effect everyone else.
Power is a spectrum. Power differentials will always exist but we can absolutely strive to make them smaller and smaller.
1) Most of the civilized world no longer has hereditary dictators (such as "kings"). Because they were removed from power by the people and the power was distributed among many individuals. It works because malicious (anti-social) individuals have trouble working together. And yes, oversight helps.
But it's a spectrum and we absolutely can and should move the needle towards more oversight and more power distribution.
2) Corporate power structures are still authoritarian. We can change that too.
The magical times in the past have always been marked with being able to be part of an "exclusive club" that takes something from nothing to changing the world.
Because of the internet, magical times can never be had again. You can invent something new, but as soon as anyone finds out about it, everyone now finds out about it. The "exclusive club" period is no more.
Yes, they can. But we need to admit to ourselves that people are not equal. Not just in terms of skill but in terms of morality and quality of character. And that some people are best kept out.
Corporations, being amoral, should also be kept out.
---
The old internet was the way it was because of gate keeping - the people on it were selected through technical skill being required. Generally people who are builder types are more pro-social than redistributor types.
Any time I've been in a community which felt good, it was full of people who enjoyed building stuff.
Any time such a community died, it was because people who crave power and status took it over.
There is no something new. Anything we invent will be able to be taken over by complex bots. Welcome to the futureshock where humans aren't at the top of their domain.
Gopher still requires the Internet. I know it's pretty common to conflate "the Internet" with "the World Wide Web", but there are actually other protocols out there (like Gopher).
The internet hasn't been a safe haven since the 80s, or maybe earlier (that was before my time, and it's never been one since I got online in the early 90s).
The only real solution is to implement some sort of identity management system, but that has so many issues that make it a non-starter.
The governments like it that way. They want banks and tech companies to be intermediaries that are more difficult to hold accountable, because they can just say “we didn’t feel like doing business with this person”.
What do you mean by that? Web pages is the central mechanism, we use to put information on the web. Of course many websites are shitty and could be much simpler to convey their information, without looking crap at all. But the web page in general? Why would we ever get rid of something so useful? And what do you suggest as an alternative?
Common man never had a need for internet or global connectedness. DARPA wanted to push technology to gain upper hand in the world matters. Universities pushed technology to show progress and sell research. Businesses pushed technologies to have more sales. It was kind of acid rain that was caused by the establishments and sold as scented rain.
This sentiment - along the lines of "the world became too dependent on the Internet", "Internet wasn't a good thing to begin with", "Internet is a threat to national security" etc - has been popping up on HN too often lately, emerged too abruptly and correlates with the recent initiatives to crack down on the Internet too well.
If this is your own opinion and not a part of a psyop to condition people into embracing the death of the Internet as we know it, do you have any solution to propose?
You don't need to have a solution to explore a problem in my opinion. OP comment is problematic but for reasons other than not having a proposed solution.
> Common man never had a need for internet or global connectedness
That's not how culture evolves. You don't necessarily need to have a problem so that a solution is developed. You can very well have a technology developed for other purposes, or just for exploration sake, and then as this tech exists uses for it start to pop post hoc.
You therefore ignore the immense benefit of access to information that technology has, something that wasn't necessarily a problem for the common man but once its there, the popularization of the access to information, they adapt and grow dependent on it. Just like electricity.
People with dialup telephones never asked for a smartphone connected to internet. They were just as happy back then or even more happy because phone didn't eat off their time or cause posture problems.
Sure, shopping was slower without amazon website, but not less happy experience back then. Infact homes had less junk and people saved more money
Messaging? sure it makes you spend time with 100 whatsapp groups, where 99% of the people don't know you personally.
It helped companies to sell more of the junk more quickly.
It created bloggers and content creators who lived in an imaginary world thinking that someone really consumes their content.
It created karma beggers who begged globally for likes that are worth nothing.
It created more concentration of wealth at some weird internet companies, which don't solve any of the world problems or basic needs of the people.
And finally it created AI that pumps plastic sewage to fill the internet. There it is, your immensely useful internet.
As if the plastic pollution was not enough in the real world, the internet will be filled with plastic content.
What else did internet give that is immensely helpful?
You're blaming the hammer for people driving nails into other's heads instead of walls.
A friend of mine, who had a similar opinion on technology, once watched a movie that seemed to reinforce it in his eyes, and tried to persuade me as if it was the ultimate proof that all technology is evil.
The plot depicted a happy small tribe of indigenous people deep in the rainforest, who never ever saw any artifacts of civilization. They never knew war, homicide, or theft. Basically, they knew no evil. Then, one day, a plane flies over and someone frivolously tosses an emptied bottle of Coca-Cola out of the window (sic!). A member of the tribe finds it in the forest and brings back to the village. And, naturally, everyone else wants to get hold of the bottle, because it's so supernatural and attractive. But the guy decides he's the only owner, refuses and then of course kills those who try to get it by force, and all hell breaks loose in no time.
"See", - concludes my friend triumphally, - "the technology brought evil into this innocent tribe!"
"But don't you think that evil already lurked in those people to start with, if they were ready to kill each other for shiny things?" - I asked, quite baffled.
"Oh, come on, so you're just supporting this shit!" was the answer...
You didn't actually refute any of the examples I gave. Show me the benefits of internet which helped equal sharing of the resources of this planet. Show me how internet did not help concentration of power and wealth. Show me how people's attention span and physical spaces are not filled by junk thanks to internet.
Why refute the examples based on the false premise that it's the medium's fault that it's filled with plastic bullshit (which I'm totally agree with, mind you)?
What's next, blaming electromagnetic field and devices to modulate it for beeing full of propaganda, violence and all kinds of filth the humankind is capable of creating? You find what you seek, and if not, keep turning that damn knob further.
But since you insist, some good frequencies to tune into:
1) Self-education in whatever field of practical or theoretical knowldege you're interested in;
2) Seeing a wider picture of the world than your local authorities would like you to (yes, basically seing that all the world's kings are naked, which is the #1 reason why the Internet became such a major pain in the ass for the kings' trade union, so to say);
3) Being able to work from any location in the world with access to the Internet;
4) You mentioned selling trash en masse worldwide, but I know enough examples of wonderful things produced by independent people and sold worldwide.
The list could be longer, but I hate doing useless and thankless work.
Thanks for providing some positive examples. But these examples are dwarfed by the negative effects brought in by internet, in my view. Sure, a modulated signal can be used for broadcasting weather report or some propaganda. But the rush to push technology was done mostly by not talking about the negative effects. Same is happening with AI. Sales prospects are the positive benefits driving it. No one want to say that the tiger which they are bringing back to life, because they can, is an enemy of humans.
I do agree with you that the negative aspects have been overwhelming any remaining good for quite some time, and that's a constant source of mourning for good things which keep succumbing to evil in this world for me.
It's not hard to build an internet that serves the people but nobody will pay you to do it, and if you are so brazen as to do it yourself then you will be investigated, harrassed, arrested, and beaten. Having been visited with every sorrow short of death, you will beg for death.