Lemmy.world update: Downtime today / Cloudflare

Lemmy.World Announcements@lemmy.world – 2311 points – 1 years ago

Today, like the past few days, we have had some downtime. Apparently some script kids are enjoying themselves by targeting our server (and others). Sorry for the inconvenience.

Most of these 'attacks' are targeted at the database, but some are more ddos-like and can be mitigated by using a CDN. Some other Lemmy servers are using Cloudflare, so we know that works. Therefore we have chosen Cloudflare as CDN / DDOS protection platform for now. We will look into other options, but we needed something to be implemented asap.

For the other attacks, we are using them to investigate and implement measures like rate limiting etc.

You are viewing a single comment

View all comments Show the parent comment

Load balancing applications is significantly more complex than most people anticipate. In the naive implementation it typically increases database loads and reduces site performance. Static content balancing is trivial, and cloudflare will do that by default, but implementing the hard part will require careful software development to prevent a naive implementation from bringing down the database. Sticky sessions are just the beginning.

I mean...this take is naive. Putting a load balancer up in front of a few servers isn't going to do anything to their database? No idea where you're even getting that from, as they are completely unrelated.

The total number of application servers accessing the database is what would affect db performance in a negative way, and load balancing doest automatically mean "do something stupid like spin up 100 app servers when we normally use 3". All you've described is a need for a db proxy in the off chance that Lemmy code has horrible access patterns for db transactions.

You can take your uninformed nerd rage elsewhere now, thank you.

You obviously haven't written one.

Simple case, without sticky sessions:

2 app servers behind a naive load balancer. Assume an actually restful service. Also assume a reasonable single app design with persistent db connections and db caching. Assume a single client. Single clients first connection comes in to app servers 1. App servers 1 makes db connection and grabs relevant data out of db. Caches information for client expecting a reconnect. Client makes second call, load balancer places it on app server 2, app servers 2 now makes a second connection and queries the data.

The db has now done twice the work for a single client. This pattern is surprisingly common and as the user count grows this duplication significantly degrades cache performance and increases load on the db. It only gets worse as the user count increases.

It's a common scenario for someone who doesn't understand the point of putting a load balancer in front of a stateful application, perhaps. Not for anyone trying to solve a traffic problem.

No idea where you are getting your ideas from, but this is an absolutely uninformed example of how NOT to do something in an ideal way.

I'm really interested now which one of you is right. While the other person put some effort and gave a lot of actual information, you just come off as arrogant. Still, maybe you're right. Care to elaborate why?

I'm not one of these 2 arguing. But in general the app servers don't do caching or state handling.

You cache things in a third external cache such as redis or memcached. So if a user connects to app server 1 and then to app server 2 they will both grab cachee info from redis. No extra db calls required. This has been the basic way of doing things even with old school WordPress sites forever. You also store session cookies in there or in the db.

And even if you weren't caching externally like this, databases use up a lot of memory to cache tons of data. So even if the same query hits the db the second hit would probably still be hot in memory and return super fast. It's not double the load. At least with postgres this is the case and it's what Lemmy uses.

Definitely this. I use PostgreSQL (which Lemmy uses on the backend) for an enterprise-grade system that has anywhere from 700-1k users at any given point in time, and it also takes in several million messages from external systems throughout the day. PostgreSQL is excellent at caching data in memory. I've got the code for that system up in another window while I write this.

At this point in time, it doesn't look like Lemmy is using any form of an L2 cache like Redis or Memched. The only single point of failure (that's not horizontally scalable) looks like the pic-rs server that Lemmy is using for image hosting. If anything, that could easily be swapped over to use something S3 compatible and easily hosted using something like Minio locally, or even directly off of B2 or Linode cloud storage (doesn't charge for requests).

Not trying to come off as arrogant, but definitely incensed when I catch armchair tech heroes throwing wildly inaccurate information out there as if it were fact. This person has a very basic understanding of some terminology here, and zero idea how it is applied in the real world. Hate to see it.

Putting a load balancer up in front of a few servers isn’t going to do anything to their database

Yes it is. Suddenly your database exists in more than one location, which is extremely difficult to do with reasonable performance.

load balancing doest automatically mean “do something stupid like spin up 100 app servers when we normally use 3”

Going from 3 to 100 is trivial. Going from one to any number greater than one is the challenge.

All you’ve described is a need for a db proxy in the off chance that Lemmy code has horrible access patterns for db transactions.

Define "horrible"?

When Lemmy, or any server side software is running on a single server, you generally upgrade the hardware before moving to multiple servers (because upgrading is cheaper). When that stops working, and you need to move to another server, it's possible everything in the database that matters (possibly the entire database) will be in L4 cache in the CPU - not even in RAM a lot of it will be in the CPU.

When you move to multiple servers, suddenly a lot of frequent database operations are on another server, which you can only reach over a network connection. Even the fastest network connection is dog slow compared to L4 cache and it doesn't really matter how well written your code is, if you haven't done extensive testing in production with real world users (and actively malicious bots) placing your systems under high load, you will have to make substantial changes to deal with a database that is suddenly hundreds of millions of times slower.

The database might still be able to handle the same number of queries per second, but each individual query will take a lot longer, which will have unpredictable results.

The other problem is you need to make sure all of your servers have the same content. Being part of the Fediverse though, Lemmy probably already has a pretty good architecture for that.

Friend...you have zero idea what you're talking about. Database existing in multiple locations? What in the hell are you even talking about? Single db instance, multiple app servers, and single LB. You are absolutely not experienced with this type of work, and need to just stop because you're making an ass out of yourself with these wild ideas that have no basis in practical deployments. Stop embarrassing yourself.

What if your application has to know a state? Say for certain write requests, only one instance is allowed to process those as it needs a cache that it can somewhat consistently rely on?

(Granted, I wouldn't know why something like Lemmy needs that. But we had that problem at work, and it was a pain to solve while also supporting multiple app instances.)

In that case, I'd use a message queue. Rabbitmq, or I use Pulsar at work - multiple subscribers (using the same subscription name) to one queue of messages that need to be processed. One worker picks it up, processes it, and marks the message as processed. The worker either passes it into a different queue for further processing, or persists it to the DB.

The nice thing with this is when using the Pulsar paradigm, you can have multiple subscriptions to the same message queue, each one carrying its own state as to which messages are processed or not. So say I get one message from an external system, have one system that is processing it right now, and need to add a second system. In that case I just use a different subscription name for the second system, and it works independently of the first with no issues.

Distributed lock of any form would work. Memcache, redis, etcd, read access mechanism in an MQ...etc. Only one process would work on whatever it as a time. Simple.