It is possible to achieve the 6 nines?

DayInProgress@lemmy.ko4abp.com to Selfhosted@lemmy.world – 21 points –

The 6 nines mean that an ideal service should have 99,9999% uptime, right?

That's almost 32 seconds of downtime in a year!

If so, how much would it cost to do it? (Let's consider that is a marketplace site with 1000 daily users)

13

It’s really hard. And really expensive. I used to work in five nine environments, life or death type use cases, and my rule of thumb was that you double your cost for every extra nine you add.

When we got to five nines it was multiple hot standbys with a custom control and orchestration plane - literally custom hardware we had to build. This was for local installations, so not modern cloud environments (it was over a decade ago), but many of the challenges are similar, like session handling, transmission replay and caching, locking, clashing, routing, jitter, latency etc.

Looks like it is very stressful to work when you need this amount of availability, even more with the pressure that a little error can cause giant consequences.

Thanks for your answer!

I'm sorta surprised that it's only 2x per nine. Six or seven nines sounds ridiculous.

6 nines is really really difficult. It’s hard to estimate costs with specific requirements, but a marketplace site with 1000 daily users means you’re expecting about 1 user per minute, which isn’t a lot. I’d imagine you could get by with the cheapest cloud hosting.

The real problem is that most major cloud providers don’t offer 6 nines. Even AWS only offers credits below 99.5%, so you’d want to not lock yourself into a single provider. My best suggestion is to have a small/cheap server with all of the big names and load balance/round robin between them.

This.

At some point, you need to be able to quantify the risk to your business before you can do this.

For instance, if your business earns $10 per transaction, and you perform 100 transactions per second, the difference between five and six nines (313 seconds vs 31 seconds) is $282,000; nowhere near enough to justify the added investment.

However, if you perform ten thousand transactions per second, the difference is $28.2M. Which, frankly, is still not enough for the added staff and infrastructure costs that would be required. Enough for other mitigations, sure, but not six nines.

For reference, Visa is pretty widely quoted to do 24,000 transactions per second. Suffice to say, as Notorious stated, it is really really difficult.

This is such an open ended question that nobody can give you an accurate answer as there are so many factors that need to be considered.

How big is the site? How much data is being stored? What is the DB backend? How does it handle failover between DB servers in the event the primary goes down? Is it being hosted in a cloud service, your own DC or a cupboard? Does the location it is being hosted already have redundant power and internet connectivity? Are they diverse, so if one provider fails the other one will remain online? You’d need to maintain separate sites in case one location goes down due to a major event like an earthquake, so you need to replicate data in real-time to your DR location.

There are so many factors to consider and I haven’t named them all. Regardless of what your answer is, it would be very expensive to maintain any server at 6 9’s level of availability. For a marketplace website with only 1000 visitors a day there is no need for that level of availability because there would be times in the day that nobody would be on it. Only a marketplace the size of Amazon would consider that level of availability.

Whoa, there are a lot more factors than i was thinking.

I am not planning to host a site, I was just thinking about it and how complex it would be to make such website and well, makes sense to only the biggest sites on the web have that level of availability.

Thanks for your answer and for your time!

Sure. It just depends on how flexible your definition of "downtime" is.

Exactly. I think about an email hosting service. What does downtime mean exactly? Is the service down when you can send but cannot receive mail? Or when the service receives mail but it can't deliver it to your inbox? Nailing this down can be difficult, especially (in my experience) with the ones who decide where $$$ goes.

Possible yes. Cost effective / valid business case probably not. Every extra 9 is diminishing returns: it'll cost you exponentially more than the previous 9 and money saved from potential downtime is reduced. Like you said 32 seconds of downtime, how much money is that for the business?

You're pretty much looking at multiple geographically diverse T4 datacenters with N+2 or even N+3 redundancy all the way up and down the stack, while also implementing diversity wherever possible so no single vendor of anything can cause you to not be operational.

Even with all that though, you'll eventually get wrecked by DNS somewhere somehow, because it's always DNS.