Amazon Web Services outage causing issues for many major apps, websites worldwide

MicroWave@lemmy.world · 1 day ago

Amazon Web Services outage causing issues for many major apps, websites worldwide

despite_velasquez@lemmy.world · 1 day ago

I was going to reference this Medium article on how paying extra for “uptime” and reliability isn’t just a 50-100% premium, but many times a 7-8 figure premium. These figures are make or break a business model type figures.

The irony is that Medium, a site hosting mostly static content, is still down due to the AWS outage.

NuXCOM_90Percent@lemmy.zip · 1 day ago

I’ve (presumably) seen that article in the past. It is very much something that every company needs to evaluate for themselves but my experience? That (scaling for company size) premium is usually within discussion range of being worth it. In large part because… finding the kind of staff who gets within even a Three Nines range of uptime is a major undertaking and something you generally only can test when shit is hitting the fan.

So you tend to get analyses both ways. “If everything goes right, X is much cheaper than Y”. Which falls apart when you realize that it is someone else’s problem to make Y viable and you can always sue the fuck out of them if they screw up badly enough. So it ends up being “Well, our forecasts are that X would cost 4 million a year and Y would cost 6 million a year… but we save N on compensation and we don’t have to deal with staffing or HR… Eh, we’re probably out a million but our revenue can handle it and then we don’t have to deal with it”

despite_velasquez@lemmy.world · edit-2 1 day ago

I think most companies don’t have a three nines SLA with their customers, yet were sold the idea that cloud (… and then serverless) should be the right decision for them.

When the initial cloud migration happened I’ve seen a handful of startups and scale-ups go bankrupt doing lift and shift

Don’t get me wrong, I agree with what you’re saying, my point is more towards the tribal consensus that was built in the tech community around 2016-2018 that the cloud is the future, for everyone, and that managing your own infrastructure is being a brute

NuXCOM_90Percent@lemmy.zip · edit-2 24 hours ago

With ANY of the “nines” notation, a good rule of thumb is to move the decimal point 2 or three spots to the left. But it is more the mindset and planning built around that.

For MOST companies and products? “Shit broke, we’ll fix it in the morning” is 100% reasonable. But when you are big enough that you are on the front page of downdetector? EVERYONE comes out of the woodwork to insist you are horrible and mismanaged and blahdy blah blah. Which might actually have investor implications.

Which is the other aspect. If I am going to pay a hosting company (with my business hat on), I need some uptime metrcis/guarantees. Violate those and I am expecting compensation. Violate those sufficiently and my bosses are going to have the lawyers see how much of our bad Q2 we can blame on the hosting company. And… there is a lot of value in the department head’s responsibility being sending angry emails to Amazon rather than figuring out what employee is getting fired… and if it is them.

But yeah. I saw someone else make the joke of “on -> off -> on -> off” prem cycles but… that is kind of reality.

When you are three people in a garage moonlighting in a way that you can pretend this all started after you all turn in your notice (seriously. One of my favorite goofing off activities is to check the repository of any company that actually has an open source project and laugh at how many MRs and commits were apparently done over the course of a month and TOTALLY weren’t rewritten for legal purposes)? Your very initial proof of concept might be a server in a closet but you very rapidly will shift to “the cloud” because you don’t have the resources for a full time IT person to even manage the VPS, let alone a rack.

Then, as you get bigger, you hire that sysadmin and either switch to a VPS or on prem to save money. Then you get bigger still and realize that sysadmin’s team is as big as engineering and start looking for ways to cut/offload costs… which tends to be The Cloud.

Then you get sufficiently large and have the kinds of customers where data protection is a full time job and start realizing it makes more sense to hire back the two or three competent sysadmins you had and rent some place in a data center. And THEN you get big enough that the entire world notices if you go down for 5 minutes and…

And… yeah. A lot of companies will fail at one of those points. Partially because they don’t run the numbers and factor in their runway. But also because those tend to be when work structures are most taxed. A whiteboard where people grab index cards works until you have teams that might not be fully staffed by people with double digit percentages of the company stocks and so forth.

Amazon Web Services outage causing issues for many major apps, websites worldwide

Amazon Web Services outage causing issues for many major apps, websites worldwide

Amazon Web Services outage persists as recovery stalls, impacting many websites and apps