Amazon Web Services outage causing issues for many major apps, websites worldwide

MicroWave@lemmy.world · 19 hours ago

Amazon Web Services outage causing issues for many major apps, websites worldwide

ieatpwns@lemmy.world · 19 hours ago

I love it when everyone’s web services rely on a single company that ppl hate

TheFogan@programming.dev · 18 hours ago

That’s why real web companies go multi cloud. Letting us divide up our dependency among the big 3 companies everyone hates (Google, Amazon and Microsoft).

CucumberFetish@lemmy.dbzer0.com · 15 hours ago

Ah, multi cloud. Before we had an outage when AWS went down, now we have an outage every time any cloud provider goes down

mic_check_one_two@lemmy.dbzer0.com · edit-2 14 hours ago

Yup.

“Oh hey, we have a partial outage right now due to AWS. Most of the site still works, but users can’t log in, because that is handled on AWS… ^Which ^means ^users ^can’t ^access ^the ^“most ^of ^the ^site” ^that ^still ^works… But at least we can say we weren’t completely down during the outage.”

ramble81@lemmy.zip · 14 hours ago

every time any cloud provider goes down

Um, you’re not doing it right if that’s the case. Multi-cloud redundancy reduces downtime, it doesn’t increase downtime unless you’re doing something stupid like dividing up your SPOFs

CucumberFetish@lemmy.dbzer0.com · 13 hours ago

Yes, if everything is configured correctly, then an outage can be either avoided completely or reduced to the time it takes to switch the traffic over (and scale up and so on). But this is not the case as can be seen from today’s events.

TheFogan@programming.dev · 8 hours ago

Well guess it depends which companies we are talking about, and which if any have multi cloud redundancy… and if they are configured correctly. Obviously if someone has a multi cloud environment, configured perfectly to be unaffected, we just wouldn’t know what they have in the cloud, because the lack of outage wouldn’t generate any news.

HeyJoe@lemmy.world · 13 hours ago

Lol, this has always been the thing we have told them would happen. When the company pitches why its so good and how they have a lot of data centers all over so there will never be a large scale outage we all laughed and said ok sure not yet but we’ll see. We’ll here it is!

isgleas@lemmy.ml · 19 hours ago

Move to the cloud they said. More resilient services they said.

dan1101@lemmy.world · 18 hours ago

Not only that, but you’re at their mercy for the cost. My servers don’t suddenly become more expensive to run because Amazon or Microsoft want more money.

despite_velasquez@lemmy.world · 19 hours ago

pay a premium for the same amount of CPU & RAM you could’ve gotten from your classic VPS provider fire your sysadmins and hire DevOps Engineers at 2x the salary raise a ticket with AWS and wait every time you need more than 5 instances of the same compute type oops, our biggest DC got knocked offline, here’s some compute time credits

the cloud has been the biggest scam in tech history

Infernal_pizza@lemmy.dbzer0.com · 19 hours ago

But when something breaks you can blame someone else and that’s all that really matters

Ironfist79@lemmy.world · 18 hours ago

Applications still need to be built to be fault tolerant across multiple AZs. Amazon even tells you this because things can and will go down.

NuXCOM_90Percent@lemmy.zip · edit-2 16 hours ago

fire your sysadmins and hire DevOps Engineers at 2x the salary

If you aren’t managing your own hardware you need far fewer sysadmins.

And while I was fortunate enough to work at a place where the sysadmins understood they were in the service industry, the vast majority of orgs do not have any meaningful communication between the departments which invariably becomes adversarial over time.

DevOps is inherently inefficient because you are paying people to do two jobs (which is why so many companies don’t and instead just add more and more responsibilities to the devs who are dumb enough to reveal they have basic linux skills…). But it is also, time and time again, one of, if not THE, most effective ways to actually have “IT” be aware of the needs and use cases of development.

raise a ticket with AWS and wait every time you need more than 5 instances of the same compute type

There is definitely a range where that can bite you and my experience is that the various cloud providers are very good about giving you special service if you are constantly hovering there. But the vast majority of companies either don’t need to scale past that or do it “once” during the initial deployment.

pay a premium for the same amount of CPU & RAM you could’ve gotten from your classic VPS provider (…) oops, our biggest DC got knocked offline, here’s some compute time credits

You’re paying extra for the stability and uptime as well as the customer service. And, speaking from experience, the vast majority of “traditional” VPS companies “guarantee” Five Nines by having a skeleton crew with a pager app on their phones who may or may not even be awake during their shifts. And the best you get is an acknowledgement and stalling until the main staff come back up.

Skimming down detector? The worst of it was around 0300 east coast time with large mitigations by 0700. It looks like it is spiking again as of 1000 though.

By all means. Rake Bezos’s shitty face across the coals and get a massive credit on your bill. But if we are judging a company by their service at their worst? This is NOTHING compared to potentially multi-day outages and needing to manually migrate our own services because “We can’t get anyone out to the data center until Wednesday” and so forth.

VPSes are spectacular for hobbyist use and company websites even for places in The City. But if you are providing a nation or even world wide service? You want a proper data center with support staff which more or less means “the cloud”. And while I think a LOT of companies should take that into consideration? Pretty much everything on downdetector et al that actually impacts people have very good reasons to not just buy a few nodes and manage it themselves.

despite_velasquez@lemmy.world · 15 hours ago

I was going to reference this Medium article on how paying extra for “uptime” and reliability isn’t just a 50-100% premium, but many times a 7-8 figure premium. These figures are make or break a business model type figures.

The irony is that Medium, a site hosting mostly static content, is still down due to the AWS outage.

NuXCOM_90Percent@lemmy.zip · 14 hours ago

I’ve (presumably) seen that article in the past. It is very much something that every company needs to evaluate for themselves but my experience? That (scaling for company size) premium is usually within discussion range of being worth it. In large part because… finding the kind of staff who gets within even a Three Nines range of uptime is a major undertaking and something you generally only can test when shit is hitting the fan.

So you tend to get analyses both ways. “If everything goes right, X is much cheaper than Y”. Which falls apart when you realize that it is someone else’s problem to make Y viable and you can always sue the fuck out of them if they screw up badly enough. So it ends up being “Well, our forecasts are that X would cost 4 million a year and Y would cost 6 million a year… but we save N on compensation and we don’t have to deal with staffing or HR… Eh, we’re probably out a million but our revenue can handle it and then we don’t have to deal with it”

despite_velasquez@lemmy.world · edit-2 13 hours ago

I think most companies don’t have a three nines SLA with their customers, yet were sold the idea that cloud (… and then serverless) should be the right decision for them.

When the initial cloud migration happened I’ve seen a handful of startups and scale-ups go bankrupt doing lift and shift

Don’t get me wrong, I agree with what you’re saying, my point is more towards the tribal consensus that was built in the tech community around 2016-2018 that the cloud is the future, for everyone, and that managing your own infrastructure is being a brute

NuXCOM_90Percent@lemmy.zip · edit-2 13 hours ago

With ANY of the “nines” notation, a good rule of thumb is to move the decimal point 2 or three spots to the left. But it is more the mindset and planning built around that.

For MOST companies and products? “Shit broke, we’ll fix it in the morning” is 100% reasonable. But when you are big enough that you are on the front page of downdetector? EVERYONE comes out of the woodwork to insist you are horrible and mismanaged and blahdy blah blah. Which might actually have investor implications.

Which is the other aspect. If I am going to pay a hosting company (with my business hat on), I need some uptime metrcis/guarantees. Violate those and I am expecting compensation. Violate those sufficiently and my bosses are going to have the lawyers see how much of our bad Q2 we can blame on the hosting company. And… there is a lot of value in the department head’s responsibility being sending angry emails to Amazon rather than figuring out what employee is getting fired… and if it is them.

But yeah. I saw someone else make the joke of “on -> off -> on -> off” prem cycles but… that is kind of reality.

When you are three people in a garage moonlighting in a way that you can pretend this all started after you all turn in your notice (seriously. One of my favorite goofing off activities is to check the repository of any company that actually has an open source project and laugh at how many MRs and commits were apparently done over the course of a month and TOTALLY weren’t rewritten for legal purposes)? Your very initial proof of concept might be a server in a closet but you very rapidly will shift to “the cloud” because you don’t have the resources for a full time IT person to even manage the VPS, let alone a rack.

Then, as you get bigger, you hire that sysadmin and either switch to a VPS or on prem to save money. Then you get bigger still and realize that sysadmin’s team is as big as engineering and start looking for ways to cut/offload costs… which tends to be The Cloud.

Then you get sufficiently large and have the kinds of customers where data protection is a full time job and start realizing it makes more sense to hire back the two or three competent sysadmins you had and rent some place in a data center. And THEN you get big enough that the entire world notices if you go down for 5 minutes and…

And… yeah. A lot of companies will fail at one of those points. Partially because they don’t run the numbers and factor in their runway. But also because those tend to be when work structures are most taxed. A whiteboard where people grab index cards works until you have teams that might not be fully staffed by people with double digit percentages of the company stocks and so forth.

Sal@lemmy.world · 15 hours ago

Turns out when you put half of the entire internet at the whims of one billionaire who literally uses vibe coding on it, if anything fails, half of the fucking internet in the WORLD goes FUBAR.

I can’t even use my credit app that I use to buy things right now because it’s fucked.

Boozilla@lemmy.world · 19 hours ago

Meanwhile, AWS claims to host 7,500 different US government agencies (state, fed, local).

etchinghillside@reddthat.com · 18 hours ago

I want to see the email that goes out that says you still need to work even though you’re not getting paid and your applications aren’t functioning.

dhork@lemmy.world · 19 hours ago

deleted by creator

Amazon Web Services outage causing issues for many major apps, websites worldwide

Amazon Web Services outage causing issues for many major apps, websites worldwide

Amazon Web Services outage persists as recovery stalls, impacting many websites and apps