Amazon Web Services outage causing issues for many major apps, websites worldwide

MicroWave@lemmy.world · 1 day ago

Amazon Web Services outage causing issues for many major apps, websites worldwide

despite_velasquez@lemmy.world · 1 day ago

pay a premium for the same amount of CPU & RAM you could’ve gotten from your classic VPS provider fire your sysadmins and hire DevOps Engineers at 2x the salary raise a ticket with AWS and wait every time you need more than 5 instances of the same compute type oops, our biggest DC got knocked offline, here’s some compute time credits

the cloud has been the biggest scam in tech history

Infernal_pizza@lemmy.dbzer0.com · 1 day ago

But when something breaks you can blame someone else and that’s all that really matters

Ironfist79@lemmy.world · 1 day ago

Applications still need to be built to be fault tolerant across multiple AZs. Amazon even tells you this because things can and will go down.

NuXCOM_90Percent@lemmy.zip · edit-2 1 day ago

fire your sysadmins and hire DevOps Engineers at 2x the salary

If you aren’t managing your own hardware you need far fewer sysadmins.

And while I was fortunate enough to work at a place where the sysadmins understood they were in the service industry, the vast majority of orgs do not have any meaningful communication between the departments which invariably becomes adversarial over time.

DevOps is inherently inefficient because you are paying people to do two jobs (which is why so many companies don’t and instead just add more and more responsibilities to the devs who are dumb enough to reveal they have basic linux skills…). But it is also, time and time again, one of, if not THE, most effective ways to actually have “IT” be aware of the needs and use cases of development.

raise a ticket with AWS and wait every time you need more than 5 instances of the same compute type

There is definitely a range where that can bite you and my experience is that the various cloud providers are very good about giving you special service if you are constantly hovering there. But the vast majority of companies either don’t need to scale past that or do it “once” during the initial deployment.

pay a premium for the same amount of CPU & RAM you could’ve gotten from your classic VPS provider (…) oops, our biggest DC got knocked offline, here’s some compute time credits

You’re paying extra for the stability and uptime as well as the customer service. And, speaking from experience, the vast majority of “traditional” VPS companies “guarantee” Five Nines by having a skeleton crew with a pager app on their phones who may or may not even be awake during their shifts. And the best you get is an acknowledgement and stalling until the main staff come back up.

Skimming down detector? The worst of it was around 0300 east coast time with large mitigations by 0700. It looks like it is spiking again as of 1000 though.

By all means. Rake Bezos’s shitty face across the coals and get a massive credit on your bill. But if we are judging a company by their service at their worst? This is NOTHING compared to potentially multi-day outages and needing to manually migrate our own services because “We can’t get anyone out to the data center until Wednesday” and so forth.

VPSes are spectacular for hobbyist use and company websites even for places in The City. But if you are providing a nation or even world wide service? You want a proper data center with support staff which more or less means “the cloud”. And while I think a LOT of companies should take that into consideration? Pretty much everything on downdetector et al that actually impacts people have very good reasons to not just buy a few nodes and manage it themselves.

despite_velasquez@lemmy.world · 1 day ago

I was going to reference this Medium article on how paying extra for “uptime” and reliability isn’t just a 50-100% premium, but many times a 7-8 figure premium. These figures are make or break a business model type figures.

The irony is that Medium, a site hosting mostly static content, is still down due to the AWS outage.

NuXCOM_90Percent@lemmy.zip · 1 day ago

I’ve (presumably) seen that article in the past. It is very much something that every company needs to evaluate for themselves but my experience? That (scaling for company size) premium is usually within discussion range of being worth it. In large part because… finding the kind of staff who gets within even a Three Nines range of uptime is a major undertaking and something you generally only can test when shit is hitting the fan.

So you tend to get analyses both ways. “If everything goes right, X is much cheaper than Y”. Which falls apart when you realize that it is someone else’s problem to make Y viable and you can always sue the fuck out of them if they screw up badly enough. So it ends up being “Well, our forecasts are that X would cost 4 million a year and Y would cost 6 million a year… but we save N on compensation and we don’t have to deal with staffing or HR… Eh, we’re probably out a million but our revenue can handle it and then we don’t have to deal with it”

despite_velasquez@lemmy.world · edit-2 1 day ago

I think most companies don’t have a three nines SLA with their customers, yet were sold the idea that cloud (… and then serverless) should be the right decision for them.

When the initial cloud migration happened I’ve seen a handful of startups and scale-ups go bankrupt doing lift and shift

Don’t get me wrong, I agree with what you’re saying, my point is more towards the tribal consensus that was built in the tech community around 2016-2018 that the cloud is the future, for everyone, and that managing your own infrastructure is being a brute

NuXCOM_90Percent@lemmy.zip · edit-2 24 hours ago

With ANY of the “nines” notation, a good rule of thumb is to move the decimal point 2 or three spots to the left. But it is more the mindset and planning built around that.

For MOST companies and products? “Shit broke, we’ll fix it in the morning” is 100% reasonable. But when you are big enough that you are on the front page of downdetector? EVERYONE comes out of the woodwork to insist you are horrible and mismanaged and blahdy blah blah. Which might actually have investor implications.

Which is the other aspect. If I am going to pay a hosting company (with my business hat on), I need some uptime metrcis/guarantees. Violate those and I am expecting compensation. Violate those sufficiently and my bosses are going to have the lawyers see how much of our bad Q2 we can blame on the hosting company. And… there is a lot of value in the department head’s responsibility being sending angry emails to Amazon rather than figuring out what employee is getting fired… and if it is them.

But yeah. I saw someone else make the joke of “on -> off -> on -> off” prem cycles but… that is kind of reality.

When you are three people in a garage moonlighting in a way that you can pretend this all started after you all turn in your notice (seriously. One of my favorite goofing off activities is to check the repository of any company that actually has an open source project and laugh at how many MRs and commits were apparently done over the course of a month and TOTALLY weren’t rewritten for legal purposes)? Your very initial proof of concept might be a server in a closet but you very rapidly will shift to “the cloud” because you don’t have the resources for a full time IT person to even manage the VPS, let alone a rack.

Then, as you get bigger, you hire that sysadmin and either switch to a VPS or on prem to save money. Then you get bigger still and realize that sysadmin’s team is as big as engineering and start looking for ways to cut/offload costs… which tends to be The Cloud.

Then you get sufficiently large and have the kinds of customers where data protection is a full time job and start realizing it makes more sense to hire back the two or three competent sysadmins you had and rent some place in a data center. And THEN you get big enough that the entire world notices if you go down for 5 minutes and…

And… yeah. A lot of companies will fail at one of those points. Partially because they don’t run the numbers and factor in their runway. But also because those tend to be when work structures are most taxed. A whiteboard where people grab index cards works until you have teams that might not be fully staffed by people with double digit percentages of the company stocks and so forth.

Amazon Web Services outage causing issues for many major apps, websites worldwide

Amazon Web Services outage causing issues for many major apps, websites worldwide

Amazon Web Services outage persists as recovery stalls, impacting many websites and apps