Hacker News new | past | comments | ask | show | jobs | submit login
Google Cloud Global Loadbalancer Outage (cloud.google.com)
299 points by brian-armstrong on July 17, 2018 | hide | past | favorite | 131 comments



Why can't Google just use UTC timestamps? Or at least include them alongside their "US/Pacific" timestamps.

I don't want to remember if "US/Pacific" currently has daylight savings or not.

Its a very strange decision especially considering that GCP has numerous regions outside of "US/Pacific".


Just another case of their idiotic culture leaking into their products. An early "design" decision led to all their production kit being set to US/Pacific, triggering frequent DST bugs, and last I heard (many years ago) it was still the case.

Coordinating a tz change over a network of that size is probably infeasible, so may as well push the pain on to customers


Slack's status page does the same thing, and perhaps more insultingly, has a link that says "See in your timezone" that just links to timeanddate.com. Really? None of the 1000s of JS devs at Slack could figure out how to import a date/time library into their status page?


I was quite amused when I clicked that link once, thought the same thing.


Just because you can import a library to do something doesn’t mean you should. Why bloat the software even more for a function that is used rarely (i.e. during outages) and by very few people?


Slack is not particularly known for caring about bloat in their pages.


Or apps for that matter. Having used the app for 6 months, I just keep a browser tab open these days, no use of feeding all the RAM to Slack.


We are going to find out the Hynix is a major stockholder in Slack. Mystery Solved.


Or for rarely having outages...


That’s why I said “why should they bloat it even more”. I mean it’s already bloated and you want them to add more stuff just for an obscure little thing?


Because its the right thing to do for the customer.


Seem reckless to use non-UTC time for anything as an engineering-focused team, but regardless they should definitely add some code to convert those times back to UTC for the rest of us.


Those were decisions made a lot of time ago, when there was only a single region/datacenter, since that was what you were running your servers on. And then they remained.

AWS has a similar issue internally, IIRC (oncall around DST switching times was fun with teams in three continents), but they are making baby steps to fix it. It's harder to fix than it sounds once in place though...


> Those were decisions made a lot of time ago, when there was only a single region/datacenter, since that was what you were running your servers on. And then they remained.

Yes, and it's still a <bad adjective here> decision. Actually, there should be no decisions on those things. It doesn't matter if it is your personal blog. Store events in UTC, no ifs or buts. Then convert for display. You better be able to fully articulate why you are not using UTC (scheduling real world events?), otherwise use UTC. Make it a tattoo if that helps. Again, UTC. Yes I know you are a single person and you don't have more than one server today, still use UTC. Thank your former self later on when you have to match events across timezones...

Text representation is another example. Use Unicode for strings unless you can clearly explain why that's the wrong representation. (Which Unicode? Pick UTF-8, unless you have a reason not to, in which case you'll probably know what the reason is).


You are talking about a decision that was made 18-20 years ago. It was bad decision in hindsight, but back then I'd imagine they were likely optimizing for a very different problem.

Even the UTF-8/Unicode example doesn't hold 18-20 years ago. UTF-8 was completely niche in 2003 and there were a number of flavors of Unicode, not all of which nicely fellback to ASCII like UTF-8 does. If I'm distributing software in 2003, I would have stuck to ASCII unless I was sure I needed Unicode, and if then I would have likely opted for UTF-16 because that was what was supported then.

I wouldn't question anyones decision for using UTF-16 in 2018 for software that had been supported since 2003.


> You are talking about a decision that was made 18-20 years ago.

Hang on.. what are we talking about?

According to Wikipedia, AWS was launched in 2006, which is only 12 years ago, and GCP was launched in 2008, only 10 years ago.

You could argue that, even in 2006, UTF-8 wasn't firmly enough established as the "winner" to be a clear best choice, but defending an for avoiding Unicode entirely at that point, for a company with global ambitions, just seems disingenuous.

For the question of UTC, claiming naivete also lacks credibility, if only because Y2K brought to and kept at the forefront of everyone's minds the topic of date/time representation in computers, starting around 20 years ago.


Shock! Wikipedia is wrong! The first AWS service was AWIS, still going strong today, although only usable by root accounts, not IAM.

This misinformation is perpetuated by AWS evangelists who claim SQS was the first service.

https://aws.amazon.com/about-aws/whats-new/2004/10/04/introd...


Although what you point out may be technically (and marketingly) correct, it's not a stretch to consider the current, cloud-computing AWS to have started until the availability of S3 and/or EC2.

More importantly, for these services to run roughshod, two years later, over the technical assumptions of something like AWIS also seems reasonable to expect. If AWIS isn't integrated under IAMs, that's pretty telling.


GCP inherited the decision from google's existing way of setting up servers, and they'd been setting up servers long since... or are you suggesting GCP should have run its servers with a different tz configuration than the rest of the company resulting in even worse issues?


Comments like yours really piss me off. You have absolutely no self-awareness or empathy, and come off sounding like a know it all.

Yes, we all know that storing date/time data is hard; they should use a standard timezone (without DST), and use time-locales to localize the timestamps to the user. We get it.

The reason this knowledge is so well known is because previous programmers (without this helpful guidance) made many different technical decisions for many reasons, and learned the hard way. They then publicized their failures for our benefit.

What was the last technical decision you made that turned out to be the wrong one? Did you make the wrong decision maliciously? Or were you doing the best you could with the information available? Would you really find it helpful for somebody to come along a decade later with the attitude "these guys have no idea what they were doing, I learned about these antipatterns years ago!" (usually followed by "better rewrite everything")

I'll go out on a limb and say most people here haven't started from a single datacenter and grew to a global service. These engineers deserve the benefit of doubt unless and until more information is available. They certainly don't need armchair analysis from some randos on hackernews.

/rant


>> we all know that storing date/time data is hard

But... it's not hard. That's the point. This isn't a hard decision and has nothing to do with regional vs global. The lesson was learned by the entire industry decades before the company existed so a modern engineering team making that mistake, and never fixing it, is definitely open for criticism.


Is it really believable that this engineering team started out by not thinking that their service (GCP) would become a global service, irrespective of how many datacenters they started with and where?


> Comments like yours really piss me off. You have absolutely no self-awareness or empathy, and come off sounding like a know it all.

Oh dear. This kind of personal attack is not ok on HN and we ban accounts that do this. Could you please not do it again, regardless of how un-self-aware another comment seems or how you feel in response?

Your comment would be fine without the first and last sentences. It's hard never to write such things but it's always possible to edit them out. That's what I do when something like that slips out.

https://news.ycombinator.com/newsguidelines.html

Edit: unfortunately it looks like you've been uncivil before, e.g. https://news.ycombinator.com/item?id=16928159. Please make sure not to in the future.


I see where you're coming from on the first sentence. I should have left it at saying his comment seemed to lack empathy. The last sentence I still think is fine though (I include myself in the list of people who don't have the perspective to criticize their decisions, I don't have the context to be constructive). I won't defend the other comment.

It's easy to be unnecessarily rude in comments, but I would have said the same were we talking in person. I don't believe in hiding behind anonymity to say something you wouldn't say otherwise.

All the same, I appreciate the input.


Sure, but we're just talking about the status site, it's not hard at all to just convert the timestamps to UTC for display using any server-side or JS library. Even better to show UTC and the current browser reported timezone alongside.


This goes deep within the UX of their systems - some of the most frequent actions in Stackdriver like jumping to a time within the logs is a mess. There are no quick actions (like "jump to now"), and THEN I have to switch to a non-intuitive "World / Greenwich Mean Time" config first every time (standard is Pacific, there is no UTC), and THEN I have to twist my brain into the weird US date format and AM/PM times within two different controls. Nothing of that is changeable or configurable, that interface is just a pain to use. My bug report apparently also didn't cut it.


"Idiotic culture" is a bit harsh.


You clearly haven't spent much time with App Engine


What’s the source of this? Is sounds intruiging, I wonder what led them to that decision.


I wasn't around at the time, but I suspect it's mainly because the first datacenters were on the West Coast. See e.g. the story at http://www.dodgycoder.net/2013/02/googles-fiber-leeching-cap...


Just ask any current/former employee, it's practically folklore by now (i.e. knowledge older than 6 months)


Ten years ago it was called jokingly GST: not Gulf Standard Time, but Google Standard Time. I don't remember outages, but the monitoring graphs when switching to DST, where plots go back in time, did look very funky. Something like this:

https://imgur.com/a/ZRtt9

The leap second troubles in 2005 (GFS?) that led to NTP smearing were more memorable.


Google Standard Time

Wow. Just... wow. Thanks for that. That phrase brought back a flood of old memories I'd forgotten.


This is fairly common for large tech companies that started in that timezone. The decisionmaking process goes something like this:

1. You MUST use a single, consistent timezone. Using the local timezone is a mess: is it the user's local? The host's local? What about server logs, which are typically text files where the timezone can't be displayed dynamically? What if you aggregate server logs across timezones? What if you ssh into a machine in a different timezone?

2. The logical timezone to use is the one in which the vast majority of your employees work, since having people subtracting 7/8 all the time is annoying.

You could argue that for customer-facing status updates like this, Google should use a dynamic timezone. That's fair, but I'm sure Google internally uses that status dashboard, so it could be very confusing and complicate coordination to mitigate the problem. I'd argue that customers would prefer that the problem get fixed slightly sooner over having to do a once-a-year-ish timezone conversion.


> 1. You MUST use a single, consistent timezone.

UTC was adopted 51 years ago.


As I said: UTC is a reasonable choice, except for the fact that most of your employees do not live in UTC and doing mental translation is annoying.


Store UTC, convert on display to any timezone, either automatically based on geolocation or using a stored preference. This is application design 101 at this point.

Converting old logs isn't hard either since ISO timestamp formats include timezones, or just pick a date for the switchover. The only scenario that gets tricky is with scheduling where users expect local times that carry over daylight-savings boundaries, but it doesn't apply to most apps.


Who is doing mental translation and why? We have computers for a reason. Are we talking about logs, software applications? Storage should be in UTC, data can be converted for display. I agree that logs can be a bit annoying when they are in UTC, but at least you have a consistent value across all servers, and you know what the offset is.

Also, if your office happens to be in a timezone which observes DST, you are still screwed. Now you think your times are in localtime, but in fact they are offset by one hour. This can lead to very "fun" debugging sessions and time wasted.

It can be a minor annoyance, but you know what can be an even greater annoyance? Undoing a bad decision which has percolated across several data stores.


> Storage should be in UTC, data can be converted for display

Sometimes we need to speak about events (verbally), or share screenshots of graphs, dashboards, or other data. We can make a habit of always stating the time zone when we talk, and make sure every dashboard/graph contains a time offset.

Or we can just pick a single arbitrary time zone and always use that.

There are a few internal tools at Google that assume your local time zone (I'm in EDT) -- those are far more confusing.

> Also, if your office happens to be in a timezone which observes DST, you are still screwed.

Has this really been a problem? A date+time is unambiguous. I can only see this being confusing if California follows through with abolishing DST.


>> Or we can just pick a single arbitrary time zone and always use that.

...so pick UTC then?

Why is that any different than how it is right now for anyone not in the Pacific timezone?


I think they forgot to tell the PC kids. I seem to recall years of Unix/Linux installs prompting (tempting) me not to use UTC. I also seem to remember dual booting and Windows and Linux doing a time tug of war with some OS preferring the Real Time Clock (RTC) being local time.


Windows likes local time on the RTC. Way back when DOS 1.0 was released (every story about Windows begins with this) there was one way to set the system clock: the user would type in the time. They would use the local time, of course. Consistency with other systems didn't mean diddly squat on a PC/XT with 256 KB of RAM, a floppy drive, and a keyboard. Fast forward a million tech years and we all live with the legacy.


Windows use the local time zone (for historical reason, explained in the sibling comment), so if you dual boot it makes sense to keep the same behavior, otherwise you will have time issues when booting the other OS. If you only use Linux just go for UTC.


1) hah, Yahoo used US/Pacific for US servers, GB/London for European servers, and I never had access to Asia Pacific, but I assume something different there. Also, ads system/related graphing tools ran on US/Eastern, except without DST (because 24 hour days are important)

There was some talk of changing to UTC everywhere, but I'm not sure if that happened. The next two big companies I went to at least managed to run US/Pacific everywhere. But the problem is always, by the time someone who knows better comes along, it's a PITA to change.


> it's a PITA to change

Pain/annoyance is a recurring theme in this sub-thread as far as reasons go for not switching or not using UTC in the first place.

I strongly suspect, however, that this is one of those situation where the negative aspect is over-estimated.

Other times, merely debating/discussing it draws attention to the pain and serves to amplify it (or its perception). Just quietly implementing it risks offending key players, but the vast majority won't even know the difference.


I'm glad they even included the timezone. An alarming number of tech posts don't even bother with that.


Perhaps they tend to think the US/Pacific time zone is the most significant timezone?


It's probably not outrageous to assert that there are more devs looking up GCP information from that time zone than any other.


The global load balancer is one of the best offerings from GCP but we are always concerned about the single point of failure it causes.

Unfortunately this isn't the first time it's broken and it's starting to look like a bad choice. We use a CDN in front so were able to switch traffic around but it seems it's better to do the load balancing ourselves too instead of using GLB.


Any load balancer is always a single point of failure. This is why lots of folks go multi-cloud.

You should also be looking at multi-CDN.


> Any load balancer is always a single point of failure.

Yes, but Google takes this to another level, all their load balancers are one single giant point of failure.

> This is why lots of folks go multi-cloud.

Or, just use AWS, where each region is almost entirely independent (some specific APIs are global by their nature, like creating a new S3 bucket), but you shouldn't be using those APIs on your run-path.

Yes, it is less convenient (and it must be less convenient for AWS to build things this way than for Google who seem to prioritise their development experience over customer availability), but new features from AWS are making it easier to deploy to multiple regions and use cross-region failover.


CDNs are single vendors but usually have a resilient and distributed system without a SPOF within, meaning traffic can always route through even with major network outages. This has been a core part of their design for a long time and I can't remember the last time an entire CDN has been down.

GLB is also distributed but clearly is a single "service" within GCP with weaknesses that can take down the entire thing, probably because of the complexity and integration involved in its architecture. It's fast and convenient but the reliability just isn't there yet.


I have yet to read solid case studies of real multi-cloud at scale. E.g. Active-Active load-balanced between multiple providers. Plenty of companies use multiple clouds, but, it tends to be a line of business decision. Team A likes AWS, Team B like Azure etc.


Agreed, multi-cloud is rather costly and means only using infrastructure services like VMs and storage.


It's fairly common. No need to call it multi cloud, companies have had multiple datacenters in multiple countries for a long time.

The only challenge is that you need global geographic load balancers and that means F5 and at least a million dollar.

Also, you will find out later that some dependencies were only running in a single location and services failed with the datacenter.


I think that’s quite an exaggeration - there are quite a few DNS providers who can do intelligent DNS LB for you and don’t care what your backend is (gcp/AWS/azure/onprem). Won’t even cost you a million bucks.


I'm not sure I agree with that, it's trivial to deploy 2 or more independent ALBs in parallel in AWS if it's a concern for you, with health checks on each from Route53 dns?


And multi-DNS provider. :)

Everyone always forgets about DNS, until Dyn dies.


Fricken Dyn... I pay them too... I'd move my DNS services to Google, but I'd have put them behind a load balancer... heh


DNS as a load balancer doesn't have any SPOFs if you do it right (but if you need to coordinate advanced load balancing between multiple providers, that's a bit painful)


I wonder if floating IPs can offer a solution here, combined with multi-cloud LBs?


Grrr. So much for global redundancy.

What is going to be faster? Updating DNS records with TTL 3600 to point to a single data center or Google fixing their problem.

We host DNS at AWS, but servers in GCP. Should we use AWS's automatic DNS failover feature to cover for such a case?


AWS engineer here, I was lead for Route 53.

We generally use 60 second TTLs, and as low as 10 seconds is very common. There's a lot of myth out there about upstream DNS resolvers not honoring low TTLs, but we find that it's very reliable. We actually see faster convergence times with DNS failover than using BGP/IP Anycast. That's probably because DNS TTLs decrement concurrently on every resolver with the record, but BGP advertisements have to propagate serially network-by-network. The way DNS failover works is that the health checks are integrated directly with the Route 53 name servers. In fact every name server is checking the latest healthiness status every single time it gets a query. Those statuses are basically a bitset, being updated /all/ of the time. The system doesn't "care" or "know" how many health status change each time, it's not delta-based. That's made it very very reliable over the years. We use it ourselves for everything.

Of course the downside of low TTLs is more queries, and we charge by the query unless you ALIAS to an ELB, S3, or CloudFront (then the cost of the queries is on us).


_most_ of the traffic will move in response to DNS changes, but there's always a group of resolvers that keep your old IPs for an unreasonable amount of time. I've taken machines out of DNS rotations with short TTLS (I think 5 minutes, but maybe 1 hour) and had some amount of traffic on them for weeks. After a reasonable amount of time, too bad for them, but when I can work behind a 'real' load balancer it's nice to be able to actually turn off the traffic.


Interesting, thank you. So a potential mitigation strategy could look like this:

- Route 53 failover record * primary record: Google global load balancer IP * secondary record: Route 53 Geolocation set (really need that latency) - Elastic Load balancer record per region * routes to mirror region GCP IP address (ELB's application load balancer seems to able to point to AWS external IPs) * optionally spin up mirror infrastructure in AWS

Seems brittle. Does Azure support global load balancing with external IPs?

Does anyone have such (or similar) setup actually in production? How did it work today?


That would work, and Azure Traffic Manager does support external IPs. CDNs like Cloudflare and Fastly also have built-in load-balancing where they use their internal routing tables for faster propagation.


I haven't been able to make an ELB target be an external IP. What did you mean by "ELB's application load balancer seems to able to point to AWS external IPs"?


https://aws.amazon.com/elasticloadbalancing/details/#details

IP addresses as Targets You can load balance any application hosted in AWS or on-premises using IP addresses of the application backends as targets. This allows load balancing to an application backend hosted on any IP address and any interface on an instance. You can also use IP addresses as targets to load balance applications hosted in on-premises locations (over a Direct Connect or VPN connection), peered VPCs and EC2-Classic (using ClassicLink). The ability to load balance across AWS and on-prem resources helps you migrate-to-cloud, burst-to-cloud or failover-to-cloud.

Looks like you need an active VPN connection to access external IPs.


That feature requires you to use a private IP address, so if you have a VPN or Direct Connect to another location you could load balance across locations. In the case of the global load balancers those will be public addresses though.

"The IP addresses that you register must be from the subnets of the VPC for the target group, the RFC 1918 range (10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16), and the RFC 6598 range (100.64.0.0/10). You cannot register publicly routable IP addresses."

[1] https://docs.aws.amazon.com/elasticloadbalancing/latest/netw...


> Of course the downside of low TTLs is more queries

I was diagnosing a networking issue from one of our service providers last Friday. For whatever indeterminate reason DNS responses from R53 took upwards of 10-15 seconds to return. While I appreciate the non-configurable default TTL of 60 seconds for ELB is not plucked out of thin air and that actual issue seemed to be on the service providers side, the lower limit seems far too low for medium/high latency networks. I wish it was configurable.

What's worse is it looks like it's our site that is the issue, so we get the complaints and I have to dig through wireshark logs.


If you have a very high latency network, say a satellite link, make sure that your near-side resolver supports pre-fetching! Unbound is a good choice.


I run unbound on my own workstations. It's so lightweight, you'd never even notice it, but it definitely makes browsing a little more snappy.


>There's a lot of myth out there about upstream DNS resolvers not honoring low TTLs, but we find that it's very reliable

I've done a few unplanned DNS failovers, and I agree with this. What can be real trouble though is if you're running a B2B app, and your customers corporate networks can be configured in any strange way. I've met real network admins who think they need to have high TTLs everywhere in order to protect themselves from root DNS DDoSes.


There really are locations where DNS resolvers don't honor TTL.

For example, the public wifi in the last Hackspace in Munich I visited did not honour my 10 second TTL.

But in my opinion there aren't enough of them to justify not using short TTLs. It's their problem after all if they don't honour websites' settings: Then they will see downtime when nobody else does.


Do you mean it was cached for longer than 10 seconds? Was it Freifunk? It might be worth writing to them to ask what their caching setup is.


I've always thought TTL less than 60 seconds should be avoided, as some upstream DNS resolvers will ignore values less than 60 seconds and use a default long value. You are saying this is not true and a TTL of 10 seconds can safely be used?


I think it's safe, based on a lot of experiments. We use 5 seconds for S3 ...

    ;; ANSWER SECTION:
    s3.us-east-1.amazonaws.com. 5	IN	A	52.216.165.117
One of the biggest, highest traffic, systems on the internet!


Traffic is coming back. Looks like Google fixed their load balancer problem within 28 minutes.


Hmm, this comment says it’s been happening for hours (below). Maybe their status page isn’t accurate

https://news.ycombinator.com/item?id=17552693


I'd wait for further details from the status page, but as a GCP employee (for whatever that claim's worth on the internet), I'm not seeing evidence of an issue earlier than 12:15 PDT.


I'm sure it was a cascading event, similar to the one Amazon had yesterday on their own site. Started small until it snowballed and effected everyone.


> We host DNS at AWS, but servers in GCP. Should we use AWS's automatic DNS failover feature to cover for such a case?

Well, I would avoid any of GCP's 'Global' features, they are an availability risk.

AWS's approach is to rather have inter-region replication, and there are lots of new features that support this.


Our services have just become available again, initial downtime starting at 20:25 GMT, uptime returning at 20:56 GMT

EDIT: App Engine and Kubernetes environments in our EU region appeared to go down, Compute Engine was okay.


This has been causing issues for us for the last couple of hours. We migrated a web application over to GAE last month. It's been rock solid and this is the first issue we've had. Happy to know we can sit back and let them sort it out.


Still having issues as of 20:20 UTC - Firebase Database messages not beeing delivered to client. It's been a full hour so far. The delivery rate also had a minor drop earlier around 03:00AM to 05:00AM but no incident had been listed.


I suspect this is what caused the technical difficulties with HQ Trivia.


Could have sworn I was going crazy last night; was attempting to write a scraper to download attachments / tickets from pivotal tracker for a yearly accounting report. Everything was working fine, then suddenly not. At around 2am I gave up and just put it down to the 'gods' telling me to go to bed.

This was the first article I saw after I woke up, somewhat refreshed. /feeling-vindicated


Wow, 2 AM, some persistence. But as someone who's very sensitive to sleep deprivation (I get violently ill if I don't sleep well for more than 2 days in a row), I insist not making it a habit. Especially if you are in your 30s, it has bad after effects. PM for more info.


Can't PM with no contact info in your profile - email is not public - this is HackerNews.


It's interesting and scary, mostly scary, that we now are almost at the point where Google going down equals the Internet going down.


There are several large cloud platforms like Azure, Google Cloud and AWS. Major sites run on these platforms, but have good reasons to do so.

Its like with airplane crashes vs car crashes. The chance that you get an accident with your car is significantly higher, but usually the airplane accident ends up on "BREAKING NEWS". Its just a matter of impact.


> The chance that you get an accident with your car is significantly higher

There's a difference between fatalities and accidents. Fatalities are surprisingly low. Plus, Air travel is only significantly safer per-km, per-journey it's not at all safer [1].

> Its just a matter of impact.

Poor choice of words.

[1]: https://en.wikipedia.org/wiki/Aviation_safety#Transport_comp...


> Air travel is only significantly safer per-km, per-journey it's not at all safer

Of course there are far more car journeys taken than plane ones, so the GP's point stands (and since the point was just to make an analogy, unless you disagree that plane crashes get far more news coverage than car crashes, I'm not sure what the point of disagreement was).

(note those numbers also come from the UK, 1990-2000. If you took the numbers from the US, the car fatalities would be significantly higher (per capita or per km) and if they were from the last decade the air fatalities far lower)


From what I've seen, these were all down: Spotify, Twitch, Discord, Snapchat, Pivotal Tracker, Workana, Codementor


I was watching Twitch through the entirety of this outage, so I'm pretty sure Twitch was fine.


Isn’t Twitch bought by Amazon years ago? Would be surprised if they’re using GCP.


Pokemon Go, Rocket League's MP servers.


I think AWS going down globally would be even worse.


Agreed. But while this outage was global, saying "GCP went down globally" overstates it.

AWS doesn't have a comparable service to the global version of GCP's load balancer service, which is what went down. Other GCP services were only affected to the extent that they use global load balancers for ingress, which varies by service.

For most GCP services, the ingress method is under the user's control and a switch to regional load balancers in one or more regions (whether split through GeoDNS or through round-robin) would have been a workaround.

Admittedly one point of global load balancers is to be able to mitigate a lot of other outages... I guess the secondary lesson here is to keep a short TTL on top-level DNS entries which point to the load balancers, and ideally have two DNS providers in the mix too.


> AWS doesn't have a comparable service to the global version of GCP's load balancer service

Intentionally, because then customers relying on a service like this, would have regular global outages.

This is the 3rd global outage GCP has had in less than a year. As far as I remember, AWS hasn't had one in many years (but, too many people put everything in us-east-1, so a short single-region outage - which happens very seldom - seems to take down half the internet).


To be honest, I hadn't even noticed. What popular sites went down?


If you check the other comments you'll see that Spotify, Snapchat, Discord etc. were all affected. We're not talking about "sites" but any application or site built on Google's infrastructure, even partly.


Not "every application," just applications making use of the GLB or services built upon it.


Pokemon Go


I'm not sure if it is related but Google developer console (for Android apps) stopped displaying total number of installs today


Snapchat, Spotify, Discord,...


Thanks for the information. I see now why I hadn't noticed. I don't Snapchat anymore, I switched to Google Music, and I haven't adopted Discord. I guess I'm not in with the cool crowd anymore!


Apparently snapchat, spotify and some game servers affected too.


Definitely noticed the issues on SnapChat.


Thought my TV was playing up (Youtube)


discord as well


Is there an easy way to see what cloud based resources sites are using? For instance, PushBullet is down now and this might explain why. Several other sites affected, too.


https://stackshare.io/ is decent for crowd-sourced info on what the stack for a given company/product is comprised of. Not guaranteed to be accurate, so take it with a grain of salt, of course.


If the web front-end or any external web-accessible APIs are being served directly via App Engine, you can see it in the HTTP response headers. Look for x-google-appengine-


generally lookup the IPs (dig <hostname>) and then attempt a host/whois on that IP. For PushBullet it's hidden by CloudFlare so hard to see easily, you should try and find an exposed endpoint (which they haven't if they've done well).


My Chrome extension (IPvFoo) displays all the subresource IP addresses in a table:

https://chrome.google.com/webstore/detail/ipvfoo/ecanpcehffn...

(There's a right-click option to look up each address on bgp.he.net, but that doesn't happen automatically, for privacy.)


curl -I www.domain.com

This might give you a clue, PushBullet seems to run behind CloudFlare though.


I got a Could not connect to POP server 'pop.gmail.com:995' (SSL=ssl): SSL connect attempt failed because of handshake problems


Is there a single cloud that can match uptime of a single top tier DC?


Uhm, no, because a cloud service spans over several data centers and features a lot more moving parts than just a solid data center.

Let's turn this around: could a typical data center + server uptime + service uptime equal that of a major cloud provider?


In my experience yes but too few data points.


Yup, discord impacted for me


Yes, yes we are.


time to spread resources across clouds? or would that be too expensive?


this one appears to be just global LBs, so a load balancer in AWS/Azure that hits the actual backends over a VPN or something would have worked, but that's just this case


Even Google Cloud's non-global load balancers, which live within a single region, wouldn't have hit this particular outage.


jep we had non-global lb's in europe-west-3 and i could not sense any outage.


I sure wish we hosted on AWS today. Work sucked big time.


Something no one has mentioned yet, could it be that the engineering force at Google is no longer what it used to be?


Because of one data point? Seriously?


I think OP was jovially referring to a similar comment regarding Amazon because of their recent Prime day outage.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: