Google Cloud Global Loadbalancer Outage

intsunny · on July 17, 2018

Why can't Google just use UTC timestamps? Or at least include them alongside their "US/Pacific" timestamps.

I don't want to remember if "US/Pacific" currently has daylight savings or not.

Its a very strange decision especially considering that GCP has numerous regions outside of "US/Pacific".

_wmd · on July 17, 2018

Just another case of their idiotic culture leaking into their products. An early "design" decision led to all their production kit being set to US/Pacific, triggering frequent DST bugs, and last I heard (many years ago) it was still the case.

Coordinating a tz change over a network of that size is probably infeasible, so may as well push the pain on to customers

strange_quark · on July 17, 2018

Slack's status page does the same thing, and perhaps more insultingly, has a link that says "See in your timezone" that just links to timeanddate.com. Really? None of the 1000s of JS devs at Slack could figure out how to import a date/time library into their status page?

sbr464 · on July 18, 2018

I was quite amused when I clicked that link once, thought the same thing.

enraged_camel · on July 17, 2018

Just because you can import a library to do something doesn’t mean you should. Why bloat the software even more for a function that is used rarely (i.e. during outages) and by very few people?

tazjin · on July 17, 2018

Slack is not particularly known for caring about bloat in their pages.

ishansharma · on July 17, 2018

Or apps for that matter. Having used the app for 6 months, I just keep a browser tab open these days, no use of feeding all the RAM to Slack.

sitkack · on July 18, 2018

We are going to find out the Hynix is a major stockholder in Slack. Mystery Solved.

scrumbledober · on July 18, 2018

Or for rarely having outages...

enraged_camel · on July 18, 2018

That’s why I said “why should they bloat it even more”. I mean it’s already bloated and you want them to add more stuff just for an obscure little thing?

PaulAnunda · on July 18, 2018

Because its the right thing to do for the customer.

manigandham · on July 17, 2018

Seem reckless to use non-UTC time for anything as an engineering-focused team, but regardless they should definitely add some code to convert those times back to UTC for the rest of us.

chronid · on July 17, 2018

Those were decisions made a lot of time ago, when there was only a single region/datacenter, since that was what you were running your servers on. And then they remained.

AWS has a similar issue internally, IIRC (oncall around DST switching times was fun with teams in three continents), but they are making baby steps to fix it. It's harder to fix than it sounds once in place though...

outworlder · on July 17, 2018

> Those were decisions made a lot of time ago, when there was only a single region/datacenter, since that was what you were running your servers on. And then they remained.

Yes, and it's still a <bad adjective here> decision. Actually, there should be no decisions on those things. It doesn't matter if it is your personal blog. Store events in UTC, no ifs or buts. Then convert for display. You better be able to fully articulate why you are not using UTC (scheduling real world events?), otherwise use UTC. Make it a tattoo if that helps. Again, UTC. Yes I know you are a single person and you don't have more than one server today, still use UTC. Thank your former self later on when you have to match events across timezones...

Text representation is another example. Use Unicode for strings unless you can clearly explain why that's the wrong representation. (Which Unicode? Pick UTF-8, unless you have a reason not to, in which case you'll probably know what the reason is).

nemothekid · on July 18, 2018

You are talking about a decision that was made 18-20 years ago. It was bad decision in hindsight, but back then I'd imagine they were likely optimizing for a very different problem.

Even the UTF-8/Unicode example doesn't hold 18-20 years ago. UTF-8 was completely niche in 2003 and there were a number of flavors of Unicode, not all of which nicely fellback to ASCII like UTF-8 does. If I'm distributing software in 2003, I would have stuck to ASCII unless I was sure I needed Unicode, and if then I would have likely opted for UTF-16 because that was what was supported then.

I wouldn't question anyones decision for using UTF-16 in 2018 for software that had been supported since 2003.

mmt · on July 18, 2018

> You are talking about a decision that was made 18-20 years ago.

Hang on.. what are we talking about?

According to Wikipedia, AWS was launched in 2006, which is only 12 years ago, and GCP was launched in 2008, only 10 years ago.

You could argue that, even in 2006, UTF-8 wasn't firmly enough established as the "winner" to be a clear best choice, but defending an for avoiding Unicode entirely at that point, for a company with global ambitions, just seems disingenuous.

For the question of UTC, claiming naivete also lacks credibility, if only because Y2K brought to and kept at the forefront of everyone's minds the topic of date/time representation in computers, starting around 20 years ago.

rbirkby · on July 18, 2018

Shock! Wikipedia is wrong! The first AWS service was AWIS, still going strong today, although only usable by root accounts, not IAM.

This misinformation is perpetuated by AWS evangelists who claim SQS was the first service.

https://aws.amazon.com/about-aws/whats-new/2004/10/04/introd...

mmt · on July 18, 2018

Although what you point out may be technically (and marketingly) correct, it's not a stretch to consider the current, cloud-computing AWS to have started until the availability of S3 and/or EC2.

More importantly, for these services to run roughshod, two years later, over the technical assumptions of something like AWIS also seems reasonable to expect. If AWIS isn't integrated under IAMs, that's pretty telling.

TheDong · on July 18, 2018

GCP inherited the decision from google's existing way of setting up servers, and they'd been setting up servers long since... or are you suggesting GCP should have run its servers with a different tz configuration than the rest of the company resulting in even worse issues?

theossuary · on July 18, 2018

Comments like yours really piss me off. You have absolutely no self-awareness or empathy, and come off sounding like a know it all.

Yes, we all know that storing date/time data is hard; they should use a standard timezone (without DST), and use time-locales to localize the timestamps to the user. We get it.

The reason this knowledge is so well known is because previous programmers (without this helpful guidance) made many different technical decisions for many reasons, and learned the hard way. They then publicized their failures for our benefit.

What was the last technical decision you made that turned out to be the wrong one? Did you make the wrong decision maliciously? Or were you doing the best you could with the information available? Would you really find it helpful for somebody to come along a decade later with the attitude "these guys have no idea what they were doing, I learned about these antipatterns years ago!" (usually followed by "better rewrite everything")

I'll go out on a limb and say most people here haven't started from a single datacenter and grew to a global service. These engineers deserve the benefit of doubt unless and until more information is available. They certainly don't need armchair analysis from some randos on hackernews.

/rant

manigandham · on July 18, 2018

>> we all know that storing date/time data is hard

But... it's not hard. That's the point. This isn't a hard decision and has nothing to do with regional vs global. The lesson was learned by the entire industry decades before the company existed so a modern engineering team making that mistake, and never fixing it, is definitely open for criticism.

talonx · on July 18, 2018

Is it really believable that this engineering team started out by not thinking that their service (GCP) would become a global service, irrespective of how many datacenters they started with and where?

dang · on July 18, 2018

> Comments like yours really piss me off. You have absolutely no self-awareness or empathy, and come off sounding like a know it all.

Oh dear. This kind of personal attack is not ok on HN and we ban accounts that do this. Could you please not do it again, regardless of how un-self-aware another comment seems or how you feel in response?

Your comment would be fine without the first and last sentences. It's hard never to write such things but it's always possible to edit them out. That's what I do when something like that slips out.

https://news.ycombinator.com/newsguidelines.html

Edit: unfortunately it looks like you've been uncivil before, e.g. https://news.ycombinator.com/item?id=16928159. Please make sure not to in the future.

theossuary · on July 19, 2018

I see where you're coming from on the first sentence. I should have left it at saying his comment seemed to lack empathy. The last sentence I still think is fine though (I include myself in the list of people who don't have the perspective to criticize their decisions, I don't have the context to be constructive). I won't defend the other comment.

It's easy to be unnecessarily rude in comments, but I would have said the same were we talking in person. I don't believe in hiding behind anonymity to say something you wouldn't say otherwise.

All the same, I appreciate the input.

manigandham · on July 17, 2018

Sure, but we're just talking about the status site, it's not hard at all to just convert the timestamps to UTC for display using any server-side or JS library. Even better to show UTC and the current browser reported timezone alongside.

endymi0n · on July 18, 2018

This goes deep within the UX of their systems - some of the most frequent actions in Stackdriver like jumping to a time within the logs is a mess. There are no quick actions (like "jump to now"), and THEN I have to switch to a non-intuitive "World / Greenwich Mean Time" config first every time (standard is Pacific, there is no UTC), and THEN I have to twist my brain into the weird US date format and AM/PM times within two different controls. Nothing of that is changeable or configurable, that interface is just a pain to use. My bug report apparently also didn't cut it.

ngrilly · on July 17, 2018

"Idiotic culture" is a bit harsh.

_wmd · on July 17, 2018

You clearly haven't spent much time with App Engine

stingraycharles · on July 17, 2018

What’s the source of this? Is sounds intruiging, I wonder what led them to that decision.

puzzle · on July 17, 2018

I wasn't around at the time, but I suspect it's mainly because the first datacenters were on the West Coast. See e.g. the story at http://www.dodgycoder.net/2013/02/googles-fiber-leeching-cap...

_wmd · on July 17, 2018

Just ask any current/former employee, it's practically folklore by now (i.e. knowledge older than 6 months)

puzzle · on July 17, 2018

Ten years ago it was called jokingly GST: not Gulf Standard Time, but Google Standard Time. I don't remember outages, but the monitoring graphs when switching to DST, where plots go back in time, did look very funky. Something like this:

https://imgur.com/a/ZRtt9

The leap second troubles in 2005 (GFS?) that led to NTP smearing were more memorable.

reaperducer · on July 17, 2018

Google Standard Time

Wow. Just... wow. Thanks for that. That phrase brought back a flood of old memories I'd forgotten.

shafte · on July 17, 2018

This is fairly common for large tech companies that started in that timezone. The decisionmaking process goes something like this:

1. You MUST use a single, consistent timezone. Using the local timezone is a mess: is it the user's local? The host's local? What about server logs, which are typically text files where the timezone can't be displayed dynamically? What if you aggregate server logs across timezones? What if you ssh into a machine in a different timezone?

2. The logical timezone to use is the one in which the vast majority of your employees work, since having people subtracting 7/8 all the time is annoying.

You could argue that for customer-facing status updates like this, Google should use a dynamic timezone. That's fair, but I'm sure Google internally uses that status dashboard, so it could be very confusing and complicate coordination to mitigate the problem. I'd argue that customers would prefer that the problem get fixed slightly sooner over having to do a once-a-year-ish timezone conversion.

djhworld · on July 17, 2018

> 1. You MUST use a single, consistent timezone.

UTC was adopted 51 years ago.

shafte · on July 17, 2018

As I said: UTC is a reasonable choice, except for the fact that most of your employees do not live in UTC and doing mental translation is annoying.

manigandham · on July 17, 2018

Store UTC, convert on display to any timezone, either automatically based on geolocation or using a stored preference. This is application design 101 at this point.

Converting old logs isn't hard either since ISO timestamp formats include timezones, or just pick a date for the switchover. The only scenario that gets tricky is with scheduling where users expect local times that carry over daylight-savings boundaries, but it doesn't apply to most apps.

outworlder · on July 17, 2018

Who is doing mental translation and why? We have computers for a reason. Are we talking about logs, software applications? Storage should be in UTC, data can be converted for display. I agree that logs can be a bit annoying when they are in UTC, but at least you have a consistent value across all servers, and you know what the offset is.

Also, if your office happens to be in a timezone which observes DST, you are still screwed. Now you think your times are in localtime, but in fact they are offset by one hour. This can lead to very "fun" debugging sessions and time wasted.

It can be a minor annoyance, but you know what can be an even greater annoyance? Undoing a bad decision which has percolated across several data stores.

sleepydog · on July 18, 2018

> Storage should be in UTC, data can be converted for display

Sometimes we need to speak about events (verbally), or share screenshots of graphs, dashboards, or other data. We can make a habit of always stating the time zone when we talk, and make sure every dashboard/graph contains a time offset.

Or we can just pick a single arbitrary time zone and always use that.

There are a few internal tools at Google that assume your local time zone (I'm in EDT) -- those are far more confusing.

> Also, if your office happens to be in a timezone which observes DST, you are still screwed.

Has this really been a problem? A date+time is unambiguous. I can only see this being confusing if California follows through with abolishing DST.

manigandham · on July 18, 2018

>> Or we can just pick a single arbitrary time zone and always use that.

...so pick UTC then?

Why is that any different than how it is right now for anyone not in the Pacific timezone?

lloydde · on July 18, 2018

I think they forgot to tell the PC kids. I seem to recall years of Unix/Linux installs prompting (tempting) me not to use UTC. I also seem to remember dual booting and Windows and Linux doing a time tug of war with some OS preferring the Real Time Clock (RTC) being local time.

shub · on July 18, 2018

Windows likes local time on the RTC. Way back when DOS 1.0 was released (every story about Windows begins with this) there was one way to set the system clock: the user would type in the time. They would use the local time, of course. Consistency with other systems didn't mean diddly squat on a PC/XT with 256 KB of RAM, a floppy drive, and a keyboard. Fast forward a million tech years and we all live with the legacy.

mrighele · on July 18, 2018

Windows use the local time zone (for historical reason, explained in the sibling comment), so if you dual boot it makes sense to keep the same behavior, otherwise you will have time issues when booting the other OS. If you only use Linux just go for UTC.

toast0 · on July 18, 2018

1) hah, Yahoo used US/Pacific for US servers, GB/London for European servers, and I never had access to Asia Pacific, but I assume something different there. Also, ads system/related graphing tools ran on US/Eastern, except without DST (because 24 hour days are important)

There was some talk of changing to UTC everywhere, but I'm not sure if that happened. The next two big companies I went to at least managed to run US/Pacific everywhere. But the problem is always, by the time someone who knows better comes along, it's a PITA to change.

mmt · on July 18, 2018

> it's a PITA to change

Pain/annoyance is a recurring theme in this sub-thread as far as reasons go for not switching or not using UTC in the first place.

I strongly suspect, however, that this is one of those situation where the negative aspect is over-estimated.

Other times, merely debating/discussing it draws attention to the pain and serves to amplify it (or its perception). Just quietly implementing it risks offending key players, but the vast majority won't even know the difference.

yen223 · on July 17, 2018

I'm glad they even included the timezone. An alarming number of tech posts don't even bother with that.

sebazzz · on July 17, 2018

Perhaps they tend to think the US/Pacific time zone is the most significant timezone?

delecti · on July 17, 2018

It's probably not outrageous to assert that there are more devs looking up GCP information from that time zone than any other.

manigandham · on July 17, 2018

The global load balancer is one of the best offerings from GCP but we are always concerned about the single point of failure it causes.

Unfortunately this isn't the first time it's broken and it's starting to look like a bad choice. We use a CDN in front so were able to switch traffic around but it seems it's better to do the load balancing ourselves too instead of using GLB.

dsl · on July 17, 2018

Any load balancer is always a single point of failure. This is why lots of folks go multi-cloud.

You should also be looking at multi-CDN.

ti_ranger · on July 18, 2018

> Any load balancer is always a single point of failure.

Yes, but Google takes this to another level, all their load balancers are one single giant point of failure.

> This is why lots of folks go multi-cloud.

Or, just use AWS, where each region is almost entirely independent (some specific APIs are global by their nature, like creating a new S3 bucket), but you shouldn't be using those APIs on your run-path.

Yes, it is less convenient (and it must be less convenient for AWS to build things this way than for Google who seem to prioritise their development experience over customer availability), but new features from AWS are making it easier to deploy to multiple regions and use cross-region failover.

manigandham · on July 17, 2018

CDNs are single vendors but usually have a resilient and distributed system without a SPOF within, meaning traffic can always route through even with major network outages. This has been a core part of their design for a long time and I can't remember the last time an entire CDN has been down.

GLB is also distributed but clearly is a single "service" within GCP with weaknesses that can take down the entire thing, probably because of the complexity and integration involved in its architecture. It's fast and convenient but the reliability just isn't there yet.

p0rkbelly · on July 17, 2018

I have yet to read solid case studies of real multi-cloud at scale. E.g. Active-Active load-balanced between multiple providers. Plenty of companies use multiple clouds, but, it tends to be a line of business decision. Team A likes AWS, Team B like Azure etc.

manigandham · on July 17, 2018

Agreed, multi-cloud is rather costly and means only using infrastructure services like VMs and storage.

user5994461 · on July 17, 2018

It's fairly common. No need to call it multi cloud, companies have had multiple datacenters in multiple countries for a long time.

The only challenge is that you need global geographic load balancers and that means F5 and at least a million dollar.

Also, you will find out later that some dependencies were only running in a single location and services failed with the datacenter.

spydum · on July 18, 2018

I think that’s quite an exaggeration - there are quite a few DNS providers who can do intelligent DNS LB for you and don’t care what your backend is (gcp/AWS/azure/onprem). Won’t even cost you a million bucks.

EwanToo · on July 17, 2018

I'm not sure I agree with that, it's trivial to deploy 2 or more independent ALBs in parallel in AWS if it's a concern for you, with health checks on each from Route53 dns?

chronid · on July 17, 2018

And multi-DNS provider. :)

Everyone always forgets about DNS, until Dyn dies.

jreiners · on July 18, 2018

Fricken Dyn... I pay them too... I'd move my DNS services to Google, but I'd have put them behind a load balancer... heh

toast0 · on July 18, 2018

DNS as a load balancer doesn't have any SPOFs if you do it right (but if you need to coordinate advanced load balancing between multiple providers, that's a bit painful)

Allstar · on July 18, 2018

I wonder if floating IPs can offer a solution here, combined with multi-cloud LBs?

iowahansen · on July 17, 2018

Grrr. So much for global redundancy.

What is going to be faster? Updating DNS records with TTL 3600 to point to a single data center or Google fixing their problem.

We host DNS at AWS, but servers in GCP. Should we use AWS's automatic DNS failover feature to cover for such a case?

colmmacc · on July 17, 2018

AWS engineer here, I was lead for Route 53.

We generally use 60 second TTLs, and as low as 10 seconds is very common. There's a lot of myth out there about upstream DNS resolvers not honoring low TTLs, but we find that it's very reliable. We actually see faster convergence times with DNS failover than using BGP/IP Anycast. That's probably because DNS TTLs decrement concurrently on every resolver with the record, but BGP advertisements have to propagate serially network-by-network. The way DNS failover works is that the health checks are integrated directly with the Route 53 name servers. In fact every name server is checking the latest healthiness status every single time it gets a query. Those statuses are basically a bitset, being updated /all/ of the time. The system doesn't "care" or "know" how many health status change each time, it's not delta-based. That's made it very very reliable over the years. We use it ourselves for everything.

Of course the downside of low TTLs is more queries, and we charge by the query unless you ALIAS to an ELB, S3, or CloudFront (then the cost of the queries is on us).

toast0 · on July 18, 2018

_most_ of the traffic will move in response to DNS changes, but there's always a group of resolvers that keep your old IPs for an unreasonable amount of time. I've taken machines out of DNS rotations with short TTLS (I think 5 minutes, but maybe 1 hour) and had some amount of traffic on them for weeks. After a reasonable amount of time, too bad for them, but when I can work behind a 'real' load balancer it's nice to be able to actually turn off the traffic.

iowahansen · on July 17, 2018

Interesting, thank you. So a potential mitigation strategy could look like this:

- Route 53 failover record * primary record: Google global load balancer IP * secondary record: Route 53 Geolocation set (really need that latency) - Elastic Load balancer record per region * routes to mirror region GCP IP address (ELB's application load balancer seems to able to point to AWS external IPs) * optionally spin up mirror infrastructure in AWS

Seems brittle. Does Azure support global load balancing with external IPs?

Does anyone have such (or similar) setup actually in production? How did it work today?

manigandham · on July 17, 2018

That would work, and Azure Traffic Manager does support external IPs. CDNs like Cloudflare and Fastly also have built-in load-balancing where they use their internal routing tables for faster propagation.

fastest963 · on July 17, 2018

I haven't been able to make an ELB target be an external IP. What did you mean by "ELB's application load balancer seems to able to point to AWS external IPs"?

iowahansen · on July 17, 2018

https://aws.amazon.com/elasticloadbalancing/details/#details

IP addresses as Targets You can load balance any application hosted in AWS or on-premises using IP addresses of the application backends as targets. This allows load balancing to an application backend hosted on any IP address and any interface on an instance. You can also use IP addresses as targets to load balance applications hosted in on-premises locations (over a Direct Connect or VPN connection), peered VPCs and EC2-Classic (using ClassicLink). The ability to load balance across AWS and on-prem resources helps you migrate-to-cloud, burst-to-cloud or failover-to-cloud.

Looks like you need an active VPN connection to access external IPs.

trout · on July 18, 2018

That feature requires you to use a private IP address, so if you have a VPN or Direct Connect to another location you could load balance across locations. In the case of the global load balancers those will be public addresses though.

"The IP addresses that you register must be from the subnets of the VPC for the target group, the RFC 1918 range (10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16), and the RFC 6598 range (100.64.0.0/10). You cannot register publicly routable IP addresses."

[1] https://docs.aws.amazon.com/elasticloadbalancing/latest/netw...

orf · on July 17, 2018

> Of course the downside of low TTLs is more queries

I was diagnosing a networking issue from one of our service providers last Friday. For whatever indeterminate reason DNS responses from R53 took upwards of 10-15 seconds to return. While I appreciate the non-configurable default TTL of 60 seconds for ELB is not plucked out of thin air and that actual issue seemed to be on the service providers side, the lower limit seems far too low for medium/high latency networks. I wish it was configurable.

What's worse is it looks like it's our site that is the issue, so we get the complaints and I have to dig through wireshark logs.

colmmacc · on July 17, 2018

If you have a very high latency network, say a satellite link, make sure that your near-side resolver supports pre-fetching! Unbound is a good choice.

jniedrauer · on July 18, 2018

I run unbound on my own workstations. It's so lightweight, you'd never even notice it, but it definitely makes browsing a little more snappy.

AmericanChopper · on July 17, 2018

>There's a lot of myth out there about upstream DNS resolvers not honoring low TTLs, but we find that it's very reliable

I've done a few unplanned DNS failovers, and I agree with this. What can be real trouble though is if you're running a B2B app, and your customers corporate networks can be configured in any strange way. I've met real network admins who think they need to have high TTLs everywhere in order to protect themselves from root DNS DDoSes.

nh2 · on July 17, 2018

There really are locations where DNS resolvers don't honor TTL.

For example, the public wifi in the last Hackspace in Munich I visited did not honour my 10 second TTL.

But in my opinion there aren't enough of them to justify not using short TTLs. It's their problem after all if they don't honour websites' settings: Then they will see downtime when nobody else does.

voltagex_ · on July 18, 2018

Do you mean it was cached for longer than 10 seconds? Was it Freifunk? It might be worth writing to them to ask what their caching setup is.

nodesocket · on July 17, 2018

I've always thought TTL less than 60 seconds should be avoided, as some upstream DNS resolvers will ignore values less than 60 seconds and use a default long value. You are saying this is not true and a TTL of 10 seconds can safely be used?

colmmacc · on July 17, 2018

I think it's safe, based on a lot of experiments. We use 5 seconds for S3 ...

    ;; ANSWER SECTION:
    s3.us-east-1.amazonaws.com. 5	IN	A	52.216.165.117

One of the biggest, highest traffic, systems on the internet!

iowahansen · on July 17, 2018

Traffic is coming back. Looks like Google fixed their load balancer problem within 28 minutes.

tango24 · on July 17, 2018

Hmm, this comment says it’s been happening for hours (below). Maybe their status page isn’t accurate

https://news.ycombinator.com/item?id=17552693

jagthebeetle · on July 17, 2018

I'd wait for further details from the status page, but as a GCP employee (for whatever that claim's worth on the internet), I'm not seeing evidence of an issue earlier than 12:15 PDT.

partiallypro · on July 17, 2018

I'm sure it was a cascading event, similar to the one Amazon had yesterday on their own site. Started small until it snowballed and effected everyone.

ti_ranger · on July 18, 2018

> We host DNS at AWS, but servers in GCP. Should we use AWS's automatic DNS failover feature to cover for such a case?

Well, I would avoid any of GCP's 'Global' features, they are an availability risk.

AWS's approach is to rather have inter-region replication, and there are lots of new features that support this.

LiquidFlux · on July 17, 2018

Our services have just become available again, initial downtime starting at 20:25 GMT, uptime returning at 20:56 GMT

EDIT: App Engine and Kubernetes environments in our EU region appeared to go down, Compute Engine was okay.

stanmancan · on July 17, 2018

This has been causing issues for us for the last couple of hours. We migrated a web application over to GAE last month. It's been rock solid and this is the first issue we've had. Happy to know we can sit back and let them sort it out.

EZ-E · on July 17, 2018

Still having issues as of 20:20 UTC - Firebase Database messages not beeing delivered to client. It's been a full hour so far. The delivery rate also had a minor drop earlier around 03:00AM to 05:00AM but no incident had been listed.

MollyRealized · on July 17, 2018

I suspect this is what caused the technical difficulties with HQ Trivia.

beilabs · on July 18, 2018

Could have sworn I was going crazy last night; was attempting to write a scraper to download attachments / tickets from pivotal tracker for a yearly accounting report. Everything was working fine, then suddenly not. At around 2am I gave up and just put it down to the 'gods' telling me to go to bed.

This was the first article I saw after I woke up, somewhat refreshed. /feeling-vindicated

sidcool · on July 18, 2018

Wow, 2 AM, some persistence. But as someone who's very sensitive to sleep deprivation (I get violently ill if I don't sleep well for more than 2 days in a row), I insist not making it a habit. Especially if you are in your 30s, it has bad after effects. PM for more info.

voltagex_ · on July 18, 2018

Can't PM with no contact info in your profile - email is not public - this is HackerNews.

hnarn · on July 17, 2018

It's interesting and scary, mostly scary, that we now are almost at the point where Google going down equals the Internet going down.

sebazzz · on July 17, 2018

There are several large cloud platforms like Azure, Google Cloud and AWS. Major sites run on these platforms, but have good reasons to do so.

Its like with airplane crashes vs car crashes. The chance that you get an accident with your car is significantly higher, but usually the airplane accident ends up on "BREAKING NEWS". Its just a matter of impact.

akira2501 · on July 17, 2018

> The chance that you get an accident with your car is significantly higher

There's a difference between fatalities and accidents. Fatalities are surprisingly low. Plus, Air travel is only significantly safer per-km, per-journey it's not at all safer [1].

> Its just a matter of impact.

Poor choice of words.

[1]: https://en.wikipedia.org/wiki/Aviation_safety#Transport_comp...

magicalist · on July 17, 2018

> Air travel is only significantly safer per-km, per-journey it's not at all safer

Of course there are far more car journeys taken than plane ones, so the GP's point stands (and since the point was just to make an analogy, unless you disagree that plane crashes get far more news coverage than car crashes, I'm not sure what the point of disagreement was).

(note those numbers also come from the UK, 1990-2000. If you took the numbers from the US, the car fatalities would be significantly higher (per capita or per km) and if they were from the last decade the air fatalities far lower)

innerspirit · on July 17, 2018

From what I've seen, these were all down: Spotify, Twitch, Discord, Snapchat, Pivotal Tracker, Workana, Codementor

teddyfrozevelt · on July 17, 2018

I was watching Twitch through the entirety of this outage, so I'm pretty sure Twitch was fine.

jaxondu · on July 17, 2018

Isn’t Twitch bought by Amazon years ago? Would be surprised if they’re using GCP.

LiquidFlux · on July 17, 2018

Pokemon Go, Rocket League's MP servers.

dewey · on July 17, 2018

I think AWS going down globally would be even worse.

jkaplowitz · on July 17, 2018

Agreed. But while this outage was global, saying "GCP went down globally" overstates it.

AWS doesn't have a comparable service to the global version of GCP's load balancer service, which is what went down. Other GCP services were only affected to the extent that they use global load balancers for ingress, which varies by service.

For most GCP services, the ingress method is under the user's control and a switch to regional load balancers in one or more regions (whether split through GeoDNS or through round-robin) would have been a workaround.

Admittedly one point of global load balancers is to be able to mitigate a lot of other outages... I guess the secondary lesson here is to keep a short TTL on top-level DNS entries which point to the load balancers, and ideally have two DNS providers in the mix too.

ti_ranger · on July 18, 2018

> AWS doesn't have a comparable service to the global version of GCP's load balancer service

Intentionally, because then customers relying on a service like this, would have regular global outages.

This is the 3rd global outage GCP has had in less than a year. As far as I remember, AWS hasn't had one in many years (but, too many people put everything in us-east-1, so a short single-region outage - which happens very seldom - seems to take down half the internet).

fortylove · on July 17, 2018

To be honest, I hadn't even noticed. What popular sites went down?

hnarn · on July 17, 2018

If you check the other comments you'll see that Spotify, Snapchat, Discord etc. were all affected. We're not talking about "sites" but any application or site built on Google's infrastructure, even partly.

Operyl · on July 18, 2018

Not "every application," just applications making use of the GLB or services built upon it.

twistedpair · on July 17, 2018

Pokemon Go

dvh · on July 17, 2018

I'm not sure if it is related but Google developer console (for Android apps) stopped displaying total number of installs today

dewey · on July 17, 2018

Snapchat, Spotify, Discord,...

fortylove · on July 17, 2018

Thanks for the information. I see now why I hadn't noticed. I don't Snapchat anymore, I switched to Google Music, and I haven't adopted Discord. I guess I'm not in with the cool crowd anymore!

theclaw · on July 17, 2018

Apparently snapchat, spotify and some game servers affected too.

Puer · on July 17, 2018

Definitely noticed the issues on SnapChat.

velox_io · on July 17, 2018

Thought my TV was playing up (Youtube)

parthdesai · on July 17, 2018

discord as well

pgrote · on July 17, 2018

Is there an easy way to see what cloud based resources sites are using? For instance, PushBullet is down now and this might explain why. Several other sites affected, too.

mcqueenjordan · on July 18, 2018

https://stackshare.io/ is decent for crowd-sourced info on what the stack for a given company/product is comprised of. Not guaranteed to be accurate, so take it with a grain of salt, of course.

CydeWeys · on July 17, 2018

If the web front-end or any external web-accessible APIs are being served directly via App Engine, you can see it in the HTTP response headers. Look for x-google-appengine-

RobinUS2 · on July 17, 2018

generally lookup the IPs (dig <hostname>) and then attempt a host/whois on that IP. For PushBullet it's hidden by CloudFlare so hard to see easily, you should try and find an exposed endpoint (which they haven't if they've done well).

p1mrx · on July 18, 2018

My Chrome extension (IPvFoo) displays all the subresource IP addresses in a table:

https://chrome.google.com/webstore/detail/ipvfoo/ecanpcehffn...

(There's a right-click option to look up each address on bgp.he.net, but that doesn't happen automatically, for privacy.)

anderspetersson · on July 17, 2018

curl -I www.domain.com

This might give you a clue, PushBullet seems to run behind CloudFlare though.

drexlspivey · on July 17, 2018

I got a Could not connect to POP server 'pop.gmail.com:995' (SSL=ssl): SSL connect attempt failed because of handshake problems

qaq · on July 17, 2018

Is there a single cloud that can match uptime of a single top tier DC?

jasonvorhe · on July 17, 2018

Uhm, no, because a cloud service spans over several data centers and features a lot more moving parts than just a solid data center.

Let's turn this around: could a typical data center + server uptime + service uptime equal that of a major cloud provider?

qaq · on July 18, 2018

In my experience yes but too few data points.

cdiddy2 · on July 17, 2018

Yup, discord impacted for me

zorkian · on July 17, 2018

Yes, yes we are.

cdiddy2 · on July 17, 2018

time to spread resources across clouds? or would that be too expensive?

RobinUS2 · on July 17, 2018

this one appears to be just global LBs, so a load balancer in AWS/Azure that hits the actual backends over a VPN or something would have worked, but that's just this case

jkaplowitz · on July 17, 2018

Even Google Cloud's non-global load balancers, which live within a single region, wouldn't have hit this particular outage.

merb · on July 17, 2018

jep we had non-global lb's in europe-west-3 and i could not sense any outage.

throwaway923842 · on July 17, 2018

I sure wish we hosted on AWS today. Work sucked big time.

avip · on July 17, 2018

Something no one has mentioned yet, could it be that the engineering force at Google is no longer what it used to be?

jasonvorhe · on July 17, 2018

Because of one data point? Seriously?

rrampage · on July 18, 2018

I think OP was jovially referring to a similar comment regarding Amazon because of their recent Prime day outage.