Hacker News new | past | comments | ask | show | jobs | submit login
Google App Engine and Cloud Datastore Outages (cloud.google.com)
231 points by tachion on Nov 11, 2019 | hide | past | favorite | 74 comments



Hey everyone - Seth from Google here. We’re currently investigating this incident and hope to have it resolved soon. Check the status dashboard for more information (https://status.cloud.google.com/incident/cloud-datastore/190...), and I’ll try to answer questions on this thread when I’m off mobile this morning.


So I should be expecting Cloud Storage to not be working right now?


Unfortunately yes. Most products are currently affected and may experience partially degraded service. See https://status.cloud.google.com/ for the full impact matrix.


We had some issues with Google Chat and Meetings, we had some 503 errors. Is it related?


There are many downstream impacts. Without more information, it's hard to say for certain, but "most likely".


https://status.cloud.google.com/summary

Google App Engine seems to be a very fragile service. From Sept. 2019 It's going down every month. 10 hour+ outage in July, Sept. and Oct.

For the premium they charge for App Engine, one would expect the service to be more reliable.


All customers must migrate to aws


Google App Engine doesn't have many users, and isn't a focus for future engineering effort.

Either people need to start using it for serious projects (rather than just demo guestbook projects), or it'll be shut down in a future round of closures.


This is fundamentally incorrect. GAE has many users and is actively being developed.


Would you choose GAE or GKE for a new project?


That's a big question :). I don't think they're mutually exclusive. GKE provides more flexibility but requires more configuration. GAE is less flexible but more "serverless". GKE is probably more expensive for a single app, so I'd probably pick GAE for a single app. For a _project_, I'd have to understand more about what I'm building and what I need.

I'd also probably use Cloud Run over GAE, but that's a personal preference because I've been working closely with that product lately.


Our app is down. I can't even access any pages in Google Cloud Console. Timing out. Sometimes it completes showing all our clusters gone, then a timeout error. This is brutal. This isn't just GKE either...

Edit: Things are working for us now

Edit: Still getting timeouts and service unavailable

Edit: I'm getting 503 (service unavailable) from buckets, but nothing on the status page indicating there's any issue.

Edit: Seems our Cloud SQL instance was restarted as well

Edit: Multiple restarts of our production database

Edit: Dashboard finally updated to reflect growing # of services effected

Edit: This wasn't an "App Engine" incident. It was a very wide-ranging incident. Just change the title to "Google Cloud Incident" and be done with it

Edit: Things have seemed to stabilize for us

Was supposed to have today off with my family (Remembrance Day in Canada), but now I have to deal with support issues all day. Thanks Google!


I’m sorry this incident is taking time away from your family. Our team is working to mitigate as quickly as possible. Initially the scope of the issue was unclear, but now the title and dashboard are updated, sorry about that.


I have been fucking with this all morning thinking it was something we did (we actually were changing some permissions last night / today), how about an email? There was no push of this info, I shouldn't have to opt in to that.


(Google Customer, not Employee)

We have two slack channels here. One is where our internal monitoring agents post, as well as an RSS subscription to the GCP status board. We have a separate channel for less critical things (like GitHub).

Unfortunately, it does depend on GCP updating their dashboard. However, to date, we’ve only been impacted by one of these major outages. This morning, I saw the alert in our channel, but all our things were still operating (fortunately!).

YMMV, but I’ve found this very helpful.


FWIW, we're exploring ways to make this better. We know it's a point of pain for customers. One of the challenges is that the "project owner" isn't always the right person to receive these kinds of alerts, since often that's someone in the finance department or central IT team. There are a few ideas being thrown around about more proactive notification right now.


Ouch when something as basic as this fails:

11:17 $ gcloud container clusters list WARNING: The following zones did not respond: us-east1, us-east1-b, us-east1-d, us-east1-c, us-east1-a. List results may be incomplete.

GCP Web Console is also really struggling - e.g. the homepage for our view of 'Cloud Functions' spins for a minute and tells the API is not enabled (it sure is).

Ah there it is... https://status.cloud.google.com/incident/cloud-datastore/190...


One of our GKE clusters suddenly went missing today (from the console as well as kubectl) and we were scared for some time, panicking how the cluster got deleted.

Google should have put up some kind of alert dialog in the console, saying that some services are experiencing a downtime of some kind.


or how about an email? I've been trying to solve a non-existent problem for a few hours before seeing this on HN.


I'm also seeing issues with GCE and GCS. Getting permissions errors and timeouts.

GKE cluster API endpoints have high error rates or timeouts too.

"Multiple services reporting issues" on https://status.cloud.google.com/ now. Can we update the title?


This is pretty bad - on a regional cluster:

$ kubectl get nodes The connection to the server XXX was refused - did you specify the right host or port?

BigTable is also not responding for some time now.

EDIT: This is us-east1. Responding again now.


What region is this? GCP Console doesn't function properly (api errors) but kubectl and all apps work. (europe-north1)


us-east1. Our cluster in europe-west1 seems unaffected.


Correct, this outages is affecting a few (but not all) regions.


Will Google please consider explaining to us why we continue to experience multi-region failures and what will be done so that we can build reliable systems on top of GCP?

I have been taught by AWS that we should expect occasional cross-AZ failures and almost no cross-region failures. This does not appear to be the case at GCP. I would like to have GCP as a cloud option - some of your tech is very impressive - but I have no idea how to design infrastructure on GCP so that I can be confident it won't fail due to a GCP problem that I cannot fix.


Aside from GKE, the chat.google.com and calendar.google.com are acting weird, with hangouts.google.com working just fine here. What's also interesting, the GCP Dashboard shows this issue being few days long now.

EDIT: now the dashboard shows multiple services having issues, across the board.


youtube has had some issues too for a dozen so minutes, seems fine now.


Unfortunately I feel like google has one of these every 6 months, I really hope they resolve it. I’ve been an app engine user since 2008 and there are many mission-critical apps that are heavily impacted by any downtime. It usually ends up being networking configuration on their end in the US East region? A strange repeating pattern.


If the Google pattern holds they will decide that users are the problem because they are using GCP incorrectly and then start cutting the support budget to show how much contempt they have for all these misbehaving users and then eventually cancel the project.


I can't tell if it's coincidence but I've had all of our GCE pull-queues failing with "transient failures" for pretty much the entire time GKE has been reporting this issue.

Have they only _just_realised_ this is affecting GCE after all this time or has it only _just_started_ to affect GCE?


GKE runs on GCE. It's affecting both.


[edit] Incident logged on GAE https://status.cloud.google.com/incident/appengine/19013

Our Google App Engine Flex app is not working (. We are just getting 502 error. Locally Everything is working fine. But the service is not working. However the instance of the service stays in restarting mode and shows message "This VM is being restarted".

As per this status the issues was supposed to be resolved on 1st Nov: https://status.cloud.google.com/incident/appengine/19012


I'm really waiting for the postmortem. The first services down were networking/datastore, and some minutes later all the others started to fail. My hypothesis is that network failures prevented Paxos, a CP algorithm, to go forward, blocking writes.


The root cause for the outage in June was bad configuration in the network control plane: https://status.cloud.google.com/incident/cloud-networking/19...


The real root cause was building redundant systems that share a single point of automated failure, creating a massive blast radius.

> Thirdly, the software initiating maintenance events had a specific bug, allowing it to deschedule multiple independent software clusters at once, crucially even if those clusters were in different physical locations


The idea of a single root cause is of course fiction in most failures in complex / redundant systems.


It feels like GCP has not done a very good job of reducing blast radius in its services. Each time there is an outage there are so many downstream Google services affected.

It's unbelievable that this is the second multi-region outage this year.


I'm overall pretty happy with GCP, but wish they would better isolate their availability zones.. I have yet to see a single AZ problem, its often a full Region, or global, and that is not great in a cloud world..


It's a function of dependencies. Things are so complex with so many inter-dependent services that it's not humanly possible to understand the consequences of a single failure, let alone a cascading failure.

This is a really tough problem at scale, and it's made worse by every layer of the stack trying to become smarter, scale further, and still be simpler to manage, by simply "reusing" some other smartness elsewhere in the stack. But surprise, every single one of those moving parts is moving faster and faster, as churn and software fads and a hundred thousand SWEs need something to do. Disrupt!


But it is possible and AWS has both proven that and demonstrated strategies that work well. The strong separation of regions (i.e. no automated system that manage multiple regions) is a simple technique that is very effective at reducing blast radius. There are other high availability tricks like shuffle sharding that I assume/hope GCP is already using heavily. AWS also has different tiers of services with different uptime expectations and their Tier 1 services have generally had exceptional uptime whereas the last two major GCP outages seem to have been problems in their Tier 1 services (too early to say for sure on this outage, but the widespread downstream effects make it seem likely that the failure occurred in a foundation-level service).

Although I am loathe to judge technical quality/decisions from the outside, it feels like there might be a deeper problem at play here. AWS does an excellent job of aligning development priorities with business requirements by watching availability metrics religiously (the CEO looks at availability metrics every single week) and having a pager culture where if you built it you maintain it so you're properly incentivized to build fundamentally reliable services/features. My understanding is that GCP relies on the SRE model and I question whether that is as effective as the incentive structure is far more complex.


> Each time there is an outage there are so many downstream Google services affected.

I think that happens because google share infrastructure for external "hosting" and internal progects


Very unlikley GKE is the root cause if Google Calendar is also affected.

Google isn't using GKE internally for much.


As reflected on the status dashboard, we believe a primary cause is related to datastore. Our SREs continue to investigate and attempt mitigation.

Also, we do use GKE internally :)


For which services? Why aren't you talking more about this?

I'd really like to know what Google is using Kubernetes for.


Where does it say so and how up to date is that source?


One tiny part could be enough to render the service unusable if it wasn't set up to handle GKE being down.


App Engine Flex, Cloud Storage, Cloud SQL, Networking seem okay in Europe (west-1).

Our app is still up.


calendar.google.com is down for Google for Business customers. I am wondering how Google will compensate their paying customers?


if we get to skip some meetings today because of this, I think we should be thankful to Google for the increased productivity


Is this worldwide? Seeing it okay in Australia.


I am in Europe and hosting here and I see failed requests on the calendar API.


It’s days like this I miss having data centers to manage. At least it was my fault the service went down. Nowadays I have to create redundancy across two different cloud providers to keep my business running. Thanks Google!

For the record the price was appealing for us to start moving to GCP, but an outage like this is giving me seines thoughts.

Am I right when I hear my other sys admin friends say GCP is like Gmail back in the day; still in beta?


Well it wasn't Google first to introduce, you may want to blame Amazon with AWS. Proper marketing made people believe in benefits of building applications this way and here we are now.

I am scared to think traditional VPS/VDS and leased servers will cease to exist and we all will have to deal with GCP/AWS/Whatever.

Hope this won't happen during my lifetime.


I know this is no way related, but there was this other submission which I found excellent, "Taking too much slack out of the rubber band" [0], and it just made me wonder...

[0]:https://news.ycombinator.com/item?id=21502292


We are still having issues with Google Cloud print. Anyone still having ongoing issues?


Really wonder what it's like being in Google teams when this happens. Must be pretty intense

Also - karma gods reward Google for manifests 3 :p


That’s why there’s SRE and on calls. Well in those big event, there likely will be a war room with hundreds of on-calls checking in either remotely or onsite. There could be an “oncall leader” as well or Speical OP team. Bad days for oncalls.

Honestly SDEs from google are quite lucky since they got SREs to back them. Elsewhere it might go to the dev team directly

EDIT: if you know any big company pays oncalls more, let me know and I’ll seriously consider join!


Netflix should do a docu about it. I'd watch it


> Bad days for oncalls

It sounds snarky, but it's honestly very hard to have sympathy with people who earn what these oncalls earn.


Hmm which ones of FAANG pays oncall extra? At least not A.

Oncall normally means common Engineer who’s on oncall rotation. You can reimburse the dinner I think at least


Google pays on-call engineers extra. Other companies often pay ENOCs in time off. I know a few only compensate IF you’re paged off-hours (e.g. paged at 3am, incident resolves at 6am, you get 3 hours of vacation time).


> I know a few only compensate IF you’re paged off-hours

Yeah, that seems like it would lead to misaligned incentives (as well as under-price the opportunity costs of being on-call).


Google compensates for on-call outside of normal business hours.

(I'm a Googler)


Not too surprised but still Google treats their people pretty damn good


The timing too. Every minute that goes by, more US people are waking up and the usage pattern goes up.


As with any incident, it can be stressful. However, it’s not only the SREs that feel that stress. Our comms, support, and developer relations teams also mobilize to help customers as best we can. It’s a team effort across the board.


Hangouts was also affected. Seems ok now. Our business is in Europe.


There were problems in MS's infrastructure too about the same time (where I sit it manifested as failures with OneDrive sync and TFS access). Perhaps there was a more general routing or DDoS issue in Europe that affected both (and, if so, presumably many other services)?


Analytics was down for around 15-20 mins too it seems.


I think my single droplet has better uptime than GAE


billing is down, making almost any operations in GCP dasboard fail :|


> Incident began at 2019-11-04 11:46

google can't fix something in 7 days

Oh my~


This is an unrelated incident.

> Mitigation work is currently underway by our engineering team and is expected to be complete by Wednesday, 2019-11-13.

The fix/mitigation is being rolled out.

> Workaround: Users seeing this issue can downgrade to a previous release (not listed in the affected versions above).

There's a workaround for end-users (downgrading to a version which is not affected),

Also:

> the number of projects actually affected is very low.

(disclamer: I'm SRE at Google)


Hmmm, this will be expensive.

I wonder what the 'lost revenue' costs will add up to. Also, I surely hope there aren't any medical/transportation/other critical things depending on this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: