Hacker News new | past | comments | ask | show | jobs | submit login
Data Wrangling at Slack (slack.engineering)
118 points by dianamp on Dec 8, 2016 | hide | past | favorite | 71 comments



For what it is worth, every company I have worked for - and almost every company I know -builds their own bizarre stats system. Each presentation I attend (last one being uber) the ideas for storing columnar data gets even nuttier. Frankly I gave up. Now I just installed new relic insights and I can run queries, have dashboards, and infinite scale. I understand that slack has scale - but why on earth hook together 30 random technologies and become an analytics company too.


Warning: I build bizarre stats systems for a living :)

I totally get where you are coming from. Right now I'm thinking about a web API that feeds data into Kafka, to be processed (in Python, maybe Go?), stored into Cassandra and later on be the target of large Spark jobs, by the way, I need to present this info through pretty graphs and tables - Pandas will come in handy!

Sometimes it's better to just use what someone else has built, let them think about the implementation, the storage, the traffic and the maths... Here is where a third party solution falls apart: a) Costs. Data Analysis is stupid expensive. b) ... and this is the important one: Your sales/consumer facing teams want some extra numbers, literally the sort of thing that only fits your business. The solution you decided on doesn't support that use case, you are now stuck with an inflexible solution.

New Relic Insights is OK for some use cases, completely useless for the majority of analytics I need to serve, though. If it fits your bill, great! Save yourself A LOT of time and life span... Just keep everyone else on the business away from it, or they will start asking for things you can't give :)


I am super curious. Most analytic questions I run into: give me a month over month, which Test won, why is x happening, etc. These could be solved with just some sql queries. What questions do you run into where you need Kafka + pig +fig+ hive+ all messaged with scribe + redshift. Doesn't it even make it more difficult to answer questions?


It's not so much the implementation details that worry me. Well, I do get worried if we end up building a vastly complex beast, but what I REALLY worry about is having data available for whatever eventual scenario that might pop up. It's true that tools (like New Relic) answer a lot of questions, but data in these systems isn't usually available for you to play with, you're constricted to their sandbox. Even if it is available, a lot of the times the data is built and stored in a way that only makes sense to be used through their system (with good reasons, performance being the best one).

A lot of the times these systems are built not only to serve business insight and stats, one of our main systems needed to answer two requirements: a) better/faster analysis for us; b) serve as a machine learning platform to serve better content to our users.

a) complements b) perfectly, as we collect data for analysis, that same data feeds into other areas of the business that help our users, on the fly.

You could argue there are solutions out there that satisfy a) perfectly, but the learnings of doing a) is what made b) possible.

Even if you're happy with a solution like New Relic (and by all means, I'm sure it's a good product, we use New Relic a lot!), what happens when someone has an idea like... oh I don't know... "can we build something that looks at the past 7 days worth of data and flags up any metric that moves away from the standard deviation line? Also, can you then match that against historic data and identify patterns/catch false positives?"... Just an actual, factual, example that I'm working on as well.


This is a valid question. In some cases it has to do with the amount of data you're working with. Most database management systems have made progress for aggregating large amounts of data. In many cases it is still necessary to distribute the workload, which in turn creates the need to build out the rest of the distribution system.

With that being said, and to your point, I would not be surprised if these systems were often over engineered when a sql query could get the job done.


> With that being said, and to your point, I would not be surprised if these systems were often over engineered when a sql query could get the job done.

RedShift takes SQL queries.


I've actually been running an analytics company for a few years (http://parse.ly), and as a result of seeing what you're describing and thinking it was pretty strange -- that many companies have "not invented here" syndrome about analytics -- we actually turned our data collection and event enrichment infrastructure into a fully-managed cloud service. It's called Parse.ly Data Pipeline, and is described here: http://parse.ly/data-pipeline.

Together with cloud SQL tools like BigQuery or Redshift, it gets rid of the need to build a "full analytics stack" on your own. You can license the data collection/enrichment from us (we've already scaled it to billions of monthly events), you can use our clean starting schema (over 100 enriched fields per event), and then you can pipe the data into a fully-managed analytics warehouse, or just analyze it in raw form. Then you can actually spend all your time focusing on insights, rather than fussing about data collection clusters, pipelines, ETLs, etc.

I would love to hear what you think of the idea; it was launched just a few months ago.


I'm surprised too. I work at companies that have their own data center so can't use new relic, datadog etc. I'm really surprised there aren't more free open source analytics platforms for small projects. I'm going to start one when I "get some spare time". lol.

Anyone know of anything out there?


Using paid tools has no relation with having or not having your own datacenter.

If you have a small project (what is "small"?) you just deal with Google analytics or direct SQL requests to the single database you have. Don't need fancy tools.

The two free stuff I can think of are piwik and snowplowanalytics. They clearly suffer from "free open source" when compared to the paid tools out there.


Actually I forgot that I played with nagios vs graphite - if anyone knows of other backends similar to those that would be appreciated.


The kind of "stats" collected by new relic are only one of many inputs toa data warehouse like Slack is describing. You can't import your mysql databases into new relic, for example.


Sometimes looking at people's stacks I wonder if we've made computing so complicated most of the time is spent dealing with stuff that is broken, and little time is left to do anything useful. Data science seems even more into this that programming in general; and sometimes you wonder if the result is actually worth all the pain.


I feel like this happens because Data Science can only work when two professional areas clash and mix: Programming and Maths. The two are very well connected, of course, but the concepts behind the maths of Data Science are much deeper than what the typical programmer is used to. Programmers need Mathematicians as much as Mathematicians need Programmers. This is where it gets hard: Programmers find it hard to implement these concepts. On the other hand, Mathematicians don't understand what good software is.

Good data analytics software can only come when these two areas learn to teach each other. Programmers need to learn maths to the point where they are comfortable enough to implement a valid solution, Mathematicians need to learn about building software that others can use.


It is not my impression that Data Science mixes programming and maths. Unless in a limited field of finance where all data and analysis are maths heavy.


I felt the same when our stats were based on simple arithmetics, "sum those revenue figures", "divide that by the total amount of users", "percentage of returning members"...

It can easily spiral into, "Pearson's Correlation" or "Give me the Linear regression of the bastard".


Still not hard maths. If all you have to do is apply a simple standard well documented algorithm, there is really no obstacle to your success =)

That being said. I guess that having had maths classes in my engineering degree skews my point of view, combined with working with Quants at times, who do analysis way more advanced than that.


If you are familiar with those concepts, I would count that as a big step over what I typically see in "data science". Surely a big step over what a lot of people think data analytics to be.

Like yourself, I had quite a bit of contact with maths during my engineering degree - whether I took most of it in is a different question :) (Financial Calculus nearly destroyed me).

Developers aren't typically aware of concepts outside basic statistics, and even though a lot of algorithms are readily available for everyone to implement and benefit from, how can you use what you don't know conceptually?

I guess everyone has a different experience, depends where you're working, really. I do know of quite a few shops where the push for analytics came from the tech people, mostly because companies don't employ people with the math knowledge to identify these business gains.


They are really just maths algorithm, seen in maths courses or found with a quick google search.

The typical reddit developer who got a job without a degree is unaware of many many things.

The typical developer who got a job with a hardcore interview at random financial company and is surrounded by other master's and PhD. Not so much.

The typical tech company doesn't need much advanced analysis. If they could figure out how many recurring users and revenues they have, that would be a good start :D


Seems like a pretty typical set of problems. Dependency conflicts hard. Schema evolution hard. Upgrades hard.

The big data space still feels like an overengineered, fractured, buggy mess to me. I was hoping spark would simplify the user experience but it's as much of a clusterf*ck as anything else.

How hard can fast, reliable distributed computation and storage for petabytes of data be? He said ironically.


Trivial. Just put everything in BigQuery. Use looker for visualization. Job done.

Any single other component you may try to add will increase the complexity factorially. Better stick to the basics ;)


IMO one major problem is integration between different projects. Like you said, its a hard problem, and any solution typically depends on many many different open source projects because of the scope of challenges. All of those projects go forward without much coordination between the teams because they're open source. Then we end up in this fun, fun clusterfuck.


There is some sort of hope, Apache Arrow is (in my opinion) a step in the right direction - A common In-Memory data layer for storage and data analysis systems? Yes please. It's important to start thinking about how all these big data storage/analytics tools can bridge the gap between themselves, hopefully projects like Apache Arrow will help... As long as there is adoption.


A common in-memory columnar data layer would make a lot of sense because a) columnar is generally better for analytics, and b) converting from one columnar format to another can theoretically be done without decompression because columnar data is typically compressed using standard algorithms (vocabulary compression, LRE, etc). Here I wrote a few suggestions for such open-source data layer: http://bi-review.blogspot.ca/2015/06/the-world-needs-open-so...


Have you seen the Apache Arrow project? https://arrow.apache.org/


Good read!


I've looked at the code and messed around with arrow. It seems like a performance optimization that solves a small sliver of the problem. It could help with the parquet/thrift version issues they mentioned. But I don't see any guarantee it won't introduce its own version and compatibility problems. If the initial implementations are buggy like described in TFA it could actually be a lot worse.

In general, I've learned to be skeptical of any new big data solution. Hadoop and hive are clumsy but as someone on my team said "they've found and fixed the tens of thousands of bugs".

It seems to take five years before any significant new solution is stable and reliable enough to be used on large, complex workloads.

Which makes me really uncertain how we get out of this situation. Maybe something like arrow is a silver bullet that fixes everything with minimal complexity and thus few bugs. But I'm skeptical.


We actually have pretty similar architecture and use Presto for ad-hoc analysis, Avro is used for hot data and ORC is used as columnar storage at https://rakam.io. Similar to Slack, we have append-only schema (stored on Mysql instead of Hive), since Avro has field ordering the parser uses the latest schema and if it gets EOF in the middle of the buffer, fills the unread columns as null. We modified the Presto engine and built a real-time data warehouse, Avro is used when pushing data to Kafka, the consumers fetch the data in micro-batches, process and convert it to ORC format and save it to the both local SSD + AWS S3.


Are you using Avro because of your own choices or Confluent's toolset (which uses Avro on Kafka)?


We tried Avro, Thrift and Protobuf and Avro was our choice. The schema of collections in Rakam is dynamic and with both Thrift and Protobuf schema evolution is not that easy at runtime. Avro is easier to use in Java and doesn't enforce code generation, the dynamic classes are optimized for performance so it's a better option for us.


I had very similar experience with Parquet and cross system pains. Pretty much the whole big data space is a giant cluster fuck of poorly documented and ever so slightly incompatible technologies.. with hidden config flags you need to find to get it to work the way you want, classpath issues, tiny incompatibilities between data storage formats and SQL dialects and so on..

Hoping someone on this thread could answer a related question - how do you store data in Parquet when the schema is not known ahead of time? Currently we create an RDD and use Spark to save as Parquet (which I believe has an encoder/decoder for Rows) but this is a problem because we can't stream each record as it comes and use a lot of memory to buffer before writing to disk.


We're actually having a debate now as we're starting to process larger datasets as to whether or not we should keep everything on S3 or start using HDFS w/ Hive. I'm curious if you guys considered HDFS and why you decided to go strictly with S3, and additionally, are there any issues you encounter with S3.


We've considered HDFS, but we really liked the idea of having compute only clusters and have our data kept completely separate. Clusters failure happen and having data on S3 makes us worry less if a cluster goes down. Just spin up a new one and you're good to go.

There is a bit of more latency when using S3 compared to HDFS, but it's not bad and the benefits overcame that. We do have a couple of jobs that store some intermediate results in HDFS, but in the end everything lands in S3.

We encountered a few issues with S3 at the beginning mostly around the eventual consistency, but nothing that could not be fixed.


netflix i think said they see about a 10% perf hit using s3 instead of hdfs, using emr where they launch temporary clusters that do a job and shut down, and that performance cost was well worth the flexibility of being to launch independent clusters whenever they need.


We're also using S3 but we have a hybrid approach to the problem. The event data is immutable and you use instance stores with EC2 and cache the data to local SSDs and use S3 as backups. The thoughtput of HDFS is better than S3 or EFS but I would prefer to use EFS in this case since it also utilizes caching under the hood and cheaper alternative.


Oh great, thanks for the reply. I think thats about where I think we'll land... keep S3 as the primary source, but have HDFS be used for intermediate jobs.


Good luck and have fun! :D


Depends on your definition of 'larger' -- if this data is on S3 currently I can't imagine we're talking multi-TB working sets here?

Generally speaking, HDFS is going to be a clusterfuck to support unless you give a load of cash to cloudera (actually, it will be regardless but slightly better with the bill) -- even then you'll get the typical db vendor line of 'not running -some patchset ver-, then upgrade. Which is really risky on a large cluster which pretty much works as you want.

Also, unless you've got a load of hardware you can dedicate to this environment, then you're going to be spending a lot of money on IAAS bills and your performance is probably not going to be very good. (Yeah sure you can virtualize HDFS but generally I passthrough local storage to the VM's, and only run demo on AWS etc).

There was been a push towards such mental complexity and folks convincing themselves they needed to solve their problems in this manner, and now a bit of an ebb backwards (at least, in the general space) now that your avg deployer found out how hard it is to do this stuff even with good support. Massive data ingestion and huge batch jobs might be a solution to a given problem you have, but it's probably not the only one whereas it's almost certainly going to be the most difficult and expensive.

Personally, I'd avoid hdfs, flume, hfs, zookeeper and all the rest of the nightmares until you're absolutely sure that you need them (and if you're not already, then you probably don't).

Also: Check out manta from joyent. :}


S3 is ideal for multi TB working set.

That should be the de-factor standard for TB scale. In fact, don't bother comparing other products if you're TB scale, just use S3.


Really?

Say you're going to ETL or Map/Reduce over all that data a lot of times, you're telling me that reading it all for processing over S3's rest api (which is the only method?) instead of, say, a local array of 15k sas's over pcie hba's is ideal?

It's pretty expensive and inefficient to my eyes, what am I missing? I

In what way would S3 be better than running this on your own gear if cost and perf are clearly not going to be better (which are really the big factors in this decision)?


You're missing that S3 is the storage system for RedShift and EMR (emr = managed hadoop on AWS).

They are pretty cheap, efficient and simple to use ;)


I would recommend S3.

Using S3 with EMR in production was breeze for us. Even cost effective, since you can play with spot instances depending on your jobs. You also improve utilization of your resources.

With recent Athena it is possible also to do ad hoc queries directly :) Before it required starting "QA" cluster.


I'm curious about how much time is spent moving data back and forth from S3. It sounds like they don't currently have an ETL per say.


Pick one solution among:

- alooma.io (SaaS queing and transformation pipeline that saves to S3)

- segment.io (Saas analytics platform that can save to S3)

- snowplowanalytics (clusterfuck open source self hosted analytics pipeline)


We are implementing a very similar architecture, and have decided to use Avro for schema validation / serialization, rather than Parquet.

Does anyone have experience with both that can talk to their strengths / weaknesses?


Avro is Row oriented like said before, you should see it ine the categories of Thrift, Protobuf. Albeit a lot better in flexibility. But he gist of it is that it's a Serialization format for than a storage format, which Parquet is. Usually, when using Kafka or the confluent platform, I'd use Avro, and for long term storage and analytics Avro isn't really suited. Instead use Parquet or ORC if you're using Hive. With things like Spark, Impala or Presto, aggregations queries for ad hoc analytics are an order of magnitude more efficient and faste with Parquet than with Avro.


Parquet is a columnar storage type whereas Avro is row-oriented serialization framework. If you have lots of columns and want to perform ad-hoc analysis, Parquet will be better than Avro due to the mechanics of the columnar storage types.


Parquet may consume less space because it uses encoding enhancements like delta encoding, run-length encoding, dictionary encoding. Also large number of tools that support Parquet as a format when Avro is Java and Hadoop centric.


The other way around: Avro is supported by pretty much any language out there, while you can't even write a Parquet file on Python, and even reading it is pretty hard.


Data Engineering is about developing technology for data management. Data management/analysis is about using this technology to produce results.

So this is not about data engineering, but data management/analysis.


We (adtech) use a very similar approach. We're consuming a ton of data through Kafka and then using Secor to store it on S3 as Parquet files. We then use Spark for both aggregations as well as ad-hoc analyses.

One thing that sounds very interesting and worked surprisingly well when I played around with it was Amazon's Athena (https://aws.amazon.com/athena/) which lets you query Parquet data directly without relying on Spark which can get expensive quickly. I wouldn't trust production use cases just yet and it ties you more and more into the AWS ecosystem but might be worth exploring as a simple way to do basic queries on top of Parquet data. I suspect it's simply a managed service on top of Apache Drill (https://drill.apache.org/).


not drill, its on top of presto. presto is quite good, but the open source s3 support is definitely second class because fb doesnt use it, hopefully aws is contributing their connector back. likewise, fb use orc, and parquet is more externally supported.

Since s3 listing is so awful, and the huge number of partitions we needed, we had to write a custom connector that was aware of the file structure on s3, instead of the hive metastore which has lots of limitations, so im a little wary of athena. create table as select is amazing too, write sql to generate temporary parquet/orc files back to s3 to query later, i hope will support this if it doesn't already.


With Qubole you can offload data engineering to their platform. Cluster management is super simple. Hand rolled solutions in my experience are a pain and elastic cloud features take up time to build. Qubole's offering provides out of the box experience for most big data engines out there. Presto/ Spark/ Hive/ Pig - what have you - all work with your data living in S3 (or any other object storage). I believe they have offerings in other clouds too.

Some amount of S3 listing optimisation is done by Qubole's engineering team for: https://www.qubole.com/blog/product/optimizing-s3-bulk-listi...

They also have features that allow you to auto-provision for additional capacity in your compute clusters as your query processing times increase.


When Amazon Athena actually matures, wouldn't it solve at least the interactive query needs, probably at a much lower/elastic price point than Qubole?


True, I've tried Athena and it's great at cost, performance and ease of use. However, most Data Engineering teams have lots of custom tweaks they need and certain level of control to add jars, applications, UDFs to their queries. I don't see this available through Athena today.


Apparently the concept on sampling has been lost in time.


I think that many people don't trust sampling.

I like sampling for figuring out how something works, it allows me to iterate much, much quicker.

However, if you need individual level predictions, sampling probably isn't going to help.


Isn't moving data back and forth from s3 rather expensive?


AWS doesn't charge to put data in to s3. It's free to pull data out from its region to any AWS service within the same region. It can get expensive to pull data out across regions or out of AWS infrastructure (ie. to your private data center).


AWS does indeed. They charge $0.005 per 1000 PUT requests (which is 12.5x more expensive than GET requests) and then you're immediately paying for storage space as well.


Wow, I hadn't noticed that before. It's less than a third of the cost to store data in S3 than to pull it across the wire (2.3c/G store, 9c/G wire, in us-east-1)


This is off-topic, but I can't help myself:

Slack, Hive, Presto, Spark, Sqooper, Kafka, Secor, Thrift, Parquet.

I sometimes can't tell the difference between real Silicon Valley product names and parodies. I'm starting to miss the days when it was all just letters and numbers.


There's a game "Pokemon or Big Data?"

https://pixelastic.github.io/pokemonorbigdata/


This is AMAZING. Thank you.


Yeah can't we go back to naming companies with a color and an animal like we did in the glory days?


I give up. Which one's the real one?


All of them! And they're all mentioned in this post! I was considering adding a fake one to the list (I was thinking "Paprika"), but I felt that would dilute the point.


Busily registers paprika.newfangledtld....


Well for its worth my experience interviewing for the data team there was terrible. A long coding exercise that when submitted resulted in a 7 day wait and a 2 liner email. Wouldn't recommend.


What surprises me the most about the Slack's job page is that most — if not all — the positions are on-site. It surprises me because most of the companies that I know are remote-friendly use Slack as their main communication method, so I would expect Slack itself to have some remote positions just for the Dogfooding [1]. I have applied 3 times for a regular SDE position there and two times I was rejected because I was not (permanently) living in the US, the 3rd time I got no response while staying in NYC.

[1] https://en.wikipedia.org/wiki/Eating_your_own_dog_food


This article talks about that specifically: http://readwrite.com/2014/11/06/slack-office-communication-p...

An excerpt...

Which raises the question: With such a good tool for team communication, why does Slack need an office? Why not do all your work virtually?

Slack CEO Stewart Butterfield gives product manager Mat Mullen advice, and a ukulele serenade. “There are some conversations that are much easier in person,” says Brady Archambo, Slack’s head of iOS engineering.


Revealing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: