Building a Versatile Analytics Pipeline on Top of Apache Spark

Grammarly, like most growing companies, strives to make data-driven decisions. That means that we need a reliable way to collect, analyze, and query data about our users. We started out using third-party tools like Mixpanel to handle our analytics needs, but soon our needs surpassed the capabilities of those tools. For example, we wanted to control the pre-aggregation and enrichment of data, to generate reports that were more customized, and to have higher confidence in the accuracy of data. So we developed our own in-house analytics engine and application on top of Apache Spark. Recently, I gave a talk at the Spark Summit sharing some of our learnings along the way. The talk covered:

Outputting data to several storages in a single Spark job
Dealing with the Spark memory model, building a custom spillable data structure for data traversal
Implementing a custom query language with parser combinators on top of the Spark SQL parser
Custom query optimizer and analyzer
Flexible-schema storage and query against multi-schema data with schema conflicts
Custom aggregation functions in Spark SQL

Shape the way millions of people communicate!

Here is the video of the talk:

Check out the slides as well: