Building a versatile analytics pipeline on top of Apache Spark

Grammarly, like most growing companies, strives to make data-driven decisions. That means that we need a reliable way to collect, analyze, and query data about our users. We started out using third-party tools like Mixpanel to handle our analytics needs, but soon our needs surpassed the capabilities of those tools. For example, we wanted to control the pre-aggregation and enrichment of data, to generate reports that were more customized, and to have higher confidence in the accuracy of data. So we developed our own in-house analytics engine and application on top of Apache Spark. Recently, I gave a talk at the Spark Summit sharing some of our learnings along the way. The talk covered:

  • Outputting data to several storages in a single Spark job
  • Dealing with the Spark memory model, building a custom spillable data-structure for data traversal
  • Implementing a custom query language with parser combinators on top of the Spark SQL parser
  • Custom query optimizer and analyzer
  • Flexible-schema storage and query against multi-schema data with schema conflicts
  • Custom aggregation functions in Spark SQL

Here is the video of the talk:


Check out the slides as well: