As discussed in the first part of this series, we were very excited when we figured out how to properly build Docker images, until we realized that we had no idea how to run them in production. You might have already guessed that we were pondering building our own toolContinue reading »
Today, the industry is saturated with discussions about containers. Many companies are looking for ways they can benefit from running an immutable infrastructure or simply boost development performance by making repeatable builds between environments simpler. However, sometimes by simplifying the user experience we end up complicating the implementation. On our journey to a usable, containerized infrastructure, we faced a number of daunting challenges, the solutions to which are the subject of this post. Welcome to the bleeding edge!Continue reading »
At Grammarly, we have long used Amazon EMR with Hadoop and Pig in support of our big data processing needs. However, we were really excited about the improvements that the maturing Apache Spark offers over Hadoop and Pig, and so set about getting Spark to work with our petabyte text data set. In this post, we describe the challenges we had in the process and a scalable working setup of Spark that we have discovered as a result.Continue reading »
At Grammarly, the foundation of our business, our core grammar engine, is written in Common Lisp. It currently processes more than a thousand sentences per second, is horizontally scalable, and has reliably served in production for almost 3 years.
We noticed that there are very few, if any, accounts of how to deploy Lisp software to modern cloud infrastructure, so we thought that it would be a good idea to share our experience. The Lisp runtime and programming environment provides several unique, albeit obscure, capabilities to support production systems (for the impatient, they are described in the final chapter).Continue reading »
In this post, we are going to discuss a common evolution of server-side architecture that many growing companies face. It is a now-legendary transition from a monolithic application to a micro-services architecture. And although decoupling is a sound software development concept, there are a number of risks, and pain points associated with it. This writeup covers some of the issues we faced while scaling Grammarly’s server backend and the solutions and insights that we had in the process.Continue reading »
We are excited to announce summer internship opportunities in the Grammarly Core Team in both of our offices: Kyiv and San Francisco. Our team is responsible for all text processing and error correction stuff that is happening under the hood of Grammarly. So, we're involved in a handful of research and engineering activities encompassing a large part of modern NLP landscape. Unfortunately, we can't publish or disclose the most interesting pieces, but some of our results can be found on our blog.
For a period of up to 3 months starting on June 1st, in each office, we're ready to invite 1 person to work in either of the following 3 directions:
- Computational Linguistics
- Data Science/Machine Learning
- Lisp Programming
The task of comparing constituency parsers is not a trivial one. Parsers vary in the types of mistakes they make, types of texts they are good at parsing, speed, and all kinds of interesting features and quirks within each implementation. We set out to understand what stands behind the vague F-measure numbers lurking around 90% and what kind of issues to expect from different parsers, regardless of their overall quality.
In this article, which is the first in our series about exploration of syntactic parsing, we describe approaches to parser evaluation and use them to analyze common parser fail cases. We notice a serious fallacy in all of the standard metrics, which results in an overly optimistic outcome of the evaluation. Thus, we propose several improved metrics suitable for different use cases. Their implementation is published as a little open source project.Continue reading »
At Grammarly, we use a lot of off-the-shelf core NLP technologies to help us make a little bit of sense in the mess that is natural language texts (English in particular). The issue with all these technologies is that even small errors in their output are often multiplied by the downstream algorithms. So, when a sophisticated mistake-detection algorithm is supposed to work on individual sentences, but it receives a fragment of a sentence or a couple merged together, it may find all sorts of funny things inside.
In this post, we analyze the problem of sentence splitting for English texts and evaluate some of the approaches to solving it. As usual, good data is key, and we discuss the usage of OntoNotes and MASC corpora for this task.Continue reading »