At Grammarly, we have long used Amazon EMR with Hadoop and Pig in support of our big data processing needs. However, we were really excited about the improvements that the maturing Apache Spark offers over Hadoop and Pig, and so set about getting Spark to work with our petabyte text data set. This presentation describes the challenges we had in the process and a scalable working setup of Spark that we have discovered as a result.
1. Petabyte-Scale Text
Processing with Spark
Oleksii Sliusarenko, Grammarly Inc.
E-mail: aliaxey90 (at) gmail (dot) com
Read the full article in Grammarly tech blog
5. Typical processing step example
Processing example:
count each n-gram frequency
Input data example:
<sentence> <tab> <frequency>
Output data example:
<n-gram> <tab> <frequency>
My name is Bob. 12
Kiev is a capital. 25
name is 12
is 37
22. ◈ We spent the same amount of money
◈ Further experiments will be cheaper
◈ You can save three months!
Was It All Worth It?
23. ◈ Don’t reinvent the wheel
◈ New technology will eat a lot of time
◈ Don’t be afraid to dive into code
◈ Look at problems from various angles
◈ Use spot instances
Take-aways