Petabyte-Scale Text Processing with Spark

•

3 j'aime•1,518 vues

At Grammarly, we have long used Amazon EMR with Hadoop and Pig in support of our big data processing needs. However, we were really excited about the improvements that the maturing Apache Spark offers over Hadoop and Pig, and so set about getting Spark to work with our petabyte text data set. This presentation describes the challenges we had in the process and a scalable working setup of Spark that we have discovered as a result.

Données & analyses

Petabyte-Scale Text
Processing with Spark
Oleksii Sliusarenko, Grammarly Inc.
E-mail: aliaxey90 (at) gmail (dot) com
Read the full article in Grammarly tech blog

Modern error correcting
depending from the weatherdepending on the weather

Size: 3 Petabytes
Format: WARC - Raw HTTP protocol dump
We need: 1 PB or 2000 x 480GB SSD disks
Common Crawl = internet dump

High-level pipeline view
Extract texts English Filter Deduplicate
Break into
words
Count
frequencies

Typical processing step example
Processing example:
count each n-gram frequency
Input data example:
<sentence> <tab> <frequency>
Output data example:
<n-gram> <tab> <frequency>
My name is Bob. 12
Kiev is a capital. 25
name is 12
is 37

Default choice: Amazon EMR
$12000
$24000
OOM
segfault

Our MapReduce
12x faster than
Hadoop
Easy to learn Full support
2x2=4

Our MapReduce
Hardware failures Network failures
Distributed failsafe
difficulties:

First of all
Latest stable
Latest stable
◈ Build Spark with patch
◈ Don’t forget Hadoop native libraries

The hardest button
S3 HEAD request failed for "file path" -
ResponseCode=403, ResponseMessage=Forbidden.
Why???

HTTP Head Request
HTTP body contains the
error description, but it’
s not fetched!
No body!

Possible reasons
Possible reasons:
◈ AccessDenied
◈ AccountProblem
◈ CrossLocationLoggingProhibited
◈ InvalidAccessKeyId
◈ InvalidObjectState
◈ InvalidPayer
◈ InvalidSecurity
◈ NotSignedUp
◈ RequestTimeTooSkewed
◈ SignatureDoesNotMatch

We need to go deeper!
Spark Hadoop JetS3t HttpClient
Fix here

Fixing Spark
◈ Choose latest filesystem: S3A, not S3 or S3N
◈ conf.setInt("fs.s3a.connection.maximum", 100)
◈ Use DirectOutputCommitter
◈ --conf spark.hadoop.fs.s3a.access.key=…
Fixing S3

Fixing Spark
◈ Spark.default.parallelism = cores * 3
◈ spark_mb = system_ram_mb * 4 // 5
◈ set("spark.akka.frameSize", "2047") Fixing OOM

Fixing Spark
◈ Don’t force Kryo class registration
◈ Use bzip2 compression for input files
Fixing
miscellaneous

Our Ultimate Spark Recipe
See Grammarly tech blog
for more info

Use spot instances
Spot instance
80% cheaper!
Safe Transient
Regular instance
Cheap
Expensive

◈ We spent the same amount of money
◈ Further experiments will be cheaper
◈ You can save three months!
Was It All Worth It?

◈ Don’t reinvent the wheel
◈ New technology will eat a lot of time
◈ Don’t be afraid to dive into code
◈ Look at problems from various angles
◈ Use spot instances
Take-aways

Thanks!
Any questions?
You can find me at aliaxey90 (at) gmail (dot) com
Read the full article in Grammarly tech blog

Contenu connexe

Dernier

What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17

Cyber awareness ppt on the recorded dataTecnoIncentive

Insurance Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics

Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell

Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181

Principles and Practices of Data VisualizationKianJazayeri1

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole

IBEF report on the Insurance market in IndiaManalVerma4

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml

Learn How Data Science Changes Our WorldEduminds Learning

modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone

Networking Case Study prepared by teacher.pptxHimangsuNath

SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali

Dernier (20)

What To Do For World Nature Conservation Day by Slidesgo.pptx

Cyber awareness ppt on the recorded data

Insurance Churn Prediction Data Analysis Project

Bank Loan Approval Analysis: A Comprehensive Data Analysis Project

Decoding Patterns: Customer Churn Prediction Data Analysis Project

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx

Rithik Kumar Singh codealpha pythohn.pdf

Principles and Practices of Data Visualization

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...

IBEF report on the Insurance market in India

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf

Learn How Data Science Changes Our World

modul pembelajaran robotic Workshop _ by Slidesgo.pptx

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024

Networking Case Study prepared by teacher.pptx

SMOTE and K-Fold Cross Validation-Presentation.pptx

Student profile product demonstration on grades, ability, well-being and mind...

Semantic Shed - Squashing and Squeezing.pptx

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...

En vedette

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at WorkGetSmarter

ChatGPT webinar slidesAlireza Esmikhani

More than Just Lines on a Map: Best Practices for U.S Bike RoutesProject for Public Spaces & National Center for Biking and Walking

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference

Barbie - Brand Strategy PresentationErica Santiago

En vedette (20)

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

ChatGPT webinar slides

More than Just Lines on a Map: Best Practices for U.S Bike Routes

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Barbie - Brand Strategy Presentation

Petabyte-Scale Text Processing with Spark

1. Petabyte-Scale Text Processing with Spark Oleksii Sliusarenko, Grammarly Inc. E-mail: aliaxey90 (at) gmail (dot) com Read the full article in Grammarly tech blog

2. Modern error correcting depending from the weatherdepending on the weather

3. Size: 3 Petabytes Format: WARC - Raw HTTP protocol dump We need: 1 PB or 2000 x 480GB SSD disks Common Crawl = internet dump

4. High-level pipeline view Extract texts English Filter Deduplicate Break into words Count frequencies

5. Typical processing step example Processing example: count each n-gram frequency Input data example: <sentence> <tab> <frequency> Output data example: <n-gram> <tab> <frequency> My name is Bob. 12 Kiev is a capital. 25 name is 12 is 37

6. Classic and modern approaches

7. Our alternatives $12000 $3000 $1000

8. Default choice: Amazon EMR $12000 $24000 OOM segfault

9. Our MapReduce 12x faster than Hadoop Easy to learn Full support 2x2=4

10. Our MapReduce Hardware failures Network failures Distributed failsafe difficulties:

11. Fixing Spark 3 months!

12. First of all Latest stable Latest stable ◈ Build Spark with patch ◈ Don’t forget Hadoop native libraries

13. The hardest button S3 HEAD request failed for "file path" - ResponseCode=403, ResponseMessage=Forbidden. Why???

14. HTTP Head Request HTTP body contains the error description, but it’ s not fetched! No body!

15. Possible reasons Possible reasons: ◈ AccessDenied ◈ AccountProblem ◈ CrossLocationLoggingProhibited ◈ InvalidAccessKeyId ◈ InvalidObjectState ◈ InvalidPayer ◈ InvalidSecurity ◈ NotSignedUp ◈ RequestTimeTooSkewed ◈ SignatureDoesNotMatch

16. We need to go deeper! Spark Hadoop JetS3t HttpClient Fix here

17. Fixing Spark ◈ Choose latest filesystem: S3A, not S3 or S3N ◈ conf.setInt("fs.s3a.connection.maximum", 100) ◈ Use DirectOutputCommitter ◈ --conf spark.hadoop.fs.s3a.access.key=… Fixing S3

18. Fixing Spark ◈ Spark.default.parallelism = cores * 3 ◈ spark_mb = system_ram_mb * 4 // 5 ◈ set("spark.akka.frameSize", "2047") Fixing OOM

19. Fixing Spark ◈ Don’t force Kryo class registration ◈ Use bzip2 compression for input files Fixing miscellaneous

20. Our Ultimate Spark Recipe See Grammarly tech blog for more info

21. Use spot instances Spot instance 80% cheaper! Safe Transient Regular instance Cheap Expensive

22. ◈ We spent the same amount of money ◈ Further experiments will be cheaper ◈ You can save three months! Was It All Worth It?

23. ◈ Don’t reinvent the wheel ◈ New technology will eat a lot of time ◈ Don’t be afraid to dive into code ◈ Look at problems from various angles ◈ Use spot instances Take-aways

24. Thanks! Any questions? You can find me at aliaxey90 (at) gmail (dot) com Read the full article in Grammarly tech blog

Petabyte-Scale Text Processing with Spark

Recommandé

Recommandé

Contenu connexe

Dernier

Dernier (20)

En vedette

En vedette (20)

Petabyte-Scale Text Processing with Spark