SlideShare une entreprise Scribd logo
1  sur  59
Télécharger pour lire hors ligne
Effectively Contributing to
Apache Spark
Beyond the Code Toronto 2016
This talk (as with all) represents my own personal views may not reflect that of the project.
I am not a Spark committer - but I’ve been contributing for 3 years
Who am I?
Holden
● Prefered pronouns: she/her
● Co-author of the Learning Spark & High Performance Spark books
● Software Engineer at IBM’s Spark Technology Center
● 100+ Spark Commits
● @holdenkarau
● http://www.slideshare.net/hkarau
● https://www.linkedin.com/in/holdenkarau
What we are going to explore together!
Getting a change into Apache Spark & the components
involved:
● Different ways to contribute
● Places to find things to contribute
● Tooling around code & doc contributions
How can we contribute to Spark?
● Direct code in the Apache Spark code base
● Code in packages built on top of Spark
● Yak shaving (aka fixing things that Spark uses)
● Documentation improvements & examples
● Books, Talks, and Blogs
● Answering questions (mailing lists, stack overflow, etc.)
Which is right for you?
● Direct code in the Apache Spark code base
○ High visibility, some things can only really be done here
○ Can take a lot longer to get changes in
● Code in packages built on top of Spark
○ No real review (+/-)
○ Really great for things like formats or standalone features
● Yak shaving (aka fixing things that Spark uses)
○ Super important to do sometimes - can take even longer to get in
Which is right for you? (continued)
● Documentation improvements & examples
○ Lots of places to contribute - mixed visibility - large impact
● Books, Talks, and Blogs
○ The documentation version of Spark Packages (e.g. no need for
community review)
○ Can be high visibility
○ Talk to me if you are thinking of writing a technical book :)
But before we get too far:
● Spark wishes to maintain compatibility between releases
● 2.0 just shipped - so most APIs should be stable
○ Notable exceptions include Structured Streaming
● It’s very important to talk about large code changes with
the key members before doing them
○ dev list is the simplest way of reaching out
○ Wonder who the key members are? Check the component maintainers on
https://cwiki.apache.org/confluence/display/SPARK/Committers
Adventure path 1: Direct to Spark
● Maybe we encountered a bug we want to fix
● Maybe we’ve got a feature we want to add
● Either way we should see if other people are doing it
● And if what we want to do is complex, it might be better
to find something simple to start with
● It’s dangerous to go alone - take this
https://cwiki.apache.org/confluence/display/SPARK/Contrib
uting+to+Spark
Getting the code
This step can take some time - especially over conference
WiFi so let's get it started now :)
Conference WiFi isn’t working out for you? Ask me and I can
make a copy of the repo to a USB stick for you :)
Exercise time
Photo by recastle
Spark’s Github (Exercise 1)
● https://github.com/apache/spark
● Make a fork of it
● Clone it locally
JIRA - Issue tracking funtimes
● It’s like bugzilla or fog bugz
● There is an Apache JIRA for all Apache projects
● You can (and should) sign up for an account
● All changes in Spark (now) require a JIRA
● https://www.youtube.com/watch?v=ca8n9uW3afg
● Check it out at:
○ https://issues.apache.org/jira/browse/SPARK
The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Graph X
MLLib
Community
Packages
The different pieces of Spark: 2.0+
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark
ML
bagel &
Graph X
MLLib
Community
Packages
Structured
Streaming
What we can do with ASF JIRA?
● Search for issues (remember to filter to Spark project)
● Create new issues
○ search first to see if someone else has reported it
● Comment on issues to let people know we are working on it
● Ask people for clarification or help
○ e.g. “Reading this I think you want the null values to be replaced by
a string when processing - is that correct?”
○ @mentions work here too
What can’t we do with ASF JIRA?
● Assign issues (to ourselves or other people)
○ In lieu of assigning we can “watch” & comment
● Post long design documents (create a Google Doc & link to
it from the JIRA)
● Tag issues
○ While we can add tags, they often get removed
Finding a good “starter” issue:
● There are explicit starter tags in JIRA we can search for
● But often the starter tag isn’t applied
● Read through and look for simple issues
● Pick something in the same component you eventually want
to work in
○ And or consider improving the non-Scala language API for the
component(s) you want to work on.
● Look at the reporter and commenters - is there a
committer or someone whose name you recognize?
● Leave a comment that says you are going to start working
on this
Exercise 2: Find an issue you want to work on
https://issues.apache.org/jira/browse/SPARK
Also grep for TODO in components you are interested in (e.g.
grep -r TODO ./python/pyspark or grep -R TODO ./core/src)
Look between language APIs and see if anything is missing
that you think is interesting -
http://spark.apache.org/docs/latest/api/scala/index.html#org
.apache.spark.package
http://spark.apache.org/docs/latest/api/python/index.html
Feel free to work in groups :)
Exercise 3a: Building Spark
./build/sbt
or
./build/mvn
Working in Python? Make sure to build the package target so
your Python code will run :)
You can quickly verify build with the Spark Shell :)
What about documentation changes?
● Still use JIRAs to track
● We can’t edit the wiki :(
● But a lot of documentations lives in docs/*.md
Exercise 3b: Building Spark’s docs
./docs/README.md has a lot of info - but quickly:
SKIP_API=1 jekyll build
SKIP_API=1 jekyll serve --watch
Finding your way around the code
● Organized into sub-projects by directory
● IntelliJ is very popular with Spark developers
○ The free version is fine
● Some people like using emacs + ensime or magit too
● Language specific code is in each sub directory
Testing the issue
The spark-shell can often be a good way to verify the issue
reported in the JIRA is still occurring and come up with a
reasonable test.
Once you’ve got a handle on the issue in the spark-shell (or
if you decide to skip that step) check out
./[component]/src/test for Scala or doctests for Python
After we get our code working
(or even better while we work on it)
● Remember to follow the style guides
○ https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Gu
ide
● Please always add tests
○ For development we can run scala test with “sbt [module]/testOnly”
○ In python we can specify module with ./python/run-tests
● ./dev/lint-scala & ./dev/lint-python check for some style
● Changing the API? Make sure we pass MiMa!
○ Sometimes its OK to make breaking changes, and MiMa can be a bit
overzealous so adding exceptions is common
A bit more on MiMa
● Spark wishes to maintain binary compatibility
○ in non-experimental components
● MiMa exclusions can be added if we verify (and document
how we verified) the compatibility
● Often MiMa is a bit over sensitive so don’t feel stressed
- feel free to ask for help if confused
Exercise 4: Open your editors
No arguing about which editor please - kthnx
Making a doc change? Look inside docs/*.md
Making a code change? grep or intellij or github inside
project codesearch can all help you find what you're looking
for.
Yay! Let’s make a PR :)
● Push to your branch
● Visit github
● Create PR (put JIRA name in title as well as component)
○ Components control where our PR shows up in
https://spark-prs.appspot.com/
● If you’ve been whitelisted tests will run
● Otherwise will wait for someone to verify
● Tag it “WIP” if its a work in progress (but maybe wait)
[puamelia]
Code review time
● Note: this is after the pull request creation
● I believe code reviews should be done in the open
○ With an exception of when we are deciding if we want to try and
submit a change
○ Even then should have hopefully decided that back at the JIRA stage
● My personal beliefs & your org’s may not align
Mitchell
Joyce
And now onto the actual code review...
● Most often committers will review your code (eventually)
● Other people can help too
● People can be very busy (check the release schedule)
● If you don’t get traction try pinging people
○ Me ( @holdenkarau - I can’t merge your code but I can take a look)
○ The author of the JIRA (even if not a committer)
○ The shepherd of the JIRA (if applicable)
○ The person who wrote the code you are changing (git blame)
○ Active committers for the component
Mitchell
Joyce
What does the review look like?
● LGTM - Looks good to me
○ Individual thinks the code looks good - ready to merge (sometimes
LGTM pending tests or LGTM but check with @[name]).
● SGTM - Sounds good to me (normally in response to a
suggestion)
● Sometimes get sent back to the drawing board
● Not all PRs get in - its ok!
○ Don’t feel bad & don’t get discouraged.
● Mixture of in-line comments & general comments
That’s a pretty standard small PR
● It took some time to get merged in
● It was fairly simple
● Review cycles are long - so move on to other things
● Only two reviewers
● Apache Spark Jenkins comments on build status :)
○ “Jenkins retest this please” is great
● Big PRs - like making PySpark pip installable can have >
10 reviewers and take a long time
● Sometimes it can be hard to find reviewers - tag your PRs
& ping people on github
Don’t get discouraged
David Martyn Hunt
It is normal to not get every pull request accepted
Sometimes other people will “scoop” you on your
pull request
Sometimes people will be super helpful with your
pull request
So who was that “Spark QA”?
● Automated pull request builder
● Jenkins based
● Runs all of the tests & style checks
● Lives in Berkeley
● Test logs live on, artifacts not so much
● https://amplab.cs.berkeley.edu/jenkins
Some changes require even more testing
● spark-perf (common for ML changes)
● spark-sql-perf (common for SQL changes)
● spark-integration-tests (integration testing)
While we are waiting:
● Keep merging in master when we get out of sync
● If we don’t jenkins can’t run :(
● We get out of sync surprisingly quickly!
● If our pull request gets older than 30 days it might get
auto-closed
So review: Where do we get started?
● Search for “starter” on JIRA
● Look on the mailing list for problems
● Stackoverflow - lots of questions some of which are bugs
● grep TODO broken FIXME
● Compare APIs between languages
● Customer/user reports?
But what about when we want to make big changes?
● Talk with the community
○ Developer mailing list dev@spark.apache.org
○ User mailing list user@spark.apache.org
● Create a public design document (google doc normally)
● Consider if it can be published as a spark-package
instead
Other resources:
● “Contributing to Apache Spark” -
https://cwiki.apache.org/confluence/display/SPARK/Contrib
uting+to+Spark
● Programming guide (along with JavaDoc, PyDoc, ScalaDoc,
etc.)
○ http://spark.apache.org/docs/latest/
What about creating a package?
● Relatively simple - need to publish to maven central
● Listed on http://spark-packages.org
● Cross building (Spark versions) not super easy
● If your building with sbt check out
https://github.com/databricks/sbt-spark-package to make
it easy to publish
● Used to do API compatibility checks
● Sometimes flakey - just republish if it doesn’t go
through
Signing your packages
● Required
● Can be a bit odd (sbt-pgp plugin has issues sometimes
with keys with passphrases)
What things can be good Spark packages?
● Input formats (especially Spark SQL, Streaming)
● Machine learning pipeline components & algorithms
● Testing support
● Monitoring data sinks
● Deployment tools
How about writing a book?
● Can be lots of fun
● Can also take up 100% of your “free” time
● Can get you invited to more nerd parties
● Most of the publisher are looking to improve/broaden
their Spark book line up
● Like an old book that hasn’t been updated? Talk to the
publisher about updating it.
Spark Videos
● Apache Spark Youtube Channel
● My Spark videos on YouTube -
○ http://bit.ly/holdenSparkVideos
● Spark Summit 2014 training
● Paco’s Introduction to Apache Spark
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Early Release
High Performance Spark
And the next book…..
First five chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
And some upcoming talks:
● September
○ This meetup (yay)!
○ Toronto - Beyond the Code (Contributing to Spark)
○ New York City Strata Conf (Structured Streaming & Machine Learning)
● October
○ PyData DC - Making Spark go fast in Python (vroom vroom)
○ Salt Lake City Spark Meetup - TBD
○ London - OSCON - Getting Started Contributing to Spark
● December
○ Strata Singapore (Introduction to Datasets)
k thnx bye!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Pssst: Have feedback on the workshop? Give me a shout
(holden@pigscanfly.ca) if you feel comfortable doing so :)

Contenu connexe

Tendances

Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 

Tendances (20)

Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Contributing to Apache Spark 3
Contributing to Apache Spark 3Contributing to Apache Spark 3
Contributing to Apache Spark 3
 

Similaire à Getting started contributing to Apache Spark

Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Holden Karau
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
 

Similaire à Getting started contributing to Apache Spark (20)

Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and BeyondGetting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)
 
Building a blog with an Onion Architecture
Building a blog with an Onion ArchitectureBuilding a blog with an Onion Architecture
Building a blog with an Onion Architecture
 
Onion Architecture and the Blog
Onion Architecture and the BlogOnion Architecture and the Blog
Onion Architecture and the Blog
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
 
Overcoming the Fear of Contributing to Open Source
Overcoming the Fear of Contributing to Open SourceOvercoming the Fear of Contributing to Open Source
Overcoming the Fear of Contributing to Open Source
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
 
Pentester++
Pentester++Pentester++
Pentester++
 
LCA13: Upstreaming 101
LCA13: Upstreaming 101LCA13: Upstreaming 101
LCA13: Upstreaming 101
 
Upstreaming 1013
Upstreaming 1013Upstreaming 1013
Upstreaming 1013
 
LCE12: Intro Training: Upstreaming 101
LCE12: Intro Training: Upstreaming 101LCE12: Intro Training: Upstreaming 101
LCE12: Intro Training: Upstreaming 101
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Pigaios: A Tool for Diffing Source Codes against Binaries (Hacktivity 2018)
Pigaios: A Tool for Diffing Source Codes against Binaries (Hacktivity 2018)Pigaios: A Tool for Diffing Source Codes against Binaries (Hacktivity 2018)
Pigaios: A Tool for Diffing Source Codes against Binaries (Hacktivity 2018)
 
Code Review
Code ReviewCode Review
Code Review
 
Prototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at SingaporePrototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at Singapore
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

Getting started contributing to Apache Spark

  • 1. Effectively Contributing to Apache Spark Beyond the Code Toronto 2016 This talk (as with all) represents my own personal views may not reflect that of the project. I am not a Spark committer - but I’ve been contributing for 3 years
  • 2. Who am I? Holden ● Prefered pronouns: she/her ● Co-author of the Learning Spark & High Performance Spark books ● Software Engineer at IBM’s Spark Technology Center ● 100+ Spark Commits ● @holdenkarau ● http://www.slideshare.net/hkarau ● https://www.linkedin.com/in/holdenkarau
  • 3. What we are going to explore together! Getting a change into Apache Spark & the components involved: ● Different ways to contribute ● Places to find things to contribute ● Tooling around code & doc contributions
  • 4. How can we contribute to Spark? ● Direct code in the Apache Spark code base ● Code in packages built on top of Spark ● Yak shaving (aka fixing things that Spark uses) ● Documentation improvements & examples ● Books, Talks, and Blogs ● Answering questions (mailing lists, stack overflow, etc.)
  • 5. Which is right for you? ● Direct code in the Apache Spark code base ○ High visibility, some things can only really be done here ○ Can take a lot longer to get changes in ● Code in packages built on top of Spark ○ No real review (+/-) ○ Really great for things like formats or standalone features ● Yak shaving (aka fixing things that Spark uses) ○ Super important to do sometimes - can take even longer to get in
  • 6. Which is right for you? (continued) ● Documentation improvements & examples ○ Lots of places to contribute - mixed visibility - large impact ● Books, Talks, and Blogs ○ The documentation version of Spark Packages (e.g. no need for community review) ○ Can be high visibility ○ Talk to me if you are thinking of writing a technical book :)
  • 7. But before we get too far: ● Spark wishes to maintain compatibility between releases ● 2.0 just shipped - so most APIs should be stable ○ Notable exceptions include Structured Streaming ● It’s very important to talk about large code changes with the key members before doing them ○ dev list is the simplest way of reaching out ○ Wonder who the key members are? Check the component maintainers on https://cwiki.apache.org/confluence/display/SPARK/Committers
  • 8. Adventure path 1: Direct to Spark ● Maybe we encountered a bug we want to fix ● Maybe we’ve got a feature we want to add ● Either way we should see if other people are doing it ● And if what we want to do is complex, it might be better to find something simple to start with ● It’s dangerous to go alone - take this https://cwiki.apache.org/confluence/display/SPARK/Contrib uting+to+Spark
  • 9. Getting the code This step can take some time - especially over conference WiFi so let's get it started now :) Conference WiFi isn’t working out for you? Ask me and I can make a copy of the repo to a USB stick for you :)
  • 11. Spark’s Github (Exercise 1) ● https://github.com/apache/spark ● Make a fork of it ● Clone it locally
  • 12.
  • 13. JIRA - Issue tracking funtimes ● It’s like bugzilla or fog bugz ● There is an Apache JIRA for all Apache projects ● You can (and should) sign up for an account ● All changes in Spark (now) require a JIRA ● https://www.youtube.com/watch?v=ca8n9uW3afg ● Check it out at: ○ https://issues.apache.org/jira/browse/SPARK
  • 14. The different pieces of Spark Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages
  • 15. The different pieces of Spark: 2.0+ Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages Structured Streaming
  • 16. What we can do with ASF JIRA? ● Search for issues (remember to filter to Spark project) ● Create new issues ○ search first to see if someone else has reported it ● Comment on issues to let people know we are working on it ● Ask people for clarification or help ○ e.g. “Reading this I think you want the null values to be replaced by a string when processing - is that correct?” ○ @mentions work here too
  • 17. What can’t we do with ASF JIRA? ● Assign issues (to ourselves or other people) ○ In lieu of assigning we can “watch” & comment ● Post long design documents (create a Google Doc & link to it from the JIRA) ● Tag issues ○ While we can add tags, they often get removed
  • 18.
  • 19. Finding a good “starter” issue: ● There are explicit starter tags in JIRA we can search for ● But often the starter tag isn’t applied ● Read through and look for simple issues ● Pick something in the same component you eventually want to work in ○ And or consider improving the non-Scala language API for the component(s) you want to work on. ● Look at the reporter and commenters - is there a committer or someone whose name you recognize? ● Leave a comment that says you are going to start working on this
  • 20. Exercise 2: Find an issue you want to work on https://issues.apache.org/jira/browse/SPARK Also grep for TODO in components you are interested in (e.g. grep -r TODO ./python/pyspark or grep -R TODO ./core/src) Look between language APIs and see if anything is missing that you think is interesting - http://spark.apache.org/docs/latest/api/scala/index.html#org .apache.spark.package http://spark.apache.org/docs/latest/api/python/index.html Feel free to work in groups :)
  • 21. Exercise 3a: Building Spark ./build/sbt or ./build/mvn Working in Python? Make sure to build the package target so your Python code will run :) You can quickly verify build with the Spark Shell :)
  • 22. What about documentation changes? ● Still use JIRAs to track ● We can’t edit the wiki :( ● But a lot of documentations lives in docs/*.md
  • 23. Exercise 3b: Building Spark’s docs ./docs/README.md has a lot of info - but quickly: SKIP_API=1 jekyll build SKIP_API=1 jekyll serve --watch
  • 24. Finding your way around the code ● Organized into sub-projects by directory ● IntelliJ is very popular with Spark developers ○ The free version is fine ● Some people like using emacs + ensime or magit too ● Language specific code is in each sub directory
  • 25. Testing the issue The spark-shell can often be a good way to verify the issue reported in the JIRA is still occurring and come up with a reasonable test. Once you’ve got a handle on the issue in the spark-shell (or if you decide to skip that step) check out ./[component]/src/test for Scala or doctests for Python
  • 26. After we get our code working (or even better while we work on it) ● Remember to follow the style guides ○ https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Gu ide ● Please always add tests ○ For development we can run scala test with “sbt [module]/testOnly” ○ In python we can specify module with ./python/run-tests ● ./dev/lint-scala & ./dev/lint-python check for some style ● Changing the API? Make sure we pass MiMa! ○ Sometimes its OK to make breaking changes, and MiMa can be a bit overzealous so adding exceptions is common
  • 27. A bit more on MiMa ● Spark wishes to maintain binary compatibility ○ in non-experimental components ● MiMa exclusions can be added if we verify (and document how we verified) the compatibility ● Often MiMa is a bit over sensitive so don’t feel stressed - feel free to ask for help if confused
  • 28. Exercise 4: Open your editors No arguing about which editor please - kthnx Making a doc change? Look inside docs/*.md Making a code change? grep or intellij or github inside project codesearch can all help you find what you're looking for.
  • 29. Yay! Let’s make a PR :) ● Push to your branch ● Visit github ● Create PR (put JIRA name in title as well as component) ○ Components control where our PR shows up in https://spark-prs.appspot.com/ ● If you’ve been whitelisted tests will run ● Otherwise will wait for someone to verify ● Tag it “WIP” if its a work in progress (but maybe wait) [puamelia]
  • 30. Code review time ● Note: this is after the pull request creation ● I believe code reviews should be done in the open ○ With an exception of when we are deciding if we want to try and submit a change ○ Even then should have hopefully decided that back at the JIRA stage ● My personal beliefs & your org’s may not align Mitchell Joyce
  • 31. And now onto the actual code review... ● Most often committers will review your code (eventually) ● Other people can help too ● People can be very busy (check the release schedule) ● If you don’t get traction try pinging people ○ Me ( @holdenkarau - I can’t merge your code but I can take a look) ○ The author of the JIRA (even if not a committer) ○ The shepherd of the JIRA (if applicable) ○ The person who wrote the code you are changing (git blame) ○ Active committers for the component Mitchell Joyce
  • 32. What does the review look like? ● LGTM - Looks good to me ○ Individual thinks the code looks good - ready to merge (sometimes LGTM pending tests or LGTM but check with @[name]). ● SGTM - Sounds good to me (normally in response to a suggestion) ● Sometimes get sent back to the drawing board ● Not all PRs get in - its ok! ○ Don’t feel bad & don’t get discouraged. ● Mixture of in-line comments & general comments
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42. That’s a pretty standard small PR ● It took some time to get merged in ● It was fairly simple ● Review cycles are long - so move on to other things ● Only two reviewers ● Apache Spark Jenkins comments on build status :) ○ “Jenkins retest this please” is great ● Big PRs - like making PySpark pip installable can have > 10 reviewers and take a long time ● Sometimes it can be hard to find reviewers - tag your PRs & ping people on github
  • 43. Don’t get discouraged David Martyn Hunt It is normal to not get every pull request accepted Sometimes other people will “scoop” you on your pull request Sometimes people will be super helpful with your pull request
  • 44. So who was that “Spark QA”? ● Automated pull request builder ● Jenkins based ● Runs all of the tests & style checks ● Lives in Berkeley ● Test logs live on, artifacts not so much ● https://amplab.cs.berkeley.edu/jenkins
  • 45. Some changes require even more testing ● spark-perf (common for ML changes) ● spark-sql-perf (common for SQL changes) ● spark-integration-tests (integration testing)
  • 46. While we are waiting: ● Keep merging in master when we get out of sync ● If we don’t jenkins can’t run :( ● We get out of sync surprisingly quickly! ● If our pull request gets older than 30 days it might get auto-closed
  • 47. So review: Where do we get started? ● Search for “starter” on JIRA ● Look on the mailing list for problems ● Stackoverflow - lots of questions some of which are bugs ● grep TODO broken FIXME ● Compare APIs between languages ● Customer/user reports?
  • 48. But what about when we want to make big changes? ● Talk with the community ○ Developer mailing list dev@spark.apache.org ○ User mailing list user@spark.apache.org ● Create a public design document (google doc normally) ● Consider if it can be published as a spark-package instead
  • 49. Other resources: ● “Contributing to Apache Spark” - https://cwiki.apache.org/confluence/display/SPARK/Contrib uting+to+Spark ● Programming guide (along with JavaDoc, PyDoc, ScalaDoc, etc.) ○ http://spark.apache.org/docs/latest/
  • 50. What about creating a package? ● Relatively simple - need to publish to maven central ● Listed on http://spark-packages.org ● Cross building (Spark versions) not super easy ● If your building with sbt check out https://github.com/databricks/sbt-spark-package to make it easy to publish ● Used to do API compatibility checks ● Sometimes flakey - just republish if it doesn’t go through
  • 51. Signing your packages ● Required ● Can be a bit odd (sbt-pgp plugin has issues sometimes with keys with passphrases)
  • 52. What things can be good Spark packages? ● Input formats (especially Spark SQL, Streaming) ● Machine learning pipeline components & algorithms ● Testing support ● Monitoring data sinks ● Deployment tools
  • 53. How about writing a book? ● Can be lots of fun ● Can also take up 100% of your “free” time ● Can get you invited to more nerd parties ● Most of the publisher are looking to improve/broaden their Spark book line up ● Like an old book that hasn’t been updated? Talk to the publisher about updating it.
  • 54. Spark Videos ● Apache Spark Youtube Channel ● My Spark videos on YouTube - ○ http://bit.ly/holdenSparkVideos ● Spark Summit 2014 training ● Paco’s Introduction to Apache Spark
  • 55. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action
  • 56. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action Early Release High Performance Spark
  • 57. And the next book….. First five chapters are available in “Early Release”*: ● Buy from O’Reilly - http://bit.ly/highPerfSpark Get notified when updated & finished: ● http://www.highperformancespark.com ● https://twitter.com/highperfspark * Early Release means extra mistakes, but also a chance to help us make a more awesome book.
  • 58. And some upcoming talks: ● September ○ This meetup (yay)! ○ Toronto - Beyond the Code (Contributing to Spark) ○ New York City Strata Conf (Structured Streaming & Machine Learning) ● October ○ PyData DC - Making Spark go fast in Python (vroom vroom) ○ Salt Lake City Spark Meetup - TBD ○ London - OSCON - Getting Started Contributing to Spark ● December ○ Strata Singapore (Introduction to Datasets)
  • 59. k thnx bye! If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau Any PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?: http://bit.ly/pySparkUDF Pssst: Have feedback on the workshop? Give me a shout (holden@pigscanfly.ca) if you feel comfortable doing so :)