In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.
5. 2011: 100 beta users
2012: 1.0 release, 2M downloads
2013: 5M downloads, global launches
2014: Between 2.0, 10M downloads
2015: Between 3.0
2016: Starts monetization, 20M downloads
2017: Global expansion, new business, team of 60
6. put your #assignedhashtag here by setting the footer in view-header/footer
Kevin Kim
• Came from Seoul, South Korea
• Co-founder, used to be a product
developer
• Now a data analyst, engineer, team
leader
• Founder of Korea Spark User Group
• Committer and PMC member of
Apache Zeppelin
6
8. Intro to Between Data Team
• Data engineer * 4
– Manager, engineer with various stack of knowledge and
experience
– Junior engineer, used to be a server engineer
– Senior engineer, has lots of exps and skills
– Data engineer, used to be a top level Android developer
• Hiring data analyst and machine learning expert
8
9. Between Data Team is doing..
• Analysis
– Service monitoring
– Analysis usage of new features and build product strategies
• Data Infrastructure
– Build and manage infrastructure
– Spark, Zeppelin, AWS, BI Tools, etc
• Third Party Management
– Mobile Attribution Tools for marketing (Kochava, Tune, Appsflyer, etc)
– Google Analytics, Firebase, etc
– Ad Networks
9
10. Between Data Team is doing..
• Machine Learning Study & Research
– For the next business model
• Support team
– To build business, product, monetization strategies
• Performance Marketing Analysis
– Monitoring effectiveness of marketing budgets
• Product Development
– Improves client performance, server architecture, etc
10
14. Requirements
• Big Data
– 2TB/day of log data from millions of DAU
– 20M of users
• Small Team
– Team of 4, need to support 50
• Tiny Budget
– Company is just over BEP (Break Even Point)
• Need very efficient tech stack!
14
15. Way We Work
• Use Apache Spark as a general processing engine
• Scriptify everything with Apache Zeppelin
• Heavy utilization of AWS and Spot instances to cut cost
• Proper selection of BI Dashboard Tools
15
16. Apache Spark, General Engine
• Definitely the best way to deal with big data (as you all know!)
• It’s performance, agility exactly meets startup requirements
– Used Spark from 2014
• Great match with Cloud Service, especially with Spot instance
– Utilizing burst nature of Cloud Service
16
17. Scriptify Everything with Zeppelin
• Doing everything on Zeppelin!
• Daily batch tasks in form of Spark scripts (using
Zeppelin scheduler)
• Ad hoc analysis
• Cluster control scripts
• The world first user of Zeppelin!
• More than 200 Zeppelin notebooks
17
18. AWS Cloud
• Spot Instance is my friend!
– Mostly use spot instance for analysis
– only 10 ~ 20% of cost compare to on-demand instances
• Dynamic cluster launch with Auto Scale
– Launch clusters automatically for batch analysis
– Manually launch more clusters on Zeppelin, with Auto Scale script
– Automatically diminish clusters when no usage
18
19. BI Dashboard Tools
• Use Zeppelin as a dashboard using Spark SQL with ZEPL
• Holistics (holistics.io) or Dash (plot.ly/products/dash/)
19
21. RDD API or DataFrame API?
• Now Spark has very different style of APIs
– Programmatic RDD API
– SQL-like DataFrame, DataSet API
• In case of having many, simple ad-hoc queries
– DataFrame works
• Having more complex, deep dive analytic questions
– RDD works
• For a while, mostly use RDD, DataFrame for ML or simple ad hoc tasks
21
22. Sushi or Cooked Data?
• Keeping data in a raw form as possible!
– Doing ETL’s usually makes trouble, increasing management cost
– The Sushi Principle (Joseph & Robert in Strata)
– Drastically reduce operation & management cost
– Apache Spark is a great tool for extracting insight from raw data
22
fresh data!
23. To Hire Data Analyst or Not?
• For data analyst, expected skill set are..
– Excel, SQL, R, ..
• Those skills are not expected..
– Programatic API like Spark RDD
– Cooking raw data
• Prefer data engineer with analytic skills
• May need to add some ETL tasks to work with data analyst
23
24. Better, Faster Team Support?
• Better - Zeppelin is great for analyzing data, but not enough for sharing data for team
– We have really few alternatives
– Increase of using BI dashboard tools?
– Still finding a good way
• Faster - Launching a Spark cluster takes few minutes
– Not bad, but we want it faster
– Google BigQuery or AWS Athena
– SQL Database with ETL
24
25. Future Plan?
• Prepare for exploding # of data operations!
– Team is growing, business is growing
– # of tasks
– # of 3rd party data products
– Communication cost
• Operations with machine learning & deep learning
– Better way to manage task & data flow
25
27. What Matters for Us
• Support Team
– Each Team should see proper data and make good decision from it
– Regular meetings, fast response to adhoc data request
– Ultimately, our every activity should be related to company’s business
• Technical Lead
– Technical investments for competence of both company and individual
– Working in Between should be a best experience for each individuals
• Social Impact
– Our activity on work has valuable impact for society?
– Open source, activity on community
27
28. How Apache Spark is Powering a Startup?
• One great tool for general purpose
– Daily batch tasks
– Agile, adhoc analysis
– Drawing dashboard
– Many more..
• Helps saving time, reducing cost of data operations
• Great experience for engineer and analyst
• Sharing know-how’s to / from community
28
29. Work as a data engineer at Startup
• Fascinating, fast evolution of tech
• Need hard work and labor
• Data work will shine only when it is understood and used by teammates
29
Two Peasants Digging, Vincent van GoghTwo Men Digging, Jean-Francois Millet