17. Zeppelin Current Status
1 Release
63 Contributors worldwide
689 Stars on GH
~300/900 Emails at users/dev
@i.a.o
http://zeppelin.incubator.apache.org
23. An Idea
Would not it be cool if …Would not it be cool if …
you could have your own Google
Analytics?
24. An Idea
Would not it be cool if …Would not it be cool if …
you could have your own Google
Analytics?
sorry, we already saw it in eCG
talk..
ok, let’s pick something else
25. An Idea
you could be the first to know when
there is a new interesting*
opensouce project
Would not it be cool if …
27. Data: Github archive
https://www.githubarchive.org• Github logs, hosted in the
cloud
• Collaboration between Github
and Google engineers
• 20+ events, 250+Gb since
2012
• Proprietary software
• available on BigQuery
28. But what if you could:
analyse this data independently, without asking
permission or paying anybody?
29. But what if you could:
analyse this data independent, without asking
permission or paying anybody?
well, with ASF you can!
32. We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Imports, filters the PublicEvent
• Join logs w/ more data from Github API calls
• Shows simple HTML template, to visualise the
list
• Sends email notifications
33. We are going to build a Notebook that:
basically, sends you digest emails:
36. We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Imports, filters the PublicEvent
• Join logs w/ more data from Github API calls
• Shows simple HTML template, to visualise the
list
• Sends email notifications
38. We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Imports, filters the PublicEvent
• Join logs w/ more data from Github API calls
• Shows simple HTML template, to visualise the
list
• Sends email notifications
42. We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Imports, filters the PublicEvent
• Join on external information through remote API
call
• Shows simple HTML template to visualise the
list
• Sends email notifications
46. We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Imports, filters the PublicEvent
• Join on external information through remote API
call
• Shows simple HTML template to visualise the
list
• Sends email notifications
49. We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Imports, filters the PublicEvent
• Join on external information through remote API
• Shows simple HTML template to visualize the
list
• Sends email notifications
50. Call Github API
Getting more information about
repository
GitHub personal access token to rise rate-limit
github.com/<username> => Edit Profile => Personal access tokens => Generate new
token
52. We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Imports, filters the PublicEvent
• Join on external information through remote API
call
• Shows simple HTML template
• Sends email notifications
55. We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Imports, filters the PublicEvent
• Join on external information through remote API
call
• Shows simple HTML template to visualise the
list
• Sends email notifications
59. We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Imports, filters the PublicEvent
• Join on external information through remote API
call
• Shows simple HTML template to visualise the
list
• Sends email notifications
short intro
designed as a workshop - go home with some practical skills
Go though some history of Zeppelin project
and then I’ll guide through a hands-on example of one data product
any “product”, based on data
Easy, because of Apache &BigData
ASF is a leader in BigData ecosystem
one of the biggest foundations, 200+ projects
gives you the power to crunch the data that only big companies have
interesting engineering challenges, job
a lot computing options
some of them do have a web ui to interact with, but are project-secific
with Z - you have visualisation options, agnostic to the backend
allows you to be flexible with your stack
Spark is the most advanced interpreter now - historical reasons
* what is Zeppelin? and some history
started at NFLabs in Seoul, South Korea
open-source (free, as a beer) since 2013
when mature enough, to provide a value to community -> ASF
comercial app
prototype
OSS free as a beer
visualisations
2 different backend Hive and Shark
with pluggable backend systems, thought ‘interpreters’
so, what exactly is Zeppelin
that’s how it look like in 10 months under ASF
interpreters abstracted was very important
if you want to build a data product - you better be prepared for scale
quality of the product depends on the amount of data
* plan for both: a cluster (cloud) and a laptop (dev prod) for free
* …and you have those, thanks to ASF!
good example of data product using web logs!
talked about Apache a lot (Kylin, Tez..) we all love Opensource
Github is the biggest opensource code repository
problem: you cannot follow organisation on Github!
oss is awesome, so many things happen - hard to keep up
problem: you cannot follow organisation on Github!
say eBay, or any other tech company you are interested in OSS something
now have software\hardware and an Idea for data product. Something missing
Github opensources their log data
may companies - log-generating machines, but GH pushes it further
* available to crunch, using tones of closed-sourced engineering effort
Would not that be cool?
Hands-on part
break-down: 8 steps
build a missing feature
definition of “interesting” may vary
i.e not just “new from big companies”, but intelligent suggestions based on
going to be using different interpreters: shell, spark, kylin
break-down: 8 steps
break-down: 8 steps
sequential download, does not utilise full channel bandwidth