4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai

1. Building a Data Product using apache Zeppelin (incubating) NFLabs for ApacheCon ’15 EU

2. Alexander Bezzubov Data engineer @NFLabs based in Seoul, South Korea bzz@apache.org

3. So you want to build a data product… What do you need?

4. To build a data product we need: Idea Data Software (to process the data) Hardware (to run the software) …and a brain

5. Idea Data Software (to process the data) Hardware (to run the software) …and a brain To build a data product we need:

6. Software

7. Software

8. Software … …

9. Software … …

10. Zeppelin

11. Zeppelin Opensouce analytical environment, with pluggable backend data-processing systems, with notebook-style GUI for visualisations

12. ASF Incubation12.2014 08.2013 NFLabs Internal project Hive/Shark http://zeppelin.incubator.apache.org 12.2012 Commercial App using AMP Lab Shark 0.5 10.2013 Prototype Hive/Shark Zeppelin History

13. Zeppelin History

17. Zeppelin Current Status 1 Release 63 Contributors worldwide 689 Stars on GH ~300/900 Emails at users/dev @i.a.o http://zeppelin.incubator.apache.org

18. Zeppelin Architecture

19. Hardware

20. Hardware . . . . . .

21. Idea

22. An Idea Would not it be cool if …

23. An Idea Would not it be cool if …Would not it be cool if … you could have your own Google Analytics?

24. An Idea Would not it be cool if …Would not it be cool if … you could have your own Google Analytics? sorry, we already saw it in eCG talk.. ok, let’s pick something else

25. An Idea you could be the first to know when there is a new interesting* opensouce project Would not it be cool if …

26. Data

27. Data: Github archive https://www.githubarchive.org• Github logs, hosted in the cloud • Collaboration between Github and Google engineers • 20+ events, 250+Gb since 2012 • Proprietary software • available on BigQuery

28. But what if you could: analyse this data independently, without asking permission or paying anybody?

29. But what if you could: analyse this data independent, without asking permission or paying anybody? well, with ASF you can!

30. Let’s build a data product for ourselves!

31. Building a product

32. We are going to build a Notebook that: • Downloads the latest data from GitHub Archive • Read & explore the dataset • Imports, filters the PublicEvent • Join logs w/ more data from Github API calls • Shows simple HTML template, to visualise the list • Sends email notifications

33. We are going to build a Notebook that: basically, sends you digest emails:

34. Start Zeppelin ./bin/zeppelin-daemon.sh start & create a new notebook http://zeppelin.incubator.apache.org/download.html

35. Zeppelin Architecture

37. Load Dependency

39. Download Data In serial, sample, using shell interpreter

40. Download Data In serial, whole day, using shell interpreter Don’t need this as we have data prepared

41. Download Data In parallel, using Spark interpreter

42. We are going to build a Notebook that: • Downloads the latest data from GitHub Archive • Read & explore the dataset • Imports, filters the PublicEvent • Join on external information through remote API call • Shows simple HTML template to visualise the list • Sends email notifications

43. Read Data

44. Explore Data #1 Pie Chart of Event types

45. Explore Data #2 Top 10 organisations by event type

47. Filter: cleanup Only organisations open sourcing repositories

48. Filter: interesting companies Only organisations open sourcing repositories

49. We are going to build a Notebook that: • Downloads the latest data from GitHub Archive • Read & explore the dataset • Imports, filters the PublicEvent • Join on external information through remote API • Shows simple HTML template to visualize the list • Sends email notifications

50. Call Github API Getting more information about repository GitHub personal access token to rise rate-limit github.com/<username> => Edit Profile => Personal access tokens => Generate new token

51. Join orgs and repos

52. We are going to build a Notebook that: • Downloads the latest data from GitHub Archive • Read & explore the dataset • Imports, filters the PublicEvent • Join on external information through remote API call • Shows simple HTML template • Sends email notifications

53. HTML Preview To generate a template

54. HTML Preview To output the results

56. Send Email

57. Send Email

58. Send Email

60. Schedule

61. Kylin: interpreter setup

62. Kylin: consume data

63. Alexander Bezzubov bzz@apache.org http://s.apache.org/zeppelin-workshop

4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai

Similaire à 4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai (20)

Plus de Luke Han

Plus de Luke Han (10)

Dernier

Dernier (20)

4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai

Notes de l'éditeur