Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Dr. David Talby
CTO, Pacific AI
BUILD YOUR OWN OPEN SOURCE
DATA SCIENCE PLATFORM
LET’S BUILD A PLATFORM
1. Ground Rules
2. Components
3. Putting It All Together
AT THE BEGINNING, THERE WAS SEARCH
Integrate Data
ETL
Streaming
Quality
Enrichment
Dataflows
Data Analyst Data Scientist
SCOPE
Discover & Visualize
SQL
Searc...
GOALS
Enterprise Grade
Scales from GB to PB Unified & Modular
Cutting Edge
CONSTRAINTS
No Commercial Software
No Copyleft
No Saas
Built It
LET’S BUILD A PLATFORM
1. Ground Rules
2. Components
3. Putting it All Together
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Eng...
APACHE NIFI
NIFI FEATURES
Web-based dataflow user interface
Seamless experience between design, control, feedback, and monitoring
High...
APACHE SPARK
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Eng...
SPARK SQL FEATURES
Distributed SQL Engine
Seamless integration with Spark DataFrames
Standards Compliant
ANSI SQL 2003 sup...
KIBANA
TIMELION
KIBANA FEATURES
Full-text and faceted search
Full text query language: Boolean operators, proximity, boosting
Faceted sear...
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Eng...
JUPYTER LAB
JUPYTER HUB
ANACONDA
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Eng...
OPEN SCORING
MLEAP
KONG API GATEWAY
API Gateway on nginx
Scalable
Modular with plugins
Authentication
Basic Auth, Open ID,
OAuth, HMAC, LDAP,...
COLLABORATION, CI & CD
Plan
Projects, Boards, Issues,
Milestones, Teams
Create
Merge, Preview, Commit,
Branch, Lock, Discu...
Infrastructure
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App De...
KUBERNETES
Portable Containers
Public, Private, Hybrid,
or Multi-Cloud
Deployment
Automation, Co-Location,
Storage Mountin...
PROMETHEUS & GRAFANA
KEYCLOAK
LET’S BUILD A PLATFORM
1. Ground Rules
2. Components
3. Putting It All Together
The Big Picture
• This is a complex, major enterprise platform
• It’s far from free: Cost is in integration, training & op...
Common Questions
Q: Do I need it all on Day One?
A: No. Use what you need, know where it fits later.
Q: What if I already ...
Summary: If you remember one thing…
Build the simplest platform that serves
everyone required to turn science into $$$
Dat...
david@pacific.ai
@davidtalby
in/davidtalby
THANK YOU!
Prochain SlideShare
Chargement dans…5
×

Build your open source data science platform

2 434 vues

Publié le

Architecture for an enterprise-grade, free open source data science platform, given available components as of Fall 2017.

Publié dans : Logiciels
  • Soyez le premier à commenter

Build your open source data science platform

  1. 1. Dr. David Talby CTO, Pacific AI BUILD YOUR OWN OPEN SOURCE DATA SCIENCE PLATFORM
  2. 2. LET’S BUILD A PLATFORM 1. Ground Rules 2. Components 3. Putting It All Together
  3. 3. AT THE BEGINNING, THERE WAS SEARCH
  4. 4. Integrate Data ETL Streaming Quality Enrichment Dataflows Data Analyst Data Scientist SCOPE Discover & Visualize SQL Search Visualization Dashboards Real-Time Alerts Train Models ML, DL, DM, NLP, … Explore & Visualize Train & Optimize Collaboration Workflows Productize Models Deploy API’s Publish API’s CI & CD for Models Measurement Feedback App DeveloperData Engineer Infrastructure Deployment Orchestration Security Monitoring Single Sign-On Backup Scaling
  5. 5. GOALS Enterprise Grade Scales from GB to PB Unified & Modular Cutting Edge
  6. 6. CONSTRAINTS No Commercial Software No Copyleft No Saas Built It
  7. 7. LET’S BUILD A PLATFORM 1. Ground Rules 2. Components 3. Putting it All Together
  8. 8. Integrate Data Data Analyst Data Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer Infrastructure
  9. 9. APACHE NIFI
  10. 10. NIFI FEATURES Web-based dataflow user interface Seamless experience between design, control, feedback, and monitoring Highly configurable Loss tolerant vs guaranteed delivery Low latency vs high throughput Dynamic prioritization Flow can be modified at runtime Back pressure Data Provenance Track dataflow from beginning to end Designed for extension Build your own processors and more (120+ available out-of-the-box) Enables rapid development and effective testing Secure SSL, SSH, HTTPS, encrypted content, etc... Multi-tenant authorization and internal authorization/policy management
  11. 11. APACHE SPARK
  12. 12. Integrate Data Data Analyst Data Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer Infrastructure
  13. 13. SPARK SQL FEATURES Distributed SQL Engine Seamless integration with Spark DataFrames Standards Compliant ANSI SQL 2003 support All 99 queries of TPC-DS supported as of Spark 2.0 High performance New “Catalyst” cost-based optimizer in Spark 2.2 Project Tungsten: “Joining a Billion Rows per Second on a Laptop” 2.5x performance gains between 1.6 and 2.0 Accessible & Extensible Python, R, Scala, Java, Hive direct API’s + UDF support
  14. 14. KIBANA
  15. 15. TIMELION
  16. 16. KIBANA FEATURES Full-text and faceted search Full text query language: Boolean operators, proximity, boosting Faceted search: Filter by field, value ranges, date ranges, sort, limit, pagination Time series analysis: aggregates, windowing, offsetting, trending, comparisons Geospatial search: Search by shape, bounding box, polygon, by distance or range Visualizations & Dashboards All the basics: Area, pie, bar, heatmap, table, metric, map, scatter, timeline, tile Drag & drop creation and editing Organize visualizations into dashboards Dashboards can be dynamically filtered by time, queries, filters Publish, embed and share dashboards Real-time updates Performant Fast interactive queries, faceting and filtering REST API and clients in all major languages
  17. 17. Integrate Data Data Analyst Data Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer Infrastructure
  18. 18. JUPYTER LAB
  19. 19. JUPYTER HUB
  20. 20. ANACONDA
  21. 21. Integrate Data Data Analyst Data Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer Infrastructure
  22. 22. OPEN SCORING
  23. 23. MLEAP
  24. 24. KONG API GATEWAY API Gateway on nginx Scalable Modular with plugins Authentication Basic Auth, Open ID, OAuth, HMAC, LDAP, JWT Security ACL, CORS, IP Restriction, Bot Detection, SSL Traffic Control Proxy Caching, Rate limit, Size limits, terminations Logging & Analytics Galileo, Datadog, Runscope TCP, HTTP, File, Syslog, StatsD
  25. 25. COLLABORATION, CI & CD Plan Projects, Boards, Issues, Milestones, Teams Create Merge, Preview, Commit, Branch, Lock, Discuss Verify Automated pipelines, graphs, history, scaling Package Built-in container registry Release Continuous integration & continuous deployment Configure & Monitor
  26. 26. Infrastructure Integrate Data Data Analyst Data Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer
  27. 27. KUBERNETES Portable Containers Public, Private, Hybrid, or Multi-Cloud Deployment Automation, Co-Location, Storage Mounting, Secrets Auto-* -Scaling, -Healing, -Restart, -Placement, -Replication Rolling Updates Load Balancing Service Discovery Monitoring Resources Accessing & Ingesting Logs
  28. 28. PROMETHEUS & GRAFANA
  29. 29. KEYCLOAK
  30. 30. LET’S BUILD A PLATFORM 1. Ground Rules 2. Components 3. Putting It All Together
  31. 31. The Big Picture • This is a complex, major enterprise platform • It’s far from free: Cost is in integration, training & ops • Why open source? 1. Often, outright better technology 2. Faster innovation 3. More native integrations 4. More books, talks, tutorials, posts & answers 5. Cheaper, both to begin and to scale
  32. 32. Common Questions Q: Do I need it all on Day One? A: No. Use what you need, know where it fits later. Q: What if I already have another tool in place? A: Keep it. Architecture is about incremental evolution. Q: What if I don’t have the in-house knowledge? A: Outsource, but require training & onboarding. Q: What often gets overlooked? A: Keeping components continuously up to date.
  33. 33. Summary: If you remember one thing… Build the simplest platform that serves everyone required to turn science into $$$ Data Analyst Data Scientist App DeveloperData Engineer
  34. 34. david@pacific.ai @davidtalby in/davidtalby THANK YOU!

×