Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Enterprise Metadata Integration, Cloudera

291 vues

Publié le

GraphConnect Europe 2017
Mirko Kämpf, Cloudera Inc

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Enterprise Metadata Integration, Cloudera

  1. 1. 1© Cloudera, Inc. All rights reserved. Enterprise Metadata Integration Mirko Kämpf | Cloudera GraphConnect 2017 – London
  2. 2. 2© Cloudera, Inc. All rights reserved. Who is speaking? Solutions Architect @ Cloudera -time series analysis, network analysis, data enrichment pipelines -personal interest: QA-Systems and semantic search Data Science Activities The Detection of Emerging Trends Using Wikipedia Traffic Data and Context Networks (PLOS ONE, 2015) Hadoop.TS (IJCA, 2013) Fluctuations in Wikipedia Access-Rate and Edit-Event Data. (Physica A, 2012).
  3. 3. 3© Cloudera, Inc. All rights reserved. Our Approach: Multilayer Metadata Integration … • Status dashboards are provided per Use-Case. • Each dashboard offers facts from multiple layers: - (L1) technical layer - (L2) operational metadata (Hadoop specific only) - (L3) application specific operational metadata - (L4) quality metrics (second order metadata) • Our Achievements: • Graph database (Neo4J) allows context exploration. • Cluster spanning metadata exploration is possible now. • Exposure of inherent but sometimes hidden facts becomes as easy as writing an email. Integration of facts to gain business knowledge
  4. 4. 4© Cloudera, Inc. All rights reserved. Intro
  5. 5. 5© Cloudera, Inc. All rights reserved. People do mining … for centuries! http://www.montanregion-erzgebirge.de/welterbe-erleben/montanregion-fuer-bergbauspezialisten/geschichtliches.html gold & diamonds, ore & coal, minerals, oil … Outcome drives whole economy
  6. 6. 6© Cloudera, Inc. All rights reserved. People use computers … for decades! 1938 Z1: World’s first free programmable device, created by Conrad Zuse. U.S. Department of Energy uses Intel Supercomputer at Argonne National Laboratory. 2015 http://www.intel.com/content/dam/www/public/us/en/images/photography-business/RWD/aurora-aerial-reflection-floor-rwd.png http://www.horst-zuse.homepage.t-online.de/z1.html
  7. 7. 7© Cloudera, Inc. All rights reserved. DATA MINING http://codecondo.com/9-free-books-for-learning-data-mining-data-analysis/ Blog: About Learning Data Mining & Data Analysis
  8. 8. 8© Cloudera, Inc. All rights reserved. If data is the new oil … … metadata are nuggets and brilliants of our age. Screenshot taken from: https://www.quora.com/Who-should-get-credit-for-the-quote-data-is-the-new-oil
  9. 9. 9© Cloudera, Inc. All rights reserved. Diamonds: beautiful even as raw material Brilliant: result of expert’s work Even more exciting in combination with other material and skills …
  10. 10. 10© Cloudera, Inc. All rights reserved. • Idea & Vision • Material • Skills / Methods • Tools Success Factors: http://www.burkhard-beyer.net/Reportage_Goldschmied.html
  11. 11. 11© Cloudera, Inc. All rights reserved. Be very careful with initial success … … work towards a professional level! High quality and reproducibility are results of a Professional Management It is hard to believe what you can get and which options arise … Manage overwhelming excitement! Start new activities not randomly …
  12. 12. 12© Cloudera, Inc. All rights reserved. Let’s Think Data Driven! • Build a mid-term or better a long-term strategy. • Try to stay independent of a particular technology or tool. Not the fancy toolset but rather data is what matters most. • After initial success you should slow down and control speed of expansion. • Focus on: maximized accessibility of data. Google’s goal was to make the data of the internet accessible. You should become your own Google! • Idea & Vision • Material • Skills / Methods • Tools
  13. 13. 13© Cloudera, Inc. All rights reserved. Dataset Profiles / Flow Descriptors •Our material is data & metadata: - Data about data : descriptive data, Dublin core metadata model, … - Derived data : statistics extracted from processes, documents, … - Results of ML/AI procedures : extracted structure and learned models - Outcome of crowd based operations : Wikipedia with its inherent structure, communication logs, access and edit history. • Idea & Vision • Material • Skills / Methods • Tools
  14. 14. 14© Cloudera, Inc. All rights reserved. Knowledge Extraction for Better Data Science
  15. 15. 15© Cloudera, Inc. All rights reserved. Science: According to Wikipedia: Science is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe. https://en.wikipedia.org/wiki/Science
  16. 16. 16© Cloudera, Inc. All rights reserved. Data Science: My observation: Commercial Data Science is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the market / business context. https://en.wikipedia.org/wiki/Infographic#/media/File:Gartner_Hype_Cycle_for_Emerging_Technologies.gif
  17. 17. 17© Cloudera, Inc. All rights reserved. Details Look into nature ….
  18. 18. 18© Cloudera, Inc. All rights reserved. Context Look into nature ….
  19. 19. 19© Cloudera, Inc. All rights reserved. Result: Visualization of Facts • An image shows what the text says. > Multi-channel communication • Data Science benefits from such an approach. > Today we still use infographics Difference: Biologist who created this one on the left observed by eye. Today, we use more and more data analysis methods.
  20. 20. 20© Cloudera, Inc. All rights reserved. Process: Knowledge Extraction is a Natural Process • Combine multiple sources • Repeat observation • Incorporate context to explain differences/variation • Cross-checks to identify anomalies
  21. 21. 21© Cloudera, Inc. All rights reserved. Process: Knowledge Extraction is a Natural Process Knowledge Facts Data
  22. 22. 22© Cloudera, Inc. All rights reserved. How did we implement EMDM? - Hadoop Based: for scalability. - Open Graph Data Model: for flexibility and connectivity - Data Centric: following the Big Data paradigm
  23. 23. 23© Cloudera, Inc. All rights reserved. Big Data Processing: e.g., with Hadoop
  24. 24. 24© Cloudera, Inc. All rights reserved. Big Graph Processing on Hadoop: e.g., with Giraph
  25. 25. 25© Cloudera, Inc. All rights reserved. Project Name should stand for: Graphs, Hadoop, and the ecosystem …
  26. 26. 26© Cloudera, Inc. All rights reserved. Project Name should stand for: Graphs, Hadoop, and the ecosystem …
  27. 27. 27© Cloudera, Inc. All rights reserved. Data Science Process Model (DSPM) • DSPM defines core artifacts for knowledge management • Describes analysis / transformation context • Allows repeatable execution • Process properties become measurable • Supports comparison of results from multiple procedures • All those fatcs are essential ingredients to business optimization. • But: Logging & tracking should never block creativity! • Remember: Scientists often act like artists. • Idea & Vision • Material • Skills / Methods • Tools Toolbox and Management Methods
  28. 28. 28© Cloudera, Inc. All rights reserved. Data Science Process Model (DSPM) • Idea & Vision • Material • Skills / Methods • Tools Representation of domain knowledge (in our case it is data science in general) Human Interaction Ontology Toolbox and Management Methods Ability to solve a problem using IT and data Technology Aspects - represent and inter- act with facts & data Data Governance Certified QM
  29. 29. 29© Cloudera, Inc. All rights reserved. • Idea & Vision • Material • Skills / Methods • Tools Semantic Logging • Property with name: (K,V) : key-value-pair • Property of a thing: S => (K,V) : (S,P,O) is a triple K becomes P; V becomes O • Many of those triples in one common context with name G: G => (S,P,O) is called quad or named graph • Log4J is the logging standard we build on. • Using structured data instead of plain strings allows easy parsing (e.g., apache log format). • Triple representation avoids specific parsing and makes log data part of the linked data graph.
  30. 30. 30© Cloudera, Inc. All rights reserved. • Idea & Vision • Material • Skills / Methods • Tools Etosha Toolbox Data extractors, Data transformers, Ontology based orchestration, People and machines, contribute facts, Iterative approach with closed feedback-loops, Scalable environment … C O N C E P T
  31. 31. 31© Cloudera, Inc. All rights reserved. • Idea & Vision • Material • Skills / Methods • Tools Multi-layer metadata capturing Operational metrics Metrics about fast & static data Business metrics Contextualized presentation Ad-hoc queries for exploration Graph-analytics > Knowledge exposure > Self-Service DS and BI can speak the same language. I N I T I A L I M P L E M E N T A T I O N
  32. 32. 32© Cloudera, Inc. All rights reserved. Results: Access Facts & Context of Critical Processes DEMO of context exploration: https://www.youtube.com/watch?v=ZE7Gcanv90s&feature=youtu.be
  33. 33. 33© Cloudera, Inc. All rights reserved. Results: Better Collaboration for (Hadoop) Knowledge Workers • Our Achievements: • The open graph model is language-, OS-, and hardware-independent. • Merging of knowledge partitions enables cluster spanning metadata exploration. • Query beans expose facts from multiple stores to a web-based interfaces. • Next Steps: • Improve implicit triplification (Query Solr-index and get RDF data) • Standardize the process and integrate with existing ontologies. • Grow a community … and enter the Apache Incubator.
  34. 34. 34© Cloudera, Inc. All rights reserved. Thank you! mirko@cloudera.com @semanpix

×