Contenu connexe

Présentations pour vous(20)


Incorporating the Data Lake into Your Analytic Architecture

  1. @joe_Caserta Incorporating the Data Lake Into your Analytic Architecture Joe Caserta President Caserta Concepts @joe_Caserta
  2. @joe_Caserta Launched Data Science Data Interaction and Cloud practices Awarded for getting data out of SAP for enterprise data analytics Top 20 Most Most Powerful Big Data Companies Caserta Timeline Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Caserta Concepts founded Web log analytics solution published in Intelligent Enterprise Partnered with Big Data vendors Cloudera, Hortonworks, IBM, Cisco, Datameer, Basho more… Launched Training practice, teaching and mentoring data warehousing concepts world-wide Laser focus on extending Data Warehouses with Big Data solutions 2001 2010 2004 2012 2009 2014 Launched Big Data Warehousing (BDW) Meetup - NYC 3,000 Members 2013 2015 Established best practices for big data ecosystem implementation – Healthcare, Finance, Insurance Dedicated to Data Governance Techniques on Big Data (Innovation) America’s Fastest Growing Private Companies - Ranked #740 1996 – Dedicated to Dimensional Data Warehousing 1986 – 1996 OLTP Data Modeling and Reporting.
  3. @joe_Caserta About Caserta Concepts • Consulting firm focused on Data Innovation, Modern Data Engineering approach to solve highly complex business data challenges • Award-winning company • Internationally recognized work force • Mentoring, Training, Knowledge Transfer • Strategy, Architecture, Implementation • Innovation Partner • Transformative Data Strategies • Modern Data Engineering • Advanced Architecture • Leader in architecting and implementing enterprise data solutions • Data Warehousing • Business Intelligence • Big Data Analytics • Data Science • Data on the Cloud • Data Interaction & Visualization • Strategic Consulting • Technical Design • Build & Deploy Solutions
  4. @joe_Caserta Client Portfolio Retail/eCommerce & Manufacturing Digital Media/AdTech Education & Services Finance. Healthcare & Insurance
  5. @joe_Caserta The Future is Today As a Mindful Cyborg, Chris Dancy utilizes up to 700 sensors, devices, applications, and services to track, analyze, and optimize as many areas of his existence. Data quantification enables him to see the connections of otherwise invisible data, resulting in dramatic upgrades to his health, productivity, and quality of life.
  6. @joe_Caserta The Progression of Data Analytics Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics What happened? Why did it happen? What will happen? How can we make It happen? Data Analytics Sophistication BusinessValue Source: Gartner Reports  Correlations  Predictions  Recommendations
  7. @joe_Caserta The Progression of Data Analytics Source: Gartner Reports  Correlations  Predictions  Recommendations Cognitive Computing / Cognitive Data Analytics
  8. @joe_Caserta Innovation is the only sustainable competitive advantage a company can have Innovations may fail, but companies that don’t innovate will fail
  9. @joe_Caserta
  10. @joe_Caserta Enrollments Claims Finance ETL Ad-Hoc Query Horizontally Scalable Environment - Optimized for Analytics Data Lake Canned Reporting Big Data Analytics NoSQL DatabasesETL Ad-Hoc/Canned Reporting Traditional BI Spark MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Traditional EDW Others… The Evolution of Modern Data Engineering Data Science
  11. @joe_Caserta “…any decent sized enterprise will have a variety of different data technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.” - Martin Fowler Think Ecosystem, Not Tech Stack
  12. @joe_Caserta Proven Methods for Building Analytics Platforms • Requirements Gathering: Business Interviews • Design: Top Down / Bottom Up • Data Profiling: Data quality assessment • Data Modeling: Create Facts and Dimensions • Extract Transform Load: From source to a Data Warehouse • BI Tool: Semantic Layer, Dashboards • Reporting: Develop Reports and distribution • Data Governance: Mostly up front • Analytics: Prepare data for SAS, predictive modeling
  13. @joe_Caserta The New Conversation • Do we need a Data Warehouse at all? • If we do, does it need to be relational? • Should we leverage Hadoop or NoSQL? • Can we get to Machine Learning faster? • Which platform and language are we going to code? • Which Apache Project should we put in production?
  14. @joe_Caserta Why Change? New technologies are great and all… But what drives our adoption of new technologies and techniques? • Data has changed – Semi structured, Unstructured, Sparse and evolving schema • Volumes have changed  GB to TB to PB workloads • Cracks in the Armor of Traditional Data Warehousing approach! Most Importantly: Companies that innovate to leverage their data win!
  15. @joe_Caserta Cracks in the Data Warehouse Armor • Onboarding new data is difficult! • Data structures are rigid! • Data Governance is slow! • Disconnected from business needs: New Requirement: “Hey – I need to munge some new data to see if it has value” Wait! We have to…. Profile, analyze and conform the data Change data models and load it into dimensional models Build a semantic layer – that nobody is going to use Create a dashboard we hope someone will notice ..and then you can have at it 3-6 months later to see if it has value!
  16. @joe_Caserta Is Traditional Data Warehousing All Wrong? NO! The concept of a Data Warehouse is sound: • Consolidating data from disparate source systems • Clean and conformed reference data • Clean and integrated business facts • Data governance (a more pragmatic version) We can be more successful by acknowledging the EDW can’t solve all problems.
  17. @joe_Caserta So what’s missing? The Data Lake A storage and processing layer for all data • Store anything: source data, semi-structured, unstructured, structured • Keep it as long as needed • Support a number of processing workloads • Scale-out ..and here is where Hadoop can help us!
  18. @joe_Caserta Hadoop (Typically) Powers the Data Lake Hadoop Provides us: • Distributed storage  HDFS • Resource Management  YARN • Many workloads, not just MapReduce
  19. @joe_Caserta •This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization •Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata •Identify and control sensitive data, regulatory compliancePrivacy/Security •Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring •Policies around data frequency, source availability, etc.Business Process Integration •Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management •Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) Data Governance for the Data Lake
  20. @joe_Caserta •This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization •Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata •Identify and control sensitive data, regulatory compliancePrivacy/Security •Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring •Policies around data frequency, source availability, etc.Business Process Integration •Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management •Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) Data Governance • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier) • Data detection and masking on unstructured data upon ingest • Near-zero latency, DevOps, Core component of business operations for the Data Lake
  21. @joe_Caserta The Big Data Pyramid Ingest Raw Data Organize, Define, Complete Munging, Blending Machine Learning Data Quality and Monitoring Metadata, ILM , Security Data Catalog Data Integration Fully Governed ( trusted) Arbitrary/Ad-hoc Queries and Reporting Usage Pattern Data Governance Metadata, ILM, Security
  22. @joe_Caserta Landing Queue Data Lake BDW Data Science API Data Providers Near Real-time Batch Data Science Clusters EDW Graph RDS Metastore Your Likely Future Landscape
  23. @joe_Caserta Peeling back the layers… The Landing Area • Source data in it’s full fidelity • Programmatically Loaded • Partitioned for data processing • No governance other than catalog and ILM (Security and Retention) • Consumers: Data Scientists, ETL Processes, Applications
  24. @joe_Caserta Data Lake • Enriched, lightly integrated • Data has been is accessible in the Hive Metastore • Either processed into tabular relations • Or via Hive Serdes directly upon Raw Data • Partitioned for data access • Governance additionally includes a guarantee of completeness • Consumers: Data Scientists, ETL Processes, Applications, Data Analysts
  25. @joe_Caserta Data Science Workspace • No barrier for onboarding and analysis of new data • Blending of new data with entire Data Lake, including the Big Data Warehouse • Data Scientists enrich data with insight • Consumers: Data Scientists
  26. @joe_Caserta Big Data Warehouse • Data is Fully Governed • Data is Structured • Partitioned/tuned for data access • Governance includes a guarantee of completeness and accuracy • Consumers: Data Scientists, ETL Processes, Applications, Data Analysts, and Business Users (the masses) Big Data Warehouse
  27. @joe_Caserta Polyglot Warehouse We promote the concept that the Big Data Warehouse may live in one or more platforms • Full Hadoop Solutions • Hadoop plus MPP or Relational Supplemental technologies: • NoSQL: Columnar, Key value, Timeseries, Graph • Search Technologies
  28. @joe_Caserta Hadoop is the Data Warehouse? • Hadoop can be the entire data pyramid platform including landing, data lake and the Big Data Warehouse • Especially serves as the Data Lake and “Refinery” • Query engines such as Hive, and Impala provide SQL support
  29. @joe_Caserta The Refinery • The feedback loop between Data Science and Data Warehouse is critical • Successful work products of science must Graduate into the appropriate layers of the Data Lake
  30. @joe_Caserta Data Analytics on the Cloud AWS and other cloud providers present a very powerful design pattern: • S3 serves as the storage layer for the Data Lake • EMR (Elastic Hadoop) provides the Refinery, most clusters can be ephemeral • The Active Set is stored into Redshift MPP or Relational Platforms Eliminate massive on-premise appliance footprint
  31. @joe_Caserta Summary Data Warehousing is not dead for analytics • The principles of Data Warehousing still make sense • Recognize gaps in feature/functionality of the Relational Database and traditional Data Warehousing • Extend your data ecosystem with a Data Lake • Accept Tunable Governance • Think Polyglot and use the right tool for the job
  32. @joe_Caserta Thank You / Q&A Joe Caserta President, Caserta Concepts (914) 261-3648 @joe_Caserta

Notes de l'éditeur

  1. Inc. 5000 – Top 6% of all IT companies in US, and #5 of 42 IT companies in NYC DG Pyramid introduced at Strata 2015
  2. Reports  correlations  predictions  recommendations
  3. JOE Throwing technology at it does not solve the problem. We need architecture, engineering and innovation. In fact previous attempt to forklift existing processes and THINKING onto hadoop did not improve things. They needed a new way of thinking about data. We needed to build a Framework to dynamically ingest somewhat unstructured data and turn it into digestible information
  4. JOE With the exception of Finance, no use case that required a Relational database. Hadoop and the various flavors of NoSQL satisfied all data needs except to “keep the books”. Reality is, the data lake and its ecosystem is evolving to become the core data system of the enterprise. So Data organization Data governance Data integrity Data security Are more important than ever…. And these aspects of the big data paradigm are getting better every day – making adoption more attainable. The overall solution architecture that makes all the puzzle pieces fit together and work in unity is the key element that keeps the ecosystem alive.