Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
@joe_Caserta
Incorporating the Data Lake
Into your Analytic Architecture
Joe Caserta
President
Caserta Concepts
@joe_Caser...
@joe_Caserta
Launched Data Science
Data Interaction and Cloud practices
Awarded for getting data out of SAP
for enterprise...
@joe_Caserta
About Caserta Concepts
• Consulting firm focused on Data Innovation, Modern Data Engineering approach
to solv...
@joe_Caserta
Client Portfolio
Retail/eCommerce
& Manufacturing
Digital Media/AdTech
Education & Services
Finance. Healthca...
@joe_Caserta
The Future is Today
As a Mindful Cyborg, Chris
Dancy utilizes up to
700 sensors, devices,
applications, and s...
@joe_Caserta
The Progression of Data Analytics
Descriptive
Analytics
Diagnostic
Analytics
Predictive
Analytics
Prescriptiv...
@joe_Caserta
The Progression of Data Analytics
Source: Gartner
Reports  Correlations  Predictions  Recommendations
Cogn...
@joe_Caserta
Innovation is the only sustainable competitive advantage a company can have
Innovations may fail, but compani...
@joe_Caserta
@joe_Caserta
Enrollments
Claims
Finance
ETL
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Data ...
@joe_Caserta
“…any decent sized enterprise will have a variety of different data
technologies for different kinds of data....
@joe_Caserta
Proven Methods for Building Analytics Platforms
• Requirements Gathering: Business Interviews
• Design: Top D...
@joe_Caserta
The New Conversation
• Do we need a Data Warehouse at all?
• If we do, does it need to be relational?
• Shoul...
@joe_Caserta
Why Change?
New technologies are great and all… But what drives our adoption of new
technologies and techniqu...
@joe_Caserta
Cracks in the Data Warehouse Armor
• Onboarding new data is difficult!
• Data structures are rigid!
• Data Go...
@joe_Caserta
Is Traditional Data Warehousing All Wrong?
NO!
The concept of a Data Warehouse is sound:
• Consolidating data...
@joe_Caserta
So what’s missing?
The Data Lake
A storage and processing layer for all data
• Store anything: source data, s...
@joe_Caserta
Hadoop (Typically) Powers the Data Lake
Hadoop Provides us:
• Distributed storage  HDFS
• Resource Managemen...
@joe_Caserta
•This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization
•Definition...
@joe_Caserta
•This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization
•Definition...
@joe_Caserta
The Big Data Pyramid
Ingest Raw
Data
Organize, Define,
Complete
Munging, Blending
Machine Learning
Data Quali...
@joe_Caserta
Landing
Queue
Data Lake
BDW
Data Science
API
Data Providers
Near Real-time
Batch
Data
Science
Clusters
EDW
Gr...
@joe_Caserta
Peeling back the layers… The Landing Area
• Source data in it’s full fidelity
• Programmatically Loaded
• Par...
@joe_Caserta
Data Lake
• Enriched, lightly integrated
• Data has been is accessible in the Hive Metastore
• Either process...
@joe_Caserta
Data Science Workspace
• No barrier for onboarding and analysis of new data
• Blending of new data with entir...
@joe_Caserta
Big Data Warehouse
• Data is Fully Governed
• Data is Structured
• Partitioned/tuned for data access
• Govern...
@joe_Caserta
Polyglot Warehouse
We promote the concept that the Big Data Warehouse may live in one or
more platforms
• Ful...
@joe_Caserta
Hadoop is the Data Warehouse?
• Hadoop can be the entire data pyramid platform including
landing, data lake a...
@joe_Caserta
The Refinery
• The feedback loop between Data Science and Data Warehouse is critical
• Successful work produc...
@joe_Caserta
Data Analytics on the Cloud
AWS and other cloud providers present a very powerful design
pattern:
• S3 serves...
@joe_Caserta
Summary
Data Warehousing is not dead for analytics
• The principles of Data Warehousing still make sense
• Re...
@joe_Caserta
Thank You / Q&A
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta
Prochain SlideShare
Chargement dans…5
×

Incorporating the Data Lake into Your Analytic Architecture

Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.

Joe Caserta presented Incorporating the Data Lake into Your Analytics Architecture.

For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.

  • Identifiez-vous pour voir les commentaires

Incorporating the Data Lake into Your Analytic Architecture

  1. 1. @joe_Caserta Incorporating the Data Lake Into your Analytic Architecture Joe Caserta President Caserta Concepts @joe_Caserta
  2. 2. @joe_Caserta Launched Data Science Data Interaction and Cloud practices Awarded for getting data out of SAP for enterprise data analytics Top 20 Most Most Powerful Big Data Companies Caserta Timeline Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Caserta Concepts founded Web log analytics solution published in Intelligent Enterprise Partnered with Big Data vendors Cloudera, Hortonworks, IBM, Cisco, Datameer, Basho more… Launched Training practice, teaching and mentoring data warehousing concepts world-wide Laser focus on extending Data Warehouses with Big Data solutions 2001 2010 2004 2012 2009 2014 Launched Big Data Warehousing (BDW) Meetup - NYC 3,000 Members 2013 2015 Established best practices for big data ecosystem implementation – Healthcare, Finance, Insurance Dedicated to Data Governance Techniques on Big Data (Innovation) America’s Fastest Growing Private Companies - Ranked #740 1996 – Dedicated to Dimensional Data Warehousing 1986 – 1996 OLTP Data Modeling and Reporting.
  3. 3. @joe_Caserta About Caserta Concepts • Consulting firm focused on Data Innovation, Modern Data Engineering approach to solve highly complex business data challenges • Award-winning company • Internationally recognized work force • Mentoring, Training, Knowledge Transfer • Strategy, Architecture, Implementation • Innovation Partner • Transformative Data Strategies • Modern Data Engineering • Advanced Architecture • Leader in architecting and implementing enterprise data solutions • Data Warehousing • Business Intelligence • Big Data Analytics • Data Science • Data on the Cloud • Data Interaction & Visualization • Strategic Consulting • Technical Design • Build & Deploy Solutions
  4. 4. @joe_Caserta Client Portfolio Retail/eCommerce & Manufacturing Digital Media/AdTech Education & Services Finance. Healthcare & Insurance
  5. 5. @joe_Caserta The Future is Today As a Mindful Cyborg, Chris Dancy utilizes up to 700 sensors, devices, applications, and services to track, analyze, and optimize as many areas of his existence. Data quantification enables him to see the connections of otherwise invisible data, resulting in dramatic upgrades to his health, productivity, and quality of life.
  6. 6. @joe_Caserta The Progression of Data Analytics Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics What happened? Why did it happen? What will happen? How can we make It happen? Data Analytics Sophistication BusinessValue Source: Gartner Reports  Correlations  Predictions  Recommendations
  7. 7. @joe_Caserta The Progression of Data Analytics Source: Gartner Reports  Correlations  Predictions  Recommendations Cognitive Computing / Cognitive Data Analytics
  8. 8. @joe_Caserta Innovation is the only sustainable competitive advantage a company can have Innovations may fail, but companies that don’t innovate will fail
  9. 9. @joe_Caserta
  10. 10. @joe_Caserta Enrollments Claims Finance ETL Ad-Hoc Query Horizontally Scalable Environment - Optimized for Analytics Data Lake Canned Reporting Big Data Analytics NoSQL DatabasesETL Ad-Hoc/Canned Reporting Traditional BI Spark MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Traditional EDW Others… The Evolution of Modern Data Engineering Data Science
  11. 11. @joe_Caserta “…any decent sized enterprise will have a variety of different data technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.” - Martin Fowler Think Ecosystem, Not Tech Stack
  12. 12. @joe_Caserta Proven Methods for Building Analytics Platforms • Requirements Gathering: Business Interviews • Design: Top Down / Bottom Up • Data Profiling: Data quality assessment • Data Modeling: Create Facts and Dimensions • Extract Transform Load: From source to a Data Warehouse • BI Tool: Semantic Layer, Dashboards • Reporting: Develop Reports and distribution • Data Governance: Mostly up front • Analytics: Prepare data for SAS, predictive modeling
  13. 13. @joe_Caserta The New Conversation • Do we need a Data Warehouse at all? • If we do, does it need to be relational? • Should we leverage Hadoop or NoSQL? • Can we get to Machine Learning faster? • Which platform and language are we going to code? • Which Apache Project should we put in production?
  14. 14. @joe_Caserta Why Change? New technologies are great and all… But what drives our adoption of new technologies and techniques? • Data has changed – Semi structured, Unstructured, Sparse and evolving schema • Volumes have changed  GB to TB to PB workloads • Cracks in the Armor of Traditional Data Warehousing approach! Most Importantly: Companies that innovate to leverage their data win!
  15. 15. @joe_Caserta Cracks in the Data Warehouse Armor • Onboarding new data is difficult! • Data structures are rigid! • Data Governance is slow! • Disconnected from business needs: New Requirement: “Hey – I need to munge some new data to see if it has value” Wait! We have to…. Profile, analyze and conform the data Change data models and load it into dimensional models Build a semantic layer – that nobody is going to use Create a dashboard we hope someone will notice ..and then you can have at it 3-6 months later to see if it has value!
  16. 16. @joe_Caserta Is Traditional Data Warehousing All Wrong? NO! The concept of a Data Warehouse is sound: • Consolidating data from disparate source systems • Clean and conformed reference data • Clean and integrated business facts • Data governance (a more pragmatic version) We can be more successful by acknowledging the EDW can’t solve all problems.
  17. 17. @joe_Caserta So what’s missing? The Data Lake A storage and processing layer for all data • Store anything: source data, semi-structured, unstructured, structured • Keep it as long as needed • Support a number of processing workloads • Scale-out ..and here is where Hadoop can help us!
  18. 18. @joe_Caserta Hadoop (Typically) Powers the Data Lake Hadoop Provides us: • Distributed storage  HDFS • Resource Management  YARN • Many workloads, not just MapReduce
  19. 19. @joe_Caserta •This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization •Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata •Identify and control sensitive data, regulatory compliancePrivacy/Security •Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring •Policies around data frequency, source availability, etc.Business Process Integration •Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management •Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) Data Governance for the Data Lake
  20. 20. @joe_Caserta •This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization •Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata •Identify and control sensitive data, regulatory compliancePrivacy/Security •Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring •Policies around data frequency, source availability, etc.Business Process Integration •Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management •Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) Data Governance • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier) • Data detection and masking on unstructured data upon ingest • Near-zero latency, DevOps, Core component of business operations for the Data Lake
  21. 21. @joe_Caserta The Big Data Pyramid Ingest Raw Data Organize, Define, Complete Munging, Blending Machine Learning Data Quality and Monitoring Metadata, ILM , Security Data Catalog Data Integration Fully Governed ( trusted) Arbitrary/Ad-hoc Queries and Reporting Usage Pattern Data Governance Metadata, ILM, Security
  22. 22. @joe_Caserta Landing Queue Data Lake BDW Data Science API Data Providers Near Real-time Batch Data Science Clusters EDW Graph RDS Metastore Your Likely Future Landscape
  23. 23. @joe_Caserta Peeling back the layers… The Landing Area • Source data in it’s full fidelity • Programmatically Loaded • Partitioned for data processing • No governance other than catalog and ILM (Security and Retention) • Consumers: Data Scientists, ETL Processes, Applications
  24. 24. @joe_Caserta Data Lake • Enriched, lightly integrated • Data has been is accessible in the Hive Metastore • Either processed into tabular relations • Or via Hive Serdes directly upon Raw Data • Partitioned for data access • Governance additionally includes a guarantee of completeness • Consumers: Data Scientists, ETL Processes, Applications, Data Analysts
  25. 25. @joe_Caserta Data Science Workspace • No barrier for onboarding and analysis of new data • Blending of new data with entire Data Lake, including the Big Data Warehouse • Data Scientists enrich data with insight • Consumers: Data Scientists
  26. 26. @joe_Caserta Big Data Warehouse • Data is Fully Governed • Data is Structured • Partitioned/tuned for data access • Governance includes a guarantee of completeness and accuracy • Consumers: Data Scientists, ETL Processes, Applications, Data Analysts, and Business Users (the masses) Big Data Warehouse
  27. 27. @joe_Caserta Polyglot Warehouse We promote the concept that the Big Data Warehouse may live in one or more platforms • Full Hadoop Solutions • Hadoop plus MPP or Relational Supplemental technologies: • NoSQL: Columnar, Key value, Timeseries, Graph • Search Technologies
  28. 28. @joe_Caserta Hadoop is the Data Warehouse? • Hadoop can be the entire data pyramid platform including landing, data lake and the Big Data Warehouse • Especially serves as the Data Lake and “Refinery” • Query engines such as Hive, and Impala provide SQL support
  29. 29. @joe_Caserta The Refinery • The feedback loop between Data Science and Data Warehouse is critical • Successful work products of science must Graduate into the appropriate layers of the Data Lake
  30. 30. @joe_Caserta Data Analytics on the Cloud AWS and other cloud providers present a very powerful design pattern: • S3 serves as the storage layer for the Data Lake • EMR (Elastic Hadoop) provides the Refinery, most clusters can be ephemeral • The Active Set is stored into Redshift MPP or Relational Platforms Eliminate massive on-premise appliance footprint
  31. 31. @joe_Caserta Summary Data Warehousing is not dead for analytics • The principles of Data Warehousing still make sense • Recognize gaps in feature/functionality of the Relational Database and traditional Data Warehousing • Extend your data ecosystem with a Data Lake • Accept Tunable Governance • Think Polyglot and use the right tool for the job
  32. 32. @joe_Caserta Thank You / Q&A Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta

×