Publicité

Datalake Architecture

TechYugadi IT Solutions & Consulting
21 Oct 2016
Publicité

Contenu connexe

Présentations pour vous(20)

Publicité

Datalake Architecture

  1. DATA LAKE ARCHITECTURE Monojit Basu, Founder & Director TechYugadi IT Solutions & Consulting OSI DAYS 2016, BANGALORE
  2. Data Never Sleeps  Every minute  Facebook users share 216,302 photos  Dropbox users upload 833,333 new files  Youtube users share 400 hours of new video  Twitter users send 350,000 tweets  A Boeing 737 Aircraft in flight generates 40 TB of data
  3. EDW vs Data Lake  Data Lake is built on the premise that every drop of data is valuable  Its a place for capturing and exploring huge volumes of raw data that a business generates  Explorers are diverse: business analysts, data scientists, …  even business managers (using self-service)  Goals of exploration may be loosely defined
  4. EDW vs Data Lake  EDW stores filtered and processed data  For pre-meditated usage scenarios  Traditionally structured in the form of ‘cubes’  Analogy  Difference between a college library (focused on curriculum) and the US Library of Congress
  5. EDW vs Data Lake  Schema-on-Read  Schema-on-Write DATA LAKE XML JSON CSV PDF TRADING PARTNER REST API INVOICING ORDERS DB READ / EXTRACT READ / EXTRACT READ / EXTRACT CRM ANALYTICS SCM ANALYTICS RECO ENGINE ENTERPRISE DATA WAREHOUSE XML JSON CSV PDF TRADING PARTNER REST API INVOICING ORDERS DB SALES OPERATIONS MARKETING ETL
  6. Why Think of Data Lake  Business Drivers  Diverse sources of data: transactions, interactions, human and machine-generated  Routine analysis not enough – deeper insights lead to differentiation  Agile and Adaptive Business Models  Technology Drivers  Fast, cheap and scalable storage (eg. HDFS)  Diverse data-processing engines (eg. NoSQL)  Infinitely elastic processing power (cluster of commodity servers)
  7. Application Domains  Healthcare  IoT  E-Governance  Insurance
  8. What Features Should It Support  Scalable Storage Layer  3 V’s of Data Inflow  Data Discovery  Data Governance  Pluggable and Extensible Analytics  Elastic Processing Power  Multi-stakeholder and Multi-tenant Access
  9. Building It On Top Of Hadoop  Data Lake doesn’t have to be Hadoop  But Hadoop has proven its prowess on planet-scale data, in terms of:  Data Volumes  Elastic Data Processing Power  Probably the idea of a Data Lake was inspired by Hadoop  Naturally most often a Data Lake Architecture is built around Hadoop
  10. Storage Capacity: Metrics  Normally HDFS scales even with one NameNode  Unless you have hundreds of Petabytes data  But you need to monitor the usage pattern  Are you creating too many small files (what’s the average number of blocks per file)?  How much RAM would you need for the NameNode? (a high value could mean larger GC pauses)  Internal Load (heartbeats and block reports) vs External Get and Create Requests
  11. Storage Capacity: HDFS Federation  Single Name Node  NameNode Federation Name Node Data Node1 Data Node2 Data NodeN MR Client Get / Create Internal Load … NameNode1 NameNode2 Block Pool1 Block Pool2 Data Node1 Data Node2 Data NodeN…
  12. Storage Capacity: Availability  NameNode Federation does not ensure HA  Even if you don’t go for Federation, configuring high availability is recommended  Essentially set up a Standby NameNode  Active NameNode shares state with the Standby  Using a shared Journal Manager, or  Simply using a NFS-mounted shared File directory  Synchronization frequency is configurable
  13. Compute Capacity  Hadoop 1.0 supported 1 type of Job (Map-Reduce)  MR jobs were scheduled by a ‘JobTracker’ process  Hadoop 2.0 offers a Resource Manager (YARN)  It is intended to replace JobTracker and better the Hadoop cluster size limit from 3000 to 10000  But more important: YARN supports different types of Jobs including MR to run on Hadoop  Hence Data Lake should preferably be built on YARN
  14. Compute Capacity: YARN  YARN ARCHITECTURE RESOURCE MANAGER NODE MANAGER MR APP MASTER SPARK TASK NODE MANAGER SPARK APP MASTER MR TASK N O D E 1 N O D E 2 MR CLIENT SPARK CLIENT
  15. Data Inflow  The goal is to build a pipeline into Hadoop-native data stores  HDFS, mandatorily  Hive and Hbase, preferably  Considering the variety of data formats that a Data Lake must accommodate:  A general purpose Data Integration Tool must be chosen  For example, Pentaho Data Integration (PDI)
  16. Data Inflow  Pipelines specialized for specific data formats may also be plugged in HDFS FLAT FILE INPUT CONNECTOR WEB SERVICE INPUT CONNECTOR HDFS OUTPUT CONNECTOR .txt .json SQOOP FLUME DB log
  17. Data Inflow: Streaming Data  Streaming Data may be processed in two ways  Simply store in the Data Lake for future analysis  Interesting tweets for building a sentiment analysis model  Store and Forward to a Real-time Analytics Engine  Even as real-time processing occurs, the source data in raw format may be useful in future  To build / update machine learning models, for example in fraud analytics HDFS STORE STORE & FORWARD
  18. Data Analytics  A Data Lake built on HDFS will most likely use a Hadoop cluster to analyze data  Sometimes the result of the analysis may be stored back into HDFS (or possibly Hive / Hbase)  But Data Visualization and Reporting / Dashboards may work only on structured data cubes  Hence on the Analytics side, a Data Lake may need outflow paths from HDFS into structured data stores
  19. Plugging In Data Analytics Engine  Jaspersoft Reporting with HDFS HDFS ANALYZED DATA JASPERSOFT ETL HDFS INPUT CONNECTOR OLAP CUBE JASPERSOFT REPORTING ENGINE
  20. Data Governance  Data Lake does not conform to a schema  Data Governance makes it possible to make sense of the data  To both analysts and administrators  Data Governance is a fairly open-ended subject  Vendors offer different techniques to solve each governance use case  But common patterns are emerging across the landscape
  21. Data Governance: Analyst Use Cases  To search and retrieve ‘relevant’ data for analysis  Common Techniques  Metadata Management  Data tagging  Text Search  Data Classification  Metadata can include technical as well as business information (linked to a Business Glossary)  Data tags are often created by users collaboratively
  22. Data Governance: Admin Use Cases  Track data flow from source to end applications  Retain, replicate and archive based on usage  Track access and usage information for compliance  Lineage  Data Life-cycle Management  Auditing
  23. Automated Metadata Generation  As data is ingested, suitable attributes are extracted and stored into a metadata repository  Data type (XML, PDF, text, etc)  Data size  Creation and Last Access time, etc  Even data tags can be inserted at the time of ingest  Unconditionally, eg. ‘sales’  Conditionally, eg. ‘holiday_sales’
  24. Apache Atlas For Data Governance Source: http://atlas.incubator.apache.org/Architecture.html
  25. Data Access And Security  By default HDFS is secured using  Kerberos for authentication, and  Unix-style file permissions for authorization  In a large data repository with diverse stakeholders you may need more control  If so, a couple of products may be considered for augmenting Data Security:  Apache Knox  Apache Ranger
  26. Data Access And Security HDFS Perimeter Security: Knox KERBEROS Authentication Authorization (rwx) RANGER Federated Access Control NODE 1 NODE N
  27. Why Use Ranger  Supports Federated Access Control  Can fall-back upon default HDFS file permissions  Manages Access Control over several Hadoop- based components, like Hive, Storm, etc.  Advanced fine-grained access control, like  Deny policies for user or group  Tag-based access control, where a collection of resources share a common access tag  For example, a few columns in a Hive table and a certain files in HDFS could share a tag: ‘internal_audit’
  28. Steps To Build A Data Lake  Set up a scalable data storage layer  Set up a Compute Cluster capable of running a diverse mix of Jobs  Create data flow pipeline(s) for batch jobs  Create data flow pipeline(s) for streaming data
  29. Steps To Build A Data Lake  Plug in one or more Analytics Engine(s)  Set up mechanisms for efficient data discovery and data governance  Implement Data Access Controls  Design a Monitoring Infrastructure for Jobs and Resources (not covered today)
  30. Building A Data Lake: Starting Points  Set up a scalable data storage layer: HDFS  Set up a Compute Cluster capable of running a diverse mix of Jobs: YARN  Create data flow pipeline(s) for batch jobs: Pentaho HDFS Connector  Create data flow pipeline(s) for streaming data: Pentaho Messaging Connector
  31. Steps To Build A Data Lake  Plug in one or more Analytics Engine(s): Pentaho Reporting and Spark MLib  Set up mechanisms for efficient data discovery and data governance: Apache Atlas  Implement Data Access Controls: Apache Ranger  Design a Monitoring Infrastructure for Jobs and Resources: Apache Ambari
  32. Taking The Plunge  Do you need to plan for and build a Data Lake?  Ask yourself: what fraction of your data are you analyzing today ?  What value might the unused data offer ?  For marketing campaigns  For product lifecycle management  For regulatory compliance, and so on …  Engage your stakeholders from different LoBs  Is decision making being hampered by lack of data ?
  33. Taking The Plunge  Start small: There is a learning curve  Storing data is not enough – maintaining the stewarding the data is all important  Design for extensibility and plugability  Minimize vendor lock-in  Be open to change as you scale your infrastructure
  34. monojit@techyugadi.com
Publicité