This document discusses Big Data solutions in Microsoft Azure. It introduces Azure cloud services and provides an overview of Big Data and how it differs from traditional databases. It then outlines Microsoft's Big Data solutions built on Hortonworks Data Platform, including HDInsight which allows running Hadoop on Azure. HDInsight supports various data storage options and processing tools like Hive, Pig, and Storm. The document also covers designing HDInsight clusters and Azure Data Lake for unlimited storage of structured and unstructured data.
3. What is BigData
• Analyzing extremely large datasets computationally to
reveal patterns, trends and associations.
• Characterized by 3Vs (Volume, Velocity and Variety).
• Enhanced insight and decision making.
5. Microsoft BigData solutions
• Microsoft supports Hadoop based BigData solutions.
• Built on top of Hortonworks Data Platform (HDP)
• Three distinct solutions based on HDP
• HDInsight
• HDP for Windows
• Microsoft Analytics Platform
7. Hadoop
• Hadoop - Framework for solving bigdata problem by using scale-out “divide
and conquer” approach
• HDFS – Hadoop Distributed File System. Allows data to be split across
multiple nodes.
• MapReduce – Enables distributed processing.
8. Hadoop Components
• Cluster – Collection of server nodes, stores data using HDFS and process it.
• Datastore – Data store in each server is a distributed storage service (HDFS
/Equivalent)
• Query – Big data processing queries using Map Reduce
9. HDInsight
• Implementation of Hadoop that runs on Azure Platform
• Pay only for what you use
• Dynamic allocation of Nodes in the cluster
• Integrated with Azure storage
10. HDInsight - Data Storage
• Following types of storage supported by HDInsight
• HDFS (Standard Hadoop)
• Azure Storage Blob
• HBase
11. HDInsight – Data Processing
• Run jobs directly on the cluster using Map Reduce
• Use external programs to connect to the cluster.
• Pig – Execute queries by writing scripts in high level language
• Hive – SQL like query on the data
• Mahout – ML library that allows to perform data mining queries
• Storm – Real time computation for processing fast, large streams of data
13. Designing for HDInsight
• Determine the analytical goals and source data
• Plan and configure the infrastructure
• Obtain data and submit it to HDInsight
• Process the data
• Evaluate the results
• Tune the solution
14. Azure DataLake
• Single place to store all structured and semi-structured data in native format
• Unlimited data size
• Compatible with HDFS
16. Summary
• Hadoop – Defacto solution to the Big Data problem
• Windows Azure HDInsight Service
• Native Hadoop implementation
• Managed Hadoop Service for Windows Azure
Notes de l'éditeur
HDP is 100% compatible with Apache Hadoom
Open Enterprise Hadoop Data Platform – Enterprise ready
HDInsight – Available to Azure subscribers. Runs on Azure clusters to run HDP and integrates with Azure storage
HDP for Windows – OnPrem solution for running Hadoop in Windows Server. Can be physical or virtual machines
Microsoft Analytics Platform – Massively Parallel Processing (MPP) in Microsoft Parallel data warehouse (PDW) with Hadoop.
Cluster - The cluster is managed by a server called the name node that has knowledge of all the cluster servers and the parts of the data files stored on each one. To store incoming data, the name node server directs the client to the appropriate data node server. The name node also manages replication of data files across all the other cluster members that communicate with each other to replicate the data.
DataStore – Key/Value store, Document Store (XML or JSON), Binary store, Column stores, Graph store
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
HCatalog™: A table and storage management service for data created using Apache Hadoop.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
ZooKeeper™: A high-performance coordination service for distributed applications.
When you store your data using HDFS, it's contained within the nodes of the cluster and it must be called through the HDFS API. When the cluster is decommissioned, the data is lost as well.
Azure Storage provides several advantages: you can load the data using standard tools, retain the data when you decommission the cluster, the cost is less, and other processes in Azure.
HBase is a NoSQL wide-column data store implemented as distributed system that provides data processing and storage over multiple nodes in a Hadoop cluster. It provides a random, real-time, read/write data store designed to host tables that can contain billions of rows and millions of columns.
Pig - Pig is another framework. Using the Pig Latin language, you can create MapReduce jobs more easily than by having to code the Java yourself—the language has simple statements for loading data, storing it in intermediate steps, and computing the data, and it's MapReduce aware
Hive - When you want to work with the data on your cluster in a relational-friendly format. Hive allows you to create a data warehouse on top of HDFS or other file systems and uses a language called HiveQL, which has a lot in common with the Structured Query Language (SQL).
Mahout is a machine learning library, which allows you to perform data mining queries that examine data files to extract specific types of information. For example, it supports recommendation mining (finding user’s preferences from their behavior), clustering (grouping documents with similar topic content), and classification (assigning new documents to a category based on existing categorization).
Storm is a distributed real-time computation system for processing fast, large streams of data. It allows you to build trees and directed acyclic graphs (DAGs) that asynchronously process data items using a user-defined number of parallel tasks. It can be used for real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.
Interactive Data Ingestion – Small amount of data
Automative Batch Upload to HDInsight – Using SSIS to gather disaparate data sources and push them to HDINsight
Relational data - Sqoop – For loading relational data into HDInsight
Web log files - Flume
Built on top of Azure’s hyperscale network and supports both single files that can be multiple petabytes in size,