Lecture4 big data technology foundations

Big data technology
foundations

Exploring the Big Data Stack
• Big data architecture is the foundation for big data analytics.
• It is a process of desinging any kind of data architecture is to creat a model that
should give a complete view of all the required elements.
• Sometimes the desinging a model consumes more time, but subsequently
implementation of the model can save significant amount of time, effort, and
reworks.
• Configuration of the model/architecture may vary depending upon the specific
needs of the organisation.
• But, for any data architecture, the basic layers and components are more or less,
remain the same.
• To design a big data architecture model we need to think of Big Data as a strategy
and not a project.

Do I Need Big Data Architecture?
• Not everyone does need to leverage big data architecture.
• Single computing tasks rarely top more than 100GB of data, which does
not require a big data architecture.
• Unless you are analyzing terabytes and petabytes of data – and doing it
consistently -- look to a scalable server instead of a massively scale-out
architecture like Hadoop.
• If you need analytics, then consider a scalable array that offers native
analytics for stored data.

Do I Need Big Data Architecture? (cont.)
You probably do need big data architecture if any of the following applies to you:
• You want to extract information from extensive networking or web logs.
• You process massive datasets over 100GB in size. Some of these computing tasks run 8
hours or longer.
• You are willing to invest in a big data project, including third-party products to optimize your
environment.
• You store large amounts of unstructured data that you need to summarize or transform into a
structured format for better analytics.
• You have multiple large data sources to analyze, including structured and unstructured.
• You want to proactively analyze big data for business needs, such as analyzing store sales by
season and advertising, applying sentiment analysis to social media posts, or investigating
email for suspicious communication patterns – or all the above.

Big Data Architecture
The startegy includes the design principles related to the creation of an environment to
support the Big Data. The principles are deals with storage of data, analytics, reporting, and
applications.
• During the creation of Big Data architecture the consideration is required on hardware, software
infrastructure, operational s/w, management s/w, APIs, and software developer tools.
• The architecture of Big Data environment must fulfill all fundamental requirements to perform the
following functions:
 Capturing data from different sources
 Cleaning and integrating data different types of formats
 Sorting and organising data
 Analysing data
 Identifying reletionships and pattern
 Deriving conclusions based on the data analysis results.

Stack of Layers in Big Data Architecture
Big Data architecture comprising the following basic layers and componenets
 Data Sources Layer
 Ingestion Layer
 Storage Layer
 Physical Infrastructure Layer
 Platform Management Layer
 Data Processing Layer
 Data Query Layer
 Security Layer
 Monitoring Layer
 Analytics Engine
 Visualization Layer
 Big Data Application Arrangement of various layers in the Big Data Architecture

Stack of Layers in Big Data Architecture

Data Sources Layer
 Data sources for big data architecture are all over the map. The bottom layer of the stack is
the foundation and is known as the data layer.
 Data can come through from company servers and sensors, or from third-party data
providers.
 The big data environment can ingest data in batch mode or real-time.
 The basic function of the data sources layer is to absorb and integrate the data coming from
various sources with different formats at varying velocity.
 Before sending to datastack and logical use data has to be validate, sorting, and cleaning.
 This layer is used for the Hadoop distributions, No SQL databases and other relational
databases.
 A few data source examples include enterprise applications like ERP or CRM, MS Office
docs, data warehouses and relational database management systems (RDBMS), databases,
mobile devices, sensors, social media, and email.

Ingestion Layer
 This layer is the first step for the data coming from variable sources to start its journey.
Data here is prioritised and categorised which makes data flow smooth in further layers.
 In this layer we plan the way to ingest data flows from hundreds or thousands of sources
into Data Center.
 Data Ingestion means taking data coming from multiple sources and putting it
somewhere it can be accessed.
 It is the beginning of Data Pipeline where it obtains or import data for immediate use.
 This layer separate noise from relevant information.
 Data can be streamed in real time or ingested in batches, When data is ingested in real
time then, as soon as data arrives it is ingested immediately. When data is ingested in
batches, data items are ingested in some chunks at a periodic interval of time. Ingestion
is the process of bringing data into Data Processing system.

Ingestion Layer (cont.)
In the ingestion layer the data passes
through the following stages;
 Identification
 Filtration
 Validation
 Noise reduction
 Transformation
 Compression
 Integration

Ingestion Layer (cont.)
 Identification: Data is categorised into various known data formats or unstructured data is assigned
with default formats.
 Filtration: The information relevant for the enterprise is filtered on the basis of the Enterprise Master
Data Management (MDM) repository.
 Validation: The filtered day is analysied against MDM metadata.
 Noise reduction: Data is cleaned by removing the noiseand minimising the related disturbances.
 Transformation: Data is split or combined on the basis of its type, contents, and the requirement of the
organisation.
 Compression: The size of the data is reduced without affecting is relavance for the required process. It
should be remembered that compression does not affect the analysis results.
 Integration: The refined data set is integrated with the Hadoop storage layer, which consists of Hadoop
Distributed File System (HDFS) and NOSQL database.
 Data ingestion in the Hadoop world means ELT (Extract, Load and Transform) as opposed to
ETL (Extract, Transform and Load) in case of traditional warehouses.

Storage Layer
 Storage becomes a challenge when the size of the data you are dealing with, becomes large.
 Finding a storage solution is very much important when the size of your data becomes large.
This layer focuses on "where to store such a large data efficiently."
 Hadoop is an open source framework normally used to store high volume of data in adistributed
manner across multiple machines.
 There are two major components of Hadoop - a scalable Hadoop Distributed File System
(HDFS) that cn support petabytes of data and another MapReduce engine that compute
results in batches.
 Hadoop has its own database file system, known as HBase, but others including Amazon’s
DynamoDB, MongoDB and Cassandra (used by Facebook), all based on the NoSQL
architecture, those are more are popular too.

Digging into Big Data Technology Components

Physical Infastructure Layer
 As big data is all about high-velocity, high-volume, and high-data variety, the physical
infrastructure will literally “make or break” the implementation.
 Most big data implementations need to be highly available, so the networks, servers, and
physical storage must be both resilient and redundant. Resiliency and redundancy are
interrelated.
 An infrastructure, or a system, is resilient to failure or changes when sufficient redundant
resources are in place, ready to jump into action.
 Redundancy ensures that such a malfunction won’t cause an outage. Resiliency helps to
eliminate single points of failure in your infrastructure.
 This means that the technical and operational complexity is masked behind a collection of
services, each with specific terms for performance, availability, recovery, and so on. These
terms are described in service-level agreements (SLAs) and are usually negotiated between the
service provider and the customer, with penalties for noncompliance.

Physical Infastructure Layer (cont.)
 A prioritized list of big data principles should include statements about the following:
 Performance: How responsive do you need the system to be? Performance, also called latency,
is often measured end to end, based on a single transaction or query request.
 Availability: Do you need a 100 percent uptime guarantee of service? How long can your
business wait in the case of a service interruption or failure?
 Scalability: How big does your infrastructure need to be? How much disk space is needed
today and in the future? How much computing power do you need? Typically, you need to
decide what you need and then add a little more scale for unexpected challenges.
 Flexibility: How quickly can you add more resources to the infrastructure? How quickly can
your infrastructure recover from failures?
 Cost: What can you afford? Because the infrastructure is a set of components, you might be
able to buy the “best” networking and decide to save money on storage. You need to establish
requirements for each of these areas in the context of an overall budget and then make trade-
offs where necessary.

A. PHYSICAL REDUNDANT NETWORKS
 Networks should be redundant and must have enough capacity to accommodate the
anticipated volume and velocity of the inbound and outbound data in addition to the
“normal” network traffic experienced by the business.
 As you begin making big data an integral part of your computing strategy, it is
reasonable to expect volume and velocity to increase.
 Infrastructure designers should plan for these expected increases and try to create
physical implementations that are “elastic.”
 As network traffic ebbs and flows, so too does the set of physical assets associated with
the implementation.
 Your infrastructure should offer monitoring capabilities so that operators can react when
more resources are required to address changes in workloads.

B. MANAGE HARDWARE: STORAGE AND SERVERS
 The hardware (storage and server) assets must have sufficient speed and capacity to
handle all expected big data capabilities.
 It’s of little use to have a high-speed network with slow servers because the servers will
most likely become a bottleneck.
 However, a very fast set of storage and compute servers can overcome variable network
performance.
 Of course, nothing will work properly if network performance is poor or unreliable.

C. INFRASTRUCTURE OPERATIONS
 Another important design consideration is infrastructure operations management.
 The greatest levels of performance and flexibility will be present only in a well-managed
environment.
 Data center managers need to be able to anticipate and prevent catastrophic failures so
that the integrity of the data, and by extension the business processes, is maintained. IT
organizations often overlook and therefore underinvest in this area.

Platform Management Layer
 The main role of this layer is to provide different tools and query languages for
accessing NoSQL (Not only SQL) databases and use the HDFS storege file
system present in top of the Hadoop physical infrastructure layer.
 It manages core components of Hadoop as HDFS & MapReduce and other tools
to store, access and analyse large amount of data using real-time analysis.
 These technologies handle all fundamental problem of processings timely,
efficiently and cost effectively.

Platform Management Layer (cont.)
 Key buildng blocks of the Hadoop platform management layer are;
 MapReduce: It is a compbination of map and reduce features. Map is a component that
distribute a problem accross a large number of systems. After completion of distribution task,
reduce function combines all the elements back togeather to provide an aggregate result.
 Hive: It provies SQL type query language named as Hive Query Language (HQL) for
querying data stored in hadoop cluster.
 Pig: It is a scriping language used for batch processing of huge amounts of data and allow
to process parallel in HDFS.
 HBase: It refers to a column orientde database that provides fast handling bigdata.
 Sqoop: It is a command line tool helps to import individual tables, specific columns of
entire database file directly to distributeed file system.
 ZooKeeper: It helps to coordinate to keep multiple Hadoop instances and nodes in
synchronization and provides protection to every nodefrom failing.

Data Processing Layer
 In this Layer, data collected in the previous layer is processed and made ready
to route to different destinations.
 Batch processing system - A pure batch processing system for offline
analytics (Sqoop).
 Near real time processing system - A pure online processing system for
online analytic (Storm).
 In-memory processing engine - Efficiently execute streaming, machine
learning or SQL workloads that require fast iterative access to datasets
(Spark)
 Distributed stream processing - Provides results that are accurate, even in
the case of out-of-order or late-arriving data (Flink)

Data Query Layer
 This is the layer where strong analytic processing takes place. Data analytics
is an essential step which solved the inefficiencies of traditional data platforms
to handle large amounts of data related to interactive queries,
ETL(Extract,Transform & Load), storage and processing.
 Tools – Hive, Spark SQL, Presto, Redshift
 Data Warehouse - Centralized repository that stores data from multiple
information sources and transforms them into a common, multidimensional
data model for efficient querying and analysis.
 Data Lake - Cloud-based enterprise architecture that structures data in a
more
 scalable way that makes it easier to experiment with it. All data is retained.

Security Layer
 It has the mechanisms for providing security while data is analyzed over a multiple
distributed systems.
 Privacy preserving, auditing, Role based access mechanisms for providing security to
the data for both data at rest and while in transit.
 Developing frameworks which are secured, that allows organizations for publish and
use the analytics securely based on several authentication mechanism such as one time
passwords(OTP), multilevel authentications and role based access mechanisms.
 It maintain users privacy and security: confidentiality, integrity and authentication
mechanisms to validate the users.
 It must ensure a secure communication between nodes by using the Secure Socket
Layer(SSL).
 The security layer handles the basic security principles that Big Data architecture should
follow.

Monitoring Layer
 It consists of a number of monitoring systems.
 These systems are aware automatically the configurations and functions of
different operating systems and hardwares.
 It also provides the facilitiy ofmachine communication with the help of a
monitoring tool named XML (Extensible Markup LAnguage) through high level
protocol.
 All monitoring systems provide tools for data storage and visualisation.

Analytics Engine
 Along the transformation path from big data to information to knowledge, lie a host of
analytics techniques and approaches.
 The role of analytic egine is to analyze huge amounts of unstructured data.
 It is useful to look at the range of big data analytics through the following four categories:
 Exploration including visualization
 Explanation
 Prediction
 Prescription
 The different types of engines are used for analysing big data,
 Search Engine: It is required because the data loaded from various sources has to be
indexed and searched for Big Data analytics processing
 Real-time engine: It is required to analyse the real time applications generated data.

Visualization Layer
 This layer focus on Big Data Visualization. We need
something that will grab people’s attention, pull
them in, make your findings well-understood. This is
the where the data value is perceived by the user.
 Dashboards – Save, share, and communicate
insights. It helps users generate questions by
revealing the depth, range, and content of their data
stores.
– Tools - Tableau, AngularJS, Kibana, React.js
 Recommenders - Recommender systems focus on
the task of information filtering, which deals with the
delivery of items selected from a large collection
that the user is likely to find interesting or useful.

Big Data Applications
 Big data management strategies and best practices are still evolving, but joining the
big data movement has become an imperative for companies across a wide variety
of industries.
 Different types of tools and applications are used to implement Big Data stack
architecture.
 The applications can be categorised as;
 Horizontal: The applications are used to address the problems that are common
across industries.
 Vertical: The applications are used to solve an industry specific problems.

Lecture4 big data technology foundations

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Lecture4 big data technology foundations

Similaire à Lecture4 big data technology foundations (20)

Plus de hktripathy

Plus de hktripathy (17)

Dernier

Dernier (20)

Lecture4 big data technology foundations