SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Big data technology
foundations
Exploring the Big Data Stack
• Big data architecture is the foundation for big data analytics.
• It is a process of desinging any kind of data architecture is to creat a model that
should give a complete view of all the required elements.
• Sometimes the desinging a model consumes more time, but subsequently
implementation of the model can save significant amount of time, effort, and
reworks.
• Configuration of the model/architecture may vary depending upon the specific
needs of the organisation.
• But, for any data architecture, the basic layers and components are more or less,
remain the same.
• To design a big data architecture model we need to think of Big Data as a strategy
and not a project.
Do I Need Big Data Architecture?
• Not everyone does need to leverage big data architecture.
• Single computing tasks rarely top more than 100GB of data, which does
not require a big data architecture.
• Unless you are analyzing terabytes and petabytes of data – and doing it
consistently -- look to a scalable server instead of a massively scale-out
architecture like Hadoop.
• If you need analytics, then consider a scalable array that offers native
analytics for stored data.
Do I Need Big Data Architecture? (cont.)
You probably do need big data architecture if any of the following applies to you:
• You want to extract information from extensive networking or web logs.
• You process massive datasets over 100GB in size. Some of these computing tasks run 8
hours or longer.
• You are willing to invest in a big data project, including third-party products to optimize your
environment.
• You store large amounts of unstructured data that you need to summarize or transform into a
structured format for better analytics.
• You have multiple large data sources to analyze, including structured and unstructured.
• You want to proactively analyze big data for business needs, such as analyzing store sales by
season and advertising, applying sentiment analysis to social media posts, or investigating
email for suspicious communication patterns – or all the above.
Big Data Architecture
The startegy includes the design principles related to the creation of an environment to
support the Big Data. The principles are deals with storage of data, analytics, reporting, and
applications.
• During the creation of Big Data architecture the consideration is required on hardware, software
infrastructure, operational s/w, management s/w, APIs, and software developer tools.
• The architecture of Big Data environment must fulfill all fundamental requirements to perform the
following functions:
 Capturing data from different sources
 Cleaning and integrating data different types of formats
 Sorting and organising data
 Analysing data
 Identifying reletionships and pattern
 Deriving conclusions based on the data analysis results.
Stack of Layers in Big Data Architecture
Big Data architecture comprising the following basic layers and componenets
 Data Sources Layer
 Ingestion Layer
 Storage Layer
 Physical Infrastructure Layer
 Platform Management Layer
 Data Processing Layer
 Data Query Layer
 Security Layer
 Monitoring Layer
 Analytics Engine
 Visualization Layer
 Big Data Application Arrangement of various layers in the Big Data Architecture
Stack of Layers in Big Data Architecture
Data Sources Layer
 Data sources for big data architecture are all over the map. The bottom layer of the stack is
the foundation and is known as the data layer.
 Data can come through from company servers and sensors, or from third-party data
providers.
 The big data environment can ingest data in batch mode or real-time.
 The basic function of the data sources layer is to absorb and integrate the data coming from
various sources with different formats at varying velocity.
 Before sending to datastack and logical use data has to be validate, sorting, and cleaning.
 This layer is used for the Hadoop distributions, No SQL databases and other relational
databases.
 A few data source examples include enterprise applications like ERP or CRM, MS Office
docs, data warehouses and relational database management systems (RDBMS), databases,
mobile devices, sensors, social media, and email.
Ingestion Layer
 This layer is the first step for the data coming from variable sources to start its journey.
Data here is prioritised and categorised which makes data flow smooth in further layers.
 In this layer we plan the way to ingest data flows from hundreds or thousands of sources
into Data Center.
 Data Ingestion means taking data coming from multiple sources and putting it
somewhere it can be accessed.
 It is the beginning of Data Pipeline where it obtains or import data for immediate use.
 This layer separate noise from relevant information.
 Data can be streamed in real time or ingested in batches, When data is ingested in real
time then, as soon as data arrives it is ingested immediately. When data is ingested in
batches, data items are ingested in some chunks at a periodic interval of time. Ingestion
is the process of bringing data into Data Processing system.
Ingestion Layer (cont.)
In the ingestion layer the data passes
through the following stages;
 Identification
 Filtration
 Validation
 Noise reduction
 Transformation
 Compression
 Integration
Ingestion Layer (cont.)
 Identification: Data is categorised into various known data formats or unstructured data is assigned
with default formats.
 Filtration: The information relevant for the enterprise is filtered on the basis of the Enterprise Master
Data Management (MDM) repository.
 Validation: The filtered day is analysied against MDM metadata.
 Noise reduction: Data is cleaned by removing the noiseand minimising the related disturbances.
 Transformation: Data is split or combined on the basis of its type, contents, and the requirement of the
organisation.
 Compression: The size of the data is reduced without affecting is relavance for the required process. It
should be remembered that compression does not affect the analysis results.
 Integration: The refined data set is integrated with the Hadoop storage layer, which consists of Hadoop
Distributed File System (HDFS) and NOSQL database.
 Data ingestion in the Hadoop world means ELT (Extract, Load and Transform) as opposed to
ETL (Extract, Transform and Load) in case of traditional warehouses.
Storage Layer
 Storage becomes a challenge when the size of the data you are dealing with, becomes large.
 Finding a storage solution is very much important when the size of your data becomes large.
This layer focuses on "where to store such a large data efficiently."
 Hadoop is an open source framework normally used to store high volume of data in adistributed
manner across multiple machines.
 There are two major components of Hadoop - a scalable Hadoop Distributed File System
(HDFS) that cn support petabytes of data and another MapReduce engine that compute
results in batches.
 Hadoop has its own database file system, known as HBase, but others including Amazon’s
DynamoDB, MongoDB and Cassandra (used by Facebook), all based on the NoSQL
architecture, those are more are popular too.
Digging into Big Data Technology Components
Physical Infastructure Layer
 As big data is all about high-velocity, high-volume, and high-data variety, the physical
infrastructure will literally “make or break” the implementation.
 Most big data implementations need to be highly available, so the networks, servers, and
physical storage must be both resilient and redundant. Resiliency and redundancy are
interrelated.
 An infrastructure, or a system, is resilient to failure or changes when sufficient redundant
resources are in place, ready to jump into action.
 Redundancy ensures that such a malfunction won’t cause an outage. Resiliency helps to
eliminate single points of failure in your infrastructure.
 This means that the technical and operational complexity is masked behind a collection of
services, each with specific terms for performance, availability, recovery, and so on. These
terms are described in service-level agreements (SLAs) and are usually negotiated between the
service provider and the customer, with penalties for noncompliance.
Physical Infastructure Layer (cont.)
 A prioritized list of big data principles should include statements about the following:
 Performance: How responsive do you need the system to be? Performance, also called latency,
is often measured end to end, based on a single transaction or query request.
 Availability: Do you need a 100 percent uptime guarantee of service? How long can your
business wait in the case of a service interruption or failure?
 Scalability: How big does your infrastructure need to be? How much disk space is needed
today and in the future? How much computing power do you need? Typically, you need to
decide what you need and then add a little more scale for unexpected challenges.
 Flexibility: How quickly can you add more resources to the infrastructure? How quickly can
your infrastructure recover from failures?
 Cost: What can you afford? Because the infrastructure is a set of components, you might be
able to buy the “best” networking and decide to save money on storage. You need to establish
requirements for each of these areas in the context of an overall budget and then make trade-
offs where necessary.
A. PHYSICAL REDUNDANT NETWORKS
 Networks should be redundant and must have enough capacity to accommodate the
anticipated volume and velocity of the inbound and outbound data in addition to the
“normal” network traffic experienced by the business.
 As you begin making big data an integral part of your computing strategy, it is
reasonable to expect volume and velocity to increase.
 Infrastructure designers should plan for these expected increases and try to create
physical implementations that are “elastic.”
 As network traffic ebbs and flows, so too does the set of physical assets associated with
the implementation.
 Your infrastructure should offer monitoring capabilities so that operators can react when
more resources are required to address changes in workloads.
B. MANAGE HARDWARE: STORAGE AND SERVERS
 The hardware (storage and server) assets must have sufficient speed and capacity to
handle all expected big data capabilities.
 It’s of little use to have a high-speed network with slow servers because the servers will
most likely become a bottleneck.
 However, a very fast set of storage and compute servers can overcome variable network
performance.
 Of course, nothing will work properly if network performance is poor or unreliable.
C. INFRASTRUCTURE OPERATIONS
 Another important design consideration is infrastructure operations management.
 The greatest levels of performance and flexibility will be present only in a well-managed
environment.
 Data center managers need to be able to anticipate and prevent catastrophic failures so
that the integrity of the data, and by extension the business processes, is maintained. IT
organizations often overlook and therefore underinvest in this area.
Platform Management Layer
 The main role of this layer is to provide different tools and query languages for
accessing NoSQL (Not only SQL) databases and use the HDFS storege file
system present in top of the Hadoop physical infrastructure layer.
 It manages core components of Hadoop as HDFS & MapReduce and other tools
to store, access and analyse large amount of data using real-time analysis.
 These technologies handle all fundamental problem of processings timely,
efficiently and cost effectively.
Platform Management Layer (cont.)
 Key buildng blocks of the Hadoop platform management layer are;
 MapReduce: It is a compbination of map and reduce features. Map is a component that
distribute a problem accross a large number of systems. After completion of distribution task,
reduce function combines all the elements back togeather to provide an aggregate result.
 Hive: It provies SQL type query language named as Hive Query Language (HQL) for
querying data stored in hadoop cluster.
 Pig: It is a scriping language used for batch processing of huge amounts of data and allow
to process parallel in HDFS.
 HBase: It refers to a column orientde database that provides fast handling bigdata.
 Sqoop: It is a command line tool helps to import individual tables, specific columns of
entire database file directly to distributeed file system.
 ZooKeeper: It helps to coordinate to keep multiple Hadoop instances and nodes in
synchronization and provides protection to every nodefrom failing.
Data Processing Layer
 In this Layer, data collected in the previous layer is processed and made ready
to route to different destinations.
 Batch processing system - A pure batch processing system for offline
analytics (Sqoop).
 Near real time processing system - A pure online processing system for
online analytic (Storm).
 In-memory processing engine - Efficiently execute streaming, machine
learning or SQL workloads that require fast iterative access to datasets
(Spark)
 Distributed stream processing - Provides results that are accurate, even in
the case of out-of-order or late-arriving data (Flink)
Data Query Layer
 This is the layer where strong analytic processing takes place. Data analytics
is an essential step which solved the inefficiencies of traditional data platforms
to handle large amounts of data related to interactive queries,
ETL(Extract,Transform & Load), storage and processing.
 Tools – Hive, Spark SQL, Presto, Redshift
 Data Warehouse - Centralized repository that stores data from multiple
information sources and transforms them into a common, multidimensional
data model for efficient querying and analysis.
 Data Lake - Cloud-based enterprise architecture that structures data in a
more
 scalable way that makes it easier to experiment with it. All data is retained.
Security Layer
 It has the mechanisms for providing security while data is analyzed over a multiple
distributed systems.
 Privacy preserving, auditing, Role based access mechanisms for providing security to
the data for both data at rest and while in transit.
 Developing frameworks which are secured, that allows organizations for publish and
use the analytics securely based on several authentication mechanism such as one time
passwords(OTP), multilevel authentications and role based access mechanisms.
 It maintain users privacy and security: confidentiality, integrity and authentication
mechanisms to validate the users.
 It must ensure a secure communication between nodes by using the Secure Socket
Layer(SSL).
 The security layer handles the basic security principles that Big Data architecture should
follow.
Monitoring Layer
 It consists of a number of monitoring systems.
 These systems are aware automatically the configurations and functions of
different operating systems and hardwares.
 It also provides the facilitiy ofmachine communication with the help of a
monitoring tool named XML (Extensible Markup LAnguage) through high level
protocol.
 All monitoring systems provide tools for data storage and visualisation.
Analytics Engine
 Along the transformation path from big data to information to knowledge, lie a host of
analytics techniques and approaches.
 The role of analytic egine is to analyze huge amounts of unstructured data.
 It is useful to look at the range of big data analytics through the following four categories:
 Exploration including visualization
 Explanation
 Prediction
 Prescription
 The different types of engines are used for analysing big data,
 Search Engine: It is required because the data loaded from various sources has to be
indexed and searched for Big Data analytics processing
 Real-time engine: It is required to analyse the real time applications generated data.
Visualization Layer
 This layer focus on Big Data Visualization. We need
something that will grab people’s attention, pull
them in, make your findings well-understood. This is
the where the data value is perceived by the user.
 Dashboards – Save, share, and communicate
insights. It helps users generate questions by
revealing the depth, range, and content of their data
stores.
– Tools - Tableau, AngularJS, Kibana, React.js
 Recommenders - Recommender systems focus on
the task of information filtering, which deals with the
delivery of items selected from a large collection
that the user is likely to find interesting or useful.
Big Data Applications
 Big data management strategies and best practices are still evolving, but joining the
big data movement has become an imperative for companies across a wide variety
of industries.
 Different types of tools and applications are used to implement Big Data stack
architecture.
 The applications can be categorised as;
 Horizontal: The applications are used to address the problems that are common
across industries.
 Vertical: The applications are used to solve an industry specific problems.

Contenu connexe

Tendances

Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxGovardhanV7
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALASaikiran Panjala
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture janani thirupathi
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guidethomasmary607
 

Tendances (20)

Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptx
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Big_data_ppt
Big_data_ppt Big_data_ppt
Big_data_ppt
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Parallel Database
Parallel DatabaseParallel Database
Parallel Database
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
What is ETL?
What is ETL?What is ETL?
What is ETL?
 
Data Mining
Data MiningData Mining
Data Mining
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
 

Similaire à Lecture4 big data technology foundations

Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeSysfore Technologies
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsWP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsJane Roberts
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overviewvhrocca
 
Traditional data word
Traditional data wordTraditional data word
Traditional data wordorcoxsm
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big DataMrinal Kumar
 
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Denodo
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsCognizant
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Nathan Bijnens
 
Sycamore Quantum Computer 2019 developed.pptx
Sycamore Quantum Computer 2019 developed.pptxSycamore Quantum Computer 2019 developed.pptx
Sycamore Quantum Computer 2019 developed.pptxshujee381
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdferamfatima43
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
Intro to big data and applications -day 3
Intro to big data and applications -day 3Intro to big data and applications -day 3
Intro to big data and applications -day 3Parviz Vakili
 

Similaire à Lecture4 big data technology foundations (20)

Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
Big data architecture
Big data architectureBig data architecture
Big data architecture
 
Big Data
Big DataBig Data
Big Data
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsWP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
 
Big data Question bank.pdf
Big data Question bank.pdfBig data Question bank.pdf
Big data Question bank.pdf
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Traditional data word
Traditional data wordTraditional data word
Traditional data word
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big Data
 
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
Sycamore Quantum Computer 2019 developed.pptx
Sycamore Quantum Computer 2019 developed.pptxSycamore Quantum Computer 2019 developed.pptx
Sycamore Quantum Computer 2019 developed.pptx
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdf
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Intro to big data and applications -day 3
Intro to big data and applications -day 3Intro to big data and applications -day 3
Intro to big data and applications -day 3
 

Plus de hktripathy

Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematicshktripathy
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your datahktripathy
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Lecture7.1 data sampling
Lecture7.1 data samplingLecture7.1 data sampling
Lecture7.1 data samplinghktripathy
 
Lecture5 virtualization
Lecture5 virtualizationLecture5 virtualization
Lecture5 virtualizationhktripathy
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligencehktripathy
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cyclehktripathy
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision treehktripathy
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & predictionhktripathy
 
Lect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysisLect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysishktripathy
 
Lect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithmLect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithmhktripathy
 
Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysishktripathy
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-Ihktripathy
 
Lect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data MiningLect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data Mininghktripathy
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your datahktripathy
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 

Plus de hktripathy (17)

Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your data
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Lecture7.1 data sampling
Lecture7.1 data samplingLecture7.1 data sampling
Lecture7.1 data sampling
 
Lecture5 virtualization
Lecture5 virtualizationLecture5 virtualization
Lecture5 virtualization
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cycle
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision tree
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & prediction
 
Lect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysisLect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysis
 
Lect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithmLect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithm
 
Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysis
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
 
Lect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data MiningLect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data Mining
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your data
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 

Dernier

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 

Dernier (20)

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 

Lecture4 big data technology foundations

  • 2. Exploring the Big Data Stack • Big data architecture is the foundation for big data analytics. • It is a process of desinging any kind of data architecture is to creat a model that should give a complete view of all the required elements. • Sometimes the desinging a model consumes more time, but subsequently implementation of the model can save significant amount of time, effort, and reworks. • Configuration of the model/architecture may vary depending upon the specific needs of the organisation. • But, for any data architecture, the basic layers and components are more or less, remain the same. • To design a big data architecture model we need to think of Big Data as a strategy and not a project.
  • 3. Do I Need Big Data Architecture? • Not everyone does need to leverage big data architecture. • Single computing tasks rarely top more than 100GB of data, which does not require a big data architecture. • Unless you are analyzing terabytes and petabytes of data – and doing it consistently -- look to a scalable server instead of a massively scale-out architecture like Hadoop. • If you need analytics, then consider a scalable array that offers native analytics for stored data.
  • 4. Do I Need Big Data Architecture? (cont.) You probably do need big data architecture if any of the following applies to you: • You want to extract information from extensive networking or web logs. • You process massive datasets over 100GB in size. Some of these computing tasks run 8 hours or longer. • You are willing to invest in a big data project, including third-party products to optimize your environment. • You store large amounts of unstructured data that you need to summarize or transform into a structured format for better analytics. • You have multiple large data sources to analyze, including structured and unstructured. • You want to proactively analyze big data for business needs, such as analyzing store sales by season and advertising, applying sentiment analysis to social media posts, or investigating email for suspicious communication patterns – or all the above.
  • 5. Big Data Architecture The startegy includes the design principles related to the creation of an environment to support the Big Data. The principles are deals with storage of data, analytics, reporting, and applications. • During the creation of Big Data architecture the consideration is required on hardware, software infrastructure, operational s/w, management s/w, APIs, and software developer tools. • The architecture of Big Data environment must fulfill all fundamental requirements to perform the following functions:  Capturing data from different sources  Cleaning and integrating data different types of formats  Sorting and organising data  Analysing data  Identifying reletionships and pattern  Deriving conclusions based on the data analysis results.
  • 6. Stack of Layers in Big Data Architecture Big Data architecture comprising the following basic layers and componenets  Data Sources Layer  Ingestion Layer  Storage Layer  Physical Infrastructure Layer  Platform Management Layer  Data Processing Layer  Data Query Layer  Security Layer  Monitoring Layer  Analytics Engine  Visualization Layer  Big Data Application Arrangement of various layers in the Big Data Architecture
  • 7. Stack of Layers in Big Data Architecture
  • 8. Data Sources Layer  Data sources for big data architecture are all over the map. The bottom layer of the stack is the foundation and is known as the data layer.  Data can come through from company servers and sensors, or from third-party data providers.  The big data environment can ingest data in batch mode or real-time.  The basic function of the data sources layer is to absorb and integrate the data coming from various sources with different formats at varying velocity.  Before sending to datastack and logical use data has to be validate, sorting, and cleaning.  This layer is used for the Hadoop distributions, No SQL databases and other relational databases.  A few data source examples include enterprise applications like ERP or CRM, MS Office docs, data warehouses and relational database management systems (RDBMS), databases, mobile devices, sensors, social media, and email.
  • 9. Ingestion Layer  This layer is the first step for the data coming from variable sources to start its journey. Data here is prioritised and categorised which makes data flow smooth in further layers.  In this layer we plan the way to ingest data flows from hundreds or thousands of sources into Data Center.  Data Ingestion means taking data coming from multiple sources and putting it somewhere it can be accessed.  It is the beginning of Data Pipeline where it obtains or import data for immediate use.  This layer separate noise from relevant information.  Data can be streamed in real time or ingested in batches, When data is ingested in real time then, as soon as data arrives it is ingested immediately. When data is ingested in batches, data items are ingested in some chunks at a periodic interval of time. Ingestion is the process of bringing data into Data Processing system.
  • 10. Ingestion Layer (cont.) In the ingestion layer the data passes through the following stages;  Identification  Filtration  Validation  Noise reduction  Transformation  Compression  Integration
  • 11. Ingestion Layer (cont.)  Identification: Data is categorised into various known data formats or unstructured data is assigned with default formats.  Filtration: The information relevant for the enterprise is filtered on the basis of the Enterprise Master Data Management (MDM) repository.  Validation: The filtered day is analysied against MDM metadata.  Noise reduction: Data is cleaned by removing the noiseand minimising the related disturbances.  Transformation: Data is split or combined on the basis of its type, contents, and the requirement of the organisation.  Compression: The size of the data is reduced without affecting is relavance for the required process. It should be remembered that compression does not affect the analysis results.  Integration: The refined data set is integrated with the Hadoop storage layer, which consists of Hadoop Distributed File System (HDFS) and NOSQL database.  Data ingestion in the Hadoop world means ELT (Extract, Load and Transform) as opposed to ETL (Extract, Transform and Load) in case of traditional warehouses.
  • 12. Storage Layer  Storage becomes a challenge when the size of the data you are dealing with, becomes large.  Finding a storage solution is very much important when the size of your data becomes large. This layer focuses on "where to store such a large data efficiently."  Hadoop is an open source framework normally used to store high volume of data in adistributed manner across multiple machines.  There are two major components of Hadoop - a scalable Hadoop Distributed File System (HDFS) that cn support petabytes of data and another MapReduce engine that compute results in batches.  Hadoop has its own database file system, known as HBase, but others including Amazon’s DynamoDB, MongoDB and Cassandra (used by Facebook), all based on the NoSQL architecture, those are more are popular too.
  • 13. Digging into Big Data Technology Components
  • 14. Physical Infastructure Layer  As big data is all about high-velocity, high-volume, and high-data variety, the physical infrastructure will literally “make or break” the implementation.  Most big data implementations need to be highly available, so the networks, servers, and physical storage must be both resilient and redundant. Resiliency and redundancy are interrelated.  An infrastructure, or a system, is resilient to failure or changes when sufficient redundant resources are in place, ready to jump into action.  Redundancy ensures that such a malfunction won’t cause an outage. Resiliency helps to eliminate single points of failure in your infrastructure.  This means that the technical and operational complexity is masked behind a collection of services, each with specific terms for performance, availability, recovery, and so on. These terms are described in service-level agreements (SLAs) and are usually negotiated between the service provider and the customer, with penalties for noncompliance.
  • 15. Physical Infastructure Layer (cont.)  A prioritized list of big data principles should include statements about the following:  Performance: How responsive do you need the system to be? Performance, also called latency, is often measured end to end, based on a single transaction or query request.  Availability: Do you need a 100 percent uptime guarantee of service? How long can your business wait in the case of a service interruption or failure?  Scalability: How big does your infrastructure need to be? How much disk space is needed today and in the future? How much computing power do you need? Typically, you need to decide what you need and then add a little more scale for unexpected challenges.  Flexibility: How quickly can you add more resources to the infrastructure? How quickly can your infrastructure recover from failures?  Cost: What can you afford? Because the infrastructure is a set of components, you might be able to buy the “best” networking and decide to save money on storage. You need to establish requirements for each of these areas in the context of an overall budget and then make trade- offs where necessary.
  • 16. A. PHYSICAL REDUNDANT NETWORKS  Networks should be redundant and must have enough capacity to accommodate the anticipated volume and velocity of the inbound and outbound data in addition to the “normal” network traffic experienced by the business.  As you begin making big data an integral part of your computing strategy, it is reasonable to expect volume and velocity to increase.  Infrastructure designers should plan for these expected increases and try to create physical implementations that are “elastic.”  As network traffic ebbs and flows, so too does the set of physical assets associated with the implementation.  Your infrastructure should offer monitoring capabilities so that operators can react when more resources are required to address changes in workloads.
  • 17. B. MANAGE HARDWARE: STORAGE AND SERVERS  The hardware (storage and server) assets must have sufficient speed and capacity to handle all expected big data capabilities.  It’s of little use to have a high-speed network with slow servers because the servers will most likely become a bottleneck.  However, a very fast set of storage and compute servers can overcome variable network performance.  Of course, nothing will work properly if network performance is poor or unreliable.
  • 18. C. INFRASTRUCTURE OPERATIONS  Another important design consideration is infrastructure operations management.  The greatest levels of performance and flexibility will be present only in a well-managed environment.  Data center managers need to be able to anticipate and prevent catastrophic failures so that the integrity of the data, and by extension the business processes, is maintained. IT organizations often overlook and therefore underinvest in this area.
  • 19. Platform Management Layer  The main role of this layer is to provide different tools and query languages for accessing NoSQL (Not only SQL) databases and use the HDFS storege file system present in top of the Hadoop physical infrastructure layer.  It manages core components of Hadoop as HDFS & MapReduce and other tools to store, access and analyse large amount of data using real-time analysis.  These technologies handle all fundamental problem of processings timely, efficiently and cost effectively.
  • 20. Platform Management Layer (cont.)  Key buildng blocks of the Hadoop platform management layer are;  MapReduce: It is a compbination of map and reduce features. Map is a component that distribute a problem accross a large number of systems. After completion of distribution task, reduce function combines all the elements back togeather to provide an aggregate result.  Hive: It provies SQL type query language named as Hive Query Language (HQL) for querying data stored in hadoop cluster.  Pig: It is a scriping language used for batch processing of huge amounts of data and allow to process parallel in HDFS.  HBase: It refers to a column orientde database that provides fast handling bigdata.  Sqoop: It is a command line tool helps to import individual tables, specific columns of entire database file directly to distributeed file system.  ZooKeeper: It helps to coordinate to keep multiple Hadoop instances and nodes in synchronization and provides protection to every nodefrom failing.
  • 21. Data Processing Layer  In this Layer, data collected in the previous layer is processed and made ready to route to different destinations.  Batch processing system - A pure batch processing system for offline analytics (Sqoop).  Near real time processing system - A pure online processing system for online analytic (Storm).  In-memory processing engine - Efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets (Spark)  Distributed stream processing - Provides results that are accurate, even in the case of out-of-order or late-arriving data (Flink)
  • 22. Data Query Layer  This is the layer where strong analytic processing takes place. Data analytics is an essential step which solved the inefficiencies of traditional data platforms to handle large amounts of data related to interactive queries, ETL(Extract,Transform & Load), storage and processing.  Tools – Hive, Spark SQL, Presto, Redshift  Data Warehouse - Centralized repository that stores data from multiple information sources and transforms them into a common, multidimensional data model for efficient querying and analysis.  Data Lake - Cloud-based enterprise architecture that structures data in a more  scalable way that makes it easier to experiment with it. All data is retained.
  • 23. Security Layer  It has the mechanisms for providing security while data is analyzed over a multiple distributed systems.  Privacy preserving, auditing, Role based access mechanisms for providing security to the data for both data at rest and while in transit.  Developing frameworks which are secured, that allows organizations for publish and use the analytics securely based on several authentication mechanism such as one time passwords(OTP), multilevel authentications and role based access mechanisms.  It maintain users privacy and security: confidentiality, integrity and authentication mechanisms to validate the users.  It must ensure a secure communication between nodes by using the Secure Socket Layer(SSL).  The security layer handles the basic security principles that Big Data architecture should follow.
  • 24. Monitoring Layer  It consists of a number of monitoring systems.  These systems are aware automatically the configurations and functions of different operating systems and hardwares.  It also provides the facilitiy ofmachine communication with the help of a monitoring tool named XML (Extensible Markup LAnguage) through high level protocol.  All monitoring systems provide tools for data storage and visualisation.
  • 25. Analytics Engine  Along the transformation path from big data to information to knowledge, lie a host of analytics techniques and approaches.  The role of analytic egine is to analyze huge amounts of unstructured data.  It is useful to look at the range of big data analytics through the following four categories:  Exploration including visualization  Explanation  Prediction  Prescription  The different types of engines are used for analysing big data,  Search Engine: It is required because the data loaded from various sources has to be indexed and searched for Big Data analytics processing  Real-time engine: It is required to analyse the real time applications generated data.
  • 26. Visualization Layer  This layer focus on Big Data Visualization. We need something that will grab people’s attention, pull them in, make your findings well-understood. This is the where the data value is perceived by the user.  Dashboards – Save, share, and communicate insights. It helps users generate questions by revealing the depth, range, and content of their data stores. – Tools - Tableau, AngularJS, Kibana, React.js  Recommenders - Recommender systems focus on the task of information filtering, which deals with the delivery of items selected from a large collection that the user is likely to find interesting or useful.
  • 27. Big Data Applications  Big data management strategies and best practices are still evolving, but joining the big data movement has become an imperative for companies across a wide variety of industries.  Different types of tools and applications are used to implement Big Data stack architecture.  The applications can be categorised as;  Horizontal: The applications are used to address the problems that are common across industries.  Vertical: The applications are used to solve an industry specific problems.