SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
Review of Big Data Analytics (BDA)
Architecture: Trends and Analysis
Keh Kok Yong, Mohamad Syazwan Shafei, Pek Yin Sian, Meng Wei Chua
Accelerative Technology Lab
Mimos Berhad
Kuala Lumpur, Malaysia
kk.yong@mimos.my, syazwan.shafei@mimos.my, py.sian@mimos.my, mw.chua@mimos.my
Abstract—The problem of constructing a big data analytics
capabilities system, it is not only ingesting large volume of data,
but also simultaneously computes vast volume and variety types
of data, which is driven by required analytics use cases. This
affects data architects, data engineers and data scientists in an
organization to discover insight of data and turn to value. The
promise of adopting a big data strategic architecture, it
maximizes technologies capabilities in automate decision
making and drives these values through innovation. A
successful solution adopts a highly flexible and scalable
architectural design to best-fit the organization BDA system.
This paper surveys and discusses BDA and its architecture for
applying the appropriate technologies.
Keywords—Big Data Analytics, BDA, AI, IOT, Data
Management
I. INTRODUCTION
Big Data Analytics is neither a novel nor unique
phenomenon. It has been a long evolution of capturing and
processing the collected data. During ancient times, human
has developed methods to keep the results of those
calculations in some kind of permanent format. Data analysis
is rooted in statistics, ancient Egypt uses it for building the
pyramid. There are many big data foundations has been built
on and were laid long ago.
Both of BDA and High-Performance Computing (HPC)
has a similarity characteristic in distributing tasks across many
servers. They share the common objectives in optimizing
algorithms, executing parallel processes efficiently,
automating computation, and building high performance
networks. HPC adopts Message Passing Interface (MPI),
OpenMP, Partitioned Global Address Space (PGAS),
OpenSHMEM, Lustre, GlusterFS and others high
performance technologies [1]. On the other hand, BDA is
materializing through a combination of techniques and
methods from one workflow with the other. It is not only to
run a forecasting model, there have to collect the data, adjust
into an acceptable model, execute the model and visualize the
results. Hadoop has introduced in early 2000s, as an open
source distributed framework. A series of papers have been
published in describing innovations in systems for producing
reliable storage (Google File System), processing in
MapReduce, and low-latency random-access queries
(Bigtable) on hundreds to thousands of potentially unreliable
servers [2].
For deciding best fit BDA architecture, it involves
identifying sources, features and analytics use cases. The
general concept of big data is to extract insights, correlations
and value from data. It starts with three Vs definition of big
data; Volume, Variety and Velocity. Subsequently, it added
Veracity and Value. These “Vs” is attempted to formalize the
definition of the big aspect of this phenomenon.
1
BCE refer to Before the Common Era
Subsequently, the features of BDA address the required
functionalities. The capabilities of particular platform depend
on certain important factors, such as data size, throughput and
model development. These can be located in “Data at Rest”
(batch) or “Data in Motion” (real-time).
In this paper, we briefly survey and discuss the design of
BDA architectures that might help adopting various open
source technologies. The remainder of this paper is structure
as follows. Section II introduces the background of Big Data.
Section III describes the big compute features. Section IV
surveys the design of architecture for BDA. Section V
discusses the trends and analysis of BDA. Finally, Section VI
concludes the paper.
II. BACKGROUND OF BIG DATA
A. Evolution of Data Analytics
The ancient Paleolithic (‘Old Stone Age’) people mark
notches into sticks/bones to keep track of trading activity of
supplies. Subsequently, it uses to carry out simple
calculations and food supplies predictions. In c. 2400 BCE1
,
the first calculation device is constructed, known as abacus.
Furthermore, libraries also appear around this time, it
represents that the first mass data storage is built too. These
have been coming into the use in Babylon. During 300 BC,
the largest library is built in the ancient world, Library of
Alexandria. This long history of revolution and innovation
has led us to the dawn of the data age [3].
In 1880, US Census Bureau has faced a series problem as
the population exploded, it turns into an administrative
nightmare. The work of measuring and recording the
population records are maddeningly slow and expensive. It
estimates to take eight years to crunch all collected data.
Hollerith [4] realizes the need for a better way to count results.
Data is entered on a machine readable medium, punched
cards, and tabulated by machine. It reduces the time required
to process the census from eight years for the 1880 census to
six years for the 1890 census. This revolutionized modern
automated computation for handling incredible big data
processing was founded by Hollerith, as father of the
company, IBM.
In 1970, the framework for a “relational database”
introduces by IBM mathematician, Edgar [5]. This model
data services, store and access information is still being
popular and used today. Material Requirement Planning [6]
system represent the first mainstream commercial
computerized system uses to accelerating daily data processes.
Subsequently, “Business Intelligence” becomes a popular
emerging tool with database system for analyzing commercial
and operational performance. Erik writes an article for Harper
Magazine using of the term “Big Data”, “The keepers of big
,(((RQIHUHQFHRQ2SHQ6VWHPV,26
k,((( 34
Authorized licensed use limited to: University of Waterloo. Downloaded on May 30,2020 at 04:01:40 UTC from IEEE Xplore. Restrictions apply.
data say they are doing it for the consumer’s benefit. But data
have a way of being used for purposes other originally
intended” [7].
Early of 1990s, the birth of the interconnected web of data
and accessible to anyone from anywhere, known as Internet.
The digital storage become more cost effective than manual
printing documents. Michael [8] describes that including the
sounds and images there are thousands of petabytes
information, the existence of 12,000 petabytes is not an
unreasonable guest. The web is increasing in size of 10-fold
each year, however, data will never be discovered values and
yield no insight. During the mid 1990s, the internet is
extremely popular, but structure relational databases cannot
cope with the variety of data types from different non-
relational databases. Thus, NoSQL system is created to
handle different languages and formats in a great flexible way.
Larry Page and Sergey Brin implement Google’s search
engine that can respond in a few seconds by returning desired
results, which processing and analyzing Big Data in
distributed method [9]. Richard comments that the purpose of
computing is insight, and not just numbers. In 1999, Kevin
introduces the term of “Internet of Things” to describe the
growing number of devices online to automated the
communication each other without a human interference;
Also, it utilizes the Internet to empower computers to sense
the world for themselves [10].
In the advent of Industry Revolution 4.0, which
developing in Germany 2013; it has been rapidly spread in
Europe and the world as a while. BDA is one of the key
adoptions and pillar for IoT initiative to improve decision
making [11]. It requires to process a large amount of data on
the fly and storing the data in various scalable storage
technologies. This lighting fast analytics implementation
allows the industries to gain rapid insights, provide prediction
for machinery, and share information. Intrinsically, it requires
a unified architecture to cater common operation for enabling
innovative applications.
B. Big Data General ‘Vs’ Concept
For understanding the Big Data concept, it always
considers the simple building block of data model which is
effectively communicating each and others. In 2001, Gartner
analyst, Doug introduces the 3Vs concept in the dimension of
data management, it consists of controlling data volume,
variety and velocity [12]. It characterizes the creation of data,
storage, retrieval and analysis. After a decade, IBM has been
coined two more worthy of Vs, which are Veracity and Value.
The following shows the brief description of 5Vs:
Volume: It implies to the enormous quantity of data is
generated.
Velocity: It refers to the speed at the data is created and
processed at staggering rate.
Variety: It defines as type of content of data analysis.
Veracity: It focuses on the quality and trust-worthiness
of the variability in the captured data.
Value: It raises to the significance of the data, which
delivering the insights and creating useful
model that answers sophisticated queries.
Inspired by the comprehensive discussion and relevant
comments on IBM website of Big Data  Analytics hub, it
clusters the 5Vs into three groups [13]:
Volume 
Velocity:
These translate into requirements of
hardware and software to deal with data.
Large scale distributed data processing
framework is required such as Hadoop.
Veracity 
Velocity:
These translate into urgency of real-time
processing. The detection of possible data
corruption or manipulation is crucial with
high speed processing ability.
Value: This translates into the necessity of
interdisciplinary cooperation. This raise
the most difficult challenge for industrial
use of big data.
C. “Data at Rest” vs “Data in Motion”
There is no small task in gaining the insights of big data.
Firstly, “Data at Rest” refers the collected historical data from
various sources. It performs the analytics after the event
occurs. Thus, it is commonly used to discover behaviors and
patterns from the past records. Also, it refers to “batch
processing” method. To automate these tasks, there is a
scheduler application in place for executing the tasks
automatically. Secondly, “Data in Motion” refers to
processing and analyzing data in real-time as the event
happens. The latency is a key consideration, as a lag of
processing can be resulted the loss of opportunities.
Furthermore, hybrid of “Data at Rest” and “Data in Motion”
are common in the industries.
III. BIG COMPUTE FEATURES
For data intensive computing [14], the system should
encapsulate the sophisticated design technologies in storing,
managing and processing big data. There are two focus of key
areas, which are application and frameworks. These consists
the concept of data parallelism and task/application
parallelism. Data parallelism is distributed among servers,
and therefore can be processed in parallel. It has been claimed
that it opposes to task parallelism, furthermore, it is often the
simpler method to craft a parallel application [15].
The followings describe the generic features for Big
Compute:
• Being efficient in pre-processing raw data and
combining relevant data from multiple sources,
commonly known as ETL (Extract, Transform and
Load)
• Being flexible to apply various aggregation functions
and perform ad-hoc queries to compute large amount
of sources in discovering the high-level insights of
data
• Being cost effective to extends functionalities with
minimum costs and minimize maintenance cost for
keeping the system running smoothly
• Being low latency in harnessing real-time data for
analytics by optimizing the high volume operation
with minimal delay
• Being highly scalable to enlarge the growth of the
compute resources and storages with support easily
plug-in
• Being robustness and fault tolerance to have ability to
cope with erroneous input and without down any
failures
,(((RQIHUHQFHRQ2SHQ6VWHPV,26
35
Authorized licensed use limited to: University of Waterloo. Downloaded on May 30,2020 at 04:01:40 UTC from IEEE Xplore. Restrictions apply.
• Being systematically governance to ensure data
availability, usability, integrity and security in used
Identifying the required features for a specific domain can
be difficult. In general, different application domains might
need different type of system. It is hard to meet all stockholder
needs with a singular design. As such, Cigdem [16] attempts
to use feature modelling technique [17]. It performs drill
down by distinguish domain scoping, which determining the
domain interest, the stockholders and their goals; and domain
modelling, which aiming to derive using a commodity
analysis. Figure 1 shows the feature model diagram. This
work provides insight in the overall feature space of BDA
system. It further assists for deriving the BDA architecture.
Figure 1: Feature Model
IV. REVIEW OF BIG DATA ARCHITECTURE FRAMEWORK
A reference architecture helps to build a blueprint of the
ultimate BDA system. It is based on a collection of
characteristics and features from common for a given set of
problems. The design of the architecture has to emerge the
fluent orchestration workflow to execute either in a
synchronous or asynchronous manner between the application
and its data. In many cases, it includes the support for the
hybrid mode of batch and real-time processing. The following
reviews of architecture frameworks broaden the perspective
and enabling problem solving with the right tools.
A. Lambda ‘λ’ Architecture
In 2011, one of the popular reference BDA architecture
design has been posted by Marz [18]. It is named as “Lambda
λ Architecture”. It is designed to combine of batch and real-
time processing paradigm in a parallel form. This method is
capable to solve many BDA use cases. In addition, it has the
robustness with fault tolerant strategy for serving wide range
of workloads. Technically, it is now feasible to run ad-hoc
queries against Big Data, but querying a petabyte dataset
every time you want to compute. Figure 2 shows the λ
architecture with three major layers.
Figure 2: λ Architecture
The batch layer pre-computes the master dataset, and
processes into batch views so that queries can be resolved with
low latency. This requires striking a balance of job between
pre-computation and execution time to complete the query.
By doing a little bit of computation on the fly to complete
queries, there save the process from needing to pre-compute
large batch views. In addition, it is not expected to update the
views frequently. The batch views may be a set of flat files
and it depends on chosen technologies. The key is to
precompute just enough information so that the query can be
completed quickly.
The serving layer indexes the views and provides
interfaces, thus, the pre-computed data can be speedily
queried. Both of the batch and speed layers are executed the
same processing logic, and then reconciles the results in
serving layer. It designates to be distributed among many
servers for scalability. There is a long-standing problem
where data is too normalized, there is a need to store some
information redundantly to improve response times. However,
denormalized the data may create huge complexity of keeping
it consistent. Thus, it need to be carefully construct this view
[19].
The speed layer is similar to batch layers. The objective is
to construct views that can be efficiently queried. It mainly
uses an incremental approach and handling real-time views.
These views are updated directly when new data arrives. It
compensates for the high latency of the batch layer to enable
up-to-date results for queries. However, incremental
computation has various new challenges and significant more
complex than batch computation. Especially, resource-
efficient manner with millisecond-level of latencies. Data
must be indexed in order to using of random-read/random-
write databases.
,(((RQIHUHQFHRQ2SHQ6VWHPV,26
36
Authorized licensed use limited to: University of Waterloo. Downloaded on May 30,2020 at 04:01:40 UTC from IEEE Xplore. Restrictions apply.

Contenu connexe

Tendances

Mi health care - multi-tenant health care system
Mi health care - multi-tenant health care systemMi health care - multi-tenant health care system
Mi health care - multi-tenant health care system
Conference Papers
 
Architectural design of IoT-cloud computing integration platform
Architectural design of IoT-cloud computing integration platformArchitectural design of IoT-cloud computing integration platform
Architectural design of IoT-cloud computing integration platform
TELKOMNIKA JOURNAL
 
FAST PACKETS DELIVERY TECHNIQUES FOR URGENT PACKETS IN EMERGENCY APPLICATIONS...
FAST PACKETS DELIVERY TECHNIQUES FOR URGENT PACKETS IN EMERGENCY APPLICATIONS...FAST PACKETS DELIVERY TECHNIQUES FOR URGENT PACKETS IN EMERGENCY APPLICATIONS...
FAST PACKETS DELIVERY TECHNIQUES FOR URGENT PACKETS IN EMERGENCY APPLICATIONS...
IJCNCJournal
 
Performance analysis of enhanced delta sampling algorithm for ble indoor loca...
Performance analysis of enhanced delta sampling algorithm for ble indoor loca...Performance analysis of enhanced delta sampling algorithm for ble indoor loca...
Performance analysis of enhanced delta sampling algorithm for ble indoor loca...
Conference Papers
 
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
Ncct   Ieee Software Abstract Collection Volume 1   50+ AbstNcct   Ieee Software Abstract Collection Volume 1   50+ Abst
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
ncct
 
Performance Analysis of Internet of Things Protocols Based Fog/Cloud over Hig...
Performance Analysis of Internet of Things Protocols Based Fog/Cloud over Hig...Performance Analysis of Internet of Things Protocols Based Fog/Cloud over Hig...
Performance Analysis of Internet of Things Protocols Based Fog/Cloud over Hig...
Istabraq M. Al-Joboury
 
Cooperative hierarchical based edge-computing approach for resources allocati...
Cooperative hierarchical based edge-computing approach for resources allocati...Cooperative hierarchical based edge-computing approach for resources allocati...
Cooperative hierarchical based edge-computing approach for resources allocati...
IJECEIAES
 
CIRCUIT BREAK CONNECT MONITORING TO 5G MOBILE APPLICATION
CIRCUIT BREAK CONNECT MONITORING TO 5G MOBILE APPLICATIONCIRCUIT BREAK CONNECT MONITORING TO 5G MOBILE APPLICATION
CIRCUIT BREAK CONNECT MONITORING TO 5G MOBILE APPLICATION
ijcsit
 

Tendances (20)

Ck34520526
Ck34520526Ck34520526
Ck34520526
 
Mi health care - multi-tenant health care system
Mi health care - multi-tenant health care systemMi health care - multi-tenant health care system
Mi health care - multi-tenant health care system
 
Architectural design of IoT-cloud computing integration platform
Architectural design of IoT-cloud computing integration platformArchitectural design of IoT-cloud computing integration platform
Architectural design of IoT-cloud computing integration platform
 
IRJET - Development of Cloud System for IoT Applications
IRJET - Development of Cloud System for IoT ApplicationsIRJET - Development of Cloud System for IoT Applications
IRJET - Development of Cloud System for IoT Applications
 
FAST PACKETS DELIVERY TECHNIQUES FOR URGENT PACKETS IN EMERGENCY APPLICATIONS...
FAST PACKETS DELIVERY TECHNIQUES FOR URGENT PACKETS IN EMERGENCY APPLICATIONS...FAST PACKETS DELIVERY TECHNIQUES FOR URGENT PACKETS IN EMERGENCY APPLICATIONS...
FAST PACKETS DELIVERY TECHNIQUES FOR URGENT PACKETS IN EMERGENCY APPLICATIONS...
 
Performance analysis of enhanced delta sampling algorithm for ble indoor loca...
Performance analysis of enhanced delta sampling algorithm for ble indoor loca...Performance analysis of enhanced delta sampling algorithm for ble indoor loca...
Performance analysis of enhanced delta sampling algorithm for ble indoor loca...
 
15CS81 Module1 IoT
15CS81 Module1 IoT15CS81 Module1 IoT
15CS81 Module1 IoT
 
IRJET- Secure Data Access Control with Cipher Text and It’s Outsourcing in Fo...
IRJET- Secure Data Access Control with Cipher Text and It’s Outsourcing in Fo...IRJET- Secure Data Access Control with Cipher Text and It’s Outsourcing in Fo...
IRJET- Secure Data Access Control with Cipher Text and It’s Outsourcing in Fo...
 
Disambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities ResearchersDisambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities Researchers
 
A Comparative Study: Taxonomy of High Performance Computing (HPC)
A Comparative Study: Taxonomy of High Performance Computing (HPC) A Comparative Study: Taxonomy of High Performance Computing (HPC)
A Comparative Study: Taxonomy of High Performance Computing (HPC)
 
IoTReport
IoTReportIoTReport
IoTReport
 
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
Ncct   Ieee Software Abstract Collection Volume 1   50+ AbstNcct   Ieee Software Abstract Collection Volume 1   50+ Abst
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
 
Performance Analysis of Internet of Things Protocols Based Fog/Cloud over Hig...
Performance Analysis of Internet of Things Protocols Based Fog/Cloud over Hig...Performance Analysis of Internet of Things Protocols Based Fog/Cloud over Hig...
Performance Analysis of Internet of Things Protocols Based Fog/Cloud over Hig...
 
Open Source Platforms Integration for the Development of an Architecture of C...
Open Source Platforms Integration for the Development of an Architecture of C...Open Source Platforms Integration for the Development of an Architecture of C...
Open Source Platforms Integration for the Development of an Architecture of C...
 
In-Network Distributed Analytics on Data-Centric IoT Network for BI-Service A...
In-Network Distributed Analytics on Data-Centric IoT Network for BI-Service A...In-Network Distributed Analytics on Data-Centric IoT Network for BI-Service A...
In-Network Distributed Analytics on Data-Centric IoT Network for BI-Service A...
 
Cooperative hierarchical based edge-computing approach for resources allocati...
Cooperative hierarchical based edge-computing approach for resources allocati...Cooperative hierarchical based edge-computing approach for resources allocati...
Cooperative hierarchical based edge-computing approach for resources allocati...
 
CIRCUIT BREAK CONNECT MONITORING TO 5G MOBILE APPLICATION
CIRCUIT BREAK CONNECT MONITORING TO 5G MOBILE APPLICATIONCIRCUIT BREAK CONNECT MONITORING TO 5G MOBILE APPLICATION
CIRCUIT BREAK CONNECT MONITORING TO 5G MOBILE APPLICATION
 
IRJET- Cost Effective Scheme for Delay Tolerant Data Transmission
IRJET- Cost Effective Scheme for Delay Tolerant Data TransmissionIRJET- Cost Effective Scheme for Delay Tolerant Data Transmission
IRJET- Cost Effective Scheme for Delay Tolerant Data Transmission
 
Prediction Based Efficient Resource Provisioning and Its Impact on QoS Parame...
Prediction Based Efficient Resource Provisioning and Its Impact on QoS Parame...Prediction Based Efficient Resource Provisioning and Its Impact on QoS Parame...
Prediction Based Efficient Resource Provisioning and Its Impact on QoS Parame...
 
An Event-based Middleware for Syntactical Interoperability in Internet of Th...
An Event-based Middleware for Syntactical Interoperability  in Internet of Th...An Event-based Middleware for Syntactical Interoperability  in Internet of Th...
An Event-based Middleware for Syntactical Interoperability in Internet of Th...
 

Similaire à Review of big data analytics (bda) architecture trends and analysis

Similaire à Review of big data analytics (bda) architecture trends and analysis (20)

Moving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesMoving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and Perspectives
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop Platform
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
IRJET- A Scenario on Big Data
IRJET- A Scenario on Big DataIRJET- A Scenario on Big Data
IRJET- A Scenario on Big Data
 
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.
 
Big data seminor
Big data seminorBig data seminor
Big data seminor
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data Analytics
 
Big data
Big dataBig data
Big data
 
Big Data
Big DataBig Data
Big Data
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
A Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE TheoremA Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE Theorem
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 
Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
Big Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesBig Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and Challenges
 
Big data Mining Using Very-Large-Scale Data Processing Platforms
Big data Mining Using Very-Large-Scale Data Processing PlatformsBig data Mining Using Very-Large-Scale Data Processing Platforms
Big data Mining Using Very-Large-Scale Data Processing Platforms
 
Big data survey
Big data surveyBig data survey
Big data survey
 
Big data security and privacy issues in the
Big data security and privacy issues in theBig data security and privacy issues in the
Big data security and privacy issues in the
 
BIG DATA SECURITY AND PRIVACY ISSUES IN THE CLOUD
BIG DATA SECURITY AND PRIVACY ISSUES IN THE CLOUD BIG DATA SECURITY AND PRIVACY ISSUES IN THE CLOUD
BIG DATA SECURITY AND PRIVACY ISSUES IN THE CLOUD
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 

Plus de Conference Papers

Ai driven occupational skills generator
Ai driven occupational skills generatorAi driven occupational skills generator
Ai driven occupational skills generator
Conference Papers
 
Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...
Conference Papers
 
Adaptive authentication to determine login attempt penalty from multiple inpu...
Adaptive authentication to determine login attempt penalty from multiple inpu...Adaptive authentication to determine login attempt penalty from multiple inpu...
Adaptive authentication to determine login attempt penalty from multiple inpu...
Conference Papers
 
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...
Conference Papers
 
A deployment scenario a taxonomy mapping and keyword searching for the appl...
A deployment scenario   a taxonomy mapping and keyword searching for the appl...A deployment scenario   a taxonomy mapping and keyword searching for the appl...
A deployment scenario a taxonomy mapping and keyword searching for the appl...
Conference Papers
 
Automated snomed ct mapping of clinical discharge summary data for cardiology...
Automated snomed ct mapping of clinical discharge summary data for cardiology...Automated snomed ct mapping of clinical discharge summary data for cardiology...
Automated snomed ct mapping of clinical discharge summary data for cardiology...
Conference Papers
 
Automated login method selection in a multi modal authentication - login meth...
Automated login method selection in a multi modal authentication - login meth...Automated login method selection in a multi modal authentication - login meth...
Automated login method selection in a multi modal authentication - login meth...
Conference Papers
 
Atomization of reduced graphene oxide ultra thin film for transparent electro...
Atomization of reduced graphene oxide ultra thin film for transparent electro...Atomization of reduced graphene oxide ultra thin film for transparent electro...
Atomization of reduced graphene oxide ultra thin film for transparent electro...
Conference Papers
 
An enhanced wireless presentation system for large scale content distribution
An enhanced wireless presentation system for large scale content distribution An enhanced wireless presentation system for large scale content distribution
An enhanced wireless presentation system for large scale content distribution
Conference Papers
 
An analysis of a large scale wireless image distribution system deployment
An analysis of a large scale wireless image distribution system deploymentAn analysis of a large scale wireless image distribution system deployment
An analysis of a large scale wireless image distribution system deployment
Conference Papers
 
Validation of early testing method for e government projects by requirement ...
Validation of early testing method for e  government projects by requirement ...Validation of early testing method for e  government projects by requirement ...
Validation of early testing method for e government projects by requirement ...
Conference Papers
 
The design and implementation of trade finance application based on hyperledg...
The design and implementation of trade finance application based on hyperledg...The design and implementation of trade finance application based on hyperledg...
The design and implementation of trade finance application based on hyperledg...
Conference Papers
 
Unified theory of acceptance and use of technology of e government services i...
Unified theory of acceptance and use of technology of e government services i...Unified theory of acceptance and use of technology of e government services i...
Unified theory of acceptance and use of technology of e government services i...
Conference Papers
 
Towards predictive maintenance for marine sector in malaysia
Towards predictive maintenance for marine sector in malaysiaTowards predictive maintenance for marine sector in malaysia
Towards predictive maintenance for marine sector in malaysia
Conference Papers
 
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...
Conference Papers
 
Searchable symmetric encryption security definitions
Searchable symmetric encryption security definitionsSearchable symmetric encryption security definitions
Searchable symmetric encryption security definitions
Conference Papers
 
Super convergence of autonomous things
Super convergence of autonomous thingsSuper convergence of autonomous things
Super convergence of autonomous things
Conference Papers
 
Study on performance of capacitor less ldo with different types of resistor
Study on performance of capacitor less ldo with different types of resistorStudy on performance of capacitor less ldo with different types of resistor
Study on performance of capacitor less ldo with different types of resistor
Conference Papers
 
Stil test pattern generation enhancement in mixed signal design
Stil test pattern generation enhancement in mixed signal designStil test pattern generation enhancement in mixed signal design
Stil test pattern generation enhancement in mixed signal design
Conference Papers
 
On premise ai platform - from dc to edge
On premise ai platform - from dc to edgeOn premise ai platform - from dc to edge
On premise ai platform - from dc to edge
Conference Papers
 

Plus de Conference Papers (20)

Ai driven occupational skills generator
Ai driven occupational skills generatorAi driven occupational skills generator
Ai driven occupational skills generator
 
Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...
 
Adaptive authentication to determine login attempt penalty from multiple inpu...
Adaptive authentication to determine login attempt penalty from multiple inpu...Adaptive authentication to determine login attempt penalty from multiple inpu...
Adaptive authentication to determine login attempt penalty from multiple inpu...
 
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...
 
A deployment scenario a taxonomy mapping and keyword searching for the appl...
A deployment scenario   a taxonomy mapping and keyword searching for the appl...A deployment scenario   a taxonomy mapping and keyword searching for the appl...
A deployment scenario a taxonomy mapping and keyword searching for the appl...
 
Automated snomed ct mapping of clinical discharge summary data for cardiology...
Automated snomed ct mapping of clinical discharge summary data for cardiology...Automated snomed ct mapping of clinical discharge summary data for cardiology...
Automated snomed ct mapping of clinical discharge summary data for cardiology...
 
Automated login method selection in a multi modal authentication - login meth...
Automated login method selection in a multi modal authentication - login meth...Automated login method selection in a multi modal authentication - login meth...
Automated login method selection in a multi modal authentication - login meth...
 
Atomization of reduced graphene oxide ultra thin film for transparent electro...
Atomization of reduced graphene oxide ultra thin film for transparent electro...Atomization of reduced graphene oxide ultra thin film for transparent electro...
Atomization of reduced graphene oxide ultra thin film for transparent electro...
 
An enhanced wireless presentation system for large scale content distribution
An enhanced wireless presentation system for large scale content distribution An enhanced wireless presentation system for large scale content distribution
An enhanced wireless presentation system for large scale content distribution
 
An analysis of a large scale wireless image distribution system deployment
An analysis of a large scale wireless image distribution system deploymentAn analysis of a large scale wireless image distribution system deployment
An analysis of a large scale wireless image distribution system deployment
 
Validation of early testing method for e government projects by requirement ...
Validation of early testing method for e  government projects by requirement ...Validation of early testing method for e  government projects by requirement ...
Validation of early testing method for e government projects by requirement ...
 
The design and implementation of trade finance application based on hyperledg...
The design and implementation of trade finance application based on hyperledg...The design and implementation of trade finance application based on hyperledg...
The design and implementation of trade finance application based on hyperledg...
 
Unified theory of acceptance and use of technology of e government services i...
Unified theory of acceptance and use of technology of e government services i...Unified theory of acceptance and use of technology of e government services i...
Unified theory of acceptance and use of technology of e government services i...
 
Towards predictive maintenance for marine sector in malaysia
Towards predictive maintenance for marine sector in malaysiaTowards predictive maintenance for marine sector in malaysia
Towards predictive maintenance for marine sector in malaysia
 
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...
 
Searchable symmetric encryption security definitions
Searchable symmetric encryption security definitionsSearchable symmetric encryption security definitions
Searchable symmetric encryption security definitions
 
Super convergence of autonomous things
Super convergence of autonomous thingsSuper convergence of autonomous things
Super convergence of autonomous things
 
Study on performance of capacitor less ldo with different types of resistor
Study on performance of capacitor less ldo with different types of resistorStudy on performance of capacitor less ldo with different types of resistor
Study on performance of capacitor less ldo with different types of resistor
 
Stil test pattern generation enhancement in mixed signal design
Stil test pattern generation enhancement in mixed signal designStil test pattern generation enhancement in mixed signal design
Stil test pattern generation enhancement in mixed signal design
 
On premise ai platform - from dc to edge
On premise ai platform - from dc to edgeOn premise ai platform - from dc to edge
On premise ai platform - from dc to edge
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Review of big data analytics (bda) architecture trends and analysis

  • 1. XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE Review of Big Data Analytics (BDA) Architecture: Trends and Analysis Keh Kok Yong, Mohamad Syazwan Shafei, Pek Yin Sian, Meng Wei Chua Accelerative Technology Lab Mimos Berhad Kuala Lumpur, Malaysia kk.yong@mimos.my, syazwan.shafei@mimos.my, py.sian@mimos.my, mw.chua@mimos.my Abstract—The problem of constructing a big data analytics capabilities system, it is not only ingesting large volume of data, but also simultaneously computes vast volume and variety types of data, which is driven by required analytics use cases. This affects data architects, data engineers and data scientists in an organization to discover insight of data and turn to value. The promise of adopting a big data strategic architecture, it maximizes technologies capabilities in automate decision making and drives these values through innovation. A successful solution adopts a highly flexible and scalable architectural design to best-fit the organization BDA system. This paper surveys and discusses BDA and its architecture for applying the appropriate technologies. Keywords—Big Data Analytics, BDA, AI, IOT, Data Management I. INTRODUCTION Big Data Analytics is neither a novel nor unique phenomenon. It has been a long evolution of capturing and processing the collected data. During ancient times, human has developed methods to keep the results of those calculations in some kind of permanent format. Data analysis is rooted in statistics, ancient Egypt uses it for building the pyramid. There are many big data foundations has been built on and were laid long ago. Both of BDA and High-Performance Computing (HPC) has a similarity characteristic in distributing tasks across many servers. They share the common objectives in optimizing algorithms, executing parallel processes efficiently, automating computation, and building high performance networks. HPC adopts Message Passing Interface (MPI), OpenMP, Partitioned Global Address Space (PGAS), OpenSHMEM, Lustre, GlusterFS and others high performance technologies [1]. On the other hand, BDA is materializing through a combination of techniques and methods from one workflow with the other. It is not only to run a forecasting model, there have to collect the data, adjust into an acceptable model, execute the model and visualize the results. Hadoop has introduced in early 2000s, as an open source distributed framework. A series of papers have been published in describing innovations in systems for producing reliable storage (Google File System), processing in MapReduce, and low-latency random-access queries (Bigtable) on hundreds to thousands of potentially unreliable servers [2]. For deciding best fit BDA architecture, it involves identifying sources, features and analytics use cases. The general concept of big data is to extract insights, correlations and value from data. It starts with three Vs definition of big data; Volume, Variety and Velocity. Subsequently, it added Veracity and Value. These “Vs” is attempted to formalize the definition of the big aspect of this phenomenon. 1 BCE refer to Before the Common Era Subsequently, the features of BDA address the required functionalities. The capabilities of particular platform depend on certain important factors, such as data size, throughput and model development. These can be located in “Data at Rest” (batch) or “Data in Motion” (real-time). In this paper, we briefly survey and discuss the design of BDA architectures that might help adopting various open source technologies. The remainder of this paper is structure as follows. Section II introduces the background of Big Data. Section III describes the big compute features. Section IV surveys the design of architecture for BDA. Section V discusses the trends and analysis of BDA. Finally, Section VI concludes the paper. II. BACKGROUND OF BIG DATA A. Evolution of Data Analytics The ancient Paleolithic (‘Old Stone Age’) people mark notches into sticks/bones to keep track of trading activity of supplies. Subsequently, it uses to carry out simple calculations and food supplies predictions. In c. 2400 BCE1 , the first calculation device is constructed, known as abacus. Furthermore, libraries also appear around this time, it represents that the first mass data storage is built too. These have been coming into the use in Babylon. During 300 BC, the largest library is built in the ancient world, Library of Alexandria. This long history of revolution and innovation has led us to the dawn of the data age [3]. In 1880, US Census Bureau has faced a series problem as the population exploded, it turns into an administrative nightmare. The work of measuring and recording the population records are maddeningly slow and expensive. It estimates to take eight years to crunch all collected data. Hollerith [4] realizes the need for a better way to count results. Data is entered on a machine readable medium, punched cards, and tabulated by machine. It reduces the time required to process the census from eight years for the 1880 census to six years for the 1890 census. This revolutionized modern automated computation for handling incredible big data processing was founded by Hollerith, as father of the company, IBM. In 1970, the framework for a “relational database” introduces by IBM mathematician, Edgar [5]. This model data services, store and access information is still being popular and used today. Material Requirement Planning [6] system represent the first mainstream commercial computerized system uses to accelerating daily data processes. Subsequently, “Business Intelligence” becomes a popular emerging tool with database system for analyzing commercial and operational performance. Erik writes an article for Harper Magazine using of the term “Big Data”, “The keepers of big ,(((RQIHUHQFHRQ2SHQ6VWHPV,26
  • 2. k,((( 34 Authorized licensed use limited to: University of Waterloo. Downloaded on May 30,2020 at 04:01:40 UTC from IEEE Xplore. Restrictions apply.
  • 3. data say they are doing it for the consumer’s benefit. But data have a way of being used for purposes other originally intended” [7]. Early of 1990s, the birth of the interconnected web of data and accessible to anyone from anywhere, known as Internet. The digital storage become more cost effective than manual printing documents. Michael [8] describes that including the sounds and images there are thousands of petabytes information, the existence of 12,000 petabytes is not an unreasonable guest. The web is increasing in size of 10-fold each year, however, data will never be discovered values and yield no insight. During the mid 1990s, the internet is extremely popular, but structure relational databases cannot cope with the variety of data types from different non- relational databases. Thus, NoSQL system is created to handle different languages and formats in a great flexible way. Larry Page and Sergey Brin implement Google’s search engine that can respond in a few seconds by returning desired results, which processing and analyzing Big Data in distributed method [9]. Richard comments that the purpose of computing is insight, and not just numbers. In 1999, Kevin introduces the term of “Internet of Things” to describe the growing number of devices online to automated the communication each other without a human interference; Also, it utilizes the Internet to empower computers to sense the world for themselves [10]. In the advent of Industry Revolution 4.0, which developing in Germany 2013; it has been rapidly spread in Europe and the world as a while. BDA is one of the key adoptions and pillar for IoT initiative to improve decision making [11]. It requires to process a large amount of data on the fly and storing the data in various scalable storage technologies. This lighting fast analytics implementation allows the industries to gain rapid insights, provide prediction for machinery, and share information. Intrinsically, it requires a unified architecture to cater common operation for enabling innovative applications. B. Big Data General ‘Vs’ Concept For understanding the Big Data concept, it always considers the simple building block of data model which is effectively communicating each and others. In 2001, Gartner analyst, Doug introduces the 3Vs concept in the dimension of data management, it consists of controlling data volume, variety and velocity [12]. It characterizes the creation of data, storage, retrieval and analysis. After a decade, IBM has been coined two more worthy of Vs, which are Veracity and Value. The following shows the brief description of 5Vs: Volume: It implies to the enormous quantity of data is generated. Velocity: It refers to the speed at the data is created and processed at staggering rate. Variety: It defines as type of content of data analysis. Veracity: It focuses on the quality and trust-worthiness of the variability in the captured data. Value: It raises to the significance of the data, which delivering the insights and creating useful model that answers sophisticated queries. Inspired by the comprehensive discussion and relevant comments on IBM website of Big Data Analytics hub, it clusters the 5Vs into three groups [13]: Volume Velocity: These translate into requirements of hardware and software to deal with data. Large scale distributed data processing framework is required such as Hadoop. Veracity Velocity: These translate into urgency of real-time processing. The detection of possible data corruption or manipulation is crucial with high speed processing ability. Value: This translates into the necessity of interdisciplinary cooperation. This raise the most difficult challenge for industrial use of big data. C. “Data at Rest” vs “Data in Motion” There is no small task in gaining the insights of big data. Firstly, “Data at Rest” refers the collected historical data from various sources. It performs the analytics after the event occurs. Thus, it is commonly used to discover behaviors and patterns from the past records. Also, it refers to “batch processing” method. To automate these tasks, there is a scheduler application in place for executing the tasks automatically. Secondly, “Data in Motion” refers to processing and analyzing data in real-time as the event happens. The latency is a key consideration, as a lag of processing can be resulted the loss of opportunities. Furthermore, hybrid of “Data at Rest” and “Data in Motion” are common in the industries. III. BIG COMPUTE FEATURES For data intensive computing [14], the system should encapsulate the sophisticated design technologies in storing, managing and processing big data. There are two focus of key areas, which are application and frameworks. These consists the concept of data parallelism and task/application parallelism. Data parallelism is distributed among servers, and therefore can be processed in parallel. It has been claimed that it opposes to task parallelism, furthermore, it is often the simpler method to craft a parallel application [15]. The followings describe the generic features for Big Compute: • Being efficient in pre-processing raw data and combining relevant data from multiple sources, commonly known as ETL (Extract, Transform and Load) • Being flexible to apply various aggregation functions and perform ad-hoc queries to compute large amount of sources in discovering the high-level insights of data • Being cost effective to extends functionalities with minimum costs and minimize maintenance cost for keeping the system running smoothly • Being low latency in harnessing real-time data for analytics by optimizing the high volume operation with minimal delay • Being highly scalable to enlarge the growth of the compute resources and storages with support easily plug-in • Being robustness and fault tolerance to have ability to cope with erroneous input and without down any failures ,(((RQIHUHQFHRQ2SHQ6VWHPV,26
  • 4. 35 Authorized licensed use limited to: University of Waterloo. Downloaded on May 30,2020 at 04:01:40 UTC from IEEE Xplore. Restrictions apply.
  • 5. • Being systematically governance to ensure data availability, usability, integrity and security in used Identifying the required features for a specific domain can be difficult. In general, different application domains might need different type of system. It is hard to meet all stockholder needs with a singular design. As such, Cigdem [16] attempts to use feature modelling technique [17]. It performs drill down by distinguish domain scoping, which determining the domain interest, the stockholders and their goals; and domain modelling, which aiming to derive using a commodity analysis. Figure 1 shows the feature model diagram. This work provides insight in the overall feature space of BDA system. It further assists for deriving the BDA architecture. Figure 1: Feature Model IV. REVIEW OF BIG DATA ARCHITECTURE FRAMEWORK A reference architecture helps to build a blueprint of the ultimate BDA system. It is based on a collection of characteristics and features from common for a given set of problems. The design of the architecture has to emerge the fluent orchestration workflow to execute either in a synchronous or asynchronous manner between the application and its data. In many cases, it includes the support for the hybrid mode of batch and real-time processing. The following reviews of architecture frameworks broaden the perspective and enabling problem solving with the right tools. A. Lambda ‘λ’ Architecture In 2011, one of the popular reference BDA architecture design has been posted by Marz [18]. It is named as “Lambda λ Architecture”. It is designed to combine of batch and real- time processing paradigm in a parallel form. This method is capable to solve many BDA use cases. In addition, it has the robustness with fault tolerant strategy for serving wide range of workloads. Technically, it is now feasible to run ad-hoc queries against Big Data, but querying a petabyte dataset every time you want to compute. Figure 2 shows the λ architecture with three major layers. Figure 2: λ Architecture The batch layer pre-computes the master dataset, and processes into batch views so that queries can be resolved with low latency. This requires striking a balance of job between pre-computation and execution time to complete the query. By doing a little bit of computation on the fly to complete queries, there save the process from needing to pre-compute large batch views. In addition, it is not expected to update the views frequently. The batch views may be a set of flat files and it depends on chosen technologies. The key is to precompute just enough information so that the query can be completed quickly. The serving layer indexes the views and provides interfaces, thus, the pre-computed data can be speedily queried. Both of the batch and speed layers are executed the same processing logic, and then reconciles the results in serving layer. It designates to be distributed among many servers for scalability. There is a long-standing problem where data is too normalized, there is a need to store some information redundantly to improve response times. However, denormalized the data may create huge complexity of keeping it consistent. Thus, it need to be carefully construct this view [19]. The speed layer is similar to batch layers. The objective is to construct views that can be efficiently queried. It mainly uses an incremental approach and handling real-time views. These views are updated directly when new data arrives. It compensates for the high latency of the batch layer to enable up-to-date results for queries. However, incremental computation has various new challenges and significant more complex than batch computation. Especially, resource- efficient manner with millisecond-level of latencies. Data must be indexed in order to using of random-read/random- write databases. ,(((RQIHUHQFHRQ2SHQ6VWHPV,26
  • 6. 36 Authorized licensed use limited to: University of Waterloo. Downloaded on May 30,2020 at 04:01:40 UTC from IEEE Xplore. Restrictions apply.
  • 7. B. Kappa ‘κ’ Architecture Jay [20] describes that the alternatives are worth exploring a part of λ architecture. He addresses the issue of maintaining the codes in two complex distributed systems. There is exactly painful development, as in operational burden. Especially, the distributed components like Storm and Hadoop. κ architecture has been introduced. In this approach, re-processing will execute, whereby the processing code has changed. Therefore, there is actually need to recompute the result sets. The job doing the re-computation is just an improved version of the same code, running on the same framework, taking the same input data. Basically, it is simplification of the λ architecture, where there have simply removed the entire Batch Layer. Hence, it remains Speed layer and Serve layer. Figure 3 shows the diagram of κ Architecture. The workflow can handle real-time data processing and continuous data re-processing in a single stream computation model. Streaming job reads the data and process them. When re-processing is required, a second instance of the streaming job is executed that starts processing the data from the beginning of the retained data and redirects the output to a separate table. When the second job that was executed has caught up with the entire dataset, simply switch the application to read from the new data view, stop the first job, and delete the data view of the first job [21]. The entire multi streams can spin up multiple consumers in parallel consuming individual part of the data. Figure 3: κ Architecture Another pillar of κ architecture is the immutable data log. This is similar in concept to the immutable Master Dataset in Lambda architecture, but instead of using technologies such as Hadoop/HDFS, κ architecture's immutable data log is (usually) Kafka2 . It retains the full log of the data that it needs to re-process. Data in Kafka is persisted to disk and replicated for fault tolerance. Furthermore, growing of data in Kafka, it doesn’t make the system slower, as it supports cluster implementation by distributed across servers with over a petabyte of storage. C. Microservices Architecture Fully built and deployed BDA solutions often include many components of mix vendor software and open source software as well. It uses physical servers, virtual machines and docker containers. Nevertheless, application programming interface (API) is a common method for integrating the functions and also stitched together into working pipeline for each data source. A container is similar a very lightweight virtual machine, however, microservices 2 Apache Kafka is developed by LinkedIn and being contributed open source community, as in Apache Software Foundation. 3 Apache Druid detail refers to “https://druid.apache.org/” are even lighter. Based on the trends in BDA, most analytics pipelines are easily deployed as an immutable microservices. These microservices executes on its own process/container and communicate in a self-regulate way without having to depend on other services or application as a whole. Microservices is commonly adopted Spark, Cassandra and Kafka open source technology [22]. Figure 4 shows the generic Microservices Architecture diagram as referring in [23]. It can build on demand as needed in batch, speed and serve layer. Figure 4: Microservices Architecture D. IOT Architecture With the raise of Industry Revolution 4.0, the combination of IOT and BDA with Artificial Intelligence are being driving to optimize and automate production for industry. IOT is in data-driven paradigm that uses real-time pervasive connected sensors, simulations and event logs to deliver analytics intelligent manufacturing through Internet/Intranet for every area of the factory [24]. These IOT devices have been deployed in daily operations to deliver operation efficiencies, process innovation and environmental benefits. It also presents the challenges in term of large-scale data management, processing and analysis [25]. It consists of four major bases; Time Series Store/Database (TSDB), Streaming Message Queue (SMQ), Workflow Orchestration Engine (WOE) and Distributed File System (DFS). Time Series Store/Database (TSDB): It is an optimized data management system for time-stamped or time-series data. For processing the query of time series data, the time series segment needs to be located. Then, there is a process of retrieval based on a combination of one or more values of the metadata, which commonly store in a relational database, such as SQLite, PostgreSQL, MySQL or others. This mechanism enables TSDB to have the low latency access for tracking, monitoring, down sampling, and aggregating over time. Typically, it has auto-shading and horizontal scaling with a store-specific API or through a specific build connector. There various open source TSDB, such as Apache Druid 3 , InfluxDB4 , OpenTSDB5 and others. Streaming Message Queue (SMQ): Machine-to-machine uses message protocol for establishing communication with publish-subscribe-based messaging to the servers; such as MQTT (Message Queue Telemetry Transport), XMPP (Extensible Messaging and Presence Protocol), DDS (Data Distribution Service and others. It handles certain filters, extraction and simple/complex calculation for process during the streaming processes. Workflow Orchestration Engine (WOE): It designs to orchestrate enterprise level data processing operation, flow- based controller, scheduler, data provenance with secure and 4 InfluxDB detail refers to “https://www.influxdata.com/” 5 OpenTSDB detail refers to “http://opentsdb.net/” ,(((RQIHUHQFHRQ2SHQ6VWHPV,26
  • 8. 37 Authorized licensed use limited to: University of Waterloo. Downloaded on May 30,2020 at 04:01:40 UTC from IEEE Xplore. Restrictions apply.
  • 9. durable for IOT and data analytics tasks. Furthermore, the orchestration framework supports in distributed cluster and extensibility with plug-in. Also, it has a diagrammatic of views and modifiable behavior from web browser. There are two popular open source orchestration workflow systems; Apache Nifi/MiniFi6 is written in Java and Node-Red7 is in JavaScript on top of Node.js platform. Figure 5: IOT Architecture Distributed File System (DFS): It is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to application. It has highly fault-tolerant capability, which has file system replicates, or copies, each piece of data multiple times and distributes the copies to individual nodes, placing at least one copy on a different server rack than the others. As a result, the data on nodes that crash can be found elsewhere within a cluster. This ensures that processing can continue while data is recovered. There choices of technologies for DFS, which is depending of the “brotherhood” of applications, as the most famous open source big data eco-system is Apache Hadoop [26]. Also, there are Ceph8 , Alluxio9 , OpenIO10 and others. E. NIST Big Data Reference Architecture (NBD-RA) The National Institute of Standards and Technology (NIST) has taken United State Federal Government for the Big Data Research and Development Initiative responsibility. It develops open standards and BDA architecture to accelerate the adoption of the most secure and effective Big Data techniques with technologies. White House announces this initiative on March 28, 2012 [27]. It starts with fix federal departments and agencies, which more than 80 projects involve in this development. NBD-RA is an elastic BDA architecture design. The conceptual model design can be vendor-neutral, technology- neutral, and infrastructure agnostic. The system consists of five logical functional components; System Orchestrator, Data Provider, Big Data Application Provider, Big data Framework Provider and Data Consumer. Then, there are two “Management” dimension and “Security Privacy”, which overlaying those five components. Also, these two dimensions provide services and functionality for BDA specific tasks. Figure 6 shows the NBD-RA architecture, which is referencing in [28]. 6 Nifi/MiNifi detail refers to “https://nifi.apache.org” 7 Node-RED detail refers to “https://nodered.org” 8 Ceph detail refers to “https://ceph.io”. 9 Alluxio detail refers to “https://github.com/Alluxio/alluxio”. Figure 6: NBD-RA Architecture V. TRENDS AND ANALYSIS The discussed architectures provide a structure with filling a set of generic tools. However, the choice of technologies to be used and integrated, which has much complexity. Firstly, the consideration of BDA system is either on-premise, cloud or hybrid. Secondly, the choice of data processing, analytics, security with governance application technologies to be developed; open source, commercial and hybrid. Finally, the return of investment (ROI) by having the big data system, it is driven by valuable AI use cases such as descriptive, predictive and prescriptive analytics. With on-premise BDA system, it provides high bandwidth of transfer rate with more flexibility for accessing the system. Nevertheless, it requires big capital outlay of investment with high maintenance cost. Alternatively, big data in cloud computing or hybrid cloud may be an alternative approach for offering high availability that ranging from 99.9% to 99.99999%. Also, the promising support of expandability of storage from gigabytes to petabytes [29]. However, there are some native Hadoop options available in public clouds like AWS, Google, Oracle, AliCloud and others. There may not be the best suit for certain solutions for many applications, due to the virtualization Hadoop performs slower workload for the intensive application [30] [31]. Generally, all these consideration needs a comprehensive requirements analysis and budgeting cost. Hadoop is one of BDA eco system, but it is not the only the choice. Elasticsearch is the alternative BDA solution, named ‘Elastic’. It is specialized for web search, network traffics and log analysis. It based on Apache Lucene for low- level indexing and analysis [32] [33]. NoSQL document- oriented data stores is popular and on-demand nowadays, MongoDB is one of widely used to provide durability with it write-ahead logging techniques [34] [35]. Apache Cassandra is one of the popular wide column-oriented enables continuous availability, tremendous scale and data distribution across multiple data centers and cloud availability zones [36]. It has been deployed at certain technology giants, such as Facebook, Netflix, Twitter, eBay and others. Nevertheless, there are variety of choices for cloud computing 10 OpenIO detail refers to “https://www.openio.io”. ,(((RQIHUHQFHRQ2SHQ6VWHPV,26
  • 10. 38 Authorized licensed use limited to: University of Waterloo. Downloaded on May 30,2020 at 04:01:40 UTC from IEEE Xplore. Restrictions apply.
  • 11. technologies; Google BigTable, Amazon S3 Object storage, Azure Cosmos DB, AlibabaCloud, ApsaraDB and others. AI Analytics is important to every aspect of the organization because it can help ROI at every level. Those implemented analytics use cases need to be built around the issues that are really clear, and the problems that businesses are having today, to improve efficiency, effectiveness, and specific issues such as customer satisfaction [37]. PWC reports that 59% of executives say big data at their company would be improved through the uses of AI [38]. By developing best practices for quick ROI and momentum of scale, it is critical for developing AI models, reusable building blocks of data sets and working across organizational boundaries to drive more valuable AI use cases [39]. VI. CONCLUSION Nowadays, data is the fuel of an organization’s vehicle to drive the business transformation. We are also witnessing the growth and important of the hidden value of data. Therefore, this paper contributes to various important aspect for exploring BDA concepts with “V”s, features model, and key component architectural components with trade-offs. BDA is now being one of the main pillars of industry revolution 4.0, as data analytics with AI are playing the crucial algorithmic roles in producing accurate results. REFERENCES [1] H. Asaadi, D. Khaldi, and B. Chapman, “A Comparative Survey of the HPC and Big Data Paradigms: Analysis and Experiments,” in 2016 IEEE International Conference on Cluster Computing (CLUSTER), 2016, pp. 423–432. [2] J. Yang, “From Google File System to Omega: A Decade of Advancement in Big Data Management at Google,” in 2015 IEEE First International Conference on Big Data Computing Service and Applications, 2015, pp. 249–255. [3] B. Marr, Big Data in Practice. John Wiley Sons, Inc., 2016. [4] F. W. Kistermann, “The Invention and Development of the Hollerith Punched Card: In Commemoration of the 130th Anniversary of the Birth of Herman Hollerith and for the 100th Anniversary of Large Scale Data Processing,” Ann. Hist. Comput., vol. 13, no. 3, pp. 245– 259, Jul. 1991. [5] E. F. Codd, “A Relational Model of Data for Large Shared Data Banks,” Commun. ACM, vol. 13, no. 6, pp. 377–387, Jun. 1970. [6] J. Peeters, “Early MRP Systems at Royal Philips Electronics in the 1960s and 1970s,” IEEE Ann. Hist. Comput., vol. 31, no. 2, pp. 56– 69, Apr. 2009. [7] R. Brueckner, “Where Did Big Data Come From?,” insidebigdata, 2013. [Online]. Available: https://insidebigdata.com/2013/02/03/where-did-big-data-come- from/. [Accessed: 12-Aug-2019]. [8] M. Lesk, “How Much Information Is There In the World?”,” 1997. [Online]. Available: http://www.lesk.com/mlesk/ksg97/ksg.html. [Accessed: 12-Aug-2019]. [9] B. Stone, “The Education of Google’s Larry Page,” Bloomberg Businessweek, Apr-2012. [10] K. Ashton, “That Internet of Things,” RFID J., 2009. [11] A. Petrillo, “Fourth Industrial Revolution: Current Practices, Challenges, and Opportunities,” in Digital Transformation in Smart Manufacturing, R. Cioffi and F. De Felice, Eds. Intechopen, 2018. [12] D. Laney, “3D Data Management: Controlling Data Volume, Velocity and Variety,” 2001. [13] S. Yin and O. Kaynak, “Big Data for Modern Industry: Challenges and Trends [Point of View],” Proc. IEEE, vol. 103, no. 2, pp. 143– 146, Feb. 2015. [14] S. Jha, J. Qiu, A. Luckow, P. K. Mantha, and G. C. Fox, “A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures,” CoRR, vol. abs/1403.1, 2014. [15] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-parallel programs from sequential building blocks,” ACM SIGOPS Oper. …, pp. 59–72, 2007. [16] C. A. Salma, B. Tekinerdogan, and I. N. Athanasiadis, “Feature Driven Survey of Big Data Systems,” in Proceedings of the International Conference on Internet of Things and Big Data, 2016, pp. 348–355. [17] K. C. Kang, S. G. Cohen, J. A. Hess, W. E. Novak, and A. S. Peterso, “Feature-Oriented Domain Analysis (FODA) Feasibility Study,” Pittsburgh, 1990. [18] N. Marz, “How to beat the CAP theorem,” 2011. [Online]. Available: http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html. [Accessed: 13-Aug-2019]. [19] N. Marz and J. Warren, Big Data: Principles and best practices of scalable realtime data systems. Manning Publications, 2015. [20] J. Kreps, “Questioning the Lambda Architecture,” O’Reilly Media, 2014. [Online]. Available: https://www.oreilly.com/ideas/questioning-the-lambda-architecture. [Accessed: 13-Aug-2019]. [21] A. Kumar, Architecting Data-Intensive Applications. Packt Publishing, 2018. [22] G. Vetticaden, “Building Secure and Governed Microservices with Kafka Streams,” Cloudera, 2018. [Online]. Available: https://blog.cloudera.com/building-secure-and-governed- microservices-with-kafka-streams/. [Accessed: 12-Aug-2019]. [23] J. Garrett, Data Analytics for IT Networks: Developing Innovative Use Cases. Cisco Press, 2018. [24] J. Davis, T. Edgar, J. Porter, J. Bernaden, and M. Sarli, “Smart manufacturing, manufacturing intelligence and demand-dynamic performance,” Comput. Chem. Eng., vol. 47, pp. 145–156, 2012. [25] M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, “Deep Learning for IoT Big Data and Streaming Analytics: A Survey,” IEEE Commun. Surv. Tutorials, vol. 20, no. 4, pp. 2923–2960, 2018. [26] Z. Li and H. Shen, “Measuring Scale-Up and Scale-Out Hadoop with Remote and Local File Systems and Selecting the Best Platform,” IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 11, pp. 3201–3214, Nov. 2017. [27] T. Kalil, “The White House Office of Science and Technology Policy: Big Data is a Big Deal,” Office of Science and Technology Policy (OSTP) Blog, 2012. [Online]. Available: https://obamawhitehouse.archives.gov/blog/2012/03/29/big-data-big- deal. [Accessed: 27-Aug-2019]. [28] “NIST Big Data Interoperability Framework: volume 8, reference architecture interfaces,” Gaithersburg, MD, Jun. 2018. [29] A. Zarrabi, E. K. Karuppiah, C. H. Ngo, K. K. Yong, and S. See, “Gravitational Search Algorithm using CUDA,” in IEEE Parallel and Distributed Computing, Applications and Technologies, PDCAT 2014, 2014, pp. 193–198. [30] D. Nuñez, I. Agudo, and J. Lopez, “Delegated Access for Hadoop Clusters in the Cloud,” in 2014 IEEE 6th International Conference on Cloud Computing Technology and Science, 2014, pp. 374–379. [31] M. E. Wendt, “Cloud-based Hadoop Deployments: Benefits and Considerations,” 2014. [32] J. Rosenberg, J. B. Coronel, J. Meiring, S. Gray, and T. Brown, “Leveraging Elasticsearch to Improve Data Discoverability in Science Gateways,” in Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning), 2019, pp. 19:1--19:5. [33] B. Dageville et al., “The Snowflake Elastic Data Warehouse,” in Proceedings of the 2016 International Conference on Management of Data, 2016, pp. 215–226. [34] R. R. Shetty, A. M. Dissanayaka, S. Mengel, L. Gittner, R. Vadapalli, and H. Khan, “Secure NoSQL Based Medical Data Processing and Retrieval: The Exposome Project,” in Companion Proceedings of the10th International Conference on Utility and Cloud Computing, 2017, pp. 99–105. [35] B. Sendir, M. Govindaraju, R. Odaira, and P. Hofstee, “Low Latency and High Throughput Write-Ahead Logging Using CAPI-Flash,” IEEE Trans. Cloud Comput., p. 1, 2019. [36] A. Lakshman and P. Malik, “Cassandra: A Decentralized Structured Storage System,” SIGOPS Oper. Syst. Rev., vol. 44, no. 2, pp. 35–40, Apr. 2010. [37] S. Earley, “Executive Roundtable Series: Driving Higher ROI and Organizational Change,” IT Prof., vol. 17, no. 6, pp. 60–64, Nov. 2015. [38] “2018 AI predictions: 8 insights to shape business strategy,” PwC AS, 2018. [39] “2019 AI Predictions: Six AI priorities you can’t afford to ignore,” PwC AS, 2019. ,(((RQIHUHQFHRQ2SHQ6VWHPV,26
  • 12. 39 Authorized licensed use limited to: University of Waterloo. Downloaded on May 30,2020 at 04:01:40 UTC from IEEE Xplore. Restrictions apply.