SlideShare une entreprise Scribd logo
1  sur  27
March Towards Big Data
Nasir Raheem
PMP, CSM, OCP, Certified Big Data Analyst
ISBN 978-1-64316-293-5
• Introduction to Big Data
o Rapid Growth of Big Data
o Big Data Definition
o Big Data Projects
o Business Value of Big Data
• Big Data Implementation
o Hadoop Infrastructure
o Hadoop Eco System
o Hadoop JVM Framework
o Hadoop Distributed File System
o Tutorial: Hadoop Distributed File System Commands
o Tutorial: Distribution of input data in HDFS
o MapReduce Software
o MapReduce Processing
o Tutorial: Pseudo Code for Map Reduce JAVA Classes
Table of Contents
• Big Data Migration
o Apache SQOOP
o SQOOP commands
o Hive arguments used by SQOOP
o SQOOP Architecture
o Tutorial: Data Migration from Oracle to HIVE database
• Big Data Ingestion / Big Data Management
o Tools & Techniques
o Cloudera HIVE Environment
o High Level Tasks to set up Informatica BDM, Cloudera HIVE and TABLEAU
o Integrated Use of Data Ingestion, Data Management and Data Visualization Tools
Table of Contents
• Big Data Visualization
o Success Factors for Big Data Analytics
o TABLEAU - Step forward in Data Analytics
o 3-D Dashboards (Fast, Wide and Deep)
o TABLEAU Architecture
• Cloud Computing
o Cloud Technology at the rescue of Big Data
o Cloud Computing versus Hadoop Processing
o Cloud Service most suited for Big Data
• End of Course Quiz
Table of Contents
MODULE 1
Introduction to Big Data
• With the advent of Cloud architecture that provides a high level of Cyber-security, clustered servers that use powerful
Compute, Network and Storage hardware and IoT (Internet of Things) that generates massive amounts of Click-stream
data, the stage is all set for Big Data (see below chart on its rapid growth projections).
• Health care industry tops most others in big data adoption. Why? Big data analytics technology plays important role in
heath care. By quickly analyzing large amounts of information, both structured and unstructured, health care providers
can provide lifesaving diagnoses or treatment options almost immediately.
Rapid Growth of Big Data
• Big Data is a term that describes the large volume of data – whether structured (E.g. Relational databases) or semi-structured (E.g. Key –
value pairs as in Mongo database) or unstructured (E.g. emails, text files, log files, videos etc.) that inundates a business on a day-to-day
basis. But it’s not the amount of data that’s important. It is what organizations do with that data that matters. Big data can be analyzed
for insights that lead to better decisions and strategic business moves.
• In summary, Raw Big Data = Structured Data + Semi Structured Data + Un-structured Data
• Collecting Raw Big Data into a Data Lake is one thing, but finding the business value hidden in heaps of structured and unstructured data
is quite another. In today’s computing world, data is information that has been translated into a form that is efficient for movement or
processing. In other words, data is information converted into binary digital form.
• In summary, Information = Raw Big Data + Business Rules + Quality
Big Data Definition
DRIVES
Business IntelligenceInformation
Figure 1: Big Data Implementation
Tools & Techniques
(Hadoop Eco System)
Infrastructure
(Hadoop Distributed File System)
Processing
(MapReduce)
• Big Data is such data whose scale, diversity and complexity require new architecture, techniques and algorithms to manage and
extract hidden knowledge from it. The majority of big data projects fall into one of two broad categories:
o Storage driven
• For organizations running corporate internal applications in a private data center now faced with large volumes of structured
and unstructured data, an in-house Hadoop infrastructure is well suited to what would primarily be data storage driven
projects
o Application driven
• For organizations where broad accessibility and frequent use are not necessary as in healthcare applications. These provide
promise and potential involved in big data analytics where processing speed such as provided by MapReduce processing is of
primary concern. These projects will Integrate with cloud providers’ security measures – start by integrating the big data
platform with the encryption and security practices inherent in the cloud environment. Here, it will be important to offer a
centralized control panel to manage data governance. Organizations can take further steps to minimize access issues by
opting for Single Sign-On (SSO) when setting up new accounts.
• It is important to use the right tool for the job at hand. 76% of organizations actively leverage at least three big data open
source engines for data preparation, machine learning, and reporting and analysis workloads (e.g. Apache Hadoop/Hive,
Apache Spark, and Presto). Greater productivity and automation is a priority. The data shows that as the size of
implementations grow, data-driven companies constantly seek to improve the ratio of users per administrator, thereby reducing
the incremental cost of growth and increasing user self-sufficiency. -1/
• Qubole Big Data Activation Report -1/
Big Data Projects
• We need to understand that earlier systems catered to small to mid-size structured data that lent itself well to
Business Intelligence such as Data Mining and Adhoc Query or Batch Reporting. Due to the addition of new tools
and techniques, Big Data Eco system supports Optimizations and Predictive Analysis including complex statistical
analysis.
Business Value of Big Data
Complexity
Business Value
Low High
High
Big Data
- Predictive Analysis
- Machine Learning
- 3-D Dashboards
Structured Data
- Business
Intelligence
Figure 2: Business Value of Big Data
• There are 9 ways to extract maximum value from Big Data -2/
o #1: The primary path to business value is through analytics. Note that this involves advanced
forms of analytics such as those based on data mining, statistical analysis, natural language
processing, and extreme SQL. Unlike reporting and OLAP, these enable data exploration and
discovery analytics with big data. Even so, reporting and OLAP won't go away because they are
still valuable elsewhere.
o #2: Explore big data to discover new business opportunities. After all, many sources of big data
are new to you, and many represent new channels for interacting with your customers and
partners. As with any new source, big data merits exploration. Data exploration leads to patterns
and new facts your business didn't know, such as new customer base segments, customer
behaviors, forms of churn, and root causes for bottom line costs.
o #3: Start analyzing the big data you've already hoarded. Yes, it's true: many firms have
"squirreled away" large datasets because they sensed business value yet didn't know how to get
value out of big data. Depending on your industry, you probably have large datasets of Web site
logs, which can be "sessionized" and analyzed to understand Web site visitor behavior. Likewise,
quality assurance data from manufacturing leads to more reliable products and better leverage
with suppliers, and RFID data can solve the mysteries of product movement through supply
Business Value of Big Data
• There are 9 ways to extract maximum value from Big Data -2/
o The primary path to business value is through analytics. Note that this involves advanced forms of analytics
such as those based on data mining, statistical analysis, natural language processing, and extreme SQL.
Unlike reporting and OLAP, these enable data exploration and discovery analytics with big data. Even so,
reporting and OLAP won't go away because they are still valuable elsewhere.
o Explore big data to discover new business opportunities. After all, many sources of big data are new to you,
and many represent new channels for interacting with your customers and partners. As with any new
source, big data merits exploration. Data exploration leads to patterns and new facts your business didn't
know, such as new customer base segments, customer behaviors, forms of churn, and root causes for
bottom line costs.
o Start analyzing the big data you've already hoarded. Yes, it's true: many firms have "squirreled away" large
datasets because they sensed business value yet didn't know how to get value out of big data. Depending
on your industry, you probably have large datasets of Web site logs, which can be "sessionized" and
analyzed to understand Web site visitor behavior. Likewise, quality assurance data from manufacturing leads
to more reliable products and better leverage with suppliers, and RFID data can solve the mysteries of
product movement through supply chains.
Business Value of Big Data
o Focus on analyzing the type of big data that's valuable to your industry. The type and content of big data
can vary by industry and thus have different value propositions for each industry. This includes call detail
records (CDRs) in telecommunications, RFID in retail, manufacturing, and other product-oriented industries
as well as sensor data from robots in manufacturing (especially automotive and consumer electronics).
o Make a leap of faith into unstructured big data. This information is largely text expressing human language,
which is very different from the relational data you work with most, so you'll need new tools for on natural
language processing, search, and text analytics. These can provide visibility into text-laden business
processes, such as the claims process in insurance, medical records in healthcare, call-center and help-desk
applications in any industry, and sentiment analysis in customer-oriented businesses.
o Expand your existing customer analytics with social media data. Customers can influence each other by
commenting on brands, reviewing products, reacting to marketing campaigns, and revealing shared
interests. Social big data can come from social media Web sites as well as from your own channels that
enable customers to voice opinions and facts. It's possible to use predictive analytics to discover patterns
and anticipate product or service issues. You might likewise measure share of voice, brand reputation,
sentiment drivers, and new customer segments.
Business Value of Big Data
o Complete your customer views by integrating big data into them. Big data (when integrated with older
enterprise sources) can broaden 360-degree views of customers and other business entities (products,
suppliers, partners), from hundreds of attributes to thousands. The added granular detail leads to more
accurate customer base segmentation, direct marketing, and other customer analytics.
o Improve older analytic applications by incorporating big data. Big data can enlarge and broaden data
samples for older analytic applications. This is especially the case with analytic technologies that depend on
large samples -- such as statistics or data mining – when applied to fraud detection, risk management, or
actuarial calculations.
o Accelerate the business into real-time operation by analyzing streaming big data. Applications for real-time
monitoring and analysis have been around for many years in businesses that offer an energy utility,
communication network, or any grid, service, or facility that demands 24x7 operation. More recently, a
wider range of organizations are tapping streaming big data for applications ranging from surveillance (cyber
security, situational awareness, fraud detection) to logistics (truck or rail freight, mobile asset management,
just-in-time inventory). Big data analytics is still mostly executed in batch and offline today, but it will move
into real time as users and technologies mature.
• Philip Russom, Ph.D – Transforming Data With Intelligence - Research Paper -2/
Business Value of Big Data
MODULE 2
Big Data Implementation
• Hadoop Infrastructure is based on an architecture of clusters. Cluster is a configuration of nodes that interact to
perform a specific task. In the past these have provided high performance processing solutions for aerospace and
defense.
• Hadoop Distributed File System (HDFS) supports the Velocity, Variety, Volume and Veracity of Big Data as it is laid
across the NameNode and DataNodes. We cannot over-emphasize the importance of Veracity as this relates to
the accuracy of data in predicting business value. A forward looking business may build a Smart Data Engine to
analyze massive amounts of data in real time that quickly assesses the value of the customer and the potential to
provide additional offers to that customer I.e. Make the Point of Service, the Point of Sale.
• With the advent of commercial Internet of Things (IoT), clusters have formed the backbone of important
commercial entities such as social media powerhouse, Facebook that spreads over thousands of nodes.
• With the advent of the Cloud, it is now possible to have Multi-layer clusters such as:
o Virtual Clusters (Cluster of Virtual Machines such as Java Virtual Machine (JVM) that can run on multiple operating systems
such as Microsoft Windows, Linux and Mac). To minimize expenses install Vagrant or Puppet on Redhat or Ubuntu to set
up a cluster of Virtual Machines.
o Cloud Clusters that are massively scalable.
o Real Clusters that deploy core application stack such as Amazon S3 platform.
Hadoop Infrastructure
Figure 3: Cluster Design Considerations
COMPUTE NETWORK STORAGE
Multi-core CPUs,
Parallel Processing Algorithms,
Load Balancing,
Synchronous Multi master
Cluster,
Cache Memory
Data Switch bandwidth of 100Gbps
Ethernet,
TCP / IP protocol,
TCP packet routing,
High speed fiber backbone
Low latency / Fast / High Volume,
Terabyte RAID drives,
Flash based Solid State Drives
• Hadoop is an open source written in JAVA. It provides a framework for scalable, fault tolerant, distributed system
to store and process data across a cluster of commodity servers called nodes. The underlying goals are Scalability,
Reliability and Economics.
• Hadoop comprises of
o A set of operations & interfaces for distributed file systems
o Hadoop Distributed File System (HDFS) that facilitates running MapReduce jobs over a cluster of commodity
servers
o Mahout tool to facilitate machine learning
o Processing tools contained in the Cloudera Software Distribution bundle are:
• HIVE: Uses an Interpretor to transform SQL query to MapReduce code
• PIG: Scripting language that creates MapReduce code
• SQOOP: Serves as data exchange between R-dbms and Hive database
• IMPALA: Uses SQL to access data directly from HDFS
o Hadoop Version 2 Infrastructure that has a master NameNode where the Resource Manager runs and has the meta
data about the underlying slave DataNodes
o Hadoop Version 2 Infrastructure that has several DataNodes where the actual fault tolerant application data resides
as well as the Node Manager to keep track of jobs running on DataNode and provides feedback to the Resource
Manager that runs on the NameNode.
Hadoop Eco System
Figure 4: Hadoop JVM Framework
Figure 4: Hadoop JVM Framework
JAVA Compile Time Environment
JAVA Source Code
Byte Code
Compile
JAVA Run Time Environment
Windows JVM Linux JVM Mac JVM
Windows
Executable
Linux
Executable
Mac
Executable
• Hadoop Distributed File System (HDFS) supports the Velocity, Variety, Volume and Veracity of Big Data. A thorough
understanding of how this data is laid out and managed is of paramount importance. The metadata and actual data reside on
specific directories on the HDFS server. An example of 1 Namenode directory (also called Master Node that stores Metadata)
and 10 Datanodes’ directories (also called Slave nodes that store actual application data) are given below:
o Metadata: /home/osboxes/lab/hdfs/namenodep
o Actual Data: /home/osboxes/lab/hdfs/datan1, /home/osboxes/lab/hdfs/datan2,
/home/osboxes/lab/hdfs/datan3, /home/osboxes/lab/hdfs/datan4,
/home/osboxes/lab/hdfs/datan5, /home/osboxes/lab/hdfs/datan6,
/home/osboxes/lab/hdfs/datan7, /home/osboxes/lab/hdfs/datan8,
/home/osboxes/lab/hdfs/datan9, /home/osboxes/lab/hdfs/datan10
• The Datanode directories shown above on HDFS server use Symbolic links to point to the directories on Slave Nodes where the
application data resides. Again, metadata about files on HDFS are stored and managed on Namenode. This includes information
like permissions and ownership of such files, location of various blocks of that file on Datanodes, etc.
• Common file-handling HDFS commands:
o List all files under present working directory: hadoop fs –ls
o Create new directory under root directory: hadoop fs –mkdir /new-directory
o Create new directory under present working directory: hadoop fs –mkdir ./new-directory
o Find file sizes under new directory: hadoop fs –du –s /new-directory
o Count the number of directories under the /etc directory: hadoop fs –count /etc
Hadoop Distributed File System
Remove “sample” file from HDFS directory: hadoop fs –rm /hdfs/cloudera/sample
System Response: Deleted Sample
Remove “bigdata” directory from HDFS directory: hadoop fs –rm –r /hdfs/cloudera/bigdata
System Response: Deleted bigdata
Copy file from Local File System to HDFS
The syntax is: hadoop fs –copyFromLocal <local file> URI
where URI is a string of characters used to identify a resource over the network
Example: hadoop fs –copyFromLocal /home/cloudera/hadooptraining /hdfs/cloudera
Where local file = /home/cloudera/hadooptraining and URI = /hdfs/cloudera
Check if the file copy to URI worked: hadoop fs –ls /hdfs/cloudera/hadooptraining
Copy complete directory from Local File System to HDFS
The syntax is: hadoop fs –put <local directory> <hdfs directory>
Example: hadoop fs –put /home/cloudera/bigdata /hdfs/cloudera
where local directory = /home/cloudera and hdfs directory = /hdfs/cloudera
Check if ‘bigdata’ directory was moved: hadoop fs –ls /hdfs/cloudera
Copy complete directory from HDFS to Local File System
The syntax is: hadoop fs –get –crc <HDFS directory> <local directory>
Example: hadoop fs –get -crc /hdfs/cloudera/bigdata /home/cloudera
where hdfs directory = /hdfs/cloudera and local directory = /home/cloudera
Check if ‘bigdata’ directory was moved: hadoop fs –ls /home/cloudera
Tutorial: Hadoop Distributed File System Commands
• In order to support fast and efficient parallel processing and built-in fault tolerance provided by HDFS, the input
data is split into ‘n’ partitions and replicated based on the HDFS Replication Factor.
Assumption:
Large Input data: sample.txt
Number of partitions = 4 (a,b,c,d)
HDFS Replication Factor = 3
Figure 5: Input Data Distribution across Data Nodes
Note:
MapReduce runs in parallel on many nodes but runs on each node separately
Tutorial: Distribution of input data in HDFS across Data Nodes
sample.txt
a
b
c
d
DN3DN1 DN2 DN5DN4 DN6
1a 1c1b 2a 2b 2c 3a 3c3b 4a 4b 4c
• MapReduce is the Distributed Data Processing model that runs on large cluster of servers.
• It uses parallel algorithm that allows linear scaling and runs in batch job mode. It suites business
intelligence applications that run on “Write Once, Read Many” WORM paradigm.
• MapReduce software is included in the Cloudera distribution package which has a number of other
Hadoop tools that are designed to run on clustered nodes under HDFS.
Figure 6: Cloudera Software Distribution Bundle
MapReduce Software
HIVE PIG SQOOP IMPALA
MapReduce
HDFS
• MapReduce is a programming model and an associated implementation for processing and generating big data sets with
a parallel, distributed algorithm that runs on a cluster of nodes.
• Under the Oracle VM Virtual Box Manager, install Cloudera in Oracle VM Virtual Box.
• Download URL: https://download.virtualbox.org/virtualbox/5.2.8/Virtualbox-5.2.8-121009.Win.exe
• The system requirements are as follows:
o 64 bit Windows Operating System
o VM Box / Player 4.x and higher
o Host Operating System must have at least 8 GB memory
o VM needs at least 4 GB RAM
• Note:
Usage of MapReduce in Business Intelligence: In terms of processing, as we know Business Intelligence runs on WORM
paradigm which is consistent with MapReduce processing. Informatica Big Data Management tool, therefore uses ETL
Connector for HIVE that runs MapReduce to deliver the industry’s first and most comprehensive solution to natively
ingest, integrate, clean, govern, and secure big data workloads on Hadoop platform.
• As we know ApacheHadoop stores and processes data in a distributed fashion. In Hadoop Version 2, the
namenode and resourcemanager daemons are master daemons, whereas the datanode and nodemanager
daemons are slave daemons. This framework of NameNode and underlying DataNodes takes care of scheduling
tasks, monitoring them and re-executing the failed tasks.
• A MapReduce job usually splits the input data-set into independent chunks which are processed by the map
tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to
the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
• The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing
by marshalling the distributed nodes, running the various tasks in parallel, managing all communications and data
transfers between the various parts of the system, and providing for redundancy and fault tolerance as defined
by the Replication Factor in HDFS configuration.
• A MapReduce program is composed of a mapper (JAVA class), which performs filtering and sorting (such as
sorting students by first name into queues, one queue for each name), and a reducer (JAVA class), which
performs a summary operation (such as counting the number of students in each queue, yielding name
frequencies et cetera).
MapReduce Processing
• Mapper Class
o Maps input key / value pairs to a set of intermediate key / value pairs
o Exchanges intermediate key / value pairs to where they are required by the reducers. This process is called
Shuffling
• E.g. after mapping and shuffle of “# of Walmart stores”, we have input key/value pairs in the Mappers
as follows:
M1 M2 M3 M4 M5 M6 M7
M8
Los Angeles/100 New York/110 San Jose/50 Chicago/80 Houston/70 Miami/75
Princeton/75 Minneapolis/70
• Reducer Class
o Reduces a set of intermediate values which share just one key. All of the intermediate values for this key are
presented to the Reducer.
• E.g. after reduction, we have the following values in the Reducers:
R1 (West) R2 (East) R3 (South) R4 (North)
150 185 145 150
• Driver Class
o Configures and runs Mapper and Reducer classes
• Given: address, zip, city, house_value. Calculate the average house value for each zip code.
• The pseudo code is as follows:
inpt = load 'data.txt' as (address:chararray, zip:chararray, city:chararray, house_value:long); grp = group inpt by zip;
average = foreach grp generate FLATTEN(group) as (zip), AVG(inpt.house_value) as average_price; dump average;
MAPPER(record): zip_code_key = record['zip']; value = {1, record['house_value']};
emit(zip_code_key, value);
SHUFFLER(zip_code_key, value_list): record_num = 0;
value_sum = 0;
foreach (value : value_list)
{ record_num += value[0];
value_sum += value[1]; }
value_out = {record_num, value_sum};
emit(zip_code_key, value_out);
REDUCER(zip_code_key, value_list): record_num = 0;
value_sum = 0;
foreach (value : value_list)
{ record_num += value[0];
value_sum += value[1]; }
avg = value_sum / record_num;
emit(zip_code_key, avg);
Tutorial: Pseudo Code for Map Reduce JAVA Classes

Contenu connexe

Tendances

Capturing big value in big data
Capturing big value in big data Capturing big value in big data
Capturing big value in big data BSP Media Group
 
Supply chain management
Supply chain managementSupply chain management
Supply chain managementmuditawasthi
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035Neelam Rawat
 
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...Big Data Spain
 
Big data, Analytics and 4th Generation Data Warehousing
Big data, Analytics and 4th Generation Data WarehousingBig data, Analytics and 4th Generation Data Warehousing
Big data, Analytics and 4th Generation Data WarehousingMartyn Richard Jones
 
Big Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New ChallengesBig Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New ChallengesEditor IJCATR
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 

Tendances (20)

Capturing big value in big data
Capturing big value in big data Capturing big value in big data
Capturing big value in big data
 
5 Big Data Use Cases for 2013
5 Big Data Use Cases for 20135 Big Data Use Cases for 2013
5 Big Data Use Cases for 2013
 
Big data
Big dataBig data
Big data
 
Supply chain management
Supply chain managementSupply chain management
Supply chain management
 
Big data analysis
Big data analysisBig data analysis
Big data analysis
 
BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
 
Unit 2
Unit 2Unit 2
Unit 2
 
Unlocking big data
Unlocking big dataUnlocking big data
Unlocking big data
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
 
Big data, Analytics and 4th Generation Data Warehousing
Big data, Analytics and 4th Generation Data WarehousingBig data, Analytics and 4th Generation Data Warehousing
Big data, Analytics and 4th Generation Data Warehousing
 
National Conference - Big Data - 31 Jan 2015
National Conference - Big Data - 31 Jan 2015National Conference - Big Data - 31 Jan 2015
National Conference - Big Data - 31 Jan 2015
 
Big data
Big dataBig data
Big data
 
Introduction to BigData
Introduction to BigData Introduction to BigData
Introduction to BigData
 
Big data-analytics-ebook
Big data-analytics-ebookBig data-analytics-ebook
Big data-analytics-ebook
 
Big Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New ChallengesBig Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New Challenges
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Big data
Big dataBig data
Big data
 

Similaire à March Towards Big Data - Big Data Implementation, Migration, Ingestion, Management, & Visualization

Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfPridesys IT Ltd.
 
Guide to big data analytics
Guide to big data analyticsGuide to big data analytics
Guide to big data analyticsGahya Pandian
 
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docx
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docxHow Analytics Has Changed in the Last 10 Years (and How It’s Staye.docx
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docxpooleavelina
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data miningEmran Hossain
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big dataRaul Chong
 
Getting down to business on Big Data analytics
Getting down to business on Big Data analyticsGetting down to business on Big Data analytics
Getting down to business on Big Data analyticsThe Marketing Distillery
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfPridesys IT Ltd.
 
Oh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataOh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataPrakalp Agarwal
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 
Keyrus US Information
Keyrus US InformationKeyrus US Information
Keyrus US InformationJulian Tong
 
TDWI checklist - Evolving to Modern DW
TDWI checklist - Evolving to Modern DWTDWI checklist - Evolving to Modern DW
TDWI checklist - Evolving to Modern DWJeannette Browning
 
L3 Big Data and Application.pptx
L3  Big Data and Application.pptxL3  Big Data and Application.pptx
L3 Big Data and Application.pptxShambhavi Vats
 
Big data an elephant business opportunities
Big data an elephant   business opportunitiesBig data an elephant   business opportunities
Big data an elephant business opportunitiesBigdata Meetup Kochi
 
Big data seminor
Big data seminorBig data seminor
Big data seminorberasrujana
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperImpetus Technologies
 

Similaire à March Towards Big Data - Big Data Implementation, Migration, Ingestion, Management, & Visualization (20)

Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdf
 
Guide to big data analytics
Guide to big data analyticsGuide to big data analytics
Guide to big data analytics
 
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docx
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docxHow Analytics Has Changed in the Last 10 Years (and How It’s Staye.docx
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docx
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data mining
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 
Getting down to business on Big Data analytics
Getting down to business on Big Data analyticsGetting down to business on Big Data analytics
Getting down to business on Big Data analytics
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdf
 
Oh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataOh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG Data
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Keyrus US Information
Keyrus US InformationKeyrus US Information
Keyrus US Information
 
Keyrus US Information
Keyrus US InformationKeyrus US Information
Keyrus US Information
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
TDWI checklist - Evolving to Modern DW
TDWI checklist - Evolving to Modern DWTDWI checklist - Evolving to Modern DW
TDWI checklist - Evolving to Modern DW
 
L3 Big Data and Application.pptx
L3  Big Data and Application.pptxL3  Big Data and Application.pptx
L3 Big Data and Application.pptx
 
Big data an elephant business opportunities
Big data an elephant   business opportunitiesBig data an elephant   business opportunities
Big data an elephant business opportunities
 
Big data
Big dataBig data
Big data
 
Big data seminor
Big data seminorBig data seminor
Big data seminor
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White Paper
 

Plus de Experfy

Predictive Analytics and Modeling in Life Insurance
Predictive Analytics and Modeling in Life InsurancePredictive Analytics and Modeling in Life Insurance
Predictive Analytics and Modeling in Life InsuranceExperfy
 
Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...
Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...
Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...Experfy
 
Graph Models for Deep Learning
Graph Models for Deep LearningGraph Models for Deep Learning
Graph Models for Deep LearningExperfy
 
Apache HBase Crash Course - Quick Tutorial
Apache HBase Crash Course - Quick Tutorial Apache HBase Crash Course - Quick Tutorial
Apache HBase Crash Course - Quick Tutorial Experfy
 
Machine Learning in AI
Machine Learning in AIMachine Learning in AI
Machine Learning in AIExperfy
 
A Gentle Introduction to Genomics
A Gentle Introduction to GenomicsA Gentle Introduction to Genomics
A Gentle Introduction to GenomicsExperfy
 
A Comprehensive Guide to Insurance Technology - InsurTech
A Comprehensive Guide to Insurance Technology - InsurTechA Comprehensive Guide to Insurance Technology - InsurTech
A Comprehensive Guide to Insurance Technology - InsurTechExperfy
 
Health Insurance 101
Health Insurance 101Health Insurance 101
Health Insurance 101Experfy
 
Financial Derivatives
Financial Derivatives Financial Derivatives
Financial Derivatives Experfy
 
AI for executives
AI for executives AI for executives
AI for executives Experfy
 
Cloud Native Computing Foundation: How Virtualization and Containers are Chan...
Cloud Native Computing Foundation: How Virtualization and Containers are Chan...Cloud Native Computing Foundation: How Virtualization and Containers are Chan...
Cloud Native Computing Foundation: How Virtualization and Containers are Chan...Experfy
 
Microsoft Azure Power BI
Microsoft Azure Power BIMicrosoft Azure Power BI
Microsoft Azure Power BIExperfy
 
Sales Forecasting
Sales ForecastingSales Forecasting
Sales ForecastingExperfy
 
Uncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial IntelligenceUncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial IntelligenceExperfy
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Experfy
 
Introduction to Healthcare Analytics
Introduction to Healthcare Analytics Introduction to Healthcare Analytics
Introduction to Healthcare Analytics Experfy
 
Blockchain Technology Fundamentals
Blockchain Technology FundamentalsBlockchain Technology Fundamentals
Blockchain Technology FundamentalsExperfy
 
Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...
Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...
Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...Experfy
 
Apache Spark SQL- Installing Spark
Apache Spark SQL- Installing SparkApache Spark SQL- Installing Spark
Apache Spark SQL- Installing SparkExperfy
 
Econometric Analysis | Methods and Applications
Econometric Analysis | Methods and ApplicationsEconometric Analysis | Methods and Applications
Econometric Analysis | Methods and ApplicationsExperfy
 

Plus de Experfy (20)

Predictive Analytics and Modeling in Life Insurance
Predictive Analytics and Modeling in Life InsurancePredictive Analytics and Modeling in Life Insurance
Predictive Analytics and Modeling in Life Insurance
 
Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...
Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...
Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...
 
Graph Models for Deep Learning
Graph Models for Deep LearningGraph Models for Deep Learning
Graph Models for Deep Learning
 
Apache HBase Crash Course - Quick Tutorial
Apache HBase Crash Course - Quick Tutorial Apache HBase Crash Course - Quick Tutorial
Apache HBase Crash Course - Quick Tutorial
 
Machine Learning in AI
Machine Learning in AIMachine Learning in AI
Machine Learning in AI
 
A Gentle Introduction to Genomics
A Gentle Introduction to GenomicsA Gentle Introduction to Genomics
A Gentle Introduction to Genomics
 
A Comprehensive Guide to Insurance Technology - InsurTech
A Comprehensive Guide to Insurance Technology - InsurTechA Comprehensive Guide to Insurance Technology - InsurTech
A Comprehensive Guide to Insurance Technology - InsurTech
 
Health Insurance 101
Health Insurance 101Health Insurance 101
Health Insurance 101
 
Financial Derivatives
Financial Derivatives Financial Derivatives
Financial Derivatives
 
AI for executives
AI for executives AI for executives
AI for executives
 
Cloud Native Computing Foundation: How Virtualization and Containers are Chan...
Cloud Native Computing Foundation: How Virtualization and Containers are Chan...Cloud Native Computing Foundation: How Virtualization and Containers are Chan...
Cloud Native Computing Foundation: How Virtualization and Containers are Chan...
 
Microsoft Azure Power BI
Microsoft Azure Power BIMicrosoft Azure Power BI
Microsoft Azure Power BI
 
Sales Forecasting
Sales ForecastingSales Forecasting
Sales Forecasting
 
Uncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial IntelligenceUncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial Intelligence
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
 
Introduction to Healthcare Analytics
Introduction to Healthcare Analytics Introduction to Healthcare Analytics
Introduction to Healthcare Analytics
 
Blockchain Technology Fundamentals
Blockchain Technology FundamentalsBlockchain Technology Fundamentals
Blockchain Technology Fundamentals
 
Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...
Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...
Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...
 
Apache Spark SQL- Installing Spark
Apache Spark SQL- Installing SparkApache Spark SQL- Installing Spark
Apache Spark SQL- Installing Spark
 
Econometric Analysis | Methods and Applications
Econometric Analysis | Methods and ApplicationsEconometric Analysis | Methods and Applications
Econometric Analysis | Methods and Applications
 

Dernier

Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 

Dernier (20)

Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 

March Towards Big Data - Big Data Implementation, Migration, Ingestion, Management, & Visualization

  • 1.
  • 2. March Towards Big Data Nasir Raheem PMP, CSM, OCP, Certified Big Data Analyst ISBN 978-1-64316-293-5
  • 3. • Introduction to Big Data o Rapid Growth of Big Data o Big Data Definition o Big Data Projects o Business Value of Big Data • Big Data Implementation o Hadoop Infrastructure o Hadoop Eco System o Hadoop JVM Framework o Hadoop Distributed File System o Tutorial: Hadoop Distributed File System Commands o Tutorial: Distribution of input data in HDFS o MapReduce Software o MapReduce Processing o Tutorial: Pseudo Code for Map Reduce JAVA Classes Table of Contents
  • 4. • Big Data Migration o Apache SQOOP o SQOOP commands o Hive arguments used by SQOOP o SQOOP Architecture o Tutorial: Data Migration from Oracle to HIVE database • Big Data Ingestion / Big Data Management o Tools & Techniques o Cloudera HIVE Environment o High Level Tasks to set up Informatica BDM, Cloudera HIVE and TABLEAU o Integrated Use of Data Ingestion, Data Management and Data Visualization Tools Table of Contents
  • 5. • Big Data Visualization o Success Factors for Big Data Analytics o TABLEAU - Step forward in Data Analytics o 3-D Dashboards (Fast, Wide and Deep) o TABLEAU Architecture • Cloud Computing o Cloud Technology at the rescue of Big Data o Cloud Computing versus Hadoop Processing o Cloud Service most suited for Big Data • End of Course Quiz Table of Contents
  • 7. • With the advent of Cloud architecture that provides a high level of Cyber-security, clustered servers that use powerful Compute, Network and Storage hardware and IoT (Internet of Things) that generates massive amounts of Click-stream data, the stage is all set for Big Data (see below chart on its rapid growth projections). • Health care industry tops most others in big data adoption. Why? Big data analytics technology plays important role in heath care. By quickly analyzing large amounts of information, both structured and unstructured, health care providers can provide lifesaving diagnoses or treatment options almost immediately. Rapid Growth of Big Data
  • 8. • Big Data is a term that describes the large volume of data – whether structured (E.g. Relational databases) or semi-structured (E.g. Key – value pairs as in Mongo database) or unstructured (E.g. emails, text files, log files, videos etc.) that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It is what organizations do with that data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves. • In summary, Raw Big Data = Structured Data + Semi Structured Data + Un-structured Data • Collecting Raw Big Data into a Data Lake is one thing, but finding the business value hidden in heaps of structured and unstructured data is quite another. In today’s computing world, data is information that has been translated into a form that is efficient for movement or processing. In other words, data is information converted into binary digital form. • In summary, Information = Raw Big Data + Business Rules + Quality Big Data Definition DRIVES Business IntelligenceInformation Figure 1: Big Data Implementation Tools & Techniques (Hadoop Eco System) Infrastructure (Hadoop Distributed File System) Processing (MapReduce)
  • 9. • Big Data is such data whose scale, diversity and complexity require new architecture, techniques and algorithms to manage and extract hidden knowledge from it. The majority of big data projects fall into one of two broad categories: o Storage driven • For organizations running corporate internal applications in a private data center now faced with large volumes of structured and unstructured data, an in-house Hadoop infrastructure is well suited to what would primarily be data storage driven projects o Application driven • For organizations where broad accessibility and frequent use are not necessary as in healthcare applications. These provide promise and potential involved in big data analytics where processing speed such as provided by MapReduce processing is of primary concern. These projects will Integrate with cloud providers’ security measures – start by integrating the big data platform with the encryption and security practices inherent in the cloud environment. Here, it will be important to offer a centralized control panel to manage data governance. Organizations can take further steps to minimize access issues by opting for Single Sign-On (SSO) when setting up new accounts. • It is important to use the right tool for the job at hand. 76% of organizations actively leverage at least three big data open source engines for data preparation, machine learning, and reporting and analysis workloads (e.g. Apache Hadoop/Hive, Apache Spark, and Presto). Greater productivity and automation is a priority. The data shows that as the size of implementations grow, data-driven companies constantly seek to improve the ratio of users per administrator, thereby reducing the incremental cost of growth and increasing user self-sufficiency. -1/ • Qubole Big Data Activation Report -1/ Big Data Projects
  • 10. • We need to understand that earlier systems catered to small to mid-size structured data that lent itself well to Business Intelligence such as Data Mining and Adhoc Query or Batch Reporting. Due to the addition of new tools and techniques, Big Data Eco system supports Optimizations and Predictive Analysis including complex statistical analysis. Business Value of Big Data Complexity Business Value Low High High Big Data - Predictive Analysis - Machine Learning - 3-D Dashboards Structured Data - Business Intelligence Figure 2: Business Value of Big Data
  • 11. • There are 9 ways to extract maximum value from Big Data -2/ o #1: The primary path to business value is through analytics. Note that this involves advanced forms of analytics such as those based on data mining, statistical analysis, natural language processing, and extreme SQL. Unlike reporting and OLAP, these enable data exploration and discovery analytics with big data. Even so, reporting and OLAP won't go away because they are still valuable elsewhere. o #2: Explore big data to discover new business opportunities. After all, many sources of big data are new to you, and many represent new channels for interacting with your customers and partners. As with any new source, big data merits exploration. Data exploration leads to patterns and new facts your business didn't know, such as new customer base segments, customer behaviors, forms of churn, and root causes for bottom line costs. o #3: Start analyzing the big data you've already hoarded. Yes, it's true: many firms have "squirreled away" large datasets because they sensed business value yet didn't know how to get value out of big data. Depending on your industry, you probably have large datasets of Web site logs, which can be "sessionized" and analyzed to understand Web site visitor behavior. Likewise, quality assurance data from manufacturing leads to more reliable products and better leverage with suppliers, and RFID data can solve the mysteries of product movement through supply Business Value of Big Data
  • 12. • There are 9 ways to extract maximum value from Big Data -2/ o The primary path to business value is through analytics. Note that this involves advanced forms of analytics such as those based on data mining, statistical analysis, natural language processing, and extreme SQL. Unlike reporting and OLAP, these enable data exploration and discovery analytics with big data. Even so, reporting and OLAP won't go away because they are still valuable elsewhere. o Explore big data to discover new business opportunities. After all, many sources of big data are new to you, and many represent new channels for interacting with your customers and partners. As with any new source, big data merits exploration. Data exploration leads to patterns and new facts your business didn't know, such as new customer base segments, customer behaviors, forms of churn, and root causes for bottom line costs. o Start analyzing the big data you've already hoarded. Yes, it's true: many firms have "squirreled away" large datasets because they sensed business value yet didn't know how to get value out of big data. Depending on your industry, you probably have large datasets of Web site logs, which can be "sessionized" and analyzed to understand Web site visitor behavior. Likewise, quality assurance data from manufacturing leads to more reliable products and better leverage with suppliers, and RFID data can solve the mysteries of product movement through supply chains. Business Value of Big Data
  • 13. o Focus on analyzing the type of big data that's valuable to your industry. The type and content of big data can vary by industry and thus have different value propositions for each industry. This includes call detail records (CDRs) in telecommunications, RFID in retail, manufacturing, and other product-oriented industries as well as sensor data from robots in manufacturing (especially automotive and consumer electronics). o Make a leap of faith into unstructured big data. This information is largely text expressing human language, which is very different from the relational data you work with most, so you'll need new tools for on natural language processing, search, and text analytics. These can provide visibility into text-laden business processes, such as the claims process in insurance, medical records in healthcare, call-center and help-desk applications in any industry, and sentiment analysis in customer-oriented businesses. o Expand your existing customer analytics with social media data. Customers can influence each other by commenting on brands, reviewing products, reacting to marketing campaigns, and revealing shared interests. Social big data can come from social media Web sites as well as from your own channels that enable customers to voice opinions and facts. It's possible to use predictive analytics to discover patterns and anticipate product or service issues. You might likewise measure share of voice, brand reputation, sentiment drivers, and new customer segments. Business Value of Big Data
  • 14. o Complete your customer views by integrating big data into them. Big data (when integrated with older enterprise sources) can broaden 360-degree views of customers and other business entities (products, suppliers, partners), from hundreds of attributes to thousands. The added granular detail leads to more accurate customer base segmentation, direct marketing, and other customer analytics. o Improve older analytic applications by incorporating big data. Big data can enlarge and broaden data samples for older analytic applications. This is especially the case with analytic technologies that depend on large samples -- such as statistics or data mining – when applied to fraud detection, risk management, or actuarial calculations. o Accelerate the business into real-time operation by analyzing streaming big data. Applications for real-time monitoring and analysis have been around for many years in businesses that offer an energy utility, communication network, or any grid, service, or facility that demands 24x7 operation. More recently, a wider range of organizations are tapping streaming big data for applications ranging from surveillance (cyber security, situational awareness, fraud detection) to logistics (truck or rail freight, mobile asset management, just-in-time inventory). Big data analytics is still mostly executed in batch and offline today, but it will move into real time as users and technologies mature. • Philip Russom, Ph.D – Transforming Data With Intelligence - Research Paper -2/ Business Value of Big Data
  • 15. MODULE 2 Big Data Implementation
  • 16. • Hadoop Infrastructure is based on an architecture of clusters. Cluster is a configuration of nodes that interact to perform a specific task. In the past these have provided high performance processing solutions for aerospace and defense. • Hadoop Distributed File System (HDFS) supports the Velocity, Variety, Volume and Veracity of Big Data as it is laid across the NameNode and DataNodes. We cannot over-emphasize the importance of Veracity as this relates to the accuracy of data in predicting business value. A forward looking business may build a Smart Data Engine to analyze massive amounts of data in real time that quickly assesses the value of the customer and the potential to provide additional offers to that customer I.e. Make the Point of Service, the Point of Sale. • With the advent of commercial Internet of Things (IoT), clusters have formed the backbone of important commercial entities such as social media powerhouse, Facebook that spreads over thousands of nodes. • With the advent of the Cloud, it is now possible to have Multi-layer clusters such as: o Virtual Clusters (Cluster of Virtual Machines such as Java Virtual Machine (JVM) that can run on multiple operating systems such as Microsoft Windows, Linux and Mac). To minimize expenses install Vagrant or Puppet on Redhat or Ubuntu to set up a cluster of Virtual Machines. o Cloud Clusters that are massively scalable. o Real Clusters that deploy core application stack such as Amazon S3 platform. Hadoop Infrastructure
  • 17. Figure 3: Cluster Design Considerations COMPUTE NETWORK STORAGE Multi-core CPUs, Parallel Processing Algorithms, Load Balancing, Synchronous Multi master Cluster, Cache Memory Data Switch bandwidth of 100Gbps Ethernet, TCP / IP protocol, TCP packet routing, High speed fiber backbone Low latency / Fast / High Volume, Terabyte RAID drives, Flash based Solid State Drives
  • 18. • Hadoop is an open source written in JAVA. It provides a framework for scalable, fault tolerant, distributed system to store and process data across a cluster of commodity servers called nodes. The underlying goals are Scalability, Reliability and Economics. • Hadoop comprises of o A set of operations & interfaces for distributed file systems o Hadoop Distributed File System (HDFS) that facilitates running MapReduce jobs over a cluster of commodity servers o Mahout tool to facilitate machine learning o Processing tools contained in the Cloudera Software Distribution bundle are: • HIVE: Uses an Interpretor to transform SQL query to MapReduce code • PIG: Scripting language that creates MapReduce code • SQOOP: Serves as data exchange between R-dbms and Hive database • IMPALA: Uses SQL to access data directly from HDFS o Hadoop Version 2 Infrastructure that has a master NameNode where the Resource Manager runs and has the meta data about the underlying slave DataNodes o Hadoop Version 2 Infrastructure that has several DataNodes where the actual fault tolerant application data resides as well as the Node Manager to keep track of jobs running on DataNode and provides feedback to the Resource Manager that runs on the NameNode. Hadoop Eco System
  • 19. Figure 4: Hadoop JVM Framework Figure 4: Hadoop JVM Framework JAVA Compile Time Environment JAVA Source Code Byte Code Compile JAVA Run Time Environment Windows JVM Linux JVM Mac JVM Windows Executable Linux Executable Mac Executable
  • 20. • Hadoop Distributed File System (HDFS) supports the Velocity, Variety, Volume and Veracity of Big Data. A thorough understanding of how this data is laid out and managed is of paramount importance. The metadata and actual data reside on specific directories on the HDFS server. An example of 1 Namenode directory (also called Master Node that stores Metadata) and 10 Datanodes’ directories (also called Slave nodes that store actual application data) are given below: o Metadata: /home/osboxes/lab/hdfs/namenodep o Actual Data: /home/osboxes/lab/hdfs/datan1, /home/osboxes/lab/hdfs/datan2, /home/osboxes/lab/hdfs/datan3, /home/osboxes/lab/hdfs/datan4, /home/osboxes/lab/hdfs/datan5, /home/osboxes/lab/hdfs/datan6, /home/osboxes/lab/hdfs/datan7, /home/osboxes/lab/hdfs/datan8, /home/osboxes/lab/hdfs/datan9, /home/osboxes/lab/hdfs/datan10 • The Datanode directories shown above on HDFS server use Symbolic links to point to the directories on Slave Nodes where the application data resides. Again, metadata about files on HDFS are stored and managed on Namenode. This includes information like permissions and ownership of such files, location of various blocks of that file on Datanodes, etc. • Common file-handling HDFS commands: o List all files under present working directory: hadoop fs –ls o Create new directory under root directory: hadoop fs –mkdir /new-directory o Create new directory under present working directory: hadoop fs –mkdir ./new-directory o Find file sizes under new directory: hadoop fs –du –s /new-directory o Count the number of directories under the /etc directory: hadoop fs –count /etc Hadoop Distributed File System
  • 21. Remove “sample” file from HDFS directory: hadoop fs –rm /hdfs/cloudera/sample System Response: Deleted Sample Remove “bigdata” directory from HDFS directory: hadoop fs –rm –r /hdfs/cloudera/bigdata System Response: Deleted bigdata Copy file from Local File System to HDFS The syntax is: hadoop fs –copyFromLocal <local file> URI where URI is a string of characters used to identify a resource over the network Example: hadoop fs –copyFromLocal /home/cloudera/hadooptraining /hdfs/cloudera Where local file = /home/cloudera/hadooptraining and URI = /hdfs/cloudera Check if the file copy to URI worked: hadoop fs –ls /hdfs/cloudera/hadooptraining Copy complete directory from Local File System to HDFS The syntax is: hadoop fs –put <local directory> <hdfs directory> Example: hadoop fs –put /home/cloudera/bigdata /hdfs/cloudera where local directory = /home/cloudera and hdfs directory = /hdfs/cloudera Check if ‘bigdata’ directory was moved: hadoop fs –ls /hdfs/cloudera Copy complete directory from HDFS to Local File System The syntax is: hadoop fs –get –crc <HDFS directory> <local directory> Example: hadoop fs –get -crc /hdfs/cloudera/bigdata /home/cloudera where hdfs directory = /hdfs/cloudera and local directory = /home/cloudera Check if ‘bigdata’ directory was moved: hadoop fs –ls /home/cloudera Tutorial: Hadoop Distributed File System Commands
  • 22. • In order to support fast and efficient parallel processing and built-in fault tolerance provided by HDFS, the input data is split into ‘n’ partitions and replicated based on the HDFS Replication Factor. Assumption: Large Input data: sample.txt Number of partitions = 4 (a,b,c,d) HDFS Replication Factor = 3 Figure 5: Input Data Distribution across Data Nodes Note: MapReduce runs in parallel on many nodes but runs on each node separately Tutorial: Distribution of input data in HDFS across Data Nodes sample.txt a b c d DN3DN1 DN2 DN5DN4 DN6 1a 1c1b 2a 2b 2c 3a 3c3b 4a 4b 4c
  • 23. • MapReduce is the Distributed Data Processing model that runs on large cluster of servers. • It uses parallel algorithm that allows linear scaling and runs in batch job mode. It suites business intelligence applications that run on “Write Once, Read Many” WORM paradigm. • MapReduce software is included in the Cloudera distribution package which has a number of other Hadoop tools that are designed to run on clustered nodes under HDFS. Figure 6: Cloudera Software Distribution Bundle MapReduce Software HIVE PIG SQOOP IMPALA MapReduce HDFS
  • 24. • MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm that runs on a cluster of nodes. • Under the Oracle VM Virtual Box Manager, install Cloudera in Oracle VM Virtual Box. • Download URL: https://download.virtualbox.org/virtualbox/5.2.8/Virtualbox-5.2.8-121009.Win.exe • The system requirements are as follows: o 64 bit Windows Operating System o VM Box / Player 4.x and higher o Host Operating System must have at least 8 GB memory o VM needs at least 4 GB RAM • Note: Usage of MapReduce in Business Intelligence: In terms of processing, as we know Business Intelligence runs on WORM paradigm which is consistent with MapReduce processing. Informatica Big Data Management tool, therefore uses ETL Connector for HIVE that runs MapReduce to deliver the industry’s first and most comprehensive solution to natively ingest, integrate, clean, govern, and secure big data workloads on Hadoop platform.
  • 25. • As we know ApacheHadoop stores and processes data in a distributed fashion. In Hadoop Version 2, the namenode and resourcemanager daemons are master daemons, whereas the datanode and nodemanager daemons are slave daemons. This framework of NameNode and underlying DataNodes takes care of scheduling tasks, monitoring them and re-executing the failed tasks. • A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. • The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed nodes, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance as defined by the Replication Factor in HDFS configuration. • A MapReduce program is composed of a mapper (JAVA class), which performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), and a reducer (JAVA class), which performs a summary operation (such as counting the number of students in each queue, yielding name frequencies et cetera). MapReduce Processing
  • 26. • Mapper Class o Maps input key / value pairs to a set of intermediate key / value pairs o Exchanges intermediate key / value pairs to where they are required by the reducers. This process is called Shuffling • E.g. after mapping and shuffle of “# of Walmart stores”, we have input key/value pairs in the Mappers as follows: M1 M2 M3 M4 M5 M6 M7 M8 Los Angeles/100 New York/110 San Jose/50 Chicago/80 Houston/70 Miami/75 Princeton/75 Minneapolis/70 • Reducer Class o Reduces a set of intermediate values which share just one key. All of the intermediate values for this key are presented to the Reducer. • E.g. after reduction, we have the following values in the Reducers: R1 (West) R2 (East) R3 (South) R4 (North) 150 185 145 150 • Driver Class o Configures and runs Mapper and Reducer classes
  • 27. • Given: address, zip, city, house_value. Calculate the average house value for each zip code. • The pseudo code is as follows: inpt = load 'data.txt' as (address:chararray, zip:chararray, city:chararray, house_value:long); grp = group inpt by zip; average = foreach grp generate FLATTEN(group) as (zip), AVG(inpt.house_value) as average_price; dump average; MAPPER(record): zip_code_key = record['zip']; value = {1, record['house_value']}; emit(zip_code_key, value); SHUFFLER(zip_code_key, value_list): record_num = 0; value_sum = 0; foreach (value : value_list) { record_num += value[0]; value_sum += value[1]; } value_out = {record_num, value_sum}; emit(zip_code_key, value_out); REDUCER(zip_code_key, value_list): record_num = 0; value_sum = 0; foreach (value : value_list) { record_num += value[0]; value_sum += value[1]; } avg = value_sum / record_num; emit(zip_code_key, avg); Tutorial: Pseudo Code for Map Reduce JAVA Classes