SlideShare a Scribd company logo
1 of 6
Download to read offline
WWW.KELLYTECHNO.COM Page 1
INTRODUCTION TO HADOOP
What is Big Data?
 Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world
today has been created in the last two years alone.
 Gartner defines Big Data as high volume, velocity and variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and decision
making.
 According to IBM, 80% of data captured today is unstructured, from sensors used to gather
climate information, posts to social media sites, digital pictures and videos, purchase transaction
records, and cell phone GPS signals, to name a few. All of this unstructured data is Big Data.
Big data spans three dimensions: Volume, Velocity and Variety.
Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes - even
petabytes - of information.
 Turn 12 terabytes of Tweets created each day into improved product sentiment analysis
 Convert 350 billion annual meter readings to better predict power consumption
Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data
must be used as it streams into your enterprise in order to maximize its value.
 Scrutinize 5 million trade events created each day to identify potential fraud
 Analyze 500 million daily call detail records in real-time to predict customer churn faster
Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio,
video, click streams, log files and more. New insights are found when analyzing these data types
together.
 Monitor 100’s of live video feeds from surveillance cameras to target points of interest
 Exploit the 80% data growth in images, video and documents to improve customer satisfaction
What does Hadoop solve?
 Organizations are discovering that important predictions can be made by sorting through and
analyzing Big Data.
WWW.KELLYTECHNO.COM Page 2
 However, since 80% of this data is "unstructured", it must be formatted (or structured) in a way
that that makes it suitable for data mining and subsequent analysis.
 Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful
for analytics purposes.
The Importance of Big Data and What You Can Accomplish
The real issue is not that you are acquiring large amounts of data. It's what you do with the data that
counts. The hopeful vision is that organizations will be able to take data from any source, harness
relevant data and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new
product development and optimized offerings, and 4) smarter business decision making. For instance,
by combining big data and high-powered analytics, it is possible to:
 Determine root causes of failures, issues and defects in near-real time, potentially saving billions
of dollars annually.
 Optimize routes for many thousands of package delivery vehicles while they are on the road.
 Analyze millions of SKUs to determine prices that maximize profit and clear inventory.
 Generate retail coupons at the point of sale based on the customer's current and past
purchases.
 Send tailored recommendations to mobile devices while customers are in the right area to take
advantage of offers.
 Recalculate entire risk portfolios in minutes.
 Quickly identify customers who matter the most.
 Use clickstream analysis and data mining to detect fraudulent behavior.
Challenges
Many organizations are concerned that the amount of amassed data is becoming so large that it is
difficult to find the most valuable pieces of information.
 What if your data volume gets so large and varied you don't know how to deal with it?
 Do you store all your data?
 Do you analyze it all?
WWW.KELLYTECHNO.COM Page 3
 How can you find out which data points are really important?
 How can you use it to your best advantage?
Until recently, organizations have been limited to using subsets of their data, or they were constrained
to simplistic analyses because the sheer volumes of data overwhelmed their processing platforms. But,
what is the point of collecting and storing terabytes of data if you can't analyze it in full context, or if you
have to wait hours or days to get results? On the other hand, not all business questions are better
answered by bigger data. You now have two choices:
 Incorporate massive data volumes in analysis. If the answers you're seeking will be better
provided by analyzing all of your data, go for it. High-performance technologies that extract
value from massive amounts of data are here today. One approach is to apply high-performance
analytics to analyze the massive amounts of data using technologies such as grid computing, in-
database processing and in-memory analytics.
 Determine upfront which data is relevant. Traditionally, the trend has been to store everything
(some call it data hoarding) and only when you query the data do you discover what is relevant.
We now have the ability to apply analytics on the front end to determine relevance based on
context. This type of analysis determines which data should be included in analytical processes
and what can be placed in low-cost storage for later use if needed.
Technologies
A number of recent technology advancements enable organizations to make the most of big data and
big data analytics:
 Cheap, abundant storage.
 Faster processors.
 Affordable open source, distributed big data platforms, such as Hadoop.
 Parallel processing, clustering, MPP, virtualization, large grid environments, high connectivity
and high throughputs.
 Cloud computing and other flexible resource allocation arrangements.
The goal of all organizations with access to large data collections should be to harness the most relevant
data and use it for better decision making.
Three Enormous Problems Big Data Tech Solves
But what’s less commonly talked about is why Big Data is such a problem beyond size and computing
power. The reasons behind the conversation are the truly interesting part and need to be understood.
WWW.KELLYTECHNO.COM Page 4
Here you go…there are three trends that are driving the discussion and should be made painfully clear
instead of lost in all the hype:
 We’re digitizing everything. This is big data’s volume and comes from unlocking hidden data
from common things all around us that were known before but weren’t quantified, stored,
compared and correlated. Suddenly, there’s enormous value in the patterns of what was
recently hidden from our view. Patterns offer understanding and a chance for prediction of what
will happen next. These each are important and together are remarkably powerful.
 There’s no time to intervene. This is big data’s velocity. All of that digital data creates massive
historical records but also rich streams of information that are flowing constantly. When we
take the patterns discovered in historical information and compare it to everything happening
right now, we can either make better things happen or prevent the worst. This is revenue
generating and life saving and all of the other wonderful things we hear about, but only if we
have the systems in place to see it happening in the moment and do something about it. We
can’t afford enough human watchers to do this, so the development of big data systems is the
only way to get to better things when the data gives humans insufficient time to intervene.
 Variation creates instability. This is big data’s variety. Data was once defined by what we could
store and relate in tables of columns and rows. A world that’s digitized ignores those boundaries
and is instead full of both structured and unstructured data. That creates a very big problem for
systems that were built upon the old definition, which comprise just about everything around
us. Suddenly, there’s data available that can’t be consumed or generated by a database. We
either ignore that information or it ends up in places and formats that are unreadable to older
systems. Gone is the ability to correlate unstructured information with that vast historical (but
highly structured) data. When we can’t analyze and correlate well, we introduce instability into
our world. We’re missing the big picture unless we build systems that are flexible and don’t
require reprogramming the logic for every unexpected (and there will be many) change.
There you have it… The underlying reasons that big data matters and isn’t just hype (though there’s
plenty of that). The digitization, lack of time for intervention and instability that big data creates leads us
to develop whole new ways of managing information that go well beyond Hadoop and distributed
computing. It’s why big data presents such enormous challenge and opportunity for software vendors
and their customers, but only if these three challenges are the drivers and not opportunism.
BI vs. Big Data vs. Data Analytics By Example
Business Intelligence (BI) encompasses a variety of tools and methods that can help organizations make
better decisions by analyzing “their” data. Therefore, Data Analytics falls under BI. Big Data, if used for
the purpose of Analytics falls under BI as well.
WWW.KELLYTECHNO.COM Page 5
Let’s say I work for the Center for Disease Control and my job is to analyze the data gathered from
around the country to improve our response time during flu season. Suppose we want to know about
the geographical spread of flu for the last winter (2012). We run some BI reports and it tells us that the
state of New York had the most outbreaks. Knowing that information we might want to better prepare
the state for the next winter. Theses types of queries examine past events, are most widely used, and
fall under the Descriptive Analytics category.
Now, we just purchased an interactive visualization tool and I am looking at the map of the United
States depicting the concentration of flu in different states for the last winter. I click on a button to
display the vaccine distribution. There it is; I visually detected a direct correlation between the intensity
of flu outbreak with the late shipment of vaccines. I noticed that the shipments of vaccine for the state
of New York were delayed last year. This gives me a clue to further investigate the case to determine if
the correlation is causal. This type of analysis falls under Diagnostic Analytics (discovery).
We go to the next phase which is Predictive Analytics. PA is what most people in the industry refer to
as Data Analytics. It gives us the probability of different outcomes and it is future-oriented. The US
banks have been using it for things like fraud detection. The process of distilling intelligence is more
complex and it requires techniques like Statistical Modeling.
Back to our examples, I hire a Data Scientist to help me create a model and apply the data to the model
in order to identify causal relationships and correlations as they relate to the spread of flu for the winter
of 2013. Note that we are now taking about the future. I can use my visualization tool to play around
with some variables such as demand, vaccine production rate, quantity… to weight the pluses and
minuses of different decisions insofar as how to prepare and tackle the potential problems in the
coming months.
The last phase is the Prescriptive Analytics and that is to integrate our tried-and-true predictive models
into our repeatable processes to yield desired outcomes. An automated risk reduction system based on
real-time data received from the sensors in a factory would be a good example of its use case.
Finally, here is an example of Big Data.
Suppose it’s December 2013 and it happens to be a bad year for the flu epidemic. A new strain of the
virus is wreaking havoc, and a drug company has produced a vaccine that is effective in combating the
virus. But, the problem is that the company can’t produce them fast enough to meet the demand.
Therefore, the Government has to prioritize its shipments. Currently the Government has to wait a
considerable amount of time to gather the data from around the country, analyze it, and take action.
The process is slow and inefficient. The following includes the contributing factors. Not having fast
enough computer systems capable of gathering and storing the data (velocity), not having computer
systems that can accommodate the volume of the data pouring in from all of the medical centers in the
country (volume), and not having computer systems that can process images, i.e, x-rays (variety).
Big Data technology changed all of that. It solved the velocity-volume-variety problem. We now have
computer systems that can handle “Big Data”. The Center for Disease Control may receive the data
WWW.KELLYTECHNO.COM Page 6
from hospitals and doctor offices in real-time and Data Analytics Software that sits on the top of Big
Data computer system could generate actionable items that can give the Government the agility it
needs in times of crises.

More Related Content

Viewers also liked

C Programming Language Tutorial for beginners - JavaTpoint
C Programming Language Tutorial for beginners - JavaTpointC Programming Language Tutorial for beginners - JavaTpoint
C Programming Language Tutorial for beginners - JavaTpointJavaTpoint.Com
 
Programming languages
Programming languagesProgramming languages
Programming languagesAsmasum
 
Railway Oriented Programming
Railway Oriented ProgrammingRailway Oriented Programming
Railway Oriented ProgrammingScott Wlaschin
 
INTRODUCTION TO C PROGRAMMING
INTRODUCTION TO C PROGRAMMINGINTRODUCTION TO C PROGRAMMING
INTRODUCTION TO C PROGRAMMINGAbhishek Dwivedi
 
Lect 1. introduction to programming languages
Lect 1. introduction to programming languagesLect 1. introduction to programming languages
Lect 1. introduction to programming languagesVarun Garg
 
Functional Programming Patterns (NDC London 2014)
Functional Programming Patterns (NDC London 2014)Functional Programming Patterns (NDC London 2014)
Functional Programming Patterns (NDC London 2014)Scott Wlaschin
 
Basics of C programming
Basics of C programmingBasics of C programming
Basics of C programmingavikdhupar
 

Viewers also liked (8)

Introduction to C Programming
Introduction to C ProgrammingIntroduction to C Programming
Introduction to C Programming
 
C Programming Language Tutorial for beginners - JavaTpoint
C Programming Language Tutorial for beginners - JavaTpointC Programming Language Tutorial for beginners - JavaTpoint
C Programming Language Tutorial for beginners - JavaTpoint
 
Programming languages
Programming languagesProgramming languages
Programming languages
 
Railway Oriented Programming
Railway Oriented ProgrammingRailway Oriented Programming
Railway Oriented Programming
 
INTRODUCTION TO C PROGRAMMING
INTRODUCTION TO C PROGRAMMINGINTRODUCTION TO C PROGRAMMING
INTRODUCTION TO C PROGRAMMING
 
Lect 1. introduction to programming languages
Lect 1. introduction to programming languagesLect 1. introduction to programming languages
Lect 1. introduction to programming languages
 
Functional Programming Patterns (NDC London 2014)
Functional Programming Patterns (NDC London 2014)Functional Programming Patterns (NDC London 2014)
Functional Programming Patterns (NDC London 2014)
 
Basics of C programming
Basics of C programmingBasics of C programming
Basics of C programming
 

Recently uploaded

Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 

Recently uploaded (20)

Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 

Hadoop training institutes in bangalore

  • 1. WWW.KELLYTECHNO.COM Page 1 INTRODUCTION TO HADOOP What is Big Data?  Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.  Gartner defines Big Data as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.  According to IBM, 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few. All of this unstructured data is Big Data. Big data spans three dimensions: Volume, Velocity and Variety. Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes - even petabytes - of information.  Turn 12 terabytes of Tweets created each day into improved product sentiment analysis  Convert 350 billion annual meter readings to better predict power consumption Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value.  Scrutinize 5 million trade events created each day to identify potential fraud  Analyze 500 million daily call detail records in real-time to predict customer churn faster Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.  Monitor 100’s of live video feeds from surveillance cameras to target points of interest  Exploit the 80% data growth in images, video and documents to improve customer satisfaction What does Hadoop solve?  Organizations are discovering that important predictions can be made by sorting through and analyzing Big Data.
  • 2. WWW.KELLYTECHNO.COM Page 2  However, since 80% of this data is "unstructured", it must be formatted (or structured) in a way that that makes it suitable for data mining and subsequent analysis.  Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful for analytics purposes. The Importance of Big Data and What You Can Accomplish The real issue is not that you are acquiring large amounts of data. It's what you do with the data that counts. The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and optimized offerings, and 4) smarter business decision making. For instance, by combining big data and high-powered analytics, it is possible to:  Determine root causes of failures, issues and defects in near-real time, potentially saving billions of dollars annually.  Optimize routes for many thousands of package delivery vehicles while they are on the road.  Analyze millions of SKUs to determine prices that maximize profit and clear inventory.  Generate retail coupons at the point of sale based on the customer's current and past purchases.  Send tailored recommendations to mobile devices while customers are in the right area to take advantage of offers.  Recalculate entire risk portfolios in minutes.  Quickly identify customers who matter the most.  Use clickstream analysis and data mining to detect fraudulent behavior. Challenges Many organizations are concerned that the amount of amassed data is becoming so large that it is difficult to find the most valuable pieces of information.  What if your data volume gets so large and varied you don't know how to deal with it?  Do you store all your data?  Do you analyze it all?
  • 3. WWW.KELLYTECHNO.COM Page 3  How can you find out which data points are really important?  How can you use it to your best advantage? Until recently, organizations have been limited to using subsets of their data, or they were constrained to simplistic analyses because the sheer volumes of data overwhelmed their processing platforms. But, what is the point of collecting and storing terabytes of data if you can't analyze it in full context, or if you have to wait hours or days to get results? On the other hand, not all business questions are better answered by bigger data. You now have two choices:  Incorporate massive data volumes in analysis. If the answers you're seeking will be better provided by analyzing all of your data, go for it. High-performance technologies that extract value from massive amounts of data are here today. One approach is to apply high-performance analytics to analyze the massive amounts of data using technologies such as grid computing, in- database processing and in-memory analytics.  Determine upfront which data is relevant. Traditionally, the trend has been to store everything (some call it data hoarding) and only when you query the data do you discover what is relevant. We now have the ability to apply analytics on the front end to determine relevance based on context. This type of analysis determines which data should be included in analytical processes and what can be placed in low-cost storage for later use if needed. Technologies A number of recent technology advancements enable organizations to make the most of big data and big data analytics:  Cheap, abundant storage.  Faster processors.  Affordable open source, distributed big data platforms, such as Hadoop.  Parallel processing, clustering, MPP, virtualization, large grid environments, high connectivity and high throughputs.  Cloud computing and other flexible resource allocation arrangements. The goal of all organizations with access to large data collections should be to harness the most relevant data and use it for better decision making. Three Enormous Problems Big Data Tech Solves But what’s less commonly talked about is why Big Data is such a problem beyond size and computing power. The reasons behind the conversation are the truly interesting part and need to be understood.
  • 4. WWW.KELLYTECHNO.COM Page 4 Here you go…there are three trends that are driving the discussion and should be made painfully clear instead of lost in all the hype:  We’re digitizing everything. This is big data’s volume and comes from unlocking hidden data from common things all around us that were known before but weren’t quantified, stored, compared and correlated. Suddenly, there’s enormous value in the patterns of what was recently hidden from our view. Patterns offer understanding and a chance for prediction of what will happen next. These each are important and together are remarkably powerful.  There’s no time to intervene. This is big data’s velocity. All of that digital data creates massive historical records but also rich streams of information that are flowing constantly. When we take the patterns discovered in historical information and compare it to everything happening right now, we can either make better things happen or prevent the worst. This is revenue generating and life saving and all of the other wonderful things we hear about, but only if we have the systems in place to see it happening in the moment and do something about it. We can’t afford enough human watchers to do this, so the development of big data systems is the only way to get to better things when the data gives humans insufficient time to intervene.  Variation creates instability. This is big data’s variety. Data was once defined by what we could store and relate in tables of columns and rows. A world that’s digitized ignores those boundaries and is instead full of both structured and unstructured data. That creates a very big problem for systems that were built upon the old definition, which comprise just about everything around us. Suddenly, there’s data available that can’t be consumed or generated by a database. We either ignore that information or it ends up in places and formats that are unreadable to older systems. Gone is the ability to correlate unstructured information with that vast historical (but highly structured) data. When we can’t analyze and correlate well, we introduce instability into our world. We’re missing the big picture unless we build systems that are flexible and don’t require reprogramming the logic for every unexpected (and there will be many) change. There you have it… The underlying reasons that big data matters and isn’t just hype (though there’s plenty of that). The digitization, lack of time for intervention and instability that big data creates leads us to develop whole new ways of managing information that go well beyond Hadoop and distributed computing. It’s why big data presents such enormous challenge and opportunity for software vendors and their customers, but only if these three challenges are the drivers and not opportunism. BI vs. Big Data vs. Data Analytics By Example Business Intelligence (BI) encompasses a variety of tools and methods that can help organizations make better decisions by analyzing “their” data. Therefore, Data Analytics falls under BI. Big Data, if used for the purpose of Analytics falls under BI as well.
  • 5. WWW.KELLYTECHNO.COM Page 5 Let’s say I work for the Center for Disease Control and my job is to analyze the data gathered from around the country to improve our response time during flu season. Suppose we want to know about the geographical spread of flu for the last winter (2012). We run some BI reports and it tells us that the state of New York had the most outbreaks. Knowing that information we might want to better prepare the state for the next winter. Theses types of queries examine past events, are most widely used, and fall under the Descriptive Analytics category. Now, we just purchased an interactive visualization tool and I am looking at the map of the United States depicting the concentration of flu in different states for the last winter. I click on a button to display the vaccine distribution. There it is; I visually detected a direct correlation between the intensity of flu outbreak with the late shipment of vaccines. I noticed that the shipments of vaccine for the state of New York were delayed last year. This gives me a clue to further investigate the case to determine if the correlation is causal. This type of analysis falls under Diagnostic Analytics (discovery). We go to the next phase which is Predictive Analytics. PA is what most people in the industry refer to as Data Analytics. It gives us the probability of different outcomes and it is future-oriented. The US banks have been using it for things like fraud detection. The process of distilling intelligence is more complex and it requires techniques like Statistical Modeling. Back to our examples, I hire a Data Scientist to help me create a model and apply the data to the model in order to identify causal relationships and correlations as they relate to the spread of flu for the winter of 2013. Note that we are now taking about the future. I can use my visualization tool to play around with some variables such as demand, vaccine production rate, quantity… to weight the pluses and minuses of different decisions insofar as how to prepare and tackle the potential problems in the coming months. The last phase is the Prescriptive Analytics and that is to integrate our tried-and-true predictive models into our repeatable processes to yield desired outcomes. An automated risk reduction system based on real-time data received from the sensors in a factory would be a good example of its use case. Finally, here is an example of Big Data. Suppose it’s December 2013 and it happens to be a bad year for the flu epidemic. A new strain of the virus is wreaking havoc, and a drug company has produced a vaccine that is effective in combating the virus. But, the problem is that the company can’t produce them fast enough to meet the demand. Therefore, the Government has to prioritize its shipments. Currently the Government has to wait a considerable amount of time to gather the data from around the country, analyze it, and take action. The process is slow and inefficient. The following includes the contributing factors. Not having fast enough computer systems capable of gathering and storing the data (velocity), not having computer systems that can accommodate the volume of the data pouring in from all of the medical centers in the country (volume), and not having computer systems that can process images, i.e, x-rays (variety). Big Data technology changed all of that. It solved the velocity-volume-variety problem. We now have computer systems that can handle “Big Data”. The Center for Disease Control may receive the data
  • 6. WWW.KELLYTECHNO.COM Page 6 from hospitals and doctor offices in real-time and Data Analytics Software that sits on the top of Big Data computer system could generate actionable items that can give the Government the agility it needs in times of crises.