Contents
What is Big Data
Challenges of Big Data
Conventional Approaches
Problems with Conventional Approaches
Welcome to the world of Big Data
What is Big Data
Every day, world create 2.5 quintillion bytesof data so much that 90% of the data in the world today has been created in the last two yearsalone
Gartner defines Big Data as high volume, velocityand varietyinformation assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
What is Big Data
According to IBM, 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few. All of this unstructured data is also Big Data.
Why Big Data
Huge Competition in the Market:
◦Retails –Customer analytics
◦Travel –travel pattern of the customer
◦Website –Understand users navigation pattern, interest, conversion, etc
Sensors, satellite, geospatial Data
Military and intelligence
What are the challenges at large scale
Single Point Failure : In a single-machine environment, failure is not something that program designers explicitly worry about very often: if the machine has crashed, then there is no way for the program to recover anyway.
Distributed environment :
oNetworks can experience partial or total failure if switches and routers break down.
oData may be corrupted, or maliciously or improperly transmitted.
othe distributed system should be able to recover from the component failure or transient error condition and continue to make progress.
What are the challenges at large scale(contd)
Individualmachinestypicallyonlyhaveafewgigabytesofmemory.
Harddrivesaremuchlarger;asinglemachinecannowholdmultipleterabytesofinformationonitsharddrives.Butintermediatedatasetsgeneratedwhileperformingalarge-scalecomputationcaneasilyfillupseveraltimesmorespacethanwhattheoriginalinputdatasethadoccupied.
Bandwidthisascarceresourceevenonaninternalnetwork.
Synchronization : The biggest challenge
If100nodesarepresentinasystemandoneofthemcrashes,theother99nodesshouldbeabletocontinuethecomputation,ideallywithonlyasmallpenaltyproportionatetothelossof1%ofthecomputingpower. Ofcourse,thiswillrequirere-computinganyworklostontheunavailablenode.
Volume
Today we are living in the world of data. There are multiple factors contributing in data growth
Huge volumes of data are generated from various sources:
Transaction based data (stored through years)
Text, Images, Videos from Social Media
Increased amounts of data generated by sensors
Volume
Turn 12 terabytes of Tweets created each day into improved product sentiment analysis
Convert 350 billion annual meter readings to better predict power consumption
Turn billions of customer complaints to analyze root cause of customer churn
Velocity
According to Gartner, velocity "means both how fast data is being produced and how fast the data must be processed to meet demand."
Scrutinize 5 million trade events created each day to identify potential fraud
Analyze customer’s searching/buying pattern and show them advertisement of attractive offers in real time
Velocity (example)
Take Google’s example, about processing of the data:
As soon as a blog is posted it comes into the search result.
If we search about traveling, shopping(electronics, apparels, shoes, watch, etc.), job, etc. the relevant advertisement it provides us, while browsing.
Even ads in the mail are highly content driven
Variety
Data today comes in all types of formats –from traditional databases to hierarchical data stores created by end users and OLAP systems, to text documents, email, meter-collected data, video, audio, stock ticker data and financial transactions
Veracity
Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and mined meaningful to the problem being analyzed.
Veracity in data analysis is the biggest challenge when compares to things like volume and velocity. Keep your data clean and processes to keep ‘dirty data’ from accumulating in your systems.
Why Big Data Technologies
Conventional Approaches/Technologies are not able to solve current problems
They are good for certain use-cases
But they cannot handle the data in the range of peta-bytes
Problems with Conventional Approaches
1.Limited Storage capacity
2.Limited Processing capacity
3.No scalability
4.Single point of failure
5.Sequential Processing
6.RBMSs can handle structured data
7.Requires preprocessing of data
8.Information is collected according to current business needs
Limited Storage capacity
Installed on single machine
Have specified storage limits
Requires to archive the data again and again
Problems of reloading data back to the repository, according to the business needs
Only process the data that can be stored on a single machine
Limited Processing capacity
Installed on single machine
Have specified processing limits
Have certain no of processing elements (CPUs)
Not able to process the large amount of data efficiently
No scalability
One of biggest limitations of conventional RDBMs, is the no scalability
We cannot add more resources on the fly
What makes Hadoop Unique ?
Simplified Programming Model
Efficient
Automatic Distribution of Data
Automatic distribution of work across machines
Fault Tolerance
GridschedulingofcomputerscanbedonewithexistingsystemssuchasCondor.ButCondordoesnotautomaticallydistributedata:aseparateSANmustbemanagedinadditiontothecomputecluster.