Publicité
Publicité

Contenu connexe

Publicité
Publicité

Big data introduction

  1. Big Data A SOFT INTRODUCTION OF BIG DATA
  2. Contents What is Big Data Challenges of Big Data Conventional Approaches Problems with Conventional Approaches Welcome to the world of Big Data
  3. What is Big Data Every day, world create 2.5 quintillion bytesof data so much that 90% of the data in the world today has been created in the last two yearsalone Gartner defines Big Data as high volume, velocityand varietyinformation assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
  4. What is Big Data According to IBM, 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few. All of this unstructured data is also Big Data.
  5. Why Big Data Huge Competition in the Market: ◦Retails –Customer analytics ◦Travel –travel pattern of the customer ◦Website –Understand users navigation pattern, interest, conversion, etc Sensors, satellite, geospatial Data Military and intelligence
  6. What are the challenges at large scale Single Point Failure : In a single-machine environment, failure is not something that program designers explicitly worry about very often: if the machine has crashed, then there is no way for the program to recover anyway. Distributed environment : oNetworks can experience partial or total failure if switches and routers break down. oData may be corrupted, or maliciously or improperly transmitted. othe distributed system should be able to recover from the component failure or transient error condition and continue to make progress.
  7. What are the challenges at large scale(contd) Individualmachinestypicallyonlyhaveafewgigabytesofmemory. Harddrivesaremuchlarger;asinglemachinecannowholdmultipleterabytesofinformationonitsharddrives.Butintermediatedatasetsgeneratedwhileperformingalarge-scalecomputationcaneasilyfillupseveraltimesmorespacethanwhattheoriginalinputdatasethadoccupied. Bandwidthisascarceresourceevenonaninternalnetwork.
  8. Synchronization : The biggest challenge If100nodesarepresentinasystemandoneofthemcrashes,theother99nodesshouldbeabletocontinuethecomputation,ideallywithonlyasmallpenaltyproportionatetothelossof1%ofthecomputingpower. Ofcourse,thiswillrequirere-computinganyworklostontheunavailablenode.
  9. Essence of Big Data
  10. Volume Today we are living in the world of data. There are multiple factors contributing in data growth Huge volumes of data are generated from various sources: Transaction based data (stored through years) Text, Images, Videos from Social Media Increased amounts of data generated by sensors
  11. Volume Turn 12 terabytes of Tweets created each day into improved product sentiment analysis Convert 350 billion annual meter readings to better predict power consumption Turn billions of customer complaints to analyze root cause of customer churn
  12. Velocity According to Gartner, velocity "means both how fast data is being produced and how fast the data must be processed to meet demand." Scrutinize 5 million trade events created each day to identify potential fraud Analyze customer’s searching/buying pattern and show them advertisement of attractive offers in real time
  13. Velocity (example) Take Google’s example, about processing of the data: As soon as a blog is posted it comes into the search result. If we search about traveling, shopping(electronics, apparels, shoes, watch, etc.), job, etc. the relevant advertisement it provides us, while browsing. Even ads in the mail are highly content driven
  14. Variety Data today comes in all types of formats –from traditional databases to hierarchical data stores created by end users and OLAP systems, to text documents, email, meter-collected data, video, audio, stock ticker data and financial transactions
  15. Veracity Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and mined meaningful to the problem being analyzed. Veracity in data analysis is the biggest challenge when compares to things like volume and velocity. Keep your data clean and processes to keep ‘dirty data’ from accumulating in your systems.
  16. Conventional Approaches Storage ◦RDBMS (Oracle, DB2, MySQL, etc.) ◦OS Filesystem Processing ◦SQL Queries ◦Custom framework ◦C/C++ ◦Python/Perl
  17. Why Big Data Technologies Conventional Approaches/Technologies are not able to solve current problems They are good for certain use-cases But they cannot handle the data in the range of peta-bytes
  18. Problems with Conventional Approaches 1.Limited Storage capacity 2.Limited Processing capacity 3.No scalability 4.Single point of failure 5.Sequential Processing 6.RBMSs can handle structured data 7.Requires preprocessing of data 8.Information is collected according to current business needs
  19. Limited Storage capacity Installed on single machine Have specified storage limits Requires to archive the data again and again Problems of reloading data back to the repository, according to the business needs Only process the data that can be stored on a single machine
  20. Limited Processing capacity Installed on single machine Have specified processing limits Have certain no of processing elements (CPUs) Not able to process the large amount of data efficiently
  21. No scalability One of biggest limitations of conventional RDBMs, is the no scalability We cannot add more resources on the fly
  22. SOLUTION: Hadoopisdesignedtoefficientlyprocesslargevolumesofinformationbyconnectingmanycommoditycomputerstogethertoworkinparallel. Thetheoretical1000-CPUmachinedescribedearlierwouldcostaverylargeamountofmoney,farmorethan1,000single-CPUor250quad- coremachines. Hadoopwilltiethesesmallerandmorereasonablypricedmachinestogetherintoasinglecost-effectivecomputecluster.
  23. What makes Hadoop Unique ? Simplified Programming Model Efficient Automatic Distribution of Data Automatic distribution of work across machines Fault Tolerance GridschedulingofcomputerscanbedonewithexistingsystemssuchasCondor.ButCondordoesnotautomaticallydistributedata:aseparateSANmustbemanagedinadditiontothecomputecluster.
  24. Questions ?
Publicité