Publicité

Big data analytics

14 Jun 2018
Publicité

Contenu connexe

Publicité

Dernier(20)

Big data analytics

  1. BIG DATA ANALYTICS
  2. CONTENTS 1. Big Data 2. Data vs Big Data 3. Examples 4. Challenges 5. Big Data Analytics 6. Traditional vs Big Data analytics 7. Hadoop 8. Application
  3. WHAT IS BIG DATA Big data is a collection of data sets that are large and complex in nature. They grow both structured and unstructured data that grow large so fast that they are not manageable by traditional relational database systems or conventional statistical tools.
  4. DATA VS BIG DATA Big data is just data with: • More volume • Faster data generation (velocity) • Multiple data format (variety) World's data volume to grow 40% per year & 50 times by 2020 [1] Data coming from various human & machine activity
  5. BIG DATA ANALYTICS IN PRACTICE 1. The New York Stock Exchange generates about one terabyte of new trade data per day. 2. Single Jet engine can generate 10+terabytes of data in 30 minutes of a flight time. With many thousand flights per day, generation of data reaches up to many Petabytes. 3. Statistic shows that 500+terabytes of new data gets ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.
  6. CHALLENGES More data = more storage space • More storage = more money to spend (RDBMS server needs very costly storage) Data coming faster • Speed up data processing or we’ll have backlog Needs to handle various data structure • How do we put JSON data format in standard RDBMS? • Hey, we also have XML format from other sources • Other system give us compressed data in gzip format Agile business requirement • On initial discussion, they only need 10 information, now they ask for 25? Can we do that? We only put that 10 in our database
  7. TYPES OF BIG DATA • Structured Data : Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data. • Un-Structured Data : Any data with unknown form or the structure is classified as unstructured data. • Semi-structured Data : Semi-structured data can contain both the forms of data.
  8. BENEFITS OF BIG DATA PROCESSING • Businesses can utilize outside intelligence while taking decisions:- Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their business strategies. • Improved customer service :- Traditional customer feedback systems are getting replaced by new systems designed with ‘Big Data’ technologies. In these new systems, Big Data and natural language processing technologies are being used to read and evaluate consumer responses.
  9. • Early identification of risk to the product/ services, if any • Better operational efficiency:-'Big Data' technologies can be used for creating staging area or landing zone for new data before identifying what data should be moved to the data warehouse. In addition, such integration of 'Big Data' technologies and data warehouse helps organization to offload infrequently accessed data.
  10. BIG DATA ANALYTICS Big data analytics is the process of examining large and varied data sets -- i.e., big data -- to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful information that can help organizations make more-informed business decisions.
  11. TRADITIONAL VS BIG DATA ANALYTICS Traditional analytics Big Data Analytics Analytics using know data which is well understood. Not well understood data format for it largely being unstructured and semi structured. Build based on relational data base model. Big data comes in various forms and formats from multiple disconnected system. They are almost flat with no relationship.
  12. 4 TYPES OF ANALYTICS 1. Descriptive : what happened ?? 2. Diagnostic : why did it happened ?? 3. Predictive : what is likely to happen ?? 4. Prescriptive : what should I do about it ??
  13. APPROACH TO ANALYTICS 1. Identify the data sources. 2. Select the right tools and technology to collect, store, aggregate the data. 3. Understand the business domain. 4. Identify tools and technology to process the data. 5. Build mathematical models for the analytics . 6. Visualize. 7. Validate your result. 8. Learn, adopt, and rebuild your analytical model.
  14. ANALYTICS TOOLS Most used statistical programming tools are: • IBM SPSS • SAS • R • MATLAB R and MATLAB have the most comprehensive support of statistical functions.
  15. HADOOP Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model . • Software framework that supports distributed applications, licensed under the Apache v2 license. • Hadoop was derived from Google's MapReduce and Google File System papers. • YAHOO is the largest contributor to the project • Written in the Java programming language .
  16. HADOOP : MAPREDUCE
  17. WHY USE HADOOP ? • Need to compress data • Nodes fail every day • Common infrastructure Efficient Easy to use Open Source
  18. COMMON USES • Searches • Log processing • Recommendation systems • Analytics (Facebook, Linkedin) • Image and video processing (NASA) • Data retention
  19. TECHNOLOGIES AND TOOLS Unstructured and semi-structured data types typically don't fit well in traditional data warehouses that are based on relational databases oriented to structured data sets. As a result, many organizations that collect, process and analyze big data turn to NoSQL databases as well as Hadoop and its companion tools, including:
  20. MapReduce: a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. YARN: a cluster management technology and one of the key features in second-generation Hadoop. Spark: an open-source parallel processing framework that enables users to run large-scale data analytics applications across clustered systems.
  21. HBase: a column-oriented key/value data store built to run on top of the Hadoop Distributed File System (HDFS). Hive: an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files. Kafka: a distributed publish-subscribe messaging system designed to replace traditional message brokers. Pig: an open-source technology that offers a high-level mechanism for the parallel programming of MapReduce jobs to be executed on Hadoop clusters.
  22. BIG DATA ANALYTICS BENEFITS • Driven by specialized analytics systems and software, big data analytics can point the way to various business benefits, including new revenue opportunities, more effective marketing, better customer service, improved operational efficiency and competitive advantages over rivals.
  23. • Big data analytics applications enable data scientists, predictive modelers, statisticians and other analytics professionals to analyze growing volumes of structured transaction data, plus other forms of data that are often left untapped by conventional business intelligence (BI) and analytics programs. • On a broad scale, data analytics technologies and techniques provide a means of analyzing data sets and drawing conclusions about them to help organizations make informed business decisions.
  24. BIG DATA ANALYTICS APPLICATION • Government : The use and adoption of big data within governmental processes allows efficiencies in terms of cost, productivity, and innovation, but does not come without its flaws. • Manufacturing: Based on TCS 2013 Global Trend Study, improvements in supply planning and product quality provide the greatest benefit of big data for manufacturing.
  25. • Information Technology :Especially since 2015, big data has come to prominence within Business Operations as a tool to help employees work more efficiently and streamline the collection and distribution of Information Technology (IT). • Education: A McKinsey Global Institute study found a shortage of 1.5 million highly trained data professionals and managers and a number of universities including University of Tennessee and UC Berkeley, have created masters programs to meet this demand.
  26. THANK YOU

Notes de l'éditeur

  1. [1] http://e27.co/worlds-data-volume-to-grow-40-per-year-50-times-by-2020-aureus-20150115-2/
Publicité