Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

02 a holistic approach to big data

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Prochain SlideShare
Ibm big data
Ibm big data
Chargement dans…3
×

Consultez-les par la suite

1 sur 68 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (20)

Publicité

Similaire à 02 a holistic approach to big data (20)

Plus par Raul Chong (17)

Publicité

Plus récents (20)

02 a holistic approach to big data

  1. 1. Raul F. Chong Senior Big Data and Cloud Program Manager Big Data University Community Leader @raulchong A holistic approach to Big Data © 2013 BigDataUniversity.com
  2. 2. Agenda  Introduction to Big Data  The state of Big Data adoption  Big Data – A holistic approach  The 5 high value Big Data use cases  Technical details of key Big Data components  The future of Big Data and Cloud  Demos  Resources
  3. 3. Agenda  Introduction to Big Data  The state of Big Data adoption  Big Data – A holistic approach  The 5 high value Big Data use cases  Technical details of key Big Data components  The future of Big Data and Cloud  Demos  Resources
  4. 4. What is Big Data? Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing. Source: Wikipedia
  5. 5. Big Data Characteristics Information is growing at a phenomenal rate as much data and content over coming decade 2009 800,000 petabytes 2020 35 zettabytes = 4 Trillion 8GB iPods 44x Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010
  6. 6. Big Data Characteristics • About 80%of the world’s data is unstructured • It may be data we’ve been collecting before, but could not process
  7. 7. Types of Big Data • Data in movement - streams • Twitter / Facebook comments • Stock market data • Sensors: Vital signs of a newly-born • Data at rest - oceans • Collection of what has streamed • Web logs, emails, social media • Unstructured documents: forms, claims • Structured data from disparate systems
  8. 8. IT Structures the data to answer that question IT Delivers a platform to enable creative discovery Business Explores what questions could be asked Business Users Determine what question to ask Monthly sales reports Profitability analysis Customer surveys Brand sentiment Product strategy Maximum asset utilization Big Data Approach Iterative & Exploratory Analysis Traditional Approach Structured & Repeatable Analysis Traditional vs. big data business approaches
  9. 9. Applications for Big Data Analytics Homeland Security FinanceSmarter Healthcare Multi-channel sales Telecom Manufacturing Traffic Control Trading Analytics Fraud and Risk Log Analysis Search Quality Retail: Churn, NBO
  10. 10. Agenda  The state of Big Data adoption  Big Data – A holistic approach  The 5 high value Big Data use cases  Technical details of key Big Data components  The future of Big Data and Cloud  Demos  Resources
  11. 11. Big Data Adoption Phases
  12. 12. Use of Big Data globally and in the financial sector Multiple responses accepted
  13. 13. Big Data: In Demand Well Paying Skill Skills are in Demand Pays well “If you can claim to be a data scientist and have the chops to back that up, you can pretty much write your own ticket even in this tough job market.” Source: Gigaom http://gigaom.com/cloud/big-data-skills-bring-big-dough/
  14. 14. Agenda  The state of Big Data adoption  Big Data – A holistic approach  The 5 high value Big Data use cases  Technical details of key Big Data components  The future of Big Data and Cloud  Demos  Resources
  15. 15. 15 KTH Swedish Royal Institute of Technology Reducing Traffic Congestion • Deployed real-time Smarter Traffic system to predict and improve traffic flow. • Analyzes streaming real-time data gathered from cameras at entry/exit to city, GPS data from taxis and trucks, and weather information. • Predicts best time and method to travel such as when to leave to catch a flight at the airport Results • Enables ability to analyze and predict traffic faster and more accurately than ever before • Provides new insight into mechanisms that affect a complex traffic system • Smarter, more efficient, and more environmentally friendly traffic 15
  16. 16. Benefits  Real-time display of public sentiment as candidates respond to questions  Debate winner prediction based on public opinion instead of solely political analysts University of Southern California Innovation Lab Monitors Political Debates
  17. 17. Big Data – A holistic approach Big Data is Not Only Hadoop!  Examples where Hadoop is not entirely applicable: – Cyber security, Stock market, Traffic control, Sensor information, monitoring trends in Social Media – What if your company has many silos of information, difficult to move to HDFS? – What about governance? Can we trust the source of this data?
  18. 18. Solutions Big Data Platform Analytics and Decision Management Big Data Infrastructure Big data holistic approach: A platform
  19. 19. Solutions Big Data Platform Analytics and Decision Management Big Data Infrastructure The IBM Big Data Platform Delivers deep insight with advanced in- database analytics & operational analytics Data Warehouse Data Warehouse Big data holistic approach: A platform
  20. 20. Solutions Big Data Platform Analytics and Decision Management Big Data Infrastructure Stream Computing Data Warehouse Analyze streaming data and large data bursts for real-time insightsStream Computing Big data holistic approach: A platform
  21. 21. Solutions Big Data Platform Analytics and Decision Management Big Data Infrastructure The IBM Big Data Platform Hadoop System Stream Computing Data Warehouse Cost-effectively analyze Petabytes of unstructured and structured data Hadoop System Big data holistic approach: A platform
  22. 22. Solutions Big Data Platform Analytics and Decision Management Big Data Infrastructure 22 Information Integration & Governance Hadoop System Stream Computing Data Warehouse Govern data quality and manage the information lifecycle Information Integration & Governance Big data holistic approach: A platform
  23. 23. Solutions Big Data Platform Analytics and Decision Management Big Data Infrastructure Accelerators Information Integration & Governance Hadoop System Stream Computing Data Warehouse Speed time to value with analytic and application accelerators Accelerators Big data holistic approach: A platform
  24. 24. Solutions Big Data Platform Analytics and Decision Management Big Data Infrastructure Accelerators Information Integration & Governance Hadoop System Stream Computing Data Warehouse Systems Management Application Development Visualization & Discovery The IBM Big Data Platform Discover, understand, search, and navigate federated sources of big data Visualization & Discovery Big data holistic approach: A platform
  25. 25.  Process any type of data – Structured, unstructured, in- motion, at-rest, in-place  Built-for-purpose engines – Designed to handle different requirements  Manage and govern data in the ecosystem  Enterprise data integration  Grow and evolve on current infrastructure  The whole is greater than the sum of parts  Integrated components  Out of the box, standards-based services  Start small (value is additive) 25 Solutions Big Data Platform Analytics and Decision Management Big Data Infrastructure Accelerators Information Integration & Governance Hadoop System Stream Computing Data Warehouse Systems Management Application Development Visualization & Discovery Big data holistic approach: A platform
  26. 26. ETL, MDM, Data Governance Metadata and Governance Zone Warehousing Zone Enterprise Warehouse Data Marts Ingestion and Real-time Analytic Zone Streams Connectors BI & Reporting Predictive Analytics Analytics and Reporting Zone Visualization & Discovery Landing and Analytics Sandbox Zone Hive/HBase Col Stores Documents in variety of formats MapReduce Hadoop An example of the big data platform in practice
  27. 27. Agenda  The state of Big Data adoption  Big Data – A holistic approach  The 5 high value Big Data use cases  Technical details of key Big Data components  The future of Big Data and Cloud  Demos  Resources
  28. 28. Big Data Exploration Find, visualize, understand all big data to improve business knowledge Enhanced 360o View of the Customer Achieve a true unified view, incorporating internal and external sources Security/Intelligence Extension Lower risk, detect fraud and monitor cyber security in real-time Data Warehouse Augmentation Integrate big data and data warehouse capabilities to increase operational efficiency Operations Analysis Analyze a variety of machine data for improved business results The 5 High Value Big Data Use Cases
  29. 29. Find, visualize and understand all big data to improve business knowledge • Greater efficiencies in business processes • New insights from combining and analyzing data types in new ways • Develop new business models with resulting increased market presence and revenue CM, RM, DM RDBMS Feeds Web 2.0 Email Web CRM, ERP File Systems Connector Framework App Builder Hadoop Integration & Governance UI / User Streams Big Data Exploration: Illustrated WarehouseData Explorer
  30. 30. Big Data Exploration: Example in Practice • Exploring 4 TB to drive point business solutions (supplier portal, call center, etc.) • Single-point of data fusion for all employees to use • Reduced costs & improved operational performance for the business  How do you enable employees to navigate and explore enterprise and external content? Can you present this in a single user interface?  How do you identify areas of data risk before they become a problem?  What is the starting point for your big data initiatives? Is Big Data Exploration Right for You?  How do you separate the “noise” from useful content?  How do you perform data exploration on large and complex data?  How do you find insights in new or unstructured data types (e.g. social media and email)? Airplane Manufacturer Blinded for confidentiality Big Data Platform Component Starting Point: Data Explorer
  31. 31. Enhanced 360º View of the Customer: Illustrated CRM J Robertson Pittsburgh, PA 15213 35 West 15th Name: Address: Address: ERP Janet Robertson Pittsburgh, PA 15213 35 West 15th St. Name: Address: Address: Legacy Jan Robertson Pittsburgh, PA 15213 36 West 15th St. Name: Address: Address: SOURCE SYSTEMS Janet 35 West 15th St Pittsburgh Robertson PA / 15213 F 48 1/4/64 First: Last: Address: City: State/Zip: Gender: Age: DOB: 360 View of Party Identity Master Data Management Unified View of Party’s Information Hadoop Streams Warehouse
  32. 32. Logs Events Alerts Configuration information System audit trails External threat intelligence feeds Network flows and anomalies Identity context Web page text Video/audio surveillance E-mail and social activity Business process data Customer transactions Traditional Security Operations and Technology Big Data Analytics New Considerations Collection, Storage and Processing Collection and integration Size and speed Enrichment and correlation Analytics and Workflow Visualization Unstructured analysis Learning and prediction Customization Sharing and export Security/Intelligence Extension: Illustrated
  33. 33. “Reconstructing Events” – Integrating Multimedia from Diverse Sources • Correlate multimedia content across a wide diversity of sources and dynamic topology of cameras • Exploit partial overlaps in field of view, re- identification of objects/people and contextual information • Obtain real-time operational picture across diverse content• 100K security cameras (static cameras, slowly changing topology) • 10M mobile photos/day (limited knowledge about locations) • 50M social media photos/video (uncertain geo-temporal context) • Moving vehicles (patrol cars), overhead drones, broadcast, retail, 311, etc. Overhead Social MediaMobile Cameras Security Cameras 33
  34. 34. Security/Intelligence Extension: Customer Example  What are your plans to enrich your security or intel system with unused or underleveraged data sources (video, audio, smart devices, network, Telco, social media)?  How will you address the need sub second detection, identification, resolution of physical or cyber threats?  How do you intend to follow activities of criminals, terrorists, or persons in a blacklist?  How do you plan to enhance your surveillance system with real-time data from video, acoustic, thermal or other security sensors?  Do you want to correlate lots of technical or human intel data and sources looking for associations or patterns (big data forensics)?  How are you going to deal with unstructured data (email, social, etc.) in your Security Information & Event Management (SIEM) solution to improve cyber threat detection & remediation? Would the Security / Intelligence Extension benefit you? Captured and analyzed 42TB of daily traffic in real-time for tracking persons of interest to take suitable action and reduce risk. Big Data Platform Component Starting Point: Streams, Hadoop
  35. 35. RawLogsandMachineData Indexing, Search Statistical Modeling Root Cause Analysis Federated Navigation & Discovery Real-time Analysis Only store what is needed Operations Analysis: Illustrated Machine Data Accelerator
  36. 36. 1 http://www.information-management.com/infodirect/2009_133/downtime_cost-10015855-1.html 2 http://www.itchannelplanet.com/business_news/article.php/3916786/IT-System-Downtime-Costs-265-Billion-A-Year-Study-Finds.htm Operations analysis is a Business Imperative Cost of System Down Time – 49% of Fortune 500 companies > 80 hrs down time/year1 • Cost of down time: $90,000/hr to $6.48 million/hr • 80 hours * $6.48M = approx $500M per year – System downtown costs North American businesses $26.5 billion a year in lost revenue2
  37. 37. Operations Analysis: Customer Example • Intelligent Infrastructure Management: log analytics, energy bill forecasting, energy consumption optimization, anomalous energy usage detection, presence-aware energy management • Optimized building energy consumption with centralized monitoring; Automated preventive and corrective maintenance • Utilized InfoSphere Streams, InfoSphere BigInsights, IBM Cognos  Do you deal with large volumes of machine data?  How do you access and search that data?  How do you perform root cause analysis?  How do you perform complex real-time analysis to correlate across different data sets?  How do you monitor and visualize streaming data in real time and generate alerts? Would Operations Analysis benefit you? Big Data Platform Component Starting Point: Hadoop, Streams
  38. 38. Integrate big data and data warehouse capabilities to increase operational efficiency Data Warehouse Augmentation: Needs Need to leverage variety of data Extend warehouse infrastructure • Optimized storage, maintenance and licensing costs by migrating rarely used data to Hadoop • Reduced storage costs through smart processing of streaming data • Improved warehouse performance by determining what data to feed into it • Structured, unstructured, and streaming data sources required for deep analysis • Low latency requirements (hours—not weeks or months) • Required query access to data
  39. 39. Filter and summarize big data for the warehouse Hadoop Data Warehouse Augmentation: Illustrated
  40. 40. Hadoop as a query-ready archive for a data warehouse Hadoop Data Warehouse Augmentation: Illustrated
  41. 41. Agenda  The state of Big Data adoption  Big Data – A holistic approach  The 5 high value Big Data use cases  Technical details of key Big Data components  The future of Big Data and Cloud  Demos  Resources
  42. 42. Open Source Hadoop Visualization & Discovery Connectors Workload Optimization Flume Runtime Advanced Engines File System MapReduce HDFS Data Store HBase Development Tools Eclipse Plug-ins Systems Management Jaql Pig ZooKeeper Lucene Oozie Hive Open Source Mahout Whirr Sqoop Hue H Catalog R
  43. 43. Visualization & Discovery Integration Workload Optimization Streams Netezza Flume DB2 DataStage IBM InfoSphere BigInsights v2.1 Enterprise Edition Runtime Advanced Analytic Engines File System MapReduce HDFS Data Store HBase Text Processing Engine & Extractor Library) BigSheets JDBC Applications & Development Text Analytics Administration Index Splittable Text Compression Enhanced Security Flexible Scheduler Jaql Pig ZooKeeper Lucene Oozie Adaptive MapReduce Hive Integrated Installer Admin Console Sqoop Adaptive Algorithms Dashboard & Visualization Apps Workflow Monitoring Management Security Audit & History Lineage R Guardium Platform Computing Cognos GPFS IBMOpen Source High Availability Big SQL H Catalog Whirr Mahout Hue Added Value on Top of Open Source Hadoop
  44. 44. InfoSphere BigInsights Added Value InfoSphere BigInsights Administration & Security Workload Optimization (MapReduce/SQL) Connectors Development Tools IBM tested & supported open source components Accelerators Open source based components Workload Management Security Development Environment Analytics/Extractors Analytics Extraction engine (System T) Visualization & Exploration Extractors and APIs SQL API
  45. 45. InfoSphere BigInsights Added Value: Accelerators Data Ingest and Prep Extract Buzz, Intent , Sentiment Entity Analytics: Profile Resolution Real time analytics. Pre-defined views and charts Dashboard Stream Computing and Analytics BigInsights System and Analytics Online flow: Data-in-motion analysis Offline flow: Data-at-rest analysis Pre-defined Workbooks and Dashboards Social Media Data Extract Buzz, Intent , Sentiment And Consumer Profiles Entity Analytics and Integration Comprehensive Social Media Customer Profiles Social Media Optional: Indexed Search Index using Push API Data Explorer Ad hoc access Social Data Analytics Accelerator Architecture
  46. 46. InfoSphere BigInsights Added Value: BigSheets InfoSphere BigInsights Administration & Security Workload Optimization (MapReduce/SQL) Connectors Development Tools IBM tested & supported open source components Accelerators Open source based components Workload Management Security Development Environment Analytics/Extractors Analytics Extraction engine (System T) Visualization & Exploration Extractors and APIs SQL API BigSheets Visualization and Exploration • Web-based analysis and visualization for Users • Familiar spreadsheet-like interface • Define and manage long running data collection jobs
  47. 47. InfoSphere BigInsights Added Value: BigSheets No programming knowledge needed! How it works  Model “big data” collected from various sources as collections  Filter and enrich content with built-in functions  Combine data in different collections  Visualize results through spreadsheets, charts  Export data into common formats (if desired)
  48. 48. InfoSphere BigInsights Added Value: Dev Tools InfoSphere BigInsights Administration & Security Workload Optimization (MapReduce/SQL) Connectors Development Tools IBM tested & supported open source components Accelerators Open source based components Workload Management Security Development Environment Analytics/Extractors Analytics Extraction engine (System T) Visualization & Exploration Extractors and APIs SQL API Development Environment • Eclipse based dev environment • Developer tools and a set of analytic extractors for fast adoption and reduction in coding and debugging time • Plugin for Text Analytics, MapReduce programming, Jaql development, Hive query development, …. and more
  49. 49. InfoSphere BigInsights Added Value: Dev Tools How it works • Built-in Apps make it easy to run Big Data applications & tasks:  Import and Export Data from a Database or files  Import and Export Web and Social Data  Perform Tex Analytics on specified content  Query HBase Content  Query content stored in BigInsights using Big SQL.  Execute Pig or JAQL applications • EXT E N S I B L E !! Build your own applications and make them easy to execute from an appealing Application launcher © 2013 IBM Corporation
  50. 50. InfoSphere BigInsights Added Value: Dev Tools
  51. 51. InfoSphere BigInsights Added Value: Text Analytics 51 Advanced Text Analytics Engine Automatically identify and understand key information in text Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands’ striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillas made the save. Winger Andres Iniesta scored for Spain for the win. InfoSphere BigInsights Administration & Security Workload Optimization Connectors Advanced Engines Visualization & Exploration Development Tools Open source Hadoop components © 2013 IBM Corporation
  52. 52. © 2013 BigDataUniversity.com Sentiments for movie Ra.One :-(
  53. 53. © 2013 BigDataUniversity.com Architecture Diagram AQL Text AnalyticsText Analytics Optimizer Text Analytics RuntimeGraph (.aog) Compiled Operator Graph (.aog) Rule language with familiar SQL-like syntax Specify annotator semantics declaratively Choose an efficient execution plan that implements the semantics Highly scalable, embeddable Java runtime Input Document Stream Annotated Document Stream
  54. 54. © 2013 BigDataUniversity.com InfoSphere BigInsights – Added Value: Connectors Connectors • Databases • DB2, Netezza, Oracle, Teradata Integrations • InfoSphere Data Stage (data collection and integration) • InfoSphere Streams (real-time streams processing) • InfoSphere Guardium (security and monitoring) • Cognos Business Intelligence (Business Intelligence capabilities) • IBM Platform Computing (cluster/grid infrastructure and management) and more… InfoSphere BigInsights Administration & Security Workload Optimization Connectors Advanced Engines Visualization & Exploration Development Tools Open source Hadoop components
  55. 55. © 2013 BigDataUniversity.com BigInsights – Added Value: Workload optimization 55 Task Map Adaptive Map Reduce Hadoop System Scheduler • Identifies small and large jobs from prior experience • Sequences work to reduce overhead Adaptive MapReduce • Drop-in replacement for Hadoop batch scheduler • Dramatic performance gains for latency- sensitive application workloads • Agile scheduling, dynamically adjust priorities at run-time © 2013 IBM Corporation InfoSphere BigInsights Administration & Security Workload Optimization (MapReduce/SQL) Connectors Development Tools IBM tested & supported open source components Accelerators Open source based components Workload Management Security Development Environment Analytics/Extractors Analytics Analytics Extraction Engine Visualization & Exploration Extractors and APIs SQL API
  56. 56. © 2013 BigDataUniversity.com BigInsights – Added Value: Web Console 56 Web Console • Start / stop services • Run / monitor jobs (applications) • Explore / modify file system • Built in Apps simplify common tasks InfoSphere BigInsights Administration & Security Workload Optimization Connectors Advanced Engines Visualization & Exploration Development Tools Open source Hadoop components
  57. 57. BigInsights – Added Value: Security Security • LDAP authentication • Support for PAM & Flat File configuration • Administrators restrict access to authorized users • HTTPS support for the InfoSphere BigInsights console, and reverse proxy. • Role based access InfoSphere BigInsights Administration & Security Workload Optimization Connectors Advanced Engines Visualization & Exploration Development Tools Open source Hadoop components
  58. 58. Achieve scale: By partitioning applications into software components By distributing across stream-connected hardware hosts Infrastructure provides services for Scheduling analytics across hardware hosts, Establishing streaming connectivity Transform Filter / Sample Classify Correlate Annotate Where appropriate: Elements can be fused together for lower communication latency  Continuous ingestion  Continuous analysis How Streams Works
  59. 59. Agenda  The state of Big Data adoption  Big Data – A holistic approach  The 5 high value Big Data use cases  Technical details of key Big Data components  The future of Big Data and Cloud  Demos  Resources
  60. 60. The Future of Big Data and Cloud  SQL for Hadoop support improvements – towards full ANSI support  Hive  Impala (Cloudera)  Big SQL (IBM)  Stinger (Hortonworks)  Drill (MapR)  HAWQ (Pivotal)  SQL-H (Teradata)  Improvements in Multimedia Analytics  Growth in usage and adoption of R programming language  Cloud  Bare metal support helping with Hadoop workloads  Private network  Full support with APIs
  61. 61. Big SQL overview Big SQL fully integrates with SQL applications and BI tooling with benefits including: • Existing queries run with no or few modifications • Existing JDBC and ODBC compliant tools can be leveraged • Applications do not have to compensate for constraints of Hive QL which may result in: • more statements • potentially moving more data over the network to the application Data Sources Hive Tables HBase Tables CSV Files BigSQL Engine BigInsights Application SQL Language JDBC / ODBC Driver JDBC / ODBC Server Try it out! Big SQL 3.0 Technology Preview: bigsql.imdemocloud.com
  62. 62. Agenda  The state of Big Data adoption  Big Data – A holistic approach  The 5 high value Big Data use cases  Technical details of key Big Data components  The future of Big Data and Cloud  Demos  Resources
  63. 63. BigInsights on the Cloud - Making Learning Hadoop Easy and FunM2M Demos (using Streams) •The Connected Car Demo – http://ausgsa.ibm.com/projects/c/connected_car/index.html – http://m2m.demos.ibm.com/  YouTube IBM Big Data Channel – http://www.youtube.com/user/ibmbigdata Big Data University (bigdatauniversity.com)
  64. 64. Agenda  The state of Big Data adoption  Big Data – A holistic approach  The 5 high value Big Data use cases  Technical details of key Big Data components  The future of Big Data and Cloud  Demos  Resources
  65. 65.  Flexible on-line delivery allows learning @your place and @your pace  Free courses, free study materials.  Cloud-based sandbox for exercises – zero setup with Robust Course Management System and Content Distribution infrastructure  169,000 registered students.  Free IBM Hadoop, BigInsights Publications Big Data University (bigdatauniversity.com)
  66. 66. BigInsights on the Cloud - Making Learning Hadoop Easy and FunQuick Start Editions available (Free, non- production, no time bomb): – IBM InfoSphere BigInsights (IBM’s Hadoop Distribution) ibm.co/QuickStart – IBM InfoSphere Streams ibm.co/streamsqs Big Data University (bigdatauniversity.com)
  67. 67. 67 My contact information Contact Info: Twitter: @raulchong Facebook: facebook.com/raul.f.chong LinkedIN: linkedin.com/pub/raul-f-chong/8/aa2/b63 My contact information
  68. 68. Thank You! © 2013 BigDataUniversity.com

×