2. State of the Data What is Hadoop Hadoop Ecosystem References Agenda
3. Data driven businesses Businesses have been collecting information all the time Mine more == Collect more (and vice-versa) Challenges Application Complexities Data growth Infrastructure Economics Need of the day State of the data
4. Data driven business Businesses have been collecting informationall the time Mine more == Collect more (and vice-versa) Challenges Application Complexities Data growth Infrastructure Economics State of the data
5. Applications Searches, Message posts, Comments, Emails,Blogs, Photos, Video Clips, Product Listings ERP, CRM, Databases, Internal Applications, Customer/Consumer facing products Mobile Context Web, Customers, Products, Business Systems,Processes, Services Support Systems CRM, SOA, Recommendation Systems/processes,Data warehouses, Business Intelligence, BPM Data driven business
6. Data driven businesses Businesses have been collecting informationall the time Mine more == Collect more (and vice-versa) Challenges Application Complexities Data growth Infrastructure Economics State of the data
7. Drivers ROI Customer Retention Product Affinity Market Trends Research Analysis Customer/Consumer Analytics Process Clustering Classification Build Relationships Regression Types Structured Semi-structured Unstructured Mine more
8. Data driven businesses Businesses have been collecting informationall the time Mine more == Collect more (and vice-versa) Challenges Application Complexities Data growth Infrastructure Economics State of the data
9. Complex Applications Data integration is a good but complex problem to solve Data Growth Growth is exponential Infrastructure Availability Unscalablehardware Economics Managing high data volume comes at a price Failures are very costly Challenges
10. System that can handle high volume data System that can perform complex operations Scalable Robust Highly Available Fault Tolerant Cheap Need of the day
11. Top level Apache project Open source Inspired by Google’s white papers onMap/Reduce (MR), Google File System (GFS) Originally developed to support Apache Nutch Search Engine Software Framework - Java Designed For sophisticated analysis To deal with structured and unstructured complex data
12. Runs on commodity hardware Shared-nothing architecture Scale hardware when ever you want System compensates for hardware scalingand issues (if any) Run large-scale, high volume data processes Scales well with complex analysis jobs Handles failures Ideal to consolidate data from both new and legacy data sources Value to the business Why Hadoop?
14. HDFS Hadoop Distributed File System Map/Reduce Software framework for Clustered, Distributed data processing ZooKeeper Scheduler Avro Data Serialization Chukwa Data Collection System to monitor Distributed Systems HBase Data storage for distributed large tables Hive Data warehousing infrastructure Pig High-Level Query Language Hadoop Ecosystem
15. Master/Slave Architecture Runs on commodity hardware Fault Tolerant Handle large volumes of data Provides High Throughput Streaming data-access Simple file coherency model Portable to heterogeneous hardware and software Robust Handles disk failures, replication (& re-replication) Performs cluster rebalancing, data integrity checks HDFS – Hadoop Distributed File System