Hw09 Rethinking The Data Warehouse With Hadoop And Hive
1. Rethinking Data Warehousing & Analytics Ashish Thusoo, Facebook Data Infrastructure Team
2.
3.
4. Trends Leading to More Data Free or low cost of user services Realization that more insights are derived from simple algorithms on more data
5. Deficiencies of Existing Technologies Cost of Analysis and Storage on proprietary systems does not support trends towards more data Closed and Proprietary Systems Limited Scalability does not support trends towards more data
14. Data Flow Architecture at Facebook Web Servers Scribe MidTier Filers Production Hive-Hadoop Cluster Oracle RAC Federated MySQL Scribe-Hadoop Cluster Adhoc Hive-Hadoop Cluster Hive replication
Cost of training people is high – have to reduce cost by making system easy to use.
Why Hive? Petabytes of structured data User base familiar with SQL and Python/Perl/PHP Commercial Warehousing Software .. Does not scale, very expensive, inflexible Closed source, not programmable using Python/Perl/PHP Solution: SQL layer on top of scalable storage and map-reduce (Hadoop) Openness: Use any data format, embed any programming language
Nomenclature: Core switch and Top of Rack
1GB connectivity within a rack, 100MB across racks? Are all disks 7200 SATA?