2. Papers LCI: a social channel analysis platform for live customer intelligence Bistro data feed management system Apache hadoop goes realtime at Facebook Nova: continuous Pig/Hadoop workflows A Hadoop based distributed loading approach to parallel data warehouses A batch of PNUTS: experiences connecting cloud batch and serving systems
3. Papers (Continued) Turbocharging DBMS buffer pool using SSDs Online reorganization in read optimized MMDBS Automated partitioning design in parallel database systems Oracle database filesystem Emerging trends in the enterprise data analytics: connecting Hadoop and DB2 warehouse Efficient processing of data warehousing queries in a split execution environment SQL server column store indexes An analytic data engine for visualization in tableau
5. Workload Types Facebook Messaging High Write Throughput Large Tables Data Migration Facebook Insights Realtime Analytics High Throughput Increments Facebook Metrics System (ODS) Automatic Sharding Fast Reads of Recent Data and Table Scans
6. Why Hadoop & HBase Elasticity High write throughput Efficient and low-latency strong consistency semantics within a data center Efficient random reads from disk High Availability and Disaster Recovery Fault Isolation Atomic read-modify-write primitives Range Scans Tolerance of network partitions within a single data center Zero Downtime in case of individual data center failure Active-active serving capability across different data centers
7. RealtimeHDFS High Availability - AvatarNode Hot Standby – AvatarNode Enhancements to HDFS transaction logging Transparent Failover: DAFS(client enhancement+ZooKeeper) HadoopRPC compatibility Block Availability: Placement Policy a pluggable block placement policy
8. Realtime HDFS (Cont.) Performance Improvements for a Realtime Workload RPC Timeout Recover File Lease HDFS-append recoverLease Reads from Local Replicas New Features HDFS sync Concurrent Readers (last chunk of data)
10. Deployment and Operational Experiences Testing Auto Tesing Tool HBase Verify Monitoring and Tools HBCK More metrics Manual versus Automatic Splitting Add new RegionServers, not region splitting Dark Launch (灰度) Dashboards/ODS integration Backups at the Application layer Schema Changes Importing Data Lzo & zip Reducing Network IO Major compaction
12. Nova Overview Scenarios Ingesting and analyzing user behavior logs Building and updating a search index from a stream of crawled web pages Processing semi-structured data feeds Two-layer programming model (Nova over Pig) Continuous processing Independent scheduling Cross-module optimization Manageability features
13. Abstract Workflow Model Workflow Two kinds of vertices: tasks (processing steps) and channels (data containers) Edges connect tasks to channels and channels to tasks Edge annotations (all, new, B and Δ) Four common patterns of processing Non-incremental (template detection) Stateless incremental (shingling) Stateless incremental with lookup table (template tagging) Stateful incremental (de-duping)
14. Abstract Workflow Model (Cont.) Data and Update Model Blocks: base blocks and delta blocks Channel functions: merge, chain and diff Task/Data Interface Consumption mode: all or new Production mode: B or Δ Workflow Programming and Scheduling Data Compaction and Garbage Collection
17. Introduction Two approaches Starting with a parallel database system and adding some MapReduce features Starting with MapReduce and adding database system technology HadoopDB follows the second of the approaches Two heuristics for HadoopDB optimizations Database systems can process data at a faster rate than Hadoop. Minimize the number of MapReduce jobs in SQL execution plan.
18. HadoopDB HadoopDB Architecture Database Connector Data Loader Catalog Query Interface VectorWise/X100 Database (SIMD) vs. PostgreSQL HadoopDB Query Execution selection, projection, and partial aggregation(Map and Combine) database system co-partitioned tables MR for redistributing data SideDB (a "database task done on the side").
19. Split Query Execution Referential Partitioning Join in database engine Local join foreign-key Referential Partitioning Split MR/DB Joins Directed join: one of the tables is already partitioned by the join key. Broadcast join: small table ought to be shipped to every node. Adding specialized joins to the MR framework Map-side join. Tradeoffs: temporary table for join. Another type of join: MR redistributes data Directed join Split MR/DB Semijoin like 'foreignKey IN (listOfValues)' Can be split into two MapReduce jobs SideDB to eliminate the first MapReduce job
20. Split Query Execution (Cont.) Post-join Aggregation Two MapReduce jobs Hash-based partial aggregation save significant I/O A similar technique is applied to TOP N selections Pre-join Aggregation For MR based join. Group-by and join-key columns is smaller than the cardinality of the entire table.
26. Introduction Why Hadoop for Teradata EDW More disk space and space can be easily added HDFS as a storage MapReduce Distributed HDFS blocks to Teradata EDW nodes assignment problem Parameters: n blocks, k copies, m nodes Goal: to assign HDFS blocks to nodes evenly and minimize network traffic
27. Block Assignment Problem HDFS file F on a cluster of P nodes (each node is uniquely identified with an integer i where 1 ≤ i ≤ P) The problem is defined by: assignment(X, Y, n,m, k, r) X is the set of n blocks (X = {1, . . . , n}) of F Y is the set of m nodes running PDBMS (called PDBMS nodes) (Y⊆{1, . . . , P }) k copies, m nodes r is the mapping recording the replicated block locations of each block.r(i) returns the set of nodes which has a copy of the block i. An assignment g from the blocks in X to the nodes in Y is denoted by a mapping from X = {1, . . . , n} to Y where g(i) = j (i ∈ X, j ∈ Y ) means that the block i is assigned to the node j.
28. Block Assignment Problem (Cont.) The problem is defined by: assignment(X, Y, n,m, k, r) An even assignment g is an assignment such that ∀ i ∈ Y ∀j ∈ Y| |{ x | ∀ 1 ≤ x ≤ n&&g(x) = i}| - |{y | ∀ 1 ≤ y ≤ n&&g(y) = j}| | ≤ 1. The cost of an assignment g is defined to be cost(g) = |{i | g(i) /∈r(i) ∀ 1 ≤ i ≤ n}|, which is the number of blocks assigned to remote nodes. We use |g| to denote the number of blocks assigned to local nodes by g. We have |g| = n - cost(g). The optimal assignment problem is to find an even assignment with the smallest cost.