Íñigo Goiri of Microsoft presents regarding the state of running Hadoop on Azure, Microsoft's cloud computing platform. He discusses some of the advanced features of Azure for cheaply running offline workloads, and what modifications have been made to Hadoop to take advantage of this functionality.
This is taken from the Apache Hadoop Contributors Meetup on January 30, hosted by LinkedIn in Mountain View.
2. Commercial options
• Azure offers Hadoop and Spark:
• HDInsight
• Azure Databricks
• Our target:
• “Raw” VMs
• Pure Hadoop OSS
• Fast creation and scaling
3. Building OSS Hadoop on Azure
• Azure DevOps for building
• Periodic sync to trunk
• Build on VM with OSS Docker image
• Output ‘tgz’ to Azure Blob Storage
4. Deploying a cluster
• Azure Resource Manager (ARM) template
• JSON file describing resources
• Main resources:
• Virtual Machine Scale Sets (VMSS)
• Virtual Network
• Network Security Group
• Load Balancer
• Public IP
• Internal DNS
5. VM creation and startup
• Cloud-init script
• YAML syntax similar to Docker
• Kubernetes (AKS) does not add much
• Download code and install
• Hadoop, Docker, ZooKeeper, scripts,…
• Setup environment variables
• Discover other services (e.g., ZooKeeper)
• Start services
6. VM roles (VMSS)
• 3 x NameNodes
• ZooKeeper
• Journal Nodes
• Routers (RBF)
• 2 x Resource Managers
• N x Workers
• DataNode
• Node Manager
7. Network
• Virtual Network for all VMs
• Load Balancer
• isActive servlet (HADOOP-15707)
• Public IPs
• External DNS
• Firewall
• Internal DNS
• Locate components (e.g., nn0, zk2, and rm1)
8. Worker nodes
• Node Manager (YARN)
• Docker for long running services
• DataNode (HDFS)
• Use VM local disks
• Leverage PROVIDED storage
• Mount external storage (S3, ADLS, HDFS,…)
• Local HDFS as caching
10. Low priority VMs
• ~80% price discount
• Can be evicted at any time
• Larger VMs more likely to be evicted
• 30 seconds notification
• Possible to decommission (NM and DN)
• Ideal for worker nodes
• Mix of low-priority and reserved VMs
Low Priority Reserved
Low Priority
Low Priority
Low Priority
Reserved
Reserved
Reserved
Reserved
Managers
11. Proposed changes to OSS Hadoop
• Hadoop Registry to find managers
• Improve PROVIDED storage (HDFS)
• Improve Dynamic Resource for NMs (YARN)
12. Hadoop Registry to find Managers
• Currently:
• Script to set DNS names (e.g., nn2.hadooptest.com, rm0.hadooptest.com)
• Configuration file with hard-coded values
• Possible to use DNS resolution (HDFS-14118)
• YARN Registry to find YARN services
• Moved to Hadoop Registry
• New approach:
• Managers (e.g., NN or RM) register when starting
• Workers (e.g., DN or NM) use registry to find managers
• Dynamic subclusters (RBF)
13. Improve PROVIDED storage
• Currently:
• Generate FS image at start time
• Propagate alias map to DNs
• New approach:
• Dynamic mount points
• HA support
• Lazy loading replicas metadata on DNs
14. Improve Dynamic Resource Config for NMs
• VMs can change size (CPU)
• Harvesting [OSDI’16]
• Leverage Resource Options (YARN-291, YARN-996)
• Container preemption
• Container priorities (OPPORTUNISTIC)
• Extend current interfaces
• Integrate with Resource Monitor
15. Future work
• Improve Security
• Currently network rules
• Integration with Azure Active Directory
• Delegation tokens propagation
• Changes to OSS
• Hadoop Registry
• PROVIDED storage
• NM Dynamic Resource
• Open source scripts?