Talk given by Ravi Kishore Valeti, Software Engineering LMTS at Salesforce, at GIDS in April 2016
Most Enterprises have been thinking of (and some of them are already) running BDaaS and performing analytics over their Big Data to help make key business decisions. This talk is about "what it takes to operationalize BDaaS, challenges in successfully running large scale Big Data clusters".
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
Operationalizing Big Data as a Service
1. Ravi Kishore Valeti
Lead Member of Technical Staff
rvaleti@salesforce.com
Operationalizing Big Data as a
Service
2. Forward-Looking Statements
Statement under the Private Securities Litigation Reform Act of 1995:
This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any
of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking
statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or
service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for
future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer
contracts or use of our services.
The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our
service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth,
interruptions or delays in our Web hosting, breach of our security measures, the outcome of any litigation, risks associated with completed and any possible
mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our
employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com
products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of
salesforce.com, inc. is included in our annual report on Form 10-K for the most recent fiscal year and in our quarterly report on Form 10-Q for the most
recent fiscal quarter. These documents and others containing important disclosures are available on the SEC Filings section of the Investor Information
section of our Web site.
Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not
be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available.
Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements.
3. What is a Service?
Big-data As a Service (BDaaS)
Operational Challenges
Operational Excellence
Agenda
7. Security using Kerberos
● Third party authentication service
● Provides both authentication and authorization
● Authenticates User to Application and Application to
Application
● Each Cluster should be configured with multiple KDC
servers in Master/Slave Mode for HA
BDaaS = SECURITY + Multi-tenancy + HA + DR + MONITORING at Scale
9. HA - High Availability
● HA for all Services in the stack
Namenode, RM, JHS
HBase
Hive
Hue
Spark Master
● Fault Tolerance & Mean time to Recover
● Multi-Rack architecture & Services are
Rack aware
● Continuous Synth tests
● Rolling restarts whenever possible
BDaaS = SECURITY + Multi-tenancy + HA + DR + MONITORING at Scale
10. DR - Disaster Recovery
● Namenode Metadata Backups
● Namenode Snapshots
● Hive Metadata Backups
● HBase Backups
Configure Replication to a Buddy Cluster
Daily/Continuous Backups using
Snapshots/WAL
● Switch to DR site when ready
BDaaS = SECURITY + Multi-tenancy + HA + DR + MONITORING at Scale
11. DR - Disaster Recovery
● All the last known check-pointed data is
available in the DR site
● For HBase, make sure HBase
Replication queue is empty - Everything
is replicated to DR
● Make sure Data checksums (fsck) &
Synth tests pass
● Enable traffic to DR site
BDaaS = SECURITY + Multi-tenancy + HA + DR + MONITORING at Scale
Site Switching Checklist
12. Monitoring & Alerting
● Centralized Visualization & Alerting
● Monitor User Quotas
● Monitor Resource Utilizations - Memory/CPU
● Should be a mix of Logs & Metrics
● Should be extensible to on-board new added service monitoring
needs
● Ability to quickly incorporate new rules to alert on newly observed
issues
BDaaS = SECURITY + Multi-tenancy + HA + DR + MONITORING at Scale
13. Monitoring & Alerting
● Resource Utilizations by jobs & trends
● Job Waiting times, run times & amount of data processed
● Unique users per day (or week or month)
● Daily queries (HBase)
● Daily read bytes
● Daily written bytes, etc.
BDaaS = SECURITY + Multi-tenancy + HA + DR + MONITORING at Scale
Monitoring success metrics include but not limited to:
14. What is a Service?
Big-data As a Service (BDaaS)
Operational Challenges
Operational Excellence
Agenda
15. Operational Challenges
● Zero down time
● Mean time to recover from failures
● Optimum utilization of resources
● Capacity Planning
● On-Demand capacity adds/removals
16. What is a Service?
Big-data As a Service (BDaaS)
Operational Challenges
Operational Excellence
Agenda
17. Operational Excellence - Shipping bits
● Maintain “Light” forks for the key services that you run
● Choose an appropriate packaging model - Bigtop
● Make sure your production services are as close possible as to
stable versions in open source
18. Operational Excellence - Shipping bits
● Continuous Integration & Deployment pipeline!
● Almost Zero DownTime* - Rack by Rack Rolling Upgrades
Block placement policy - All replicas on different racks** can mitigate the
risk of Service disruptions during Rack by Rack Rolling upgrades
● Auto-Restart bots#
* - except some planned major upgrades where downtime cannot be avoided!
# - Caution! May cause more damage than healing if not configured properly
** Faster network links are usually preferred to make sure service SLAs are not breached due to this special block
placement policy. Extensive performance testing might be required.
19. Operational Excellence - Tuning
● Always keep an eye on the user resource requirements vs reality
Update User Quotas/resource configurations based on actual usage
● Automated Daily reports on important events/metrics
● Dynamic Thresholds for Alerting & continuous tuning to make the
alerts meaningful & non-noisy.
● Performance testing & configuration tuning of all services
● Choosing the right GC settings
Forbes’ Most innovative company 4 years in a row (2011 to 2014)
On Fortune’s list of 100 best places to work for 7 years in a row (2015 - 8th)
Fortune 500 company
Key Takeaway:We are a publicly traded company. Please make your buying decisions only on the products commercially available from Salesforce.
Talk Track:
Before I begin, just a quick note that when considering future developments, whether by us or with any other solution provider, you should always base your purchasing decisions on what is currently available.
Link to the blog - https://www.cloudera.com/content/dam/cloudera/Resources/PDF/whitepaper/Multitenancy_and_the_Enterprise_Data_Hub.pdf
HDFS Quotas
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html
Understanding HDFS quotas reported by fsck, du & count -q
http://www.michael-noll.com/blog/2011/10/20/understanding-hdfs-quotas-and-hadoop-fs-and-fsck-tools/
HDFS permissions & ACLs
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html