Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Similaire à Automated Cluster Management and Recovery for Large Scale Multi-Tenant Search Infrastructure: Presented by Nitin Sharma & Li Ding, BloomReach
Similaire à Automated Cluster Management and Recovery for Large Scale Multi-Tenant Search Infrastructure: Presented by Nitin Sharma & Li Ding, BloomReach (20)
4. 4
02
Abstract
Cluster Management & Recovery for an enterprise grade global search Infrastructure is non-trivial.
• Serving Hundreds of Millions of Documents & Queries
• Multi-Tenant
• Geographically distributed Data Centers
• Custom SolrCloud Components - Analysis, Ranking and Faceting
• Dynamic Ranking Elements – Collection/Cluster Level
At BloomReach, we have built an innovative search architecture aimed at reliable Cluster Management and
Recovery.
• The Infrastructure is data center based.
• Discovery service: DC Metadata, roles, tenants.
• Real-time Active Monitoring: Robust failure detection.
• Recovery Service: One step Recovery, Rollback and Backup.
The presentation will describe the infrastructure in great detail and how it achieves the availability and
performance while making things simple from a platform management standpoint
5. 5
02
About us
BloomReach is a Cloud Marketing Platform. We have developed a personalized discovery platform that
features applications which analyze big data to makes our customers’ digital content more discoverable,
relevant and profitable.
Nitin: I work on Scaling the Search Platform for Bloomreach’s big data. My relevant experience and
background includes scaling real-time services for latency sensitive applications and building performance
and search-quality metrics infrastructure for personalization platforms.
Li : I am a member of technical staff at BloomReach's platform team. My background includes working on
virtualization management platforms, building search performance infrastructures, scaling distributed
services.
6. BloomReach’s Applications
Organic
Search
Contentunderstanding
What it does
Content optimization,
management and measurement
Benefit
Enhanced discoverability and
customer acquisition in organic search
What it does
Personalized onsite search and
navigation across devices
Benefit
Relevant and consistent onsite
experiences for new and known users
What it does
Merchandising tool that understands
products and identifies opportunities
Benefit
Prioritize and optimize
online merchandising
SNAP
Compass
7. 7
01
Agenda
• History - Elastic Search Infrastructure
• Real Time Serving - Scaling & Availability Challenges.
• Highly Available Multi DC/Multi-Tenant Architecture.
• Cluster Management Suite
• Replication & Ranking Config Mgmnt Service
• Deployment & Recovery Service
• Active Monitor Service
• Auto Recovery Service
9. Zookeeper
SC2
Solr Compute Cloud Infrastructure – Large Scale Pipelines
Backend
Solr
Elastic
Cluster
Elastic
Cluster
Elastic
Cluster
Elastic
Cluster
Indexing Pipeline
Analysis PipelineH
A
F
T
provision
provision
replicate
replicate
Serving
Solr
H A F T
replicate
Read/write
10. 10
01
SC2 HAFT – Open Source
github.com/bloomreach/solrcloud-haft
12. 12
01
Agenda
• History - Elastic Search Infrastructure
• Real Time Serving - Scaling & Availability Challenges.
• Highly Available Multi DC/Multi-Tenant Architecture.
• Cluster Management Suite
• Replication & Ranking Config Mgmnt Service
• Deployment & Recovery Service
• Active Monitor Service
• Auto Recovery Service
14. Real Time Serving – Dynamic Elements
Zookeeper
Custom SolrCloud Components & Analyzers
Global Ranking Configurations
Global Entities
Collection Level Entities
Ranking Files
Ranking configs
Serving SolrCloud
Load Balancer
15. Real Time Serving - Scaling & Availability Challenges
Zookeeper
Custom SolrCloud Components & Analyzers
Global Ranking Configurations
Global Entities
• Query Parsers, Analyzers
• Ranking & Scoring Components
• Non Optimal Performance (Latency, Mem usage)
• Memory & File Handle Leaks
• Stale Searchers Left open• Non Confirming Configurations
• Size Limit
• Solr Startup Issues
Serving SolrCloud
16. Real Time Serving - Scaling & Availability Challenges
Zookeeper
Collection Level Entities
• Ranking Elements Loaded once per core
• Collection Reloads
• Non Optimal Performance (Latency, Mem usage)
• Files not versioned to support roll back
• External files not sharded.
Serving SolrCloud
Ranking Files
Ranking configs
• Loads when core initializes. Misconfiguration crashes cores.
• Hot swap of configs requires dynamic loading.
17. Real Time Serving – Recovery Challenges
Zookeeper
Global EntitiesCollection Level Entities
Serving SolrCloud
Load Balancer
• The cluster goes down
• Restoring older release takes time. Restarting 1000s of
collections is unstable and could take hours.
• Serving is affected
Bad jar deployed
Large ranking
file
• Large ranking file – Unsharded. Increases per core mem requirement.
• Auto Rollback to previous version is non trivial (if new ranking file is produced
by pipelines).
• Serving is affected. No longer highly Available.
18. Real Time Serving – Multi Tenancy Challenges
Zookeeper
Serving SolrCloud
Load Balancer
Tenant 1
Tenant 2
Tenant n
• Tenant is a unique <app, collection> pair in solrcloud.
• Unique collection type per app. [<Zkconfig, Ranking, Query
Patterns>]
• Index, Config, Cluster Management strategies
vary drastically
Recovery:
Tenant 1: No dynamic config, static large index
Tenant 2: External Ranking files, aggressive index
refresh & customer generated data.
Tenant N: …
What is a tenant?
19. Real Time Serving – Multi DC Challenges
Zookeeper (Common)
• Every data center hosts only part of the tenants
based on geo.
• Adding a new Geo based DC needs to have
a shared zk ensemble
• Selective Collection Placement is not possible
• HA and Latency Guarantees are non trivial
Multi Dc
EU
East
West
20. 20
01
Agenda
• History - Elastic Search Infrastructure
• Real Time Serving - Scaling & Availability Challenges.
• Highly Available Multi DC/Multi-Tenant Architecture.
• Cluster Management Suite
• Replication & Ranking Config Mgmnt Service
• Deployment & Recovery Service
• Active Monitor Service
• Auto Recovery Service
22. Multi DC /Multi- Tenant Architecture
Terminologies Definition
Solr Data Center A logical group of solr nodes with metadata. The metadata contains
placement, role, replication factor, apps etc.
Solr Cluster Logical Grouping of Solr Data Centers.
Replication API Replicates Index from Elastic Clusters onto all Datacenters
Ranking File Management API Uploads ranking files to Serving Datacenters
23. Cluster Topology/ Data Center Definition
• Where does the dc live?
• How many nodes ?
• Name, Type of DC.
• Role of DC
• Serve – Behind LB for api requests
• Replicate – Gets indexing updates
• LB End point
• Tenants/apps
• …
24. Automated Cluster Management Suite
Solr Serving 1
LoadBalancer
Solr Serving 2
Solr
Backup
Replication API
Cluster
Metadata
Cluster Management
API
Deployment/Recovery
API
R/W
Cluster Ops
Replication API
Ranking Mgmnt API
Cluster Management
API
Deployment/Recovery
API
HA Mode Setup
Ranking Mgmnt API
25. 25
01
Agenda
• History - Elastic Search Infrastructure
• Real Time Serving - Scaling & Availability Challenges.
• Highly Available Multi DC/Multi-Tenant Architecture.
• Cluster Management Suite
• Replication & Ranking Config Mgmnt Service
• Deployment & Recovery Service
• Active Monitor Service
• Auto Recovery Service
26. Replication Management API
Solr Serving 1
LoadBalancer
Solr Serving 2
Solr
Backup
Replication API
Cluster
Metadata
Cluster Management
API
Query = operation: Replicate?
App : app1
Elastic Indexers
Index
27. Ranking File Management API
Solr Serving 1
LoadBalancer
Solr Serving 2
Solr
Backup
Ranking Mgmnt API
Cluster
Metadata
Cluster Management
API
Query = operation: Serve?
App : app2
Ranking Pipelines
Ranking
files
S3
Version
files and
Store
them
28. 28
01
Agenda
• History - Elastic Search Infrastructure
• Real Time Serving - Scaling & Availability Challenges.
• Highly Available Multi DC/Multi-Tenant Architecture.
• Cluster Management Suite
• Replication & Ranking Config Mgmnt Service
• Deployment & Recovery Service
• Active Monitor Service
• Auto Recovery Service
29. Deployment/Recovery Service (Launch New Datacenter)
• Adding new DC to the cluster (Geo based)
• Adding temporary capacity for increased traffic.
• Expanding cluster capacity permanently.
30. Deployment/Recovery Service (Launch New Datacenter)
Serving3
app1
Datacenter Definition
Data (JSON) ZK ZK ZK SOLR SOLR SOLR SOLR
Multi-threads installation
Smoke Test:
1) Every collection is queryable
2) Every collection has the
config it suppose to have
SolrCloud Production Cluster
Load Balancer
Serving1
app1
Backup
app1
app2
Serving2
app1
Others…
Deployment/Recovery Service
Cluster
Metadata
Index
Store the config
31. Example of additional DC
• Where does the dc live?
• How many nodes ?
• Name, Type of DC.
• Role of DC
• Serve – Behind LB for api requests
• Replicate – Gets indexing updates
• LB End point
• Tenants/apps
• …
32. Deployment/Recovery Service (Hard Recovery)
• One or more hosts in a datacenter is down
• There are network issues with one datacenter
• AWS decides to retire some instances in a datacenter
33. Smoke
Test
New
Serving2
Provision hosts using serving2’s config and same
as creating a new dc
New
Serving2
Update config of new
serving2
Add back to LB
Deployment/Recovery Service (Hard Recovery)
Load Balancer
Serving1
Dc
Serving2
app1
Deployment/Recovery Service
Cluster
Metadata
SolrCloud Production Cluster
Retrieve config
of serving2
34. Deployment/Recovery Service (Soft Recovery)
• One or more datacenters are having high memory usage, CPU usage and doesn’t respond to re
quests
• Several collections are down in a DC due to Zookeeper state or other issues
• Deploy code. Our customized component needs a restart of Solr to take effect
35. Snapshots
• A snapshot service will take snapshot of a serving dc every 24 hours
• The snapshot contains global files: customized jar, Zookeeper configs
and per tenant level files like ranking files, synonyms etc. This is done
through HAFT API
• Index is never snapshoted
• The snapshot will be timestamped and stored in S3
SolrCloud Production Cluster
Load Balancer
Serving1
app1
Serving2
app1
Deployment/Recovery Service
H
A
F
T
S3
Using HAFT to
Take snapshots
Store snapshot
in S3 with timestamp
Base: s3://cluster/production/20151008155637
s3://cluster/production/20151008/jar
s3://cluster/production/20151008/zkconfig
s3://cluster/production/20151008/tenant1/ranking
…
Global Files:
Jar, ZK config
Per Tenant Files
36. Revert to Snapshots
When revert a DC back to a snapshot point, we use HAFT to replicate index from backup datacenter
and all external files, global config files and per tenant based files from the snapshot S3 location
37. Code Deploy Mode – Soft Recovery
Take a global
lock of that DC
SolrCloud Production Cluster
Load Balancer
Serving1
Dc
Serving2
Dc
Deployment/Recovery Service
Cluster
Metadata
Get DC’s config, ZK,
Solr hosts etc
Release Lock
Take out of LB
Deploy code
Release configs
Post Deployment Tests
38. Disaster Recovery Mode – Soft Recovery
SolrCloud Production Cluster
Load Balancer
Serving1
app1
Serving2
app1
Get the global lock
Take dc out of LB
Rolling restarts Solr
Check host and
collection health
Pass?
Add to LB
Delete all files of Zookeeper
and Solr, wipe out everything
Install Zookeeper
and Solr
Setup global files using
current files or snapshot files
Using HAFT to replicate all
collections with current or
snapshot per tenant files and
indexes from backup
Smoke Test
39. 39
01
Agenda
• History - Elastic Search Infrastructure
• Real Time Serving - Scaling & Availability Challenges.
• Highly Available Multi DC/Multi-Tenant Architecture.
• Cluster Management Suite
• Replication & Ranking Config Mgmnt Service
• Deployment & Recovery Service
• Active Monitor Service
• Auto Recovery Service
40. Active Monitor Service
• We are using SPM to monitor JVM usage, CPU load on each S
olr host as well
Sematxt
• Runs every five minutes
• Check if each Zookeeper node is accessible through HAFT api.
If more than half of the zookeeper node is down, page
• Check if every Solr node is accessible and every collection on
that node is able to query. If a node is unhealthy, page
• Checks all data centers
Node level monitorSolrCloud Production Cluster
Load Balancer
Serving1
dc
Serving2
dc
Others…
Auto Recovery Service
Active Monitor Service
41. Active Monitor Service
• Runs every five minutes
• Check if serving data centers with the
same app have same config files
• Check if all datacneters have same index
• Either one failed will page us
Cluster level monitorSolrCloud Production Cluster
Active Monitor Service
Auto Recovery Service
H
A
F
T
Serving1
app1
ZooKeeper
Serving1
app1
ZooKeeper
"test-tenant":{
"shards":{"shard1":{
"range":"80000000-7fffffff",
"state":"active",
"replicas":{"core_node1":{
"state":"active",
"base_url":"http://10.99.99.99:8983/solr
",
"core":"test-tenant_shard1_replica1",
"node_name":"10.99.99.99:8983_solr",
"leader":"true"}}}},
"maxShardsPerNode":"500",
"router":{"name":"compositeId"},
"replicationFactor":"1"},
"test-tenant":{
"shards":{"shard1":{
"range":"80000000-7fffffff",
"state":"active",
"replicas":{"core_node1":{
"state":"active",
"base_url":"http://10.99.99.99:8983/solr
",
"core":"test-tenant_shard1_replica1",
"node_name":"10.99.99.99:8983_solr",
"leader":"true"}}}},
"maxShardsPerNode":"500",
"router":{"name":"compositeId"},
"replicationFactor":"1"},
42. 42
01
Agenda
• History - Elastic Search Infrastructure
• Real Time Serving - Scaling & Availability Challenges.
• Highly Available Multi DC/Multi-Tenant Architecture.
• Cluster Management Suite
• Replication & Ranking Config Mgmnt Service
• Deployment & Recovery Service
• Active Monitor Service
• Auto Recovery Service
43. Auto Recovery Service (In Progress)
• One ZK node is down
• One Solr node is down
• Serving1 and Serving2 has different number
of ducments
• Serving1 and Serving2 has different
versions of config files
• Serving1 JVM usage is high
• Serving1’s machines are not accessible
Active Monitor Service
• Restart ZK, check if ZK is accessible
• Restart Solr
• Replicate from backup datacenter
• Using versioned config files and HAFT api
to recreate the files
• Soft recovery or rollback
• Hard recovery
Auto Recovery Service
Only Page Us Only When Automation Failed