RABBIT: A CLI tool for identifying bots based on their GitHub events.
Configuring elasticsearch for performance and scale
1. Configuring Elasticsearch For
Performance and Scale
Based on the knowledge gained after
attending elasticsearch webinar on
30th September 2014
Prepared By:
Bharvi Narayan Dixit
Software Engineer,
Orkash Services Pvt. Ltd.
2. Contents
The Elasticsearch Open Source Model
The Popularity of Elasticsearch
Insights across The Guardian
Ophan - The real time analytics tool
Datadog’s Elasticsearch Story
How Datadog’s event dashboards look like
Elasticsearch use @ Captora
Captora dashboard and it’s architecture
Webinar Poll for type of infrastructures used for
elasticsearch
4. The Popularity of Elasticsearch
10M downloads in 2 years and counting..
5. Insights across the Guardian
• A large portion of The Guardian’s business relies on
Elasticsearch to understand how their content is being
consumed.
• Before Ophan, guardian used a traditional analytics package
which had a four-hour lag and that is too with so many
restrictions.
• ~40M documents is processed per day and 360M documents
can be easily queried.
• Real-Time traffic analysis of each content, which enables the
organization to see the audience engagement.
• Easy scaling the cluster (Adding more capacity) whenever there
is any stress on elasticsearch because of any new feature.
6. Ophan - The real time analytics tool created by the
Guardian based on elasticsearch
7. Datadog’s Elasticsearch Story
• Elasticsearch is used as Datadog’s primary data store for
events/logs.
• Before elasticsearch Postgres was being used.
• Event data is always structured with flexibility of
adding/removing fields as needed.
• Hundreds of millions of full-text events across 12+ indices.
• ~10M documents/day. Doubling the volume every 4-5 months.
8. First version of elasticsearch cluster in Datadog
• One node per AZ (availability zone) handling HTTP and data.
• One large index storing all events from all time.
• Writing to a pool of all nodes in the cluster.
• Worked well for 1-1.5 years.
9. Faster and more scalable cluster
• Split cluster into head and data nodes.
• Head nodes act as a load balancer, accepting the HTTP requests.
• Data nodes just interact with head and data nodes.
• Use a rolling index with one month of event data each.
10. What Datadog’s engineers learned??
• Give some planning time to sizing before setting on data format.
– With a bit of planning, they could have avoided migrating to a rolling index
later on.
– But you can’t plan for everything, so architect deployments, with
migration in mind.
• Monitor your elasticsearch cluster from the beginning.
• Creating tooling around backup and restore should almost be in
your first deployment
15. Elasticsearch use @ Captora
• Captora is the first marketing cloud solution to automatically
expand and optimize the marketing campaigns to engage and
convert thousands of new future buyers.
• It provides an approach of Adaptive Marketing, market
discovery, engagement, and convert new buyers by intelligently
and automatically scaling content-driven campaigns across
multiple channels (search, advertising, and social).
• Read more at http://www.captora.com/technology/
16. Elasticsearch use @ Captora
@captora Elasticsearch is primarily used for
• Indexing all textual data (i.e. crawled multi-channel content streams, user
generated documents etc.)
• Power the textual search, rankings, and relevant calculation of the content
recommendation engine.
• Power the user portal search of the content stream.
Elasticsearch stats @captora
• Mostly semi-structured data (i.e. web-pages, white-papers, meta data of videos
from YouTube, LinkedIn updates, blogs, Tweets etc.)
• ~200M documents, ~300GB of data.
• Partitioned across ~1200 indices, 2300 shards, with replication factor of 4.
• 6 EC2 nodes (c3.2xlarge, provisioned SSD), two AWS availability zones, ELB
balanced.
• Index rate: 10 to 500 requests/Sec.
• Query rate: 100 to 2000 requests/Sec.