SlideShare une entreprise Scribd logo
1  sur  38
You know, for search
 querying 24 000 000 000 Records in 900ms




                                @jodok
@jodok
source: www.searchmetrics.com
First Iteration
The anatomy of a tweet
                         http://www.readwriteweb.com/archives/what_a_tweet_can_tell_you.php
c1.xlarge   5 x m2.2xlarge




           EBS
                     ES as document store

- bash               - 5 instances
  - find              - weekly indexes
  - zcat             - 2 replicas
  - curl             - EBS volume
• Map/Reduce to push
  to Elasticsearch
• via NFS to HDFS
  storage               HDFS

                                ES




• no dedicated nodes

                       MAPRED
- Disk IO
- concated gzip files
- compression
Hadoop Storage - Index “Driver”

  Namenode         Jobtracker       Hive
  Datanode         Secondary NN     Datanode
  2 Tasktracker    Datanode         2 Tasktracker
  6x 500 TB HDFS   2 Tasktracker    6x 500 TB HDFS
                   6x 500 TB HDFS




  Datanode         Datanode         Datanode
  4 Tasktracker    4 Tasktracker    4 Tasktracker
  6x 500 TB HDFS   6x 500 TB HDFS   6x 500 TB HDFS




                   Tasktracker
                   Spot Instances
Adding S3 / External Tables to Hive

create external table $tmp_table_name
        (size bigint, path string)
          ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
        stored as
         INPUTFORMAT "org.apache.hadoop.mapred.lib.NLineInputFormat"
         OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
        location s3n://...;
SET ...
from (
     select transform (size, path)
        using './current.tar.gz/bin/importer transform${max_lines}' as
        (crawl_ts int, screen_name string, ... num_tweets int)
         ROW FORMAT DELIMITED
             FIELDS TERMINATED BY '001'
             COLLECTION ITEMS TERMINATED BY '002'
             MAP KEYS TERMINATED BY '003'
             LINES TERMINATED BY 'n'
                from $tmp_table_name
        ) f
INSERT overwrite TABLE crawls PARTITION (crawl_day='${day}')
     select
            crawl_ts, ... user_json, tweets,
https://launchpad.net/ubuntu/+source/cloud-init
http://www.netfort.gr.jp/~dancer/software/dsh.html.en
packages:
 - puppet

# Send pre-generated ssh private keys to the server
ssh_keys:
 rsa_private: | ${SSH_RSA_PRIVATE_KEY}
 rsa_public: ${SSH_RSA_PUBLIC_KEY}
 dsa_private: | ${SSH_DSA_PRIVATE_KEY}
 dsa_public: ${SSH_DSA_PUBLIC_KEY}

# set up mount points
# remove default mount points
mounts:
 - [ swap, null ]
 - [ ephemeral0, null ]

# Additional YUM Repositories
repo_additions:
- source: "lovely-public"
  name: "Lovely Systems, Public Repository for RHEL 6 compatible Distributions"
  filename: lovely-public.repo
  enabled: 1
  gpgcheck: 1
  key: "file:///etc/pki/rpm-gpg/RPM-GPG-KEY-lovely"
  baseurl: "https://yum.lovelysystems.com/public/release"

runcmd:
 - [ hostname, "${HOST}" ]
 - [ sed, -i, -e, "s/^HOSTNAME=.*/HOSTNAME=${HOST}/", /etc/sysconfig/network ]
 - [ wget, "http://169.254.169.254/latest/meta-data/local-ipv4", -O, /tmp/local-ipv4 ]
 - [ sh, -c, echo "$(/bin/cat /tmp/local-ipv4)    ${HOST}    ${HOST_NAME}" >> /etc/hosts ]

 - [ rpm, --import, "https://yum.lovelysystems.com/public/RPM-GPG-KEY-lovely"]

 - [ mkdir, -p, /var/lib/puppet/ssl/private_keys ]
 - [ mkdir, -p, /var/lib/puppet/ssl/public_keys ]
 - [ mkdir, -p, /var/lib/puppet/ssl/certs ]
${PUPPET_PRIVATE_KEY}
 - [ mv, /tmp/puppet_private_key.pem, /var/lib/puppet/ssl/private_keys/${HOST}.pem ]
${PUPPET_PUBLIC_KEY}
 - [ mv, /tmp/puppet_public_key.pem, /var/lib/puppet/ssl/public_keys/${HOST}.pem ]
${PUPPET_CERT}
 - [ mv, /tmp/puppet_cert.pem, /var/lib/puppet/ssl/certs/${HOST}.pem ]

 - [ sh, -c, echo "    server = ${PUPPET_MASTER}" >> /etc/puppet/puppet.conf ]
 - [ sh, -c, echo "    certname = ${HOST}" >> /etc/puppet/puppet.conf ]
 - [ /etc/init.d/puppet, start ]
- IO
- ES Memory
- ES Backup
- ES Replicas
- Load while indexing
- AWS Limits
EBS performance




                  http://blog.dt.org
• Shard allocation
• Avoid rebalancing (Discovery Timeout)
• Uncached Facets
  https://github.com/lovelysystems/elasticsearch-ls-plugins
• LUCENE-2205
  Rework of the TermInfosReader class to remove the
  Terms[], TermInfos[], and the index pointer long[] and create
  a more memory efficient data structure.
3 AP server / MC
                    c1.xlarge




6 ES Master Nodes                      6 Node Hadoop Cluster
c1.xlarge                              + Spot Instances




                                          40 ES nodes per zone
                                          m1.large
                                          8 EBS Volumes
Everything fine?
Cutting the cost
• Reduce the amount of Data
  use Hadoop/MapRed transform to
  eliminate SPAM, irrelevant Languages,...
• no more time-based indizes
• Dedicated Hardware
• SSD Disks
• Share Hardware for ES and Hadoop
Jenkins for Workflows
distcp




       S3                      HDFS




                  transform




https://github.com/lovelysystems/ls-hive
https://github.com/lovelysystems/ls-thrift-py-hadoop
That's thirty
   minutes away.
I'll be there in ten.

      @jodok

Contenu connexe

Tendances

Tendances (20)

Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with Elasticsearch
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
ElasticSearch in action
ElasticSearch in actionElasticSearch in action
ElasticSearch in action
 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalability
 
Elastic search
Elastic searchElastic search
Elastic search
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - Sematext
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
 
Simple search with elastic search
Simple search with elastic searchSimple search with elastic search
Simple search with elastic search
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 

En vedette

Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
Tom Z Zeng
 

En vedette (20)

Webinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDBWebinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDB
 
Lessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at CraigslistLessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at Craigslist
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
 
Elasticsearch at Automattic
Elasticsearch at AutomatticElasticsearch at Automattic
Elasticsearch at Automattic
 
Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
 
Elasticsearch - Zero to Hero
Elasticsearch - Zero to HeroElasticsearch - Zero to Hero
Elasticsearch - Zero to Hero
 
Elastic meetup june16
Elastic meetup june16Elastic meetup june16
Elastic meetup june16
 
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
 
Elasticsearch in Production (London version)
Elasticsearch in Production (London version)Elasticsearch in Production (London version)
Elasticsearch in Production (London version)
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache Mesos
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Search and analyze your data with elasticsearch
Search and analyze your data with elasticsearchSearch and analyze your data with elasticsearch
Search and analyze your data with elasticsearch
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
 
Benchmark slideshow
Benchmark slideshowBenchmark slideshow
Benchmark slideshow
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Elasticsearch Introduction to Data model, Search & Aggregations
Elasticsearch Introduction to Data model, Search & AggregationsElasticsearch Introduction to Data model, Search & Aggregations
Elasticsearch Introduction to Data model, Search & Aggregations
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
 
Elasticsearch Query DSL - Not just for wizards...
Elasticsearch Query DSL - Not just for wizards...Elasticsearch Query DSL - Not just for wizards...
Elasticsearch Query DSL - Not just for wizards...
 

Similaire à You know, for search. Querying 24 Billion Documents in 900ms

Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
S S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
elliando dias
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
Andrew Brust
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117
exsuns
 
Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011
CodeIgniter Conference
 
20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag
garrett honeycutt
 

Similaire à You know, for search. Querying 24 Billion Documents in 900ms (20)

Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 
Learning Puppet basic thing
Learning Puppet basic thing Learning Puppet basic thing
Learning Puppet basic thing
 
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117
 
מיכאל
מיכאלמיכאל
מיכאל
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Bottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMPBottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMP
 
Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011
 
20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

You know, for search. Querying 24 Billion Documents in 900ms

  • 1. You know, for search querying 24 000 000 000 Records in 900ms @jodok
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 9.
  • 10.
  • 11.
  • 13. The anatomy of a tweet http://www.readwriteweb.com/archives/what_a_tweet_can_tell_you.php
  • 14. c1.xlarge 5 x m2.2xlarge EBS ES as document store - bash - 5 instances - find - weekly indexes - zcat - 2 replicas - curl - EBS volume
  • 15. • Map/Reduce to push to Elasticsearch • via NFS to HDFS storage HDFS ES • no dedicated nodes MAPRED
  • 16. - Disk IO - concated gzip files - compression
  • 17.
  • 18.
  • 19. Hadoop Storage - Index “Driver” Namenode Jobtracker Hive Datanode Secondary NN Datanode 2 Tasktracker Datanode 2 Tasktracker 6x 500 TB HDFS 2 Tasktracker 6x 500 TB HDFS 6x 500 TB HDFS Datanode Datanode Datanode 4 Tasktracker 4 Tasktracker 4 Tasktracker 6x 500 TB HDFS 6x 500 TB HDFS 6x 500 TB HDFS Tasktracker Spot Instances
  • 20. Adding S3 / External Tables to Hive create external table $tmp_table_name (size bigint, path string) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' stored as INPUTFORMAT "org.apache.hadoop.mapred.lib.NLineInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" location s3n://...; SET ... from ( select transform (size, path) using './current.tar.gz/bin/importer transform${max_lines}' as (crawl_ts int, screen_name string, ... num_tweets int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' COLLECTION ITEMS TERMINATED BY '002' MAP KEYS TERMINATED BY '003' LINES TERMINATED BY 'n' from $tmp_table_name ) f INSERT overwrite TABLE crawls PARTITION (crawl_day='${day}') select crawl_ts, ... user_json, tweets,
  • 22. packages: - puppet # Send pre-generated ssh private keys to the server ssh_keys: rsa_private: | ${SSH_RSA_PRIVATE_KEY} rsa_public: ${SSH_RSA_PUBLIC_KEY} dsa_private: | ${SSH_DSA_PRIVATE_KEY} dsa_public: ${SSH_DSA_PUBLIC_KEY} # set up mount points # remove default mount points mounts: - [ swap, null ] - [ ephemeral0, null ] # Additional YUM Repositories repo_additions: - source: "lovely-public" name: "Lovely Systems, Public Repository for RHEL 6 compatible Distributions" filename: lovely-public.repo enabled: 1 gpgcheck: 1 key: "file:///etc/pki/rpm-gpg/RPM-GPG-KEY-lovely" baseurl: "https://yum.lovelysystems.com/public/release" runcmd: - [ hostname, "${HOST}" ] - [ sed, -i, -e, "s/^HOSTNAME=.*/HOSTNAME=${HOST}/", /etc/sysconfig/network ] - [ wget, "http://169.254.169.254/latest/meta-data/local-ipv4", -O, /tmp/local-ipv4 ] - [ sh, -c, echo "$(/bin/cat /tmp/local-ipv4) ${HOST} ${HOST_NAME}" >> /etc/hosts ] - [ rpm, --import, "https://yum.lovelysystems.com/public/RPM-GPG-KEY-lovely"] - [ mkdir, -p, /var/lib/puppet/ssl/private_keys ] - [ mkdir, -p, /var/lib/puppet/ssl/public_keys ] - [ mkdir, -p, /var/lib/puppet/ssl/certs ] ${PUPPET_PRIVATE_KEY} - [ mv, /tmp/puppet_private_key.pem, /var/lib/puppet/ssl/private_keys/${HOST}.pem ] ${PUPPET_PUBLIC_KEY} - [ mv, /tmp/puppet_public_key.pem, /var/lib/puppet/ssl/public_keys/${HOST}.pem ] ${PUPPET_CERT} - [ mv, /tmp/puppet_cert.pem, /var/lib/puppet/ssl/certs/${HOST}.pem ] - [ sh, -c, echo " server = ${PUPPET_MASTER}" >> /etc/puppet/puppet.conf ] - [ sh, -c, echo " certname = ${HOST}" >> /etc/puppet/puppet.conf ] - [ /etc/init.d/puppet, start ]
  • 23. - IO - ES Memory - ES Backup - ES Replicas - Load while indexing - AWS Limits
  • 24. EBS performance http://blog.dt.org
  • 25. • Shard allocation • Avoid rebalancing (Discovery Timeout) • Uncached Facets https://github.com/lovelysystems/elasticsearch-ls-plugins • LUCENE-2205 Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.
  • 26. 3 AP server / MC c1.xlarge 6 ES Master Nodes 6 Node Hadoop Cluster c1.xlarge + Spot Instances 40 ES nodes per zone m1.large 8 EBS Volumes
  • 27.
  • 28.
  • 29.
  • 30.
  • 32.
  • 33. Cutting the cost • Reduce the amount of Data use Hadoop/MapRed transform to eliminate SPAM, irrelevant Languages,... • no more time-based indizes • Dedicated Hardware • SSD Disks • Share Hardware for ES and Hadoop
  • 35. distcp S3 HDFS transform https://github.com/lovelysystems/ls-hive https://github.com/lovelysystems/ls-thrift-py-hadoop
  • 36.
  • 37.
  • 38. That's thirty minutes away. I'll be there in ten. @jodok

Notes de l'éditeur

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. how do i work?\n* agile leader * i say what i do * i do what i say * hands on\n* quality over speed * responsibility to team\n* attract specialists * not trying to sell something. but DO IT. DELIVER\n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. how i feel ich unternehme dinge\n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. how i feel ich unternehme dinge\n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n