You know, for search. Querying 24 Billion Documents in 900ms

You know, for search
querying 24 000 000 000 Records in 900ms

@jodok

The anatomy of a tweet
http://www.readwriteweb.com/archives/what_a_tweet_can_tell_you.php

c1.xlarge 5 x m2.2xlarge

EBS
ES as document store

- bash - 5 instances
- ﬁnd - weekly indexes
- zcat - 2 replicas
- curl - EBS volume

• Map/Reduce to push
to Elasticsearch
• via NFS to HDFS
storage HDFS

ES

• no dedicated nodes

MAPRED

- Disk IO
- concated gzip ﬁles
- compression

Hadoop Storage - Index “Driver”

Namenode Jobtracker Hive
Datanode Secondary NN Datanode
2 Tasktracker Datanode 2 Tasktracker
6x 500 TB HDFS 2 Tasktracker 6x 500 TB HDFS
6x 500 TB HDFS

Datanode Datanode Datanode
4 Tasktracker 4 Tasktracker 4 Tasktracker
6x 500 TB HDFS 6x 500 TB HDFS 6x 500 TB HDFS

Tasktracker
Spot Instances

Adding S3 / External Tables to Hive

create external table $tmp_table_name
(size bigint, path string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
stored as
INPUTFORMAT "org.apache.hadoop.mapred.lib.NLineInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
location s3n://...;
SET ...
from (
select transform (size, path)
using './current.tar.gz/bin/importer transform${max_lines}' as
(crawl_ts int, screen_name string, ... num_tweets int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '001'
COLLECTION ITEMS TERMINATED BY '002'
MAP KEYS TERMINATED BY '003'
LINES TERMINATED BY 'n'
from $tmp_table_name
) f
INSERT overwrite TABLE crawls PARTITION (crawl_day='${day}')
select
crawl_ts, ... user_json, tweets,

https://launchpad.net/ubuntu/+source/cloud-init
http://www.netfort.gr.jp/~dancer/software/dsh.html.en

packages:
- puppet

# Send pre-generated ssh private keys to the server
ssh_keys:
rsa_private: | ${SSH_RSA_PRIVATE_KEY}
rsa_public: ${SSH_RSA_PUBLIC_KEY}
dsa_private: | ${SSH_DSA_PRIVATE_KEY}
dsa_public: ${SSH_DSA_PUBLIC_KEY}

# set up mount points
# remove default mount points
mounts:
- [ swap, null ]
- [ ephemeral0, null ]

# Additional YUM Repositories
repo_additions:
- source: "lovely-public"
name: "Lovely Systems, Public Repository for RHEL 6 compatible Distributions"
filename: lovely-public.repo
enabled: 1
gpgcheck: 1
key: "file:///etc/pki/rpm-gpg/RPM-GPG-KEY-lovely"
baseurl: "https://yum.lovelysystems.com/public/release"

runcmd:
- [ hostname, "${HOST}" ]
- [ sed, -i, -e, "s/^HOSTNAME=.*/HOSTNAME=${HOST}/", /etc/sysconfig/network ]
- [ wget, "http://169.254.169.254/latest/meta-data/local-ipv4", -O, /tmp/local-ipv4 ]
- [ sh, -c, echo "$(/bin/cat /tmp/local-ipv4) ${HOST} ${HOST_NAME}" >> /etc/hosts ]

- [ rpm, --import, "https://yum.lovelysystems.com/public/RPM-GPG-KEY-lovely"]

- [ mkdir, -p, /var/lib/puppet/ssl/private_keys ]
- [ mkdir, -p, /var/lib/puppet/ssl/public_keys ]
- [ mkdir, -p, /var/lib/puppet/ssl/certs ]
${PUPPET_PRIVATE_KEY}
- [ mv, /tmp/puppet_private_key.pem, /var/lib/puppet/ssl/private_keys/${HOST}.pem ]
${PUPPET_PUBLIC_KEY}
- [ mv, /tmp/puppet_public_key.pem, /var/lib/puppet/ssl/public_keys/${HOST}.pem ]
${PUPPET_CERT}
- [ mv, /tmp/puppet_cert.pem, /var/lib/puppet/ssl/certs/${HOST}.pem ]

- [ sh, -c, echo " server = ${PUPPET_MASTER}" >> /etc/puppet/puppet.conf ]
- [ sh, -c, echo " certname = ${HOST}" >> /etc/puppet/puppet.conf ]
- [ /etc/init.d/puppet, start ]

- IO
- ES Memory
- ES Backup
- ES Replicas
- Load while indexing
- AWS Limits

EBS performance

http://blog.dt.org

• Shard allocation
• Avoid rebalancing (Discovery Timeout)
• Uncached Facets
https://github.com/lovelysystems/elasticsearch-ls-plugins
• LUCENE-2205
Rework of the TermInfosReader class to remove the
Terms[], TermInfos[], and the index pointer long[] and create
a more memory efﬁcient data structure.

3 AP server / MC
c1.xlarge

6 ES Master Nodes 6 Node Hadoop Cluster
c1.xlarge + Spot Instances

40 ES nodes per zone
m1.large
8 EBS Volumes

Cutting the cost
• Reduce the amount of Data
use Hadoop/MapRed transform to
eliminate SPAM, irrelevant Languages,...
• no more time-based indizes
• Dedicated Hardware
• SSD Disks
• Share Hardware for ES and Hadoop

distcp

S3 HDFS

transform

https://github.com/lovelysystems/ls-hive
https://github.com/lovelysystems/ls-thrift-py-hadoop

That's thirty
minutes away.
I'll be there in ten.

@jodok

You know, for search. Querying 24 Billion Documents in 900ms

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à You know, for search. Querying 24 Billion Documents in 900ms

Similaire à You know, for search. Querying 24 Billion Documents in 900ms (20)

Dernier

Dernier (20)

You know, for search. Querying 24 Billion Documents in 900ms

Notes de l'éditeur