SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
How Rackspace Query
Terabytes of Log Data
 Uses MapReduce, Hadoop

    Case Study, by Schubert Zhang 2009-04-30
Rackspace
• Rackspace has more than 50K devices and 7 data
  centers.

• The mail system and logging servers are currently in 3 of
  the Rackspace data centers.

• The system stores over 800 million objects (an object = a
  user event such as receiving an email or logging into
  IMAP) within Solr and 9.6 (records?) billion within
  Hadoop, which equals 6.3 TB compressed.

• Several hundred gigabytes of email log data is
  generated each day. (seems 140GB after cleared up)
Background on Mailtrust
• Email hosting company
• Founded in 1999, merged with Rackspace in 2007,
  previous name: Webmail.us
• 80K business customers, 700K mailboxes.
• 2 hosted mail products: Noteworthy, MS Exchange
• The Noteworthy System:
   – Homegrown, Linux based, POP3, IMAP, webmail, RSS feeds,
     shared calendaring, Outlook sync, Blackberry sync.
   – ~600 servers, commodity hardware, designed to work around
     frequent failures.
• The MS Exchange System:
   – MAPI, POP, IMAP, OWA, Blackberry, Goodmail, ActiveSync.
   – ~100 servers, higher-end hardware, SAN & DAS storage.
Problems
•   Hundreds of gigabytes of new data each day streaming in from over 600 hyperactive
    servers.
•   Log processing system.
     –   (1) Flat text files stored on each machine.
           •   Had to be manually searched by engineers logging into each individual machine.
     –   (2) Relational database solution that just couldn't compete. MySQL.
           •   Inserts quickly became the bottleneck.
           •   A lot of index churn.
           •   Data was then broken into Merge Tables based on time so index updates weren't a problem.
           •   Load and operational problems.
     –   (3) Hadoop based solution that works wisely and has virtually unlimited scalability potential.
           •   Hadoop
           •   Lucene and Solr.
•   The familiar faced problem now: Lots and lots of data streaming in.
     –   Where do you store all that data?
     –   How do you do anything useful with it?
     –   How to retrieve the wanted data from the data sea.

•   Examine mail logs in order to troubleshoot problems for our customers.
•   The query/search should be fast and accurate.
Now the new system
• The advantage of their new system is that they can now
  look at their data in anyway they want:
   – Nightly MapReduce jobs collect statistics about their mail system
     such as spam counts by domain, bytes transferred and number
     of logins.
   – When they wanted to find out which part of the the world their
     customers logged in from, a quick MapReduce job was created
     and they had the answer within a few hours. Not really possible
     in your typical ETL system.
• "Now whenever we think of complex question about our
  customers’ usage patterns, we can pull the answer from
  our logs within hours via MapReduce. This is powerful
  stuff."
The Platform
•   Hadoop MapReduce
•   Hadoop Distributed File System (HDFS)
•   Lucene
•   Solr
•   Tomcat
The Architecture
• Raw logs get streamed from hundreds of mail servers to
  the Hadoop Distributed File System (”HDFS”) in real time.

• MapReduce jobs are scheduled run to index the new
  data using Apache Lucene and Solr.

• Once the indexes have been built, they are compressed
  and stored away in HDFS.

• Each Hadoop datanode runs a Tomcat servlet container,
  which hosts a number of Solr instances that pull and
  merge the new indexes, and provide really fast search
  results to our support team.
The System Evolution
              Logging v1.0
• Logs were stored in flat text files on the local disk of
  each mail server and were kept for 14 days.

• Our support techs did not have login access to the
  servers, so in order to search the logs they would have
  to escalate a ticket to our engineers. The engineers
  would then have to ssh into each mail server and grep
  /var/log/maillog.

• Problems: Once we grew much past a dozen servers,
  this manual process of logging into each server become
  too time consuming for our engineers.
Logging v1.1
•   Sped up the search process by writing a script that would search
    multiple servers via one command run from a centralized server.

•   Remote still grep.

•   Problems: The support techs still had to escalate a ticket to the
    engineers in order to perform a search. As the number of customers
    and servers increased, this began to take too much of our
    engineers' scarce time. Also, storing and searching the logs on a
    live server was negatively affecting the performance of the servers.
    To make matters worse, the engineering team had grown and we
    started running into the problem where two engineers would perform
    a search at the same time, which really slowed things down.
Logging v2.0
•   a web-based tool where they could search the logs.
•   It allowed searching by the sender or recipient's email address, domain name or IP
    address.
•   All of these were indexed fields in a MySQL database. The centralized log server

•   Each day's logs were stored in a separate table, so that we could cleanup old data by
    simply dropping and recreating MySQL tables.
•   Log data was only kept for 3 days in order to keep the MySQL database down to a
    reasonable size.
•   Wildcard text searches (i.e. MySQL "LIKE" statements) were not allowed because the
    data set was very large and these queries would be horribly slow.

•   Problems: We quickly realized that we had a bottleneck with the MySQL inserts. As
    the tables grew, indexing each entry as it was inserted became slow. Within the first
    hours of testing, the inserts began slowing and could not keep up with the rate at
    which data was received. Version 2.0 of the logging system was never used in
    production.
Logging v2.1
•   Fixed the MySQL INSERT bottleneck by queuing up the log entries
    in local text files on the centralized log server and periodically bulk
    loading them into the database. As syslog-ng received logs on its 6
    ports, the data would be streamed to 6 separate text files. Every 10
    minutes a script would rotate those text files and execute a MySQL
    LOAD to load the data into the database. This was magnitudes
    faster than inserting the log data one record at a time.

•   Problems: The LOADs would get progressively slower as the
    database grew because MySQL indexing performance decreases as
    the table you are inserting into gets larger. This version was fast
    enough to be released into production, but we knew the system
    would not scale too far without additional work.
Logging v2.2
•   Introduced Merge Tables in order to speed up loading the log data into the database.
•   every 10 minutes our script would create a new database table and then load the text
    logs into the empty table.
•   After the data was loaded, the script would modify a set of Merge Tables that
    combined all of the 10-minute tables together.
•   The web search tool was modified to allow searching within the different time ranges.
    Corresponding Merge Tables existed for each of those time ranges, and were
    modified every 10 minutes as new tables were created.

•   Problems: the database LOAD operations would take 2-3 minutes to run. the server
    was now always under a heavy cpu and disk IO load.
•   Searches were being performed more frequently and were becoming slow. We
    started to see some strange problems such as random errors while trying to create
    new tables or modify the Merge Tables. These errors progressively became more
    frequent, resulting in missing log data. The support team began to lose confidence in
    the system's accuracy.
•   the logging system had no redundancy.

•   We needed a new solution that would be fast, reliable and could scale indefinitely
    with our growth. We needed something truly scalable.
Logging v3+
• Avoid limiting our abilities to build new features down the
  road.
• For example, we wanted to build a tool that would allow
  our customers to search their logs directly.

• It scales out it's workload horizontally by adding servers
  and distributing the data and MapReduce jobs amongst
  the servers.

• In about 3 months we build a fresh new log processing
  system using Hadoop, Lucene and Solr.

• Put the log search tool in the hands of our customers.
Stu Hood’s Detailed Comments
•   The loading of data is streaming, but the indexing is not. We write to a file in Hadoop until it
    reaches a size below the block size, or until it times out, and then we close and move it to where it
    will be processed.
•   Our processing jobs run every 10 minutes or so, meaning that the logs become available for
    Customer Care after about 15. We’ve executed around 150K jobs on this cluster with 3 restarts.

•   We create the indexes on local disk in our reducer, and compress them into HDFS after they are
    complete.
•   When we pull the index to make it available for search, we decompress it to local disk and merge
    it using the Lucene IndexWriter.addIndexes method before calling /commit on the Solr instance.
    The Nutch project created an IndexReader that can do read-only access on HDFS, but for speed
    reasons, we decided not to take that approach.
•   Since we are indexing to local disk, we use an embedded SolrCore, in the same JVM as the
    reducer.

•   We have 10 Hadoop data nodes, with 3.5TB hard drives each. = 35TB
•   We are currently indexing an average of 140GBytes per day.

•   The merged indexes are not replicated at all… only one Solr node has a copy of each index, so
    failover involves a brief downtime for queries. If we lose a node, other nodes (consistent hashing)
    become responsible and merge the indexes from the copies we always have in Hadoop.
Future
• Creating reports or doing ad-hoc queries.
• More wanted MapReduce jobs to do
  wanted things.
References
• How Rackspace Now Uses MapReduce
  and Hadoop to Query Terabytes of Data
• MapReduce at Rackspace

Contenu connexe

Tendances

Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceUwe Printz
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-timeFlurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-timeTrieu Nguyen
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldUwe Printz
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
RedisConf18 - Application of Redis in IOT Edge Devices
RedisConf18 - Application of Redis in IOT Edge DevicesRedisConf18 - Application of Redis in IOT Edge Devices
RedisConf18 - Application of Redis in IOT Edge DevicesRedis Labs
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0Heiko Loewe
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013StampedeCon
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen ShapiraGNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapiragluent.
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
 
Microservices - Is it time to breakup?
Microservices - Is it time to breakup? Microservices - Is it time to breakup?
Microservices - Is it time to breakup? Dave Nielsen
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketCloudera, Inc.
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 

Tendances (20)

Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-timeFlurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
RedisConf18 - Application of Redis in IOT Edge Devices
RedisConf18 - Application of Redis in IOT Edge DevicesRedisConf18 - Application of Redis in IOT Edge Devices
RedisConf18 - Application of Redis in IOT Edge Devices
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen ShapiraGNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
Microservices - Is it time to breakup?
Microservices - Is it time to breakup? Microservices - Is it time to breakup?
Microservices - Is it time to breakup?
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 

En vedette

Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data miningMai Mustafa
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
PageRank and Markov Chain
PageRank and Markov ChainPageRank and Markov Chain
PageRank and Markov ChainGenioAladino
 

En vedette (7)

Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Google PageRank
Google PageRankGoogle PageRank
Google PageRank
 
3 apache-avro
3 apache-avro3 apache-avro
3 apache-avro
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data mining
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
PageRank and Markov Chain
PageRank and Markov ChainPageRank and Markov Chain
PageRank and Markov Chain
 

Similaire à Case Study - How Rackspace Query Terabytes Of Data

Toronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKToronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKAndrew Trossman
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftJie Li
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Emprovise
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoopch adnan
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
high performance databases
high performance databaseshigh performance databases
high performance databasesmahdi_92
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Data at Scale - Michael Peacock, Cloud Connect 2012
Data at Scale - Michael Peacock, Cloud Connect 2012Data at Scale - Michael Peacock, Cloud Connect 2012
Data at Scale - Michael Peacock, Cloud Connect 2012Michael Peacock
 

Similaire à Case Study - How Rackspace Query Terabytes Of Data (20)

Toronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKToronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELK
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
high performance databases
high performance databaseshigh performance databases
high performance databases
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Hadoop
HadoopHadoop
Hadoop
 
Data at Scale - Michael Peacock, Cloud Connect 2012
Data at Scale - Michael Peacock, Cloud Connect 2012Data at Scale - Michael Peacock, Cloud Connect 2012
Data at Scale - Michael Peacock, Cloud Connect 2012
 

Plus de Schubert Zhang

Engineering Culture and Infrastructure
Engineering Culture and InfrastructureEngineering Culture and Infrastructure
Engineering Culture and InfrastructureSchubert Zhang
 
Simple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluationSimple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluationSchubert Zhang
 
Scrum Agile Development
Scrum Agile DevelopmentScrum Agile Development
Scrum Agile DevelopmentSchubert Zhang
 
Engineering practices in big data storage and processing
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processingSchubert Zhang
 
Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算Schubert Zhang
 
Big Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223aBig Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223aSchubert Zhang
 
HBase Coprocessor Introduction
HBase Coprocessor IntroductionHBase Coprocessor Introduction
HBase Coprocessor IntroductionSchubert Zhang
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验Schubert Zhang
 
Wild Thinking of BigdataBase
Wild Thinking of BigdataBaseWild Thinking of BigdataBase
Wild Thinking of BigdataBaseSchubert Zhang
 
RockStor - A Cloud Object System based on Hadoop
RockStor -  A Cloud Object System based on HadoopRockStor -  A Cloud Object System based on Hadoop
RockStor - A Cloud Object System based on HadoopSchubert Zhang
 
Hadoop compress-stream
Hadoop compress-streamHadoop compress-stream
Hadoop compress-streamSchubert Zhang
 
Ganglia轻度使用指南
Ganglia轻度使用指南Ganglia轻度使用指南
Ganglia轻度使用指南Schubert Zhang
 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionSchubert Zhang
 

Plus de Schubert Zhang (20)

Blockchain in Action
Blockchain in ActionBlockchain in Action
Blockchain in Action
 
科普区块链
科普区块链科普区块链
科普区块链
 
Engineering Culture and Infrastructure
Engineering Culture and InfrastructureEngineering Culture and Infrastructure
Engineering Culture and Infrastructure
 
Simple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluationSimple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluation
 
Scrum Agile Development
Scrum Agile DevelopmentScrum Agile Development
Scrum Agile Development
 
Career Advice
Career AdviceCareer Advice
Career Advice
 
Engineering practices in big data storage and processing
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processing
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
 
Horizon for Big Data
Horizon for Big DataHorizon for Big Data
Horizon for Big Data
 
Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算
 
Big Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223aBig Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223a
 
HBase Coprocessor Introduction
HBase Coprocessor IntroductionHBase Coprocessor Introduction
HBase Coprocessor Introduction
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验
 
Wild Thinking of BigdataBase
Wild Thinking of BigdataBaseWild Thinking of BigdataBase
Wild Thinking of BigdataBase
 
RockStor - A Cloud Object System based on Hadoop
RockStor -  A Cloud Object System based on HadoopRockStor -  A Cloud Object System based on Hadoop
RockStor - A Cloud Object System based on Hadoop
 
Fans of running gump
Fans of running gumpFans of running gump
Fans of running gump
 
Hadoop compress-stream
Hadoop compress-streamHadoop compress-stream
Hadoop compress-stream
 
Ganglia轻度使用指南
Ganglia轻度使用指南Ganglia轻度使用指南
Ganglia轻度使用指南
 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solution
 
Big data and cloud
Big data and cloudBig data and cloud
Big data and cloud
 

Dernier

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 

Dernier (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 

Case Study - How Rackspace Query Terabytes Of Data

  • 1. How Rackspace Query Terabytes of Log Data Uses MapReduce, Hadoop Case Study, by Schubert Zhang 2009-04-30
  • 2. Rackspace • Rackspace has more than 50K devices and 7 data centers. • The mail system and logging servers are currently in 3 of the Rackspace data centers. • The system stores over 800 million objects (an object = a user event such as receiving an email or logging into IMAP) within Solr and 9.6 (records?) billion within Hadoop, which equals 6.3 TB compressed. • Several hundred gigabytes of email log data is generated each day. (seems 140GB after cleared up)
  • 3. Background on Mailtrust • Email hosting company • Founded in 1999, merged with Rackspace in 2007, previous name: Webmail.us • 80K business customers, 700K mailboxes. • 2 hosted mail products: Noteworthy, MS Exchange • The Noteworthy System: – Homegrown, Linux based, POP3, IMAP, webmail, RSS feeds, shared calendaring, Outlook sync, Blackberry sync. – ~600 servers, commodity hardware, designed to work around frequent failures. • The MS Exchange System: – MAPI, POP, IMAP, OWA, Blackberry, Goodmail, ActiveSync. – ~100 servers, higher-end hardware, SAN & DAS storage.
  • 4. Problems • Hundreds of gigabytes of new data each day streaming in from over 600 hyperactive servers. • Log processing system. – (1) Flat text files stored on each machine. • Had to be manually searched by engineers logging into each individual machine. – (2) Relational database solution that just couldn't compete. MySQL. • Inserts quickly became the bottleneck. • A lot of index churn. • Data was then broken into Merge Tables based on time so index updates weren't a problem. • Load and operational problems. – (3) Hadoop based solution that works wisely and has virtually unlimited scalability potential. • Hadoop • Lucene and Solr. • The familiar faced problem now: Lots and lots of data streaming in. – Where do you store all that data? – How do you do anything useful with it? – How to retrieve the wanted data from the data sea. • Examine mail logs in order to troubleshoot problems for our customers. • The query/search should be fast and accurate.
  • 5. Now the new system • The advantage of their new system is that they can now look at their data in anyway they want: – Nightly MapReduce jobs collect statistics about their mail system such as spam counts by domain, bytes transferred and number of logins. – When they wanted to find out which part of the the world their customers logged in from, a quick MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system. • "Now whenever we think of complex question about our customers’ usage patterns, we can pull the answer from our logs within hours via MapReduce. This is powerful stuff."
  • 6. The Platform • Hadoop MapReduce • Hadoop Distributed File System (HDFS) • Lucene • Solr • Tomcat
  • 7. The Architecture • Raw logs get streamed from hundreds of mail servers to the Hadoop Distributed File System (”HDFS”) in real time. • MapReduce jobs are scheduled run to index the new data using Apache Lucene and Solr. • Once the indexes have been built, they are compressed and stored away in HDFS. • Each Hadoop datanode runs a Tomcat servlet container, which hosts a number of Solr instances that pull and merge the new indexes, and provide really fast search results to our support team.
  • 8. The System Evolution Logging v1.0 • Logs were stored in flat text files on the local disk of each mail server and were kept for 14 days. • Our support techs did not have login access to the servers, so in order to search the logs they would have to escalate a ticket to our engineers. The engineers would then have to ssh into each mail server and grep /var/log/maillog. • Problems: Once we grew much past a dozen servers, this manual process of logging into each server become too time consuming for our engineers.
  • 9. Logging v1.1 • Sped up the search process by writing a script that would search multiple servers via one command run from a centralized server. • Remote still grep. • Problems: The support techs still had to escalate a ticket to the engineers in order to perform a search. As the number of customers and servers increased, this began to take too much of our engineers' scarce time. Also, storing and searching the logs on a live server was negatively affecting the performance of the servers. To make matters worse, the engineering team had grown and we started running into the problem where two engineers would perform a search at the same time, which really slowed things down.
  • 10. Logging v2.0 • a web-based tool where they could search the logs. • It allowed searching by the sender or recipient's email address, domain name or IP address. • All of these were indexed fields in a MySQL database. The centralized log server • Each day's logs were stored in a separate table, so that we could cleanup old data by simply dropping and recreating MySQL tables. • Log data was only kept for 3 days in order to keep the MySQL database down to a reasonable size. • Wildcard text searches (i.e. MySQL "LIKE" statements) were not allowed because the data set was very large and these queries would be horribly slow. • Problems: We quickly realized that we had a bottleneck with the MySQL inserts. As the tables grew, indexing each entry as it was inserted became slow. Within the first hours of testing, the inserts began slowing and could not keep up with the rate at which data was received. Version 2.0 of the logging system was never used in production.
  • 11. Logging v2.1 • Fixed the MySQL INSERT bottleneck by queuing up the log entries in local text files on the centralized log server and periodically bulk loading them into the database. As syslog-ng received logs on its 6 ports, the data would be streamed to 6 separate text files. Every 10 minutes a script would rotate those text files and execute a MySQL LOAD to load the data into the database. This was magnitudes faster than inserting the log data one record at a time. • Problems: The LOADs would get progressively slower as the database grew because MySQL indexing performance decreases as the table you are inserting into gets larger. This version was fast enough to be released into production, but we knew the system would not scale too far without additional work.
  • 12. Logging v2.2 • Introduced Merge Tables in order to speed up loading the log data into the database. • every 10 minutes our script would create a new database table and then load the text logs into the empty table. • After the data was loaded, the script would modify a set of Merge Tables that combined all of the 10-minute tables together. • The web search tool was modified to allow searching within the different time ranges. Corresponding Merge Tables existed for each of those time ranges, and were modified every 10 minutes as new tables were created. • Problems: the database LOAD operations would take 2-3 minutes to run. the server was now always under a heavy cpu and disk IO load. • Searches were being performed more frequently and were becoming slow. We started to see some strange problems such as random errors while trying to create new tables or modify the Merge Tables. These errors progressively became more frequent, resulting in missing log data. The support team began to lose confidence in the system's accuracy. • the logging system had no redundancy. • We needed a new solution that would be fast, reliable and could scale indefinitely with our growth. We needed something truly scalable.
  • 13. Logging v3+ • Avoid limiting our abilities to build new features down the road. • For example, we wanted to build a tool that would allow our customers to search their logs directly. • It scales out it's workload horizontally by adding servers and distributing the data and MapReduce jobs amongst the servers. • In about 3 months we build a fresh new log processing system using Hadoop, Lucene and Solr. • Put the log search tool in the hands of our customers.
  • 14. Stu Hood’s Detailed Comments • The loading of data is streaming, but the indexing is not. We write to a file in Hadoop until it reaches a size below the block size, or until it times out, and then we close and move it to where it will be processed. • Our processing jobs run every 10 minutes or so, meaning that the logs become available for Customer Care after about 15. We’ve executed around 150K jobs on this cluster with 3 restarts. • We create the indexes on local disk in our reducer, and compress them into HDFS after they are complete. • When we pull the index to make it available for search, we decompress it to local disk and merge it using the Lucene IndexWriter.addIndexes method before calling /commit on the Solr instance. The Nutch project created an IndexReader that can do read-only access on HDFS, but for speed reasons, we decided not to take that approach. • Since we are indexing to local disk, we use an embedded SolrCore, in the same JVM as the reducer. • We have 10 Hadoop data nodes, with 3.5TB hard drives each. = 35TB • We are currently indexing an average of 140GBytes per day. • The merged indexes are not replicated at all… only one Solr node has a copy of each index, so failover involves a brief downtime for queries. If we lose a node, other nodes (consistent hashing) become responsible and merge the indexes from the copies we always have in Hadoop.
  • 15. Future • Creating reports or doing ad-hoc queries. • More wanted MapReduce jobs to do wanted things.
  • 16. References • How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data • MapReduce at Rackspace