1. Next Revolution
Toward Open Platform
Terapot: Massive Email Archiving
with Hadoop & Friends
- Commercial Hadoop Application
Jaesun Han
Founder & CEO of NexR
jshan@nexrcorp.com
2. #2
About NexR
Offering Hadoop & Cloud Computing Platform and Services
Hadoop & Cloud Computing Services
Hadoop Provisioning & Management
Academic Support
Massive Email Archiving MapReduce Workflow
Program
Massive Data Storage & Processing Platform
Cloud Computing Platform
(Compatible with Amazon AWS)
icube-cc icube-sc
(Compute) (Storage)
3. #3
What is Email Archiving?
The Objectives of Email Archiving
- Regulatory compliance
- e-Discovery: Litigation and legal discovery
- E-mail backup and disaster recovery
- Messaging system & storage optimization
- Monitoring of internal and external e-mail content
4. #4
The Architecture of Email Archiving
Data Acquisition Data Processing Data Access
Journaling Indexing Search
Mailbox Crawling Filtering Discovery
Email
Servers
Journaling Crawling
Search employee
Indexing Indexes
Email Archiving
Server
Discovery auditor
administrator
Archival Storage
email data
5. #5
The Challenges of Email Archiving
Explosive growth of digital data
- 6 times (988XB) in 2010 than 2006
- 95% (939 XB) unstructured data including email
- Increasing the cost and complexity of archiving
Requiring scalable & low cost archiving
Reinforcement of data retention regulation
- Retention, Disposal, e-Discovery, Security
- HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs,
OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX
Requiring scalable archiving & fast discovery
Needs for intelligent data management
- Knowledge management from email data
- Filtering, monitoring, data mining, etc
Requiring integration with intelligent system
6. #6
New Requirements of Email Archiving
High Scalability
Low Cost
High Performance
Intelligence
7. #7
Terapot: When Hadoop Met Email Archiving…
Scale-out architecture with Hadoop
- Hadoop HDFS for archiving email data
- Hadoop MapReduce for crawling & indexing
- Apache Lucene for search & discovery
Email
Servers
Distributed Crawling
Journaling
Hadoop MapReduce
(Crawling, Indexing, etc)
Journaling Hadoop HDFS
Server (Archiving)
Distributed Search & Discovery
8. #8
Features of Terapot
Distributed Massive Email Archiving
High Scalability by Shared-Nothing Architecture
- Thousands of servers, billions of emails
Low Cost by Inexpensive Hardware
- Entry servers under $5,000
High Performance by Parallelism
- Fast search under 1-2 seconds for each user account
- Fast discovery in parallel with MapReduce
Intelligence by Data Mining
- Contact network analysis, content analysis, statistics
Support Both On-premise Version and Cloud(hosted)
Version
Development with Various Open Source Software
9. #9
The Architecture of Terapot
Terapot Clients Email Sources
HTTP/
SOAP REST JSON POP3 Mail NAS/
FTP/SFTP
Server Server NFS
Server
Terapot Frontend
MR Workflow Manager MailServer Search Gateway Analyzer
Batch processing Analysis 4 key
Real-Time
Crawling Indexing Merging Searching ETL Mining components
Indexing
Hadoop MapReduce, Lucene, & Hive
HDFS
(email)
Local
(index)
10. #10
Batch Processing Component
Email Sources
HDFS
Crawling Archiving policies
(MR) An archive file per user
An archive file per user Several archive files per crawling
(sequence file)
configured
period
Indexing
(MR)
a temporary index file
per user
(lucene index file)
Local file system
Merging shard 1 shard 0
Search
a merged index file
(for backing up)
index shard
(3 copy replication)
11. #11
Real-Time Indexing Component
Journaling
Server
Forwarding Database
Memory
Indexing Real-Time Archiving
Indexing
Crawling
Real-Time
HDFS
Index
Flushing
archive
Batch
Processing index
Component
12. #12
Search & Discovery Component
Search
Gateway
Locating
index shards
Distributed
Search
Assigning
shards
Search Nodes Real-Time
copy index shards Indexing Nodes
to local file system
Zookeeper
Updating
shard status HDFS
index shards