2. About me
• Data Warehouse Architect
• Move Inc (realtor.com)
• Pluralsight Author
• Passion for data and technology
Ahmad Alkilani
linkedin.com/in/ahmadalkilani
EASkills.com
3. Topics…
• History of Move’s enterprise data warehouse
• Why Hadoop found a home at Move
• High level architecture
• Where we are now
• Where we’re heading in the future
• Q & A
4. Move Inc.
• Leader in online real estate and operator of realtor.com
• Over 410 million minutes per month on Move websites
• Over 300 million user engagement events per day on realtor.com and
mobile apps
• Connecting consumers and customers requires lots of data
8. • Bigger servers
• 8 processors 10 core each
• 2 TB of RAM!
• Solid state drives
• Fusion IO cards
• 10 Terabytes each server
• Worked Great!
• Until we realized we could only store 50 days worth of data!
reactive …
-
1
2
3
4
5
6
7
Billions
Raw Events Move Inc. (Realtor.com and Mobile)
9. • Started with 13 nodes at a fraction of the cost of our SSD monster servers - Cost
• Plan to continue to scale out – Ease of scalability
• Current capacity is ~125 TB – Good starting point
proactive …
11. In more details…
Hive over HCatalog
Transferred to HDFS and
then the Hive Warehouse
HDFS
External
Tables against
data in HDFS
Data moves to
Hive
Warehouse
Dynamic
Partition
Inserts
• Partition Pruning
• Snappy Compression
• Dynamic Tables with
Maps and Arrays
12. ETL & Querying Hive…
Hive Warehouse
Aggregates
SQL
Server
(EDW)
Multi-Inserts
Single Pass
Details
Stats
13. ETL & Querying Hive…
Separate files for different keys of a Map
• Resort to MapReduce instead of Hive and use MultipleOutputs class
• Dynamic Partition Inserts again & Hadoop -getmerge
14. Some lessons learned…
• Our ETLs are still expensive
• Putting our data loads and cluster at the mercy of our analysts. Not a very good idea
• Use Queues to guarantee room for ETLs to do their job
• Default queue is for users
• Specialized queue is for ETL
• Keep an eye on the slots available
• Use .hiverc file to automatically control behavior
15. Where we’re headed
• Re-evaluate tool selection
• Talend/Pentaho
• Real-time analytics
• Kafka/Honu/Flume/Storm/StreamInsight
• Hive Geospatial
• Integrating different technologies is OK
16. D3.js with Asp.Net SignalR
Visualizing search activity and active listings in different states