SlideShare une entreprise Scribd logo
1  sur  12
淘宝Hadoop数据分析实践 淘宝 数据平台与产品部 周敏(周忱)
数据分析选型历程 Hadoop简介 系统架构 集群介绍 近期对Hadoop的改造实践 主要内容
淘宝数据分析选型历程 webalizer awstat 般若 & Oracle Atpanel & Oracle RAC  日志最高达250GB/天 最高达约50道作业 每天运行20小时以上 Oracle RAC集群最多20个节点 Hadoop Hive
Hadoop是什么
目前架构 天网调度系统 Oracle 备库 爬虫数据 MySQL备库 日志系统 TimeTunnel DataExchange DataSync Gateway Servers Hadoop Cluster:云梯1 Map Reduce Java Jobs Streaming Jobs Hive Jobs 数据平台 搜索 支付宝 B2B 云梯2 口碑 广告 BI 数据魔方 量子统计 淘数据 推荐系统 搜索排行 …
规模 总容量27.79PB, 利用率51.06% 总共1600+台机器 约6.6千万个文件 每台机器12 TB/24TB 约40000道作业/天 扫描数据约1.7PB/天 产生数据约255 TB/天 用户数820人, 用户组67个
JobTracker优化 ,[object Object]
Heartbeat锁粒度降低
JobHistory页面分离
Log4j配置及使用优化
MapReduce模拟器,[object Object]
存储优化 极限存储 采用增量存储表数据 建立聚簇索引定位某天/某段时间内的快照 压缩核心表在云梯的存储空间, 平均比率1/30 已经节省3PB空间 压缩 历史数据采用BZip2压缩 已经开发LZMA2压缩, 等待上线 Hadoop RAID 源于Facebook的版本, 添加Placement Mover 正在上线, 预计可再节省3PB空间

Contenu connexe

Tendances

Hadoop Deployment Model @ OSDC.TW
Hadoop Deployment Model @ OSDC.TWHadoop Deployment Model @ OSDC.TW
Hadoop Deployment Model @ OSDC.TW
Jazz Yao-Tsung Wang
 
Hadoop与数据分析
Hadoop与数据分析Hadoop与数据分析
Hadoop与数据分析
George Ang
 

Tendances (20)

2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform Security2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform Security
 
大資料趨勢介紹與相關使用技術
大資料趨勢介紹與相關使用技術大資料趨勢介紹與相關使用技術
大資料趨勢介紹與相關使用技術
 
Data Analyse Black Horse - ClickHouse
Data Analyse Black Horse - ClickHouseData Analyse Black Horse - ClickHouse
Data Analyse Black Horse - ClickHouse
 
How to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environmentHow to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environment
 
Java Concurrent Optimization: Concurrent Queue
Java Concurrent Optimization: Concurrent QueueJava Concurrent Optimization: Concurrent Queue
Java Concurrent Optimization: Concurrent Queue
 
Hadoop 2.0 之古往今來
Hadoop 2.0 之古往今來Hadoop 2.0 之古往今來
Hadoop 2.0 之古往今來
 
Hadoop 介紹 20141024
Hadoop 介紹 20141024Hadoop 介紹 20141024
Hadoop 介紹 20141024
 
ClickHouse北京Meetup ClickHouse Best Practice @Sina
ClickHouse北京Meetup ClickHouse Best Practice @SinaClickHouse北京Meetup ClickHouse Best Practice @Sina
ClickHouse北京Meetup ClickHouse Best Practice @Sina
 
唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub
 
Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結
 
Hadoop Deployment Model @ OSDC.TW
Hadoop Deployment Model @ OSDC.TWHadoop Deployment Model @ OSDC.TW
Hadoop Deployment Model @ OSDC.TW
 
Life of Big Data Technologies
Life of Big Data TechnologiesLife of Big Data Technologies
Life of Big Data Technologies
 
When R meet Hadoop
When R meet HadoopWhen R meet Hadoop
When R meet Hadoop
 
Log collection
Log collectionLog collection
Log collection
 
Hadoop 0.20 程式設計
Hadoop 0.20 程式設計Hadoop 0.20 程式設計
Hadoop 0.20 程式設計
 
用Python实现hadoop任务调度管理
用Python实现hadoop任务调度管理用Python实现hadoop任务调度管理
用Python实现hadoop任务调度管理
 
Big Data Taiwan 2014 Track1-3: Big Data, Big Challenge — Splunk 幫你解決 Big Data...
Big Data Taiwan 2014 Track1-3: Big Data, Big Challenge — Splunk 幫你解決 Big Data...Big Data Taiwan 2014 Track1-3: Big Data, Big Challenge — Splunk 幫你解決 Big Data...
Big Data Taiwan 2014 Track1-3: Big Data, Big Challenge — Splunk 幫你解決 Big Data...
 
Mesos-based Data Infrastructure @ Douban
Mesos-based Data Infrastructure @ DoubanMesos-based Data Infrastructure @ Douban
Mesos-based Data Infrastructure @ Douban
 
Hadoop与数据分析
Hadoop与数据分析Hadoop与数据分析
Hadoop与数据分析
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計
 

En vedette

En vedette (20)

空望 推荐系统@淘宝
空望 推荐系统@淘宝空望 推荐系统@淘宝
空望 推荐系统@淘宝
 
鹰眼下的淘宝_EagleEye with Taobao
鹰眼下的淘宝_EagleEye with Taobao鹰眼下的淘宝_EagleEye with Taobao
鹰眼下的淘宝_EagleEye with Taobao
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Path to 400M Members: LinkedIn’s Data Powered Journey
Path to 400M Members: LinkedIn’s Data Powered JourneyPath to 400M Members: LinkedIn’s Data Powered Journey
Path to 400M Members: LinkedIn’s Data Powered Journey
 
Taobao
TaobaoTaobao
Taobao
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
Rebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for ScaleRebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for Scale
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Hive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHive - 1455: Cloud Storage
Hive - 1455: Cloud Storage
 
Data science lifecycle with Apache Zeppelin
Data science lifecycle with Apache ZeppelinData science lifecycle with Apache Zeppelin
Data science lifecycle with Apache Zeppelin
 
Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.
 
SEGA : Growth hacking by Spark ML for Mobile games
SEGA : Growth hacking by Spark ML for Mobile gamesSEGA : Growth hacking by Spark ML for Mobile games
SEGA : Growth hacking by Spark ML for Mobile games
 
The truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on HadoopThe truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on Hadoop
 
Scaling real time streaming architectures with HDF and Dell EMC Isilon
Scaling real time streaming architectures with HDF and Dell EMC IsilonScaling real time streaming architectures with HDF and Dell EMC Isilon
Scaling real time streaming architectures with HDF and Dell EMC Isilon
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache Atlas
 
Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?
 

Similaire à 淘宝Hadoop数据分析实践

淘宝分布式数据处理实践
淘宝分布式数据处理实践淘宝分布式数据处理实践
淘宝分布式数据处理实践
isnull
 
05 杨志丰
05 杨志丰05 杨志丰
05 杨志丰
锐 张
 
Hadoop系统及其关键技术
Hadoop系统及其关键技术Hadoop系统及其关键技术
Hadoop系统及其关键技术
冬 陈
 

Similaire à 淘宝Hadoop数据分析实践 (20)

選擇正確的Solution 來建置現代化的雲端資料倉儲
選擇正確的Solution 來建置現代化的雲端資料倉儲選擇正確的Solution 來建置現代化的雲端資料倉儲
選擇正確的Solution 來建置現代化的雲端資料倉儲
 
Big Data Projet Management the Body of Knowledge (BDPMBOK)
Big Data Projet Management the Body of Knowledge (BDPMBOK)Big Data Projet Management the Body of Knowledge (BDPMBOK)
Big Data Projet Management the Body of Knowledge (BDPMBOK)
 
查礼 -大数据技术如何用于传统信息系统
查礼 -大数据技术如何用于传统信息系统查礼 -大数据技术如何用于传统信息系统
查礼 -大数据技术如何用于传统信息系统
 
Hadoop
HadoopHadoop
Hadoop
 
Hdfs introduction
Hdfs introductionHdfs introduction
Hdfs introduction
 
淘宝分布式数据处理实践
淘宝分布式数据处理实践淘宝分布式数据处理实践
淘宝分布式数据处理实践
 
Qcon2013 罗李 - hadoop在阿里
Qcon2013 罗李 - hadoop在阿里Qcon2013 罗李 - hadoop在阿里
Qcon2013 罗李 - hadoop在阿里
 
05 杨志丰
05 杨志丰05 杨志丰
05 杨志丰
 
Easier and Faster for hbase in HadoopCon 2014
Easier and Faster for hbase in HadoopCon 2014Easier and Faster for hbase in HadoopCon 2014
Easier and Faster for hbase in HadoopCon 2014
 
Hadoop con 2015 hadoop enables enterprise data lake
Hadoop con 2015   hadoop enables enterprise data lakeHadoop con 2015   hadoop enables enterprise data lake
Hadoop con 2015 hadoop enables enterprise data lake
 
Hdfs
HdfsHdfs
Hdfs
 
Hdfs
HdfsHdfs
Hdfs
 
《数据库发展研究报告-解读(2023年)》.pdf
《数据库发展研究报告-解读(2023年)》.pdf《数据库发展研究报告-解读(2023年)》.pdf
《数据库发展研究报告-解读(2023年)》.pdf
 
Hbase
HbaseHbase
Hbase
 
Hadoop yarn 基本架构和发展趋势
Hadoop yarn 基本架构和发展趋势Hadoop yarn 基本架构和发展趋势
Hadoop yarn 基本架构和发展趋势
 
Ocean base 千亿级海量数据库-日照
Ocean base 千亿级海量数据库-日照Ocean base 千亿级海量数据库-日照
Ocean base 千亿级海量数据库-日照
 
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
 
高科技產業資料分析解決方案 Hare DB
高科技產業資料分析解決方案 Hare DB高科技產業資料分析解決方案 Hare DB
高科技產業資料分析解決方案 Hare DB
 
Hadoop系统及其关键技术
Hadoop系统及其关键技术Hadoop系统及其关键技术
Hadoop系统及其关键技术
 
Paas研究介绍
Paas研究介绍Paas研究介绍
Paas研究介绍
 

Plus de Min Zhou (6)

Big Data Analytics Infrastructure
Big Data Analytics InfrastructureBig Data Analytics Infrastructure
Big Data Analytics Infrastructure
 
Java trouble shooting
Java trouble shootingJava trouble shooting
Java trouble shooting
 
Hive
HiveHive
Hive
 
Java程序员也需要了解CPU
Java程序员也需要了解CPUJava程序员也需要了解CPU
Java程序员也需要了解CPU
 
Anthill: A Distributed DBMS Based On MapReduce
Anthill: A Distributed DBMS Based On MapReduceAnthill: A Distributed DBMS Based On MapReduce
Anthill: A Distributed DBMS Based On MapReduce
 
Redpoll
RedpollRedpoll
Redpoll
 

淘宝Hadoop数据分析实践