SlideShare une entreprise Scribd logo
1  sur  20
Parallel Data Mining Platform in Telecom Industry -- Big Cloud based Parallel Data Mining Platform Friday, Oct 2, 2009  NYC Research Institute of  China Mobile Communication Corporation Feng Cao
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Large scale data in China Mobile Communication Corporation (CMCC) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Large Scale Data Applications and current solution ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],The Requirements Current solution Clemetine Enterprise Miner Intelligent Miner
What’s BASS  ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Challenges and limitations of BASS ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
What is the BC-PDM ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
BC-PDM Architecture  ,[object Object],[object Object],[object Object],[object Object],DE DT ,[object Object],[object Object],[object Object],Data mining App
Features of BC-PDM (I) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Features of BC-PDM(II) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Case I – Mapreduce based ETL  ,[object Object],[object Object],Input Data Set the targe fields to Key, other fields to Value Reduce the same key, read from the value list and write once Output Data Define the target fields (one or all)  Set the targe fields to Key, other fields to Value Set the targe fields to Key, other fields to Value MapTasker 1 MapTasker 2 MapTasker n ReduceTasker 1 Reduce the same key, read from the value list and write once ReduceTasker m
关键技术方案 - 并行 ETL- 冗余删除 功能 冗余删除操作实现了针对所有数据样本中完全相同的两条或多条记录进行删除,只保留相同记录中的一条记录。 指标 1 )实现数据表冗余删除的并行化 2 )正确性与串行结果完全一致 3 )加速比接近线性, TB 级处理时间千秒级 参考方案 数据库中的串行冗余删除 我们的方案 1 )通过 map 对待处理数据进行分块处理,每个数据块对应一个处理节点; map 中输入的 key 为默认值——每行数据的偏移量, value 为该行数据的文本形式,以此方式实现在每块中依次读入每行数据; map 任务输出中间 <key,value> 对,其中, key 从整行数据文本, value 为空文本; 2 )对具有相同 key 值的数据由 reduce 输出: key 为整行数据, value 值为空,即可实现同样的数据记录仅保留一条数据记录; 将 reduce 输出结果存储到分布式文件系统。
Case II – Mapreduce based DM Algorithm ,[object Object],[object Object],Input Data Set the frequent k-1 length item sets to Key,  appear times to Value Reduce the same key, read from the value list and sum Output Data Set the frequent k-1 length item sets to Key,  appear times  to Value Set the frequent k-1 length item sets to Key,  appear times to Value MapTasker 1 MapTasker 2 MapTasker n ReduceTasker 1 Reduce the same key, read from the value list and sum ReduceTasker m Output rules satisfy both  minimum support value and minimum confidence value
关键技术方案 - 并行关联规则算法 -PApriori 功能 Apriori 是基于统计频繁项集的策略发现属性间的关联关系 指标 1 )实现查找频繁 k 项集的并行化 2 )正确性与串行结果完全一致 3 )扩展性优良, TB 级处理时间千秒级 参考方案 串行 Apriori 算法 我们的方案 1 )采用 Map/Reduce 机制逐层迭代方法来发现频繁项集,在查找每个频繁 k 项集时进行并行化; 2 )将数据转换为中间 Key/Value 对输出: key 为候选 k 项集, value 为项集计数;将各处理节点输出的数据进行合并处理,满足最小支持度阈值的作为频繁 k 项集; 3 )由频集产生强关联规则,输出满足最小可信度阈值的关联规则。
Experiment Environment Software Hardware ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Evaluation of BC-PDM(Phase I)  ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Conclusions ,[object Object],[object Object],[object Object],[object Object]
Future works ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
People(cloud computing team from CMRI) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Collaborations are welcome! Thanks and Questions? fengcao@chinamobile.comluozhiguo@chinamobile.com   [email_address]   Cloud Computing E-Channel  (in Chinese) http://labs.chinamobile.com/cloud

Contenu connexe

Tendances

Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
Nithin Kakkireni
 
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIESSTUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
ijdpsjournal
 
TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)
ruchabhandiwad
 
Jovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudJovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloud
Bharat Rane
 

Tendances (20)

E031201032036
E031201032036E031201032036
E031201032036
 
Understanding Big Data Platform from Patents
Understanding Big Data Platform from PatentsUnderstanding Big Data Platform from Patents
Understanding Big Data Platform from Patents
 
Geo distributed parallelization pacts in map reduce
Geo distributed parallelization pacts in map reduceGeo distributed parallelization pacts in map reduce
Geo distributed parallelization pacts in map reduce
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
 
Keysum - Using Checksum Keys
Keysum - Using Checksum KeysKeysum - Using Checksum Keys
Keysum - Using Checksum Keys
 
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIESSTUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
 
TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Survey real time databases
Survey real time databasesSurvey real time databases
Survey real time databases
 
Jovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudJovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloud
 
Fault tolerance on cloud computing
Fault tolerance on cloud computingFault tolerance on cloud computing
Fault tolerance on cloud computing
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET- Big Data-A Review Study with Comparitive Analysis of HadoopIRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
 

En vedette

Harnessing Big Data in Real-Time
Harnessing Big Data in Real-TimeHarnessing Big Data in Real-Time
Harnessing Big Data in Real-Time
DataWorks Summit
 
Hadoop Boosts Profits in Media and Telecom Industry
Hadoop Boosts Profits in Media and Telecom IndustryHadoop Boosts Profits in Media and Telecom Industry
Hadoop Boosts Profits in Media and Telecom Industry
DataWorks Summit
 

En vedette (8)

Managing Big data using Hadoop Map Reduce in Telecom Domain
Managing Big data using Hadoop Map Reduce in Telecom DomainManaging Big data using Hadoop Map Reduce in Telecom Domain
Managing Big data using Hadoop Map Reduce in Telecom Domain
 
Cisp dm
Cisp dmCisp dm
Cisp dm
 
Harnessing Big Data in Real-Time
Harnessing Big Data in Real-TimeHarnessing Big Data in Real-Time
Harnessing Big Data in Real-Time
 
Hadoop Boosts Profits in Media and Telecom Industry
Hadoop Boosts Profits in Media and Telecom IndustryHadoop Boosts Profits in Media and Telecom Industry
Hadoop Boosts Profits in Media and Telecom Industry
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Data mining (DM) in the pharmaceutical industry
Data mining (DM) in the pharmaceutical industryData mining (DM) in the pharmaceutical industry
Data mining (DM) in the pharmaceutical industry
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
豆瓣数据架构实践
豆瓣数据架构实践豆瓣数据架构实践
豆瓣数据架构实践
 

Similaire à Hw09 Hadoop Based Data Mining Platform For The Telecom Industry

BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentation
lilyco
 
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Kiruthikak14
 
sector-sphere
sector-spheresector-sphere
sector-sphere
xlight
 
Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226
Nick Kypreos
 

Similaire à Hw09 Hadoop Based Data Mining Platform For The Telecom Industry (20)

Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentation
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...
How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...
How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
 
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
 
Distribute Storage System May-2014
Distribute Storage System May-2014Distribute Storage System May-2014
Distribute Storage System May-2014
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Sector Sphere 2009
Sector Sphere 2009Sector Sphere 2009
Sector Sphere 2009
 
sector-sphere
sector-spheresector-sphere
sector-sphere
 
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics DataBest pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals
 
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
 
Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226
 

Plus de Cloudera, Inc.

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Hw09 Hadoop Based Data Mining Platform For The Telecom Industry

  • 1. Parallel Data Mining Platform in Telecom Industry -- Big Cloud based Parallel Data Mining Platform Friday, Oct 2, 2009 NYC Research Institute of China Mobile Communication Corporation Feng Cao
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. 关键技术方案 - 并行 ETL- 冗余删除 功能 冗余删除操作实现了针对所有数据样本中完全相同的两条或多条记录进行删除,只保留相同记录中的一条记录。 指标 1 )实现数据表冗余删除的并行化 2 )正确性与串行结果完全一致 3 )加速比接近线性, TB 级处理时间千秒级 参考方案 数据库中的串行冗余删除 我们的方案 1 )通过 map 对待处理数据进行分块处理,每个数据块对应一个处理节点; map 中输入的 key 为默认值——每行数据的偏移量, value 为该行数据的文本形式,以此方式实现在每块中依次读入每行数据; map 任务输出中间 <key,value> 对,其中, key 从整行数据文本, value 为空文本; 2 )对具有相同 key 值的数据由 reduce 输出: key 为整行数据, value 值为空,即可实现同样的数据记录仅保留一条数据记录; 将 reduce 输出结果存储到分布式文件系统。
  • 13.
  • 14. 关键技术方案 - 并行关联规则算法 -PApriori 功能 Apriori 是基于统计频繁项集的策略发现属性间的关联关系 指标 1 )实现查找频繁 k 项集的并行化 2 )正确性与串行结果完全一致 3 )扩展性优良, TB 级处理时间千秒级 参考方案 串行 Apriori 算法 我们的方案 1 )采用 Map/Reduce 机制逐层迭代方法来发现频繁项集,在查找每个频繁 k 项集时进行并行化; 2 )将数据转换为中间 Key/Value 对输出: key 为候选 k 项集, value 为项集计数;将各处理节点输出的数据进行合并处理,满足最小支持度阈值的作为频繁 k 项集; 3 )由频集产生强关联规则,输出满足最小可信度阈值的关联规则。
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20. Collaborations are welcome! Thanks and Questions? fengcao@chinamobile.comluozhiguo@chinamobile.com [email_address] Cloud Computing E-Channel (in Chinese) http://labs.chinamobile.com/cloud

Notes de l'éditeur

  1. sdf