SlideShare une entreprise Scribd logo
1  sur  30
Large-Scale Log Analysis for Marketing   Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Kenji Hara/ Yukio Uematsu Innovative IP Architecture Center NTT Communications Corporation
Company Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved.
NTT Group, NTT Communications Corporate Structure 100% 100% US$ 12.9B revenue Global data, Internet Access,  Voice, IT US$ 24.4B revenue, Local Telecom Nippon Telephone &  Telegraph  100% US$ 21.9B revenue, Local Telecom 66.4% US$ 52.8B revenue, Mobile 54.2% US$ 14.5B revenue,System Integration Second Sales Division First Sales Division Global Sales Division ... Video & Voice Division Network Services Division Cloud Services Division Applications and Cotent Division Solutions Division Customer Services Division Service Infrastructure Division Systems Division Corporate Planning Division Finance Division ... Innovative IP Architecture Center Staff Operation Product R&D
NTT Group, NTT Communications Corporate Structure 100% 100% US$ 12.9B revenue Global data, Internet Access,  Voice, IT US$ 24.4B revenue, Local Telecom Nippon Telephone &  Telegraph  100% US$ 21.9B revenue, Local Telecom 66.4% US$ 52.8B revenue, Mobile 54.2% US$ 14.5B revenue,System Integration Technical Support, SI Partnership
BizCITY: Cloud Services provided by NTT Communications Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. High-Speed Backbone between Datacenters Global NW Secure Connectivity Internet/IP Phone VPN Service            ICT Outsourcing Fire Wall Guaranteed Burst Best Effort Domestic International BizHosting Virtual Server Hosting BizMail WebMail, Scheduler SaaS CRM/SFA Internet BizStorage Online Storage Multi Layer Analysis BizMarketing Big Data (user log) Mobile Access Mobile Thin Client Ubiquitous Office Remote Access Mobile Access IP Phone Big Data Analysis BizStorage Online Storage Multi Layer Analysis BizMarketing Big Data (user log)
Big Data in BizCITY Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Private Data Analysis Natural Language Processing Statistics  Secure & High-Capacity Storage Service Mining Data for Marketing User Log Private Data BizStorage Online Storage Multi Layer Analysis BizMarketing Access Log Use hadoop for  “ enormous ”  user log analysis CGM Log Query Log B Application Data Feature Next target BizMarketing
We provide a “cloud” service for marketing!!! Hadoop in cloud!!!!
Hadoop in BizMarketing Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Web Access Analysis CGM Analysis Hadoop!! Many Join Operations Increasing Data!! Requirement for scalability Jan 2009 July 2009 Jan 2010 July 2010 Jan 2011 July 2011 Tweets Per Day
CGM Analysis in Biz Marketing Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. “ BuzzFinder ”  supports marketing activity using customers ’  feedbacks in social media Crawl Crawl Marketer Advertiser Promoter R&D Branding Ads ’  Result Company Reputations Difference with other companies Tweet Blog Search Collect Buzz Finder Blog
Data Flow in BuzzFinder Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. PostgreSQL Hadoop Cluster PostgreSQL NLP and Statistics by Map/Reduce
Map/Reduce in BuzzFinder Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. CGM Data size/record is large Small amount of records (x mil /day) Map is costly (mainly by NLP) Keywords Customer Keywords Semtiment Locations Topics Index Data Keywords Semtiment Locations Topics Index Data Keyword Sentiment Location Topic Search Index Map(Data Extract) Keyword Count Topic Count Sentiment Count Location Count Reduce(Statistics) Features Map(NLP) Linguistic &User Data
Output of BuzzFinder: Keyword Trend Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Trends of “Nuclear Power Plant”and“Earthquake”in twitter 100,000 50,000 Earthquake Nuclear Power Plant 18565 tweets / day 65642 tweets / day Many tweets about “Earthquake” on 11 th  each month Trends of specified keywords in Twitter Heavy white smoke from Fukushima No.1 nuclear power plant. 95,271 tweets
Output of BuzzFinder: Topic Analysis Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Topics about“Nuclear Power Plant” in September Popular topics about specified keywords in Twitter Topics about “Nuclear Power Plant” Tokyo Electric Power Japan Nuclear Accident Fukushima Noda
Output of BuzzFinder: Location Analysis Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Location analysis of  “ Nuclear Power Plant ” Disaster Area Tokyo Area Many Few Many tweets from big city and disaster area
Output of BuzzFinder: Sentiment Analysis Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Sentiment analysis of  “ Nuclear Power Plant ” APR 2011 AUG 2011 48.4% 51.6% 47.5% 52.5% Positive Negative The sentiment of  “ Nuclear Power Plant ”  got more negative from April (1 month after the earthquake) to August. The sentiment is more negative than average sentiment(70% positive)
Hadoop in Biz Marketing Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Web Access Analysis CGM Data Analysis Hadoop!! Jan 2009 July 2009 Jan 2010 July 2010 Jan 2011 July 2011 Increasing Data!! Tweets Per Day Many Join Operations Requirement for scalability
Web Access Analysis Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. ex.) Why users went out without conversion? To find out internet-users’ behavior inside of the site Click stream based analysis
Visualization of  internet-users behaviors ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Click stream based analysis ex.) Why users went out without conversion? Statistics Click stream analysis (OLAP)
Hadoop for PaaS Services Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. At a same speed Server reduction Speeding-up technique 1. Summation 2. OLAP(multi join processing) Want to reduce the cost! Normal Hadoop Cluster High Speed Hadoop Cluster Map/Reduce speeding-up technique
Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Our Cluster Normal Cluster Elephant in Cloud runs FAST!!
Strategies for Cost Reduction Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Map Multi-Reduce   * Record reduce HashMap-based pre-combining before combiner advantages: 1) efficient combining by HashMap 2) reduction of # of spill operation Local reduce Combining mapper outputs in same servers advantages: reduction of amount of shuffle Pjoin  ** Join with pre-partitioning and semi-join advantages: efficient for multi-table joins *, **  “ Map Multi-Reduce ”  and  “ Pjoin ” are developed in NTT labs; the source code is closed now. Statistics (summation) OLAP (join)
Map Multi-Reduce/Record Reduce Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. input Map MapOutputBuffer sort&spill Spill files mergeParts Output Normal map/reduce Map/ r educe with  r ecord reduce Input Map MapOutputBuffer sort&spill Spill files mergeParts Output Record reduce Pre-combining function before combiner Pre-combining in map function to reduce # of spill operation Map Task Reduce Task Server Process File Smaller output buffer
Map Multi-Reduce/Local Reduce Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. User Program worker worker worker Input Data fork fork fork Master worker worker assign map assign reduce local  write remote read, sort Output  File 0 Output  File 1 Split 1 Split 0 Split 2 Split 3 Split 4 read worker worker worker worker worker assign local reduce Server Process File Pre-reduce data in the same server before combiner function Local Reduce  タスク Local Reduce  タスク Local Reduce Twice as fast as the normal  cluster
OLAP in Click Stream Based Analysis ,[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. click_stream Page info Location info User info Click info Scalable join is required! Amount of unique key is large
Join using Map/Reduce ,[object Object],[object Object],[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Combine map-side join and reduce-side join to reduce shuffle cost and disk space while keeping scalability Memory-backed join Reduce side join Map-side join Scalability NG Good Good Shuffle cost low high low Disk space good good bad
Pjoin/Join using Semi-Join View Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Query execution pageinfo  z Pre-processing pageinfo click_strm pageinfo  primary key & foreign key  (click_strm   primary key)  Site description data Pre-processing redundant data for multiple join Join in map-side using pre-partitioning, and only rest of join in reduce side click_strm  processing +  semi-join mapper … click_strm  processing +  semi-join pageinfo  a pageinfo _  click_strm  1 … pageinfo _  click_strm  n click_strm  n click_strm  1 Joining with  pageinfo reducer … Joining with  pageinfo … pageinfo  b pageinfo  a pageinfo  z click_strm  1 click_strm  n pageinfo _  click_strm  n pageinfo _  click_strm  1 … hash(x) hash(y) hash(y) DFS read shuffle
Experimental Evaluation (Pjoin) Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. 1TB access log join processing using Pjoin to verify the effectiveness HiveQL No. of servers Processing time (min) Pjoin vs Hive(reduce side join) Pjoin(50 servers) Hive(50servers) Pjoin(20 servers) 50 servers(normal hadoop cluster) 23 servers (Pjoin applied cluster) = same speed!! insert overwrite table q1_result select count(distinct s_sessionseqid) from clckstrm c join page p on c.c_pageseqid = p.p_pageseqid and p.p_url like '%blog.goo.ne.jp%' join session_info s on s.s_clckstrmseqid = c.c_clckstrmseqid and s.s_referer like '%QUERY%';
Other Verification on Hadoop Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. ,[object Object],[object Object],[object Object],Hadoop Cluster(250cores) Namenode ・・・ ・・・ Rack 1( LOC1 ) Rack 2( LOC1 ) Rack 3 ( LOC2 ) WAN(30miles) 300Mb LACP 4GB Processing time Servers WAN NO significant loss over WAN
Conclusions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved.
Contacts ,[object Object],[object Object],[object Object],[object Object],[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved.

Contenu connexe

Similaire à Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications

The Implacable advance of the data
The Implacable advance of the dataThe Implacable advance of the data
The Implacable advance of the dataDataWorks Summit
 
Edge Computing risks and Opportunities for Telco and hyperscalers
Edge Computing risks and Opportunities for Telco and hyperscalersEdge Computing risks and Opportunities for Telco and hyperscalers
Edge Computing risks and Opportunities for Telco and hyperscalersPatrick Lopez
 
Intra mart accel platform 2021winter-en
Intra mart accel platform 2021winter-enIntra mart accel platform 2021winter-en
Intra mart accel platform 2021winter-enNTTDATA INTRAMART
 
UC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_JohnstonUC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_JohnstonH Eddie Newton
 
How Changing Mobile Technology Is Changing The Way We Do Business
How Changing Mobile Technology Is Changing The Way We Do Business How Changing Mobile Technology Is Changing The Way We Do Business
How Changing Mobile Technology Is Changing The Way We Do Business Osaka University
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMatillion
 
SenchaCon 2015 - The advanced operation portal built sencha ExtJs
SenchaCon 2015 - The advanced operation portal built sencha ExtJsSenchaCon 2015 - The advanced operation portal built sencha ExtJs
SenchaCon 2015 - The advanced operation portal built sencha ExtJs直樹 益子
 
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven AlertsA DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven AlertsDevOps.com
 
Microsoft and aspect, transforming customer contact management
Microsoft and aspect, transforming customer contact managementMicrosoft and aspect, transforming customer contact management
Microsoft and aspect, transforming customer contact managementUnified Communications Online
 
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025Nicola Sandoli
 
Device to Intelligence, IOT and Big Data in Oracle
Device to Intelligence, IOT and Big Data in OracleDevice to Intelligence, IOT and Big Data in Oracle
Device to Intelligence, IOT and Big Data in OracleJunSeok Seo
 
What's New with Windows Phone - FoxCon Talk
What's New with Windows Phone - FoxCon TalkWhat's New with Windows Phone - FoxCon Talk
What's New with Windows Phone - FoxCon TalkSam Basu
 
The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013RightScale
 
Effective IoT System on Openstack
Effective IoT System on OpenstackEffective IoT System on Openstack
Effective IoT System on OpenstackTakashi Kajinami
 
The impact of IOT - exchange cala - 2015
The impact of IOT - exchange cala - 2015The impact of IOT - exchange cala - 2015
The impact of IOT - exchange cala - 2015Eduardo Pelegri-Llopart
 

Similaire à Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications (20)

Accel series 2021_summer en
Accel series 2021_summer enAccel series 2021_summer en
Accel series 2021_summer en
 
The Implacable advance of the data
The Implacable advance of the dataThe Implacable advance of the data
The Implacable advance of the data
 
Edge Computing risks and Opportunities for Telco and hyperscalers
Edge Computing risks and Opportunities for Telco and hyperscalersEdge Computing risks and Opportunities for Telco and hyperscalers
Edge Computing risks and Opportunities for Telco and hyperscalers
 
Intra mart accel platform 2021winter-en
Intra mart accel platform 2021winter-enIntra mart accel platform 2021winter-en
Intra mart accel platform 2021winter-en
 
UC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_JohnstonUC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_Johnston
 
How Changing Mobile Technology Is Changing The Way We Do Business
How Changing Mobile Technology Is Changing The Way We Do Business How Changing Mobile Technology Is Changing The Way We Do Business
How Changing Mobile Technology Is Changing The Way We Do Business
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - Snowflake
 
SenchaCon 2015 - The advanced operation portal built sencha ExtJs
SenchaCon 2015 - The advanced operation portal built sencha ExtJsSenchaCon 2015 - The advanced operation portal built sencha ExtJs
SenchaCon 2015 - The advanced operation portal built sencha ExtJs
 
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven AlertsA DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
 
Microsoft and aspect, transforming customer contact management
Microsoft and aspect, transforming customer contact managementMicrosoft and aspect, transforming customer contact management
Microsoft and aspect, transforming customer contact management
 
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
 
Device to Intelligence, IOT and Big Data in Oracle
Device to Intelligence, IOT and Big Data in OracleDevice to Intelligence, IOT and Big Data in Oracle
Device to Intelligence, IOT and Big Data in Oracle
 
Spotfire
SpotfireSpotfire
Spotfire
 
What's New with Windows Phone - FoxCon Talk
What's New with Windows Phone - FoxCon TalkWhat's New with Windows Phone - FoxCon Talk
What's New with Windows Phone - FoxCon Talk
 
The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013
 
Effective IoT System on Openstack
Effective IoT System on OpenstackEffective IoT System on Openstack
Effective IoT System on Openstack
 
Soma_Chakraborty (1)
Soma_Chakraborty (1)Soma_Chakraborty (1)
Soma_Chakraborty (1)
 
The impact of IOT - exchange cala - 2015
The impact of IOT - exchange cala - 2015The impact of IOT - exchange cala - 2015
The impact of IOT - exchange cala - 2015
 
Accel series 2019_winter_en
Accel series 2019_winter_enAccel series 2019_winter_en
Accel series 2019_winter_en
 
Industrial IoT bootcamp
Industrial IoT bootcampIndustrial IoT bootcamp
Industrial IoT bootcamp
 

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Dernier (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications

  • 1. Large-Scale Log Analysis for Marketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Kenji Hara/ Yukio Uematsu Innovative IP Architecture Center NTT Communications Corporation
  • 2.
  • 3. NTT Group, NTT Communications Corporate Structure 100% 100% US$ 12.9B revenue Global data, Internet Access, Voice, IT US$ 24.4B revenue, Local Telecom Nippon Telephone & Telegraph 100% US$ 21.9B revenue, Local Telecom 66.4% US$ 52.8B revenue, Mobile 54.2% US$ 14.5B revenue,System Integration Second Sales Division First Sales Division Global Sales Division ... Video & Voice Division Network Services Division Cloud Services Division Applications and Cotent Division Solutions Division Customer Services Division Service Infrastructure Division Systems Division Corporate Planning Division Finance Division ... Innovative IP Architecture Center Staff Operation Product R&D
  • 4. NTT Group, NTT Communications Corporate Structure 100% 100% US$ 12.9B revenue Global data, Internet Access, Voice, IT US$ 24.4B revenue, Local Telecom Nippon Telephone & Telegraph 100% US$ 21.9B revenue, Local Telecom 66.4% US$ 52.8B revenue, Mobile 54.2% US$ 14.5B revenue,System Integration Technical Support, SI Partnership
  • 5. BizCITY: Cloud Services provided by NTT Communications Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. High-Speed Backbone between Datacenters Global NW Secure Connectivity Internet/IP Phone VPN Service           ICT Outsourcing Fire Wall Guaranteed Burst Best Effort Domestic International BizHosting Virtual Server Hosting BizMail WebMail, Scheduler SaaS CRM/SFA Internet BizStorage Online Storage Multi Layer Analysis BizMarketing Big Data (user log) Mobile Access Mobile Thin Client Ubiquitous Office Remote Access Mobile Access IP Phone Big Data Analysis BizStorage Online Storage Multi Layer Analysis BizMarketing Big Data (user log)
  • 6. Big Data in BizCITY Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Private Data Analysis Natural Language Processing Statistics Secure & High-Capacity Storage Service Mining Data for Marketing User Log Private Data BizStorage Online Storage Multi Layer Analysis BizMarketing Access Log Use hadoop for “ enormous ” user log analysis CGM Log Query Log B Application Data Feature Next target BizMarketing
  • 7. We provide a “cloud” service for marketing!!! Hadoop in cloud!!!!
  • 8. Hadoop in BizMarketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Web Access Analysis CGM Analysis Hadoop!! Many Join Operations Increasing Data!! Requirement for scalability Jan 2009 July 2009 Jan 2010 July 2010 Jan 2011 July 2011 Tweets Per Day
  • 9. CGM Analysis in Biz Marketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. “ BuzzFinder ” supports marketing activity using customers ’ feedbacks in social media Crawl Crawl Marketer Advertiser Promoter R&D Branding Ads ’ Result Company Reputations Difference with other companies Tweet Blog Search Collect Buzz Finder Blog
  • 10. Data Flow in BuzzFinder Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. PostgreSQL Hadoop Cluster PostgreSQL NLP and Statistics by Map/Reduce
  • 11. Map/Reduce in BuzzFinder Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. CGM Data size/record is large Small amount of records (x mil /day) Map is costly (mainly by NLP) Keywords Customer Keywords Semtiment Locations Topics Index Data Keywords Semtiment Locations Topics Index Data Keyword Sentiment Location Topic Search Index Map(Data Extract) Keyword Count Topic Count Sentiment Count Location Count Reduce(Statistics) Features Map(NLP) Linguistic &User Data
  • 12. Output of BuzzFinder: Keyword Trend Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Trends of “Nuclear Power Plant”and“Earthquake”in twitter 100,000 50,000 Earthquake Nuclear Power Plant 18565 tweets / day 65642 tweets / day Many tweets about “Earthquake” on 11 th each month Trends of specified keywords in Twitter Heavy white smoke from Fukushima No.1 nuclear power plant. 95,271 tweets
  • 13. Output of BuzzFinder: Topic Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Topics about“Nuclear Power Plant” in September Popular topics about specified keywords in Twitter Topics about “Nuclear Power Plant” Tokyo Electric Power Japan Nuclear Accident Fukushima Noda
  • 14. Output of BuzzFinder: Location Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Location analysis of “ Nuclear Power Plant ” Disaster Area Tokyo Area Many Few Many tweets from big city and disaster area
  • 15. Output of BuzzFinder: Sentiment Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Sentiment analysis of “ Nuclear Power Plant ” APR 2011 AUG 2011 48.4% 51.6% 47.5% 52.5% Positive Negative The sentiment of “ Nuclear Power Plant ” got more negative from April (1 month after the earthquake) to August. The sentiment is more negative than average sentiment(70% positive)
  • 16. Hadoop in Biz Marketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Web Access Analysis CGM Data Analysis Hadoop!! Jan 2009 July 2009 Jan 2010 July 2010 Jan 2011 July 2011 Increasing Data!! Tweets Per Day Many Join Operations Requirement for scalability
  • 17. Web Access Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. ex.) Why users went out without conversion? To find out internet-users’ behavior inside of the site Click stream based analysis
  • 18.
  • 19. Hadoop for PaaS Services Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. At a same speed Server reduction Speeding-up technique 1. Summation 2. OLAP(multi join processing) Want to reduce the cost! Normal Hadoop Cluster High Speed Hadoop Cluster Map/Reduce speeding-up technique
  • 20. Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Our Cluster Normal Cluster Elephant in Cloud runs FAST!!
  • 21. Strategies for Cost Reduction Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Map Multi-Reduce * Record reduce HashMap-based pre-combining before combiner advantages: 1) efficient combining by HashMap 2) reduction of # of spill operation Local reduce Combining mapper outputs in same servers advantages: reduction of amount of shuffle Pjoin ** Join with pre-partitioning and semi-join advantages: efficient for multi-table joins *, ** “ Map Multi-Reduce ” and “ Pjoin ” are developed in NTT labs; the source code is closed now. Statistics (summation) OLAP (join)
  • 22. Map Multi-Reduce/Record Reduce Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. input Map MapOutputBuffer sort&spill Spill files mergeParts Output Normal map/reduce Map/ r educe with r ecord reduce Input Map MapOutputBuffer sort&spill Spill files mergeParts Output Record reduce Pre-combining function before combiner Pre-combining in map function to reduce # of spill operation Map Task Reduce Task Server Process File Smaller output buffer
  • 23. Map Multi-Reduce/Local Reduce Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. User Program worker worker worker Input Data fork fork fork Master worker worker assign map assign reduce local write remote read, sort Output File 0 Output File 1 Split 1 Split 0 Split 2 Split 3 Split 4 read worker worker worker worker worker assign local reduce Server Process File Pre-reduce data in the same server before combiner function Local Reduce タスク Local Reduce タスク Local Reduce Twice as fast as the normal cluster
  • 24.
  • 25.
  • 26. Pjoin/Join using Semi-Join View Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Query execution pageinfo z Pre-processing pageinfo click_strm pageinfo primary key & foreign key (click_strm primary key) Site description data Pre-processing redundant data for multiple join Join in map-side using pre-partitioning, and only rest of join in reduce side click_strm processing + semi-join mapper … click_strm processing + semi-join pageinfo a pageinfo _ click_strm 1 … pageinfo _ click_strm n click_strm n click_strm 1 Joining with pageinfo reducer … Joining with pageinfo … pageinfo b pageinfo a pageinfo z click_strm 1 click_strm n pageinfo _ click_strm n pageinfo _ click_strm 1 … hash(x) hash(y) hash(y) DFS read shuffle
  • 27. Experimental Evaluation (Pjoin) Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 1TB access log join processing using Pjoin to verify the effectiveness HiveQL No. of servers Processing time (min) Pjoin vs Hive(reduce side join) Pjoin(50 servers) Hive(50servers) Pjoin(20 servers) 50 servers(normal hadoop cluster) 23 servers (Pjoin applied cluster) = same speed!! insert overwrite table q1_result select count(distinct s_sessionseqid) from clckstrm c join page p on c.c_pageseqid = p.p_pageseqid and p.p_url like '%blog.goo.ne.jp%' join session_info s on s.s_clckstrmseqid = c.c_clckstrmseqid and s.s_referer like '%QUERY%';
  • 28.
  • 29.
  • 30.