NTT DATA has been providing Hadoop professional services for enterprise customers for years. In this talk we will categorize Hadoop integration cases based on our experience and illustrate archetypal design practices how Hadoop clusters are deployed into existing infrastructure and services. We will also present enhancement cases motivated by customer’s demand including GPU for big math, HDFS capable storage system, etc.
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
1. Hadoop’s Life in Enterprise Systems Y Masatani OSS Professional Services System Platform Sector NTT DATA CORPORATION Hadoop World 2011 Nov 8 th
2.
3.
4. Size of IT Services Market by Sectors <FY ended March 31,2011> [ Moderate Case ] <2010> Source: Gartner, "Forecast: IT Services Japan by Industry, 1Q 2011" Tsuyoshi Ebina, 20 May 2011 Note: Chart created by NTT Data based on Gartner data 42.2% 20.4% Government and healthcare Financial Enterprise, services, etc. 31.7% Other 5.7% Government and healthcare-related 15.2% 23.4% Financial Enterprise, services, etc. 61.5% Approx. 15.9% Our Shares in Markets IT Services Market in Japan NTT DATA’s Consolidated Net Sales JPY 9.83 trillion JPY 1.16 trillion Percent of our net sales accounted for by each customer field /service when results are totaled using the criteria below Government and healthcare: Central Government and Related Agencies, Overseas Public Institutions, etc. / Local Government and Community-based Business/Healthcare Financial: Banks/Financial Unions/Insurance, Security and Credit Corporations/Settlement Services Enterprise, services, etc.: Global IT Services Company Other: Sales not included in the above : (JPY Trillion) Approx. 6.1% Approx. 21.3%
5.
6.
7.
8. Popularity of Hadoop ~ 2011 Fall 3+ years none < 3 months 3 < 6 6 < 12 months 1 < 3 years ~50% attendees are still under research ~30% just started within 6 months
9.
10.
11.
12.
13.
14.
15.
16.
17.
18. Archetype of Integration between Engines Big Data Processing Latency Size GB TB PB Enterprise Batch Processing financial media public media telcom telcom public telcom RDBMS Low-Latency Serving Systems DWH, Search Engine, etc Hadoop Raw Data Source Input Coherent Import and Export Reduction sec min hour day Online Processing Online Batch Processing Query & Search Processing
27. Copyright 2011 FUJITSU LIMITED Enhanced Storage Architecture Established storage management technology (memory caching and disk I/O scheduling) and enhanced dedicated network enables boosted HDFS performance Local FS Mem CPU Extract Disk I/O bandwidth as of Locality Local FS Mem CPU Local FS Mem CPU Mem CPU Mem CPU Mem CPU Meshed network (40Gb b/w) Pros: Achieve Read 5x and Write 10x performance based on a financial enterprise batch benchmark case compared to local disk HDFS. Cons: Limited scalability (up to 40~50 nodes based on the prototype configuration, will be extended to ~120) Enhanced Bandwidth between Nodes and Storage Storage File system supports HDFS APIs
Hadoop’s Life in Enterprise Systems NTT DATA has been providing Hadoop professional services for enterprise customers for years. In this talk we will categorize Hadoop integration cases based on our experience and illustrate archetypal design practices how Hadoop clusters are deployed into existing infrastructure and services. We will also present enhancement cases motivated by customer’s demand including GPU for big math, HDFS capable storage system, etc. Y Masatani Senior Specialist NTT DATA Masatani is a senior specialist at System Platforms Sector in NTT DATA Corporation. He has more than 15 years experience in software engineering and Internet services. He has been directed OSS professional services unit from 2006 and delivering technical services and developing platform solutions. The team first became acquainted with Hadoop late 2007 and started operational support services from mid 2008.
Who we are? The situation of Hadoop in Japan Our experience .. What have been learnt , What have we observe in our customers and their clusters. More than fingers of both arms and both legs.
We will introduce who we are? 11.6 B, SI, Consulting , Outsourcing
Left Middle Right .. All rounder in Japanese IT Service Market
Nov 2009 – there was one session from Cloudera 2 nd one takes 15months 3 rd one comes earlier in 7 months there were Cloudera, Horton, MapR from US. Hope will have Hadoop World Japan or ASIA in the near future..
Regarding the popularity and deployment of Hadoop. It is not wide and not matured enough yet but APPARENTLY it is accelerated in this year
Let’s look at landscape first…
Let’s look at “data processing domains” and “applicable engines” データの流れ・変化と処理内容の変遷 Data warehouse servers Mid-tire servers
Let’s talk about our experience Parallel processing based on “data locality” That would be beneficial large amount of data and also repetitive sweeping of data. Receipt processing on healthcare / insurance
So, the landscape changes from here to here..
データの流れ・変化と処理内容の変遷 According to our customer’s cases.. Data warehouse servers Mid-tire servers
データの流れ・変化と処理内容の変遷 According to our customer’s cases.. Data warehouse servers Mid-tire servers
We have been over 3 years support for customers. Then the oldest clusters are going to be renewed and expanded We called these area as “Frontiers” and “Establishment” last year.. We call these as “Involvement” and “Expansion” after some more reasoning… Here is the story
These groups are not different in processing domain, but also in Life-Cycle We haven’t seen huge cases yet.. Let’s talk about our experience Parallel processing based on “data locality” That would be beneficial large amount of data and also repetitive sweeping of data. Receipt processing on healthcare / insurance
Many clusters or a Big cluster Hadoop cluster itself has good scalability and expandability..
Do we have flexible / useful scalability ??? According to our customer’s cases.. Data warehouse servers Mid-tire servers
Parallel processing based on “data locality” That would be beneficial large amount of data and also repetitive sweeping of data. Receipt processing on healthcare / insurance
Parallel processing based on “data locality” That would be beneficial large amount of data and also repetitive sweeping of data. Receipt processing on healthcare / insurance PostgreSQL is more polu
利点 高速 DB サーバの負荷が小さい。 (WAL 、共有バッファをバイパスできる。 ) エラーが発生するレコードを飛ばしてデータを RDB にロードできる。 エラーが発生するレコードがどれかをログから確認できる。 エラーが発生しても export 先テーブルにゴミが残らない。 欠点 DB の管理者権限がなければ使いにくい 各 Map タスクで一時テーブルの作成、削除を行う。 pg_bulkoad のログは DB サーバ側に出力される。 全スレーブノードに pg_bulkload をインストールする必要がある。