Data analysis has evolved significantly from its origins in statistics. The evolution included developments in database management technology, business intelligence and analytic platforms, and statistical processing technologies. The key pillars of modern data analysis systems are big data, open source software, and cloud computing. These have transformed data analysis by enabling massive and fast distributed processing of both structured and unstructured data at large scale and low costs. Emerging trends in data analysis include the rise of citizen data scientists, algorithm marketplaces, and smarter data discovery tools that make analytics more accessible.
3. Page 2Page 2
10초 이야기
Data analysis is rooted in statistics,
which has a pretty long history.
It is said that the beginning of statistics was
marked in ancient Egypt,
when Egypt was taking a periodic census
for building pyramids.
Throughout history, statistics has played an important role for governments
all across the world, for the creation of censuses, which were used for
various governmental planning activities (including, of course, taxation).
4. Page 3
Contents
I. Evolution of Data Analysis
II. Data Analysis System – 3 Pillar
III. Big Data, Open Source, Cloud Computing, Data Analysis
IV. New Era – Data Analysis, Chaos
V. Data Consumer’s Needs
VI. New Trend– Citizen Data Scientist/ Smart Data Discovery
5. Page 4Page 4
데이터 분석 시스템의 진화에 대한 이해
데이터 분석 시스템의 진화 과정을 세가지 관점에서 이해
Database
Management
Technology
Development
of Business
Intelligence
& Analytic
Platform
Technologies
and Packages
for Statistical
Processing
Flat File Based
Tape based
storage/Batch reporting
Query Modules &
Report Generators
Batch querying &
reporting/reporting
generators
Niche Statistical
Subroutines
Social science/clinical
trials/agriculture
Routinization
Querying &
Reporting
Statistical
Computation
Navigational DBMS
Late 1970 RDBMS
emerged
Early DSS Tools
Commercial tools for
building DSS
Statistical Software
Pharma & Social Scince
SPSS/SAS incorporated
Modularization
Decision Support
& Modeling
1st Gen Statistical
Processing
Relational DBMS
RDBMS solutions
matured/personal
databases for PC
DSS & 4GL
Environments
4GL/EIS/spreadsheet/des
criptive analytics
PC-based Statistical
Packages
Other industries
Pc-based,
graphics/Expert systems
Abstraction
Analytical
Processing
2nd Gen Statistical
Processing
Distributed DBMS
Distributed
architecture(clustering)
Data Warehouse &
BI
BI tool market grew
rapidly/Web based
analytics
Early Data Mining
tools
Vendors & solutions
Scaling &
Distribution
Enterprise Performance
Management
Data Mining
1960s 1970s 1980s 1990s 2000s
Post Relational
DBMS
Unstructured data, non-
relational data model/ large
scale distributed data
Data Processing &
Analytic Platform
Large scale data
processing/unstructured,real-
time analytics/ big data
analytics
Data Processing &
analytics Platforms
Open source R based
statistical platforms/NLP
Text analysis
Specialization &
Extension
Next Gen Data
Processing
Next Gen Data
Processing
AI
hyped
ML
started
new ML
invented
* Max Kanaskar’s “BIG DATA TECHNOLOGY SERIES”에서 자료 정리
6. Page 5Page 5
데이터 분석 시스템의 진화에 대한 이해
데이터 분석 시스템의 진화 과정을 세가지 관점에서 이해
Technologies and Packages for Statistical Processing
7. Page 6Page 6
데이터 분석 시스템의 진화에 대한 이해
데이터 분석 시스템의 진화 과정을 세가지 관점에서 이해
Database Management Technology
8. Page 7Page 7
데이터 분석 시스템의 진화에 대한 이해
데이터 분석 시스템의 진화 과정을 세가지 관점에서 이해
Development of Business Intelligence & Analytic Platform
9. Page 8Page 8
데이터 분석 시스템 – 3개의 기둥
1. 빅데이터의 등장 – 5V로 특징 지워지는 최근의 정의
Prescriptive
Predictive
Decisions
Recommend
Findings
Objectives
small big
few many
Data Object
Size
Data Object
Quantity
VOLUME
VALUE
Data Sourcesfew many
Contents Typesfew many
Structure
Typesstructured unstructured
Semantic
Divirsity
low high
VARIETY
slow fast
Acquisition
Rate
VELOCITY
Update Rateslow fast
Known Data Sources Provenance Data Integrity Governance
VERACITY
* NIST, 2014
too big (volume),
arrives too fast (velocity),
changes too fast (variability),
contains too much noise (veracity),
too diverse (variety)
to be processed within
a local computing structure
using traditional approaches
and techniques
* ISO, 2014
10. Page 9Page 9
데이터 분석 시스템 – 3개의 기둥
더 이상 떠오르는 신기술이 아닌 빅데이터
2015.8 가트너의 Hype Cycle
에서 빅데이터가 빠짐
Machine Learning, Citizen Data Science*가 새로 등장
(데이터 분석과 관련한 새로운 트랜드가 빅데이터를 대체)
* people on the business side that may have some data skills,
possibly from a math or even social science degree
Big Data 2014년에 여기에 위치
11. Page 10Page 10
데이터 분석 시스템 - 3개의 기둥
2. 클라우드 환경으로의 변화
ask previously un-askable questions is the emerging power of the cloud
Cloud computing is a transformative force addressing size, speed, and
scale, with a low cost of entry and very high potential benefits.
large-scale image processing, sensor data correlation, social network
analysis, encryption/decryption, data mining, simulations, and pattern
recognition
*출처 : Booz Allen Hamilton
12. Page 11Page 11
데이터 분석 시스템 - 3개의 기둥
Massive Data Analytics and the Cloud
HDFS Commercial hardware
resilienceelasticityscalability
Multi tenancy
Virtualization
Data Cloud Utility Cloud
Computing architecture for large-scale data
processing and analytics
Designed to operate at trillions of operations/day,
petabytes of storage
Designed for performance, scale, and data
processing
Characterized by run-time data models and
simplified development models
Computing services for outsourced IT operations
Concurrent, independent, multi-tenant user
population
Service offerings such as SaaS, PaaS, and IaaS
Characterized by data segmentation, hosted
applications, low cost of ownership, and
elasticity
*출처 : Booz Allen Hamilton
13. Page 12Page 12
데이터 분석 시스템 - 3개의 기둥
Cloud based “as a Service” 의 다양한 모델
Data Analytics as a Service
Database as a Service
Storage as a Service
Backup as a Service
…
Insights as a Service
14. Page 13Page 13
데이터 분석 시스템 - 3개의 기둥
오픈소스 소프트웨어는 데이터 분석 시장에 혁명적인 파괴를 가져옴
Traditional
(proprietary sw)
15. Page 14Page 14
데이터 분석 시스템 - 3개의 기둥
오픈소스 소프트웨어는 데이터 분석 시장에 혁명적인 파괴를 가져옴
Big Data Analysis
Platforms and Tools
Hadoop, MapReduce, GridGain, HPCC, Storm
Databases/Data
Warehouses
CouchDB, OrientDB, Terrastore, FlockDB,
Hibari, Riak, Hypertable, BigData, Hive,
InfoBright, Community, Edition, Infinispan,
Redis, Cassandra, HBase, MongoDB, Neo4j
Business Intelligence
Talend, Jaspersoft, Palo BI Suite/Jedox,
Pentaho, SpagoBI, KNIME, BIRT/Actuate
Data Mining
RapidMiner/RapidAnalytics, Mahout, Orange,
Weka, jHepWork, KEEL, SPMF, Rattle, Gluster,
Hadoop Distributed File System
Programming Languages Pig/Pig Latin, R, ECL
Big Data Search Lucene, Solr
Data Aggregation and
Transfer
Sqoop, Flume, Chukwa
Miscellaneous Big Data
Tools
Terracotta, Avro, Oozie, Zookeeper
분야 오픈소스 소프트웨어(50)
아파치재단의프로젝트10월현재약230여개
데이터분석및빅데이터관련오픈소스소프트웨어의종류
16. Page 15Page 15
Big Data, Open Source, Cloud Computing, Data analysis
The combination of ‘data analysis’ and 'big data-open source-cloud computing' opens up a new
universe of opportunities at many levels and in many places.
Traditional Data Analysis Data Analysis New Era
Big Data
processing
Slow processing
Massive/fast/distributed
processing
Computing Power
Scale Up
on premise
Scale Out
Off Premise(Cloud)
S/W proprietary s/w Open source s/w
Data structured data
Structured & unstructured data
Graph data
Analysis statistical analysis
ML, data mining, Network
analysis, text mining, etc.
Value limited value & insight
Quick & fast discover
knowledge, value
17. Page 16Page 16
New Era – Data Analysis, Chaos
SaaS 전문 기업, 전통적인 데이터 분석 기업, BI 기업 등 다양한 기업들의 각기둥장
전문 업체 - 단순한 분석 및 시각화에 초점
대용량의 데이터 분석 보다는 경량 데이터 분석에 치중
SaaS 형태의 서비스
2,3 곳을 제외하고 다양한 분석기법을 적용하지 않음
사용자 중심의 UI/UX
MS/IBM/Amazon에 주목하여 3개의 서비스 별도 비교
10개의 SaaS 업체 조사 결과
18. Page 17Page 17
New Era – Data Analysis, Chaos
Cloud Machine Learning으로 빅데이터 분석 시장에서 새로운 경쟁이 심화
- IBM Watson Analytics, Microsoft Azure ML, Amazon ML 비교
IBM
Watson
Analytics
• Decision Tree
• Classification
• Correlation
Anomaly Detection 2개
/Classification 14개/Clustering 1개
/Regression 8개/Feature selection 3
개/Evaluate 3개/Score 4개/Train 4개
/Statistical function 7개/Text
Analytics 4개
Binary classification (predicting one
of two possible outcomes)/
Multiclass classification (predicting
one of more than two outcomes/
Regression(predicting a numeric
value)
• couldn't handle enterprise scale data
• focused more on data visualization and exploration
• use natural language(plain English questions )
• automates some tasks
• user-friendly, GUI
• requires knowledge of the characteristics of machine
learning algorithms
• targeted to developers, data scientists and very
advanced business users
• narrower in scope
• data acquisition is effortless
• No infrastructure management required
• Does not require data science expertise
Microsoft
Azure ML
Amazon
ML
알고리즘 특징
쉬운 사용자 환경 제공에 노력( GUI / Data Scientist 가 필요 없는)
아직은 빅데이터 처리에 미흡
주요 특징
19. Page 18Page 18
Data Consumer’s Needs
경제적인 비용으로 시스템을 확장할 수 있는 환경을 갖고 언제 어디서나 쉽게 접속하여 다양하고 방대한 데이터를 취급
하여 인사이트를 발견하고 실행할 수 있는 데이터 분석 시스템에 대한 요구
Data Consumer
Group
C-level
Lob user
Data scientist
Data engineer
360 Degree
Customer view
understand
the market
find
new market
personalized
website/offering
improve
service
co-create &
innovate
reduce
risk/fraud
better organize
company
Understand
competition
customers product organization
Data Analysis Use Case Framework
accebility
Easy to use
Elastic sharing
security
scalability
Cost effective
C-level ; CEO,COO,CIO,CTO,CMO…
LoB ; Line of Business
20. Page 19Page 19
New Trend
The Rise of the Citizen Data Scientist
Gartner defines a "citizen data scientist"
¹ At the end of 2007 classic, Competing on Analytics,
Tom Davenport predicted the rise of “analytical
amateurs,”
line of
business
Not a trained
data
scientist or
developerFocused on
business
problems
Driven to
pull
togather the
right data,
now
Iterative
workflow -
one question
leads to the
next
creates or
generates
models
not typically a
member of an
analytics
Citizen
Data Scientist
Alexander Linden, Research Director at Gartner,
predicts that through 2017, the number of
“Citizen Data Scientists,” i.e. analytical
amateurs¹, will grow five times faster than the
number of highly skilled Data Scientists.
21. Page 20Page 20
New Trend
5-10%
Analytical Professionals
— Can create algorithms
Analytical Semi-Professionals
— Can use visual tools, create
simple models
Analytical Amateurs
— Can use spreadsheets
15-20%
70-80%
Competing on Analytics, Tom Davenport
¹At the end of 2007 classic, Competing on Analytics, Tom Davenport
predicted the rise of “analytical amateurs,”
22. Page 21Page 21
New Trend
Algorithm Marketplaces Are Bringing the App Economy
to Analytics
Source: Gartner (October 2015)
23. Page 22Page 22
New Trend
Easier-to-use analytics tools : Smart data discovery
“Smart data discovery is a next-generation data discovery capability that provides insights
from advanced analytics to business users or citizen data scientists without requiring them to
have traditional data scientist expertise.”
Source: Gartner (June 2015)
24. Page 23Page 23
New Trend
Current Data
Discovery Analytics
Workflow
Emerging Smart Data
Discovery Analytics
Workflow
Source: Gartner (June 2015)
Easier-to-use analytics tools : Smart data discovery
25. Page 24Page 24
Business
User
New Trend
Algorithms
DAaaS functional elements
Smart Data Discovery
“ ~ make new sources of information accessible, consumable and meaningful to organizations of
all sizes, even ones that don't have extensive advanced analytics skills or in-house resources.”
Citizen
Data
Scientist
이 자료는 매월 계속 업데이트 될 예정입니다.