SlideShare une entreprise Scribd logo
1  sur  57
Télécharger pour lire hors ligne
Big-data Analytics: Challenges and 
Opportunities 
Chih-Jen Lin 
Department of Computer Science 
National Taiwan University 
Talk at ðcÇ™Ñx}t, August 30, 2014 
Chih-Jen Lin (National Taiwan Univ.) 1 / 54
Everybody talks about big data now, but it's not easy to 
have an overall picture of this subject 
In this talk, I will give some personal thoughts on 
technical developments of big-data analytics. Some are 
very pre-mature, so your comments are very welcome 
Chih-Jen Lin (National Taiwan Univ.) 2 / 54
Outline 
1 From data mining to big data 
2 Challenges 
3 Opportunities 
4 Discussion and conclusions 
Chih-Jen Lin (National Taiwan Univ.) 3 / 54
From data mining to big data 
Outline 
1 From data mining to big data 
2 Challenges 
3 Opportunities 
4 Discussion and conclusions 
Chih-Jen Lin (National Taiwan Univ.) 4 / 54
From data mining to big data 
From Data Mining to Big Data 
In early 90's, a buzzword called data mining 
appeared 
Many years after, we have another one called big 
data 
Well, what's the dierence? 
Chih-Jen Lin (National Taiwan Univ.) 5 / 54
From data mining to big data 
Status of Data Mining and Machine 
Learning 
Over the years, we have all kinds of eective 
methods for classi
cation, clustering, and regression 
We also have good integrated tools for data mining 
(e.g., Weka, R, Scikit-learn) 
However, mining useful information remains dicult 
for some real-world applications 
Chih-Jen Lin (National Taiwan Univ.) 6 / 54
From data mining to big data 
What's Big Data? 
 Though many de
nitions are 
available, I am considering 
the situation that data are 
larger than the capacity of a 
computer 
 I think this is a main 
dierence between data 
mining and big data 
 So in a sense we are talking 
about distributed data 
mining or machine learning 
(a), (b): distributed 
systems 
Image from Wikimedia 
Chih-Jen Lin (National Taiwan Univ.) 7 / 54
From data mining to big data 
From Small to Big Data 
Two important dierences: 
Negative side: 
Methods for big data analytics are not quite ready, 
not even mentioned to integrated tools 
Positive side: 
Some (Halevy et al., 2009) argue that the almost 
unlimited data make us easier to mine information 
I will discuss the
rst dierence 
Chih-Jen Lin (National Taiwan Univ.) 8 / 54
Challenges 
Outline 
1 From data mining to big data 
2 Challenges 
3 Opportunities 
4 Discussion and conclusions 
Chih-Jen Lin (National Taiwan Univ.) 9 / 54
Challenges 
Possible Advantages of Distributed Data 
Analytics 
Parallel data loading 
Reading several TB data from disk is slow 
Using 100 machines, each has 1/100 data in its 
local disk ) 1/100 loading time 
But having data ready in these 100 machines is 
another issue 
Fault tolerance 
Some data replicated across machines: if one fails, 
others are still available 
Chih-Jen Lin (National Taiwan Univ.) 10 / 54
Challenges 
Possible Advantages of Distributed Data 
Analytics (Cont'd) 
Work
ow not interrupted 
If data are already distributedly stored, it's not 
convenient to reduce some to one machine for 
analysis 
Chih-Jen Lin (National Taiwan Univ.) 11 / 54
Challenges 
Possible Disadvantages of Distributed 
Data Analytics 
More complicated (of course) 
Communication and synchronization 
Everybody says moving computation to data, but 
this isn't that easy 
Chih-Jen Lin (National Taiwan Univ.) 12 / 54
Challenges 
Going Distributed or Not Isn't Easy to 
Decide 
Quote from Yann LeCun (KDnuggets News 14:n05) 
I have seen people insisting on using Hadoop for 
datasets that could easily
t on a 
ash drive and 
could easily be processed on a laptop. 
Now disk and RAM are large. You may load several 
TB of data once and conveniently conduct all 
analysis 
The decision is application dependent 
We will discuss this issue again later 
Chih-Jen Lin (National Taiwan Univ.) 13 / 54
Challenges 
Distributed Environments 
Many easy tasks on one computer become dicult 
in a distributed environment 
For example, subsampling is easy on one machine, 
but may not be in a distributed system 
Usually we attribute the problem to slow 
communication between machines 
Chih-Jen Lin (National Taiwan Univ.) 14 / 54
Challenges 
Challenges 
Big data, small analysis 
versus 
Big data, big analysis 
If you need a single record from a huge set, it's 
reasonably easy 
For example, accessing your high-speed rail 
reservation is fast 
However, if you want to analyze the whole set by 
accessing data several time, it can be much harder 
Chih-Jen Lin (National Taiwan Univ.) 15 / 54
Challenges 
Challenges (Cont'd) 
Most existing data mining/machine learning 
methods were designed without considering data 
access and communication of intermediate results 
They iteratively use data by assuming they are 
readily available 
Example: doing least-square regression isn't easy in 
a distributed environment 
Chih-Jen Lin (National Taiwan Univ.) 16 / 54
Challenges 
Challenges (Cont'd) 
So we are facing many challenges 
methods not ready 
no convenient tools 
rapid change on the system side 
and many others 
What should we do? 
Chih-Jen Lin (National Taiwan Univ.) 17 / 54
Opportunities 
Outline 
1 From data mining to big data 
2 Challenges 
3 Opportunities 
4 Discussion and conclusions 
Chih-Jen Lin (National Taiwan Univ.) 18 / 54
Opportunities 
Opportunities 
Looks like we are in the early stage of a research 
topic 
But what is our chance? 
Chih-Jen Lin (National Taiwan Univ.) 19 / 54
Opportunities Lessons from past developments in one machine 
Outline 
3 Opportunities 
Lessons from past developments in one machine 
Successful examples? 
Design of big-data algorithms 
Chih-Jen Lin (National Taiwan Univ.) 20 / 54
Opportunities Lessons from past developments in one machine 
Algorithms for Distributed Data Analytics 
This is an on-going research topic. 
Roughly there are two types of approaches 
1 Parallelize existing (single-machine) algorithms 
2 Design new algorithms particularly for distributed 
settings 
Of course there are things in between 
Chih-Jen Lin (National Taiwan Univ.) 21 / 54
Opportunities Lessons from past developments in one machine 
Algorithms for Distributed Data Analytics 
(Cont'd) 
Given the complicated distributed setting, we 
wonder if easy-to-use big-data analytics tools can 
ever be available? 
I don't know either. Let's try to think about the 
situation on one computer
rst 
Indeed those easy-to-use analytics tools on one 
computer were not there at the
rst day 
Chih-Jen Lin (National Taiwan Univ.) 22 / 54
Opportunities Lessons from past developments in one machine 
Past Development on One Computer 
The problem now is we take many things for 
granted on one computer 
On one computer, have you ever worried about 
calculating the average of some numbers? 
Probably not. You can use Excel, statistical 
software (e.g., R and SAS), and many things else 
We seldom care internally how these tools work 
Can we go back to see the early development on 
one computer and learn some lessons/experiences? 
Chih-Jen Lin (National Taiwan Univ.) 23 / 54
Opportunities Lessons from past developments in one machine 
Example: Matrix-matrix Product 
Consider the example of matrix-matrix products 
C = A  B; A 2 Rnd ; B 2 Rdm 
where 
Cij = 
Xd 
k=1 
AikBkj 
This is a simple operation. You can easily write your 
own code 
Chih-Jen Lin (National Taiwan Univ.) 24 / 54
Opportunities Lessons from past developments in one machine 
Example: Matrix-matrix Product (Cont'd) 
A segment of C code (assume n = m here) 
for (i=0;in;i++) 
for (j=0;jn;j++) 
{ 
c[i][j]=0; 
for (k=0;kn;k++) 
c[i][j] += a[i][k]*b[k][j]; 
} 
For 3; 000  3; 000 matrices 
$ gcc -O3 mat.c 
$ time ./a.out 
3m24.843s 
Chih-Jen Lin (National Taiwan Univ.) 25 / 54
Opportunities Lessons from past developments in one machine 
Example: Matrix-matrix Product (Cont'd) 
But on Matlab (single-thread mode) 
$ matlab -singleCompThread 
 tic; c = a*b; toc 
Elapsed time is 4.095059 seconds. 
Chih-Jen Lin (National Taiwan Univ.) 26 / 54
Opportunities Lessons from past developments in one machine 
Example: Matrix-matrix Product (Cont'd) 
How can Matlab be much faster than ours? 
The fast implementation comes from some deep 
research and development 
Matlab calls optimized BLAS (Basic Linear Algebra 
Subroutines) that was developed in 80's-90's 
Our implementation is slow because data are not 
available for computation 
Chih-Jen Lin (National Taiwan Univ.) 27 / 54
Opportunities Lessons from past developments in one machine 
Example: Matrix-matrix Product (Cont'd) 
CPU 
# 
Registers 
# 
Cache 
# 
Main Memory 
# 
Secondary storage (Disk) 
: increasing in speed 
#: increasing in 
capacity 
Optimized BLAS: try 
to make data available 
in a higher level of 
memory 
You don't waste time 
to frequently move 
data 
Chih-Jen Lin (National Taiwan Univ.) 28 / 54
Opportunities Lessons from past developments in one machine 
Example: Matrix-matrix Product (Cont'd) 
Optimized BLAS uses block algorithms 
A  B = 
2 
4 
A11    A14 
... 
A41    A44 
3 
5 
2 
4 
B11    B14 
... 
B41    B44 
3 
5 
= 
 
A11B11 +    + A14B41    
... 
. . . 
 
If we compare the number of page faults (cache 
misses) 
Ours: much larger 
Block: much smaller 
Chih-Jen Lin (National Taiwan Univ.) 29 / 54
Opportunities Lessons from past developments in one machine 
Example: Matrix-matrix Product (Cont'd) 
I like this example because it involves both 
mathematical operations (matrix products), 
and 
computer architecture (memory hierarchy) 
Only if knowing both, you can make breakthroughs 
Chih-Jen Lin (National Taiwan Univ.) 30 / 54
Opportunities Lessons from past developments in one machine 
Example: Matrix-matrix Product (Cont'd) 
For big-data analytics, we are in a similar situation 
We want to run mathematical algorithms 
(classi
cation and clustering) in a complicated 
architecture (distributed system) 
But we are like at the time point before optimized 
BLAS was developed 
Chih-Jen Lin (National Taiwan Univ.) 31 / 54
Opportunities Lessons from past developments in one machine 
Algorithms and Systems 
To have technical breakthroughs for big-data 
analytics, we should know both algorithms and 
systems well, and consider them together 
Indeed, if you are an expert on both topics, 
everybody wants you now 
Many machine learning Ph.D. students don't know 
much about systems. But this isn't the case in the 
early days of computer science 
Chih-Jen Lin (National Taiwan Univ.) 32 / 54
Opportunities Lessons from past developments in one machine 
Algorithms and Systems (Cont'd) 
At that time, every numerical analyst knows 
computer architecture well. 
That's how they successfully developed 

oating-point systems and IEEE 754/854 standard 
Chih-Jen Lin (National Taiwan Univ.) 33 / 54
Opportunities Lessons from past developments in one machine 
Example: Machine Learning Using Spark 
Recently we developed a classi
er on Spark 
Spark is an in-memory cluster-computing platform 
Beyond algorithms we must take details of 
Spark 
Scala 
into account 
For example, you want to know 
the dierence between mapPartitions and 
map in Spark, and 
the slower for loop than while loop in Scala 
Chih-Jen Lin (National Taiwan Univ.) 34 / 54
Opportunities Lessons from past developments in one machine 
Example: Machine Learning Using Spark 
(Cont'd) 
During our development, Spark was signi
cantly 
upgraded from version 0.9 to 1.0. We must learn 
their changes 
It's like when you write a code on a computer, but 
the compiler or OS is actively changed. We are in a 
stage just like that. 
Chih-Jen Lin (National Taiwan Univ.) 35 / 54
Opportunities Successful examples? 
Outline 
3 Opportunities 
Lessons from past developments in one machine 
Successful examples? 
Design of big-data algorithms 
Chih-Jen Lin (National Taiwan Univ.) 36 / 54
Opportunities Successful examples? 
Example of Distributed Machine Learning 
I don't think we have many successful examples yet 
Here I will show one: CTR (Click Through Rate) 
prediction for computational advertising 
Many companies now run distributed classi
cation 
for CTR problems 
Chih-Jen Lin (National Taiwan Univ.) 37 / 54
Opportunities Successful examples? 
Example: CTR Prediction 
De
nition of CTR: 
CTR = 
# clicks 
# impressions 
. 
A sequence of events 
Not clicked Features of user 
Clicked Features of user 
Not clicked Features of user 
      
A binary classi
cation problem. 
Chih-Jen Lin (National Taiwan Univ.) 38 / 54
Opportunities Successful examples? 
Example: CTR Prediction (Cont'd) 
Chih-Jen Lin (National Taiwan Univ.) 39 / 54
Opportunities Design of big-data algorithms 
Outline 
3 Opportunities 
Lessons from past developments in one machine 
Successful examples? 
Design of big-data algorithms 
Chih-Jen Lin (National Taiwan Univ.) 40 / 54
Opportunities Design of big-data algorithms 
Design Considerations 
Generally you want to minimize the data access and 
communication in a distributed environment 
It's possible that 
method A better than B on one computer 
but 
method A worse than B in distributed environments 
Chih-Jen Lin (National Taiwan Univ.) 41 / 54
Opportunities Design of big-data algorithms 
Design Considerations (Cont'd) 
Example: on one computer, often we do batch 
rather than online learning 
Online and streaming learning may be more useful 
for big-data applications 
Example: very often we design synchronous parallel 
algorithms 
Maybe asynchronous ones are better for big data? 
Chih-Jen Lin (National Taiwan Univ.) 42 / 54
Opportunities Design of big-data algorithms 
Work
ow Issues 
Data analytics is often only part of the work
ow of 
a big-data application 
By work
ow, I mean things from raw data to
nal 
use of the results 
Other steps may be more complicated than the 
analytics step 
In one-computer situation, the focus is often on the 
analytics step 
Chih-Jen Lin (National Taiwan Univ.) 43 / 54
Opportunities Design of big-data algorithms 
How to Get Started? 
In my opinion, we should start from applications 
Applications ! programming frameworks and 
algorithms ! general tools 
Now almost every big-data application requires 
special settings of algorithms, but I believe general 
tools will be possible 
Chih-Jen Lin (National Taiwan Univ.) 44 / 54

Contenu connexe

Tendances

Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.pptneelamoberoi1030
 
Differences Between Machine Learning Ml Artificial Intelligence Ai And Deep L...
Differences Between Machine Learning Ml Artificial Intelligence Ai And Deep L...Differences Between Machine Learning Ml Artificial Intelligence Ai And Deep L...
Differences Between Machine Learning Ml Artificial Intelligence Ai And Deep L...SlideTeam
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
HML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningHML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningYan Xu
 
Machine Learning Approaches for Crime Pattern Detection
Machine Learning Approaches for Crime Pattern DetectionMachine Learning Approaches for Crime Pattern Detection
Machine Learning Approaches for Crime Pattern DetectionAPNIC
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringHadi Fadlallah
 
[Data Lake + Arquitetura Lambda] na prática
 [Data Lake + Arquitetura Lambda] na prática [Data Lake + Arquitetura Lambda] na prática
[Data Lake + Arquitetura Lambda] na práticaFelipe Santos
 
Machine learning vs deep learning
Machine learning vs deep learningMachine learning vs deep learning
Machine learning vs deep learningUSM Systems
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Data Science Use cases in Banking
Data Science Use cases in BankingData Science Use cases in Banking
Data Science Use cases in BankingArul Bharathi
 
Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleImpetus Technologies
 
Machine Learning & Cyber Security: Detecting Malicious URLs in the Haystack
Machine Learning & Cyber Security: Detecting Malicious URLs in the HaystackMachine Learning & Cyber Security: Detecting Malicious URLs in the Haystack
Machine Learning & Cyber Security: Detecting Malicious URLs in the HaystackAlistair Gillespie
 
NOC-Back Office-Front Office Design-eTOM
NOC-Back Office-Front Office Design-eTOMNOC-Back Office-Front Office Design-eTOM
NOC-Back Office-Front Office Design-eTOMphf_sobhan
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsPat Patterson
 

Tendances (20)

Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
Differences Between Machine Learning Ml Artificial Intelligence Ai And Deep L...
Differences Between Machine Learning Ml Artificial Intelligence Ai And Deep L...Differences Between Machine Learning Ml Artificial Intelligence Ai And Deep L...
Differences Between Machine Learning Ml Artificial Intelligence Ai And Deep L...
 
Three Big Data Case Studies
Three Big Data Case StudiesThree Big Data Case Studies
Three Big Data Case Studies
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
HML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningHML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep Learning
 
Machine Learning Approaches for Crime Pattern Detection
Machine Learning Approaches for Crime Pattern DetectionMachine Learning Approaches for Crime Pattern Detection
Machine Learning Approaches for Crime Pattern Detection
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
[Data Lake + Arquitetura Lambda] na prática
 [Data Lake + Arquitetura Lambda] na prática [Data Lake + Arquitetura Lambda] na prática
[Data Lake + Arquitetura Lambda] na prática
 
Machine learning vs deep learning
Machine learning vs deep learningMachine learning vs deep learning
Machine learning vs deep learning
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Data Science Use cases in Banking
Data Science Use cases in BankingData Science Use cases in Banking
Data Science Use cases in Banking
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scale
 
Machine Learning & Cyber Security: Detecting Malicious URLs in the Haystack
Machine Learning & Cyber Security: Detecting Malicious URLs in the HaystackMachine Learning & Cyber Security: Detecting Malicious URLs in the Haystack
Machine Learning & Cyber Security: Detecting Malicious URLs in the Haystack
 
NOC-Back Office-Front Office Design-eTOM
NOC-Back Office-Front Office Design-eTOMNOC-Back Office-Front Office Design-eTOM
NOC-Back Office-Front Office Design-eTOM
 
Data mining
Data miningData mining
Data mining
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Computer forensic
Computer forensicComputer forensic
Computer forensic
 

En vedette

機率統計 -- 使用 R 軟體
機率統計 -- 使用 R 軟體機率統計 -- 使用 R 軟體
機率統計 -- 使用 R 軟體鍾誠 陳鍾誠
 
R統計軟體 -安裝與使用
R統計軟體 -安裝與使用R統計軟體 -安裝與使用
R統計軟體 -安裝與使用Person Lin
 
資料科學的第一堂課 Data Science Orientation
資料科學的第一堂課 Data Science Orientation資料科學的第一堂課 Data Science Orientation
資料科學的第一堂課 Data Science OrientationRyan Chung
 
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studioR Ladies Taipei
 
R統計軟體簡介
R統計軟體簡介R統計軟體簡介
R統計軟體簡介Person Lin
 
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室台灣資料科學年會
 
[DSC 2016] 系列活動:許懷中 / R 語言資料探勘實務
[DSC 2016] 系列活動:許懷中 / R 語言資料探勘實務[DSC 2016] 系列活動:許懷中 / R 語言資料探勘實務
[DSC 2016] 系列活動:許懷中 / R 語言資料探勘實務台灣資料科學年會
 
Collaboration with Statistician? 矩陣視覺化於探索式資料分析
Collaboration with Statistician? 矩陣視覺化於探索式資料分析Collaboration with Statistician? 矩陣視覺化於探索式資料分析
Collaboration with Statistician? 矩陣視覺化於探索式資料分析台灣資料科學年會
 
曾韵/沒有大數據怎麼辦 ? 會計師事務所的小數據科學
曾韵/沒有大數據怎麼辦 ? 會計師事務所的小數據科學曾韵/沒有大數據怎麼辦 ? 會計師事務所的小數據科學
曾韵/沒有大數據怎麼辦 ? 會計師事務所的小數據科學台灣資料科學年會
 
初學R語言的60分鐘
初學R語言的60分鐘初學R語言的60分鐘
初學R語言的60分鐘Chen-Pan Liao
 
那些你知道的,但還沒看過的 Big Data 風景 ─ 致 Hadooper
那些你知道的,但還沒看過的 Big Data 風景 ─ 致 Hadooper那些你知道的,但還沒看過的 Big Data 風景 ─ 致 Hadooper
那些你知道的,但還沒看過的 Big Data 風景 ─ 致 HadooperFred Chiang
 
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會台灣資料科學年會
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies台灣資料科學年會
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程台灣資料科學年會
 
[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務台灣資料科學年會
 
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨手把手教你 R 語言資料分析實務/張毓倫&陳柏亨
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨台灣資料科學年會
 
[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R台灣資料科學年會
 

En vedette (20)

機率統計 -- 使用 R 軟體
機率統計 -- 使用 R 軟體機率統計 -- 使用 R 軟體
機率統計 -- 使用 R 軟體
 
R統計軟體 -安裝與使用
R統計軟體 -安裝與使用R統計軟體 -安裝與使用
R統計軟體 -安裝與使用
 
資料科學的第一堂課 Data Science Orientation
資料科學的第一堂課 Data Science Orientation資料科學的第一堂課 Data Science Orientation
資料科學的第一堂課 Data Science Orientation
 
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio
不會寫程式的人友善上手機器學習-淺談 Azure machine learning studio
 
新手村-資料探索
新手村-資料探索新手村-資料探索
新手村-資料探索
 
R統計軟體簡介
R統計軟體簡介R統計軟體簡介
R統計軟體簡介
 
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
 
第一場預測
第一場預測第一場預測
第一場預測
 
[DSC 2016] 系列活動:許懷中 / R 語言資料探勘實務
[DSC 2016] 系列活動:許懷中 / R 語言資料探勘實務[DSC 2016] 系列活動:許懷中 / R 語言資料探勘實務
[DSC 2016] 系列活動:許懷中 / R 語言資料探勘實務
 
Collaboration with Statistician? 矩陣視覺化於探索式資料分析
Collaboration with Statistician? 矩陣視覺化於探索式資料分析Collaboration with Statistician? 矩陣視覺化於探索式資料分析
Collaboration with Statistician? 矩陣視覺化於探索式資料分析
 
曾韵/沒有大數據怎麼辦 ? 會計師事務所的小數據科學
曾韵/沒有大數據怎麼辦 ? 會計師事務所的小數據科學曾韵/沒有大數據怎麼辦 ? 會計師事務所的小數據科學
曾韵/沒有大數據怎麼辦 ? 會計師事務所的小數據科學
 
初學R語言的60分鐘
初學R語言的60分鐘初學R語言的60分鐘
初學R語言的60分鐘
 
那些你知道的,但還沒看過的 Big Data 風景 ─ 致 Hadooper
那些你知道的,但還沒看過的 Big Data 風景 ─ 致 Hadooper那些你知道的,但還沒看過的 Big Data 風景 ─ 致 Hadooper
那些你知道的,但還沒看過的 Big Data 風景 ─ 致 Hadooper
 
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
 
李育杰/The Growth of a Data Scientist
李育杰/The Growth of a Data Scientist李育杰/The Growth of a Data Scientist
李育杰/The Growth of a Data Scientist
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
 
[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務
 
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨手把手教你 R 語言資料分析實務/張毓倫&陳柏亨
手把手教你 R 語言資料分析實務/張毓倫&陳柏亨
 
[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R
 

Similaire à Big-data analytics: challenges and opportunities

Twdatasci cjlin-big data analytics - challenges and opportunities
Twdatasci cjlin-big data analytics - challenges and opportunitiesTwdatasci cjlin-big data analytics - challenges and opportunities
Twdatasci cjlin-big data analytics - challenges and opportunitiesAravindharamanan S
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
Toward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxToward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxjuliennehar
 
chalenges and apportunity of deep learning for big data analysis f
 chalenges and apportunity of deep learning for big data analysis f chalenges and apportunity of deep learning for big data analysis f
chalenges and apportunity of deep learning for big data analysis fmaru kindeneh
 
Some thoughts on large scale data classification
Some thoughts on large scale data classificationSome thoughts on large scale data classification
Some thoughts on large scale data classificationXiang Li
 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseData Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseFormulatedby
 
Data Science Salon Miami Presentation
Data Science Salon Miami PresentationData Science Salon Miami Presentation
Data Science Salon Miami PresentationGreg Werner
 
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language ModelsDataScienceConferenc1
 
challenges of bigdata
challenges of bigdatachallenges of bigdata
challenges of bigdataRavi Vaniya
 
Demystifying Ml, DL and AI
Demystifying Ml, DL and AIDemystifying Ml, DL and AI
Demystifying Ml, DL and AIGreg Werner
 
Dialogue system②
Dialogue system②Dialogue system②
Dialogue system②Kent T
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
 
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataSemi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataTech Triveni
 
Reasoning over big data
Reasoning over big dataReasoning over big data
Reasoning over big dataOSTHUS
 
Lecture_1_-_Course_Overview_(Inked).pdf
Lecture_1_-_Course_Overview_(Inked).pdfLecture_1_-_Course_Overview_(Inked).pdf
Lecture_1_-_Course_Overview_(Inked).pdfRTEFGDFGJU
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation sourcebutest
 

Similaire à Big-data analytics: challenges and opportunities (20)

Twdatasci cjlin-big data analytics - challenges and opportunities
Twdatasci cjlin-big data analytics - challenges and opportunitiesTwdatasci cjlin-big data analytics - challenges and opportunities
Twdatasci cjlin-big data analytics - challenges and opportunities
 
Vitriol
VitriolVitriol
Vitriol
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Toward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxToward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docx
 
Learning for Big Data-林軒田
Learning for Big Data-林軒田Learning for Big Data-林軒田
Learning for Big Data-林軒田
 
chalenges and apportunity of deep learning for big data analysis f
 chalenges and apportunity of deep learning for big data analysis f chalenges and apportunity of deep learning for big data analysis f
chalenges and apportunity of deep learning for big data analysis f
 
Some thoughts on large scale data classification
Some thoughts on large scale data classificationSome thoughts on large scale data classification
Some thoughts on large scale data classification
 
How to crack down big data?
How to crack down big data? How to crack down big data?
How to crack down big data?
 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseData Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
 
Data Science Salon Miami Presentation
Data Science Salon Miami PresentationData Science Salon Miami Presentation
Data Science Salon Miami Presentation
 
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
 
challenges of bigdata
challenges of bigdatachallenges of bigdata
challenges of bigdata
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Demystifying Ml, DL and AI
Demystifying Ml, DL and AIDemystifying Ml, DL and AI
Demystifying Ml, DL and AI
 
Dialogue system②
Dialogue system②Dialogue system②
Dialogue system②
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataSemi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text Data
 
Reasoning over big data
Reasoning over big dataReasoning over big data
Reasoning over big data
 
Lecture_1_-_Course_Overview_(Inked).pdf
Lecture_1_-_Course_Overview_(Inked).pdfLecture_1_-_Course_Overview_(Inked).pdf
Lecture_1_-_Course_Overview_(Inked).pdf
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation source
 

Plus de 台灣資料科學年會

[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用台灣資料科學年會
 
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告台灣資料科學年會
 
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰台灣資料科學年會
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機台灣資料科學年會
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機台灣資料科學年會
 
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話台灣資料科學年會
 
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇台灣資料科學年會
 
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 [TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 台灣資料科學年會
 
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵台灣資料科學年會
 
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用台灣資料科學年會
 
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告台灣資料科學年會
 
[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話台灣資料科學年會
 
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人台灣資料科學年會
 
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維台灣資料科學年會
 
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察台灣資料科學年會
 
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰台灣資料科學年會
 
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT台灣資料科學年會
 
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達台灣資料科學年會
 
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳台灣資料科學年會
 

Plus de 台灣資料科學年會 (20)

[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用
 
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告
 
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
 
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
 
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
 
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 [TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
 
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
 
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
 
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
 
台灣人工智慧學校成果發表會
台灣人工智慧學校成果發表會台灣人工智慧學校成果發表會
台灣人工智慧學校成果發表會
 
[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話
 
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
 
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
 
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
 
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰
 
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
 
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
 
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
 

Dernier

1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPTiSEO AI
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024Stephanie Beckett
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FIDO Alliance
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024Stephen Perrenod
 

Dernier (20)

1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 

Big-data analytics: challenges and opportunities

  • 1. Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University Talk at ðcÇ™Ñx}t, August 30, 2014 Chih-Jen Lin (National Taiwan Univ.) 1 / 54
  • 2. Everybody talks about big data now, but it's not easy to have an overall picture of this subject In this talk, I will give some personal thoughts on technical developments of big-data analytics. Some are very pre-mature, so your comments are very welcome Chih-Jen Lin (National Taiwan Univ.) 2 / 54
  • 3. Outline 1 From data mining to big data 2 Challenges 3 Opportunities 4 Discussion and conclusions Chih-Jen Lin (National Taiwan Univ.) 3 / 54
  • 4. From data mining to big data Outline 1 From data mining to big data 2 Challenges 3 Opportunities 4 Discussion and conclusions Chih-Jen Lin (National Taiwan Univ.) 4 / 54
  • 5. From data mining to big data From Data Mining to Big Data In early 90's, a buzzword called data mining appeared Many years after, we have another one called big data Well, what's the dierence? Chih-Jen Lin (National Taiwan Univ.) 5 / 54
  • 6. From data mining to big data Status of Data Mining and Machine Learning Over the years, we have all kinds of eective methods for classi
  • 7. cation, clustering, and regression We also have good integrated tools for data mining (e.g., Weka, R, Scikit-learn) However, mining useful information remains dicult for some real-world applications Chih-Jen Lin (National Taiwan Univ.) 6 / 54
  • 8. From data mining to big data What's Big Data? Though many de
  • 9. nitions are available, I am considering the situation that data are larger than the capacity of a computer I think this is a main dierence between data mining and big data So in a sense we are talking about distributed data mining or machine learning (a), (b): distributed systems Image from Wikimedia Chih-Jen Lin (National Taiwan Univ.) 7 / 54
  • 10. From data mining to big data From Small to Big Data Two important dierences: Negative side: Methods for big data analytics are not quite ready, not even mentioned to integrated tools Positive side: Some (Halevy et al., 2009) argue that the almost unlimited data make us easier to mine information I will discuss the
  • 11. rst dierence Chih-Jen Lin (National Taiwan Univ.) 8 / 54
  • 12. Challenges Outline 1 From data mining to big data 2 Challenges 3 Opportunities 4 Discussion and conclusions Chih-Jen Lin (National Taiwan Univ.) 9 / 54
  • 13. Challenges Possible Advantages of Distributed Data Analytics Parallel data loading Reading several TB data from disk is slow Using 100 machines, each has 1/100 data in its local disk ) 1/100 loading time But having data ready in these 100 machines is another issue Fault tolerance Some data replicated across machines: if one fails, others are still available Chih-Jen Lin (National Taiwan Univ.) 10 / 54
  • 14. Challenges Possible Advantages of Distributed Data Analytics (Cont'd) Work ow not interrupted If data are already distributedly stored, it's not convenient to reduce some to one machine for analysis Chih-Jen Lin (National Taiwan Univ.) 11 / 54
  • 15. Challenges Possible Disadvantages of Distributed Data Analytics More complicated (of course) Communication and synchronization Everybody says moving computation to data, but this isn't that easy Chih-Jen Lin (National Taiwan Univ.) 12 / 54
  • 16. Challenges Going Distributed or Not Isn't Easy to Decide Quote from Yann LeCun (KDnuggets News 14:n05) I have seen people insisting on using Hadoop for datasets that could easily
  • 17. t on a ash drive and could easily be processed on a laptop. Now disk and RAM are large. You may load several TB of data once and conveniently conduct all analysis The decision is application dependent We will discuss this issue again later Chih-Jen Lin (National Taiwan Univ.) 13 / 54
  • 18. Challenges Distributed Environments Many easy tasks on one computer become dicult in a distributed environment For example, subsampling is easy on one machine, but may not be in a distributed system Usually we attribute the problem to slow communication between machines Chih-Jen Lin (National Taiwan Univ.) 14 / 54
  • 19. Challenges Challenges Big data, small analysis versus Big data, big analysis If you need a single record from a huge set, it's reasonably easy For example, accessing your high-speed rail reservation is fast However, if you want to analyze the whole set by accessing data several time, it can be much harder Chih-Jen Lin (National Taiwan Univ.) 15 / 54
  • 20. Challenges Challenges (Cont'd) Most existing data mining/machine learning methods were designed without considering data access and communication of intermediate results They iteratively use data by assuming they are readily available Example: doing least-square regression isn't easy in a distributed environment Chih-Jen Lin (National Taiwan Univ.) 16 / 54
  • 21. Challenges Challenges (Cont'd) So we are facing many challenges methods not ready no convenient tools rapid change on the system side and many others What should we do? Chih-Jen Lin (National Taiwan Univ.) 17 / 54
  • 22. Opportunities Outline 1 From data mining to big data 2 Challenges 3 Opportunities 4 Discussion and conclusions Chih-Jen Lin (National Taiwan Univ.) 18 / 54
  • 23. Opportunities Opportunities Looks like we are in the early stage of a research topic But what is our chance? Chih-Jen Lin (National Taiwan Univ.) 19 / 54
  • 24. Opportunities Lessons from past developments in one machine Outline 3 Opportunities Lessons from past developments in one machine Successful examples? Design of big-data algorithms Chih-Jen Lin (National Taiwan Univ.) 20 / 54
  • 25. Opportunities Lessons from past developments in one machine Algorithms for Distributed Data Analytics This is an on-going research topic. Roughly there are two types of approaches 1 Parallelize existing (single-machine) algorithms 2 Design new algorithms particularly for distributed settings Of course there are things in between Chih-Jen Lin (National Taiwan Univ.) 21 / 54
  • 26. Opportunities Lessons from past developments in one machine Algorithms for Distributed Data Analytics (Cont'd) Given the complicated distributed setting, we wonder if easy-to-use big-data analytics tools can ever be available? I don't know either. Let's try to think about the situation on one computer
  • 27. rst Indeed those easy-to-use analytics tools on one computer were not there at the
  • 28. rst day Chih-Jen Lin (National Taiwan Univ.) 22 / 54
  • 29. Opportunities Lessons from past developments in one machine Past Development on One Computer The problem now is we take many things for granted on one computer On one computer, have you ever worried about calculating the average of some numbers? Probably not. You can use Excel, statistical software (e.g., R and SAS), and many things else We seldom care internally how these tools work Can we go back to see the early development on one computer and learn some lessons/experiences? Chih-Jen Lin (National Taiwan Univ.) 23 / 54
  • 30. Opportunities Lessons from past developments in one machine Example: Matrix-matrix Product Consider the example of matrix-matrix products C = A B; A 2 Rnd ; B 2 Rdm where Cij = Xd k=1 AikBkj This is a simple operation. You can easily write your own code Chih-Jen Lin (National Taiwan Univ.) 24 / 54
  • 31. Opportunities Lessons from past developments in one machine Example: Matrix-matrix Product (Cont'd) A segment of C code (assume n = m here) for (i=0;in;i++) for (j=0;jn;j++) { c[i][j]=0; for (k=0;kn;k++) c[i][j] += a[i][k]*b[k][j]; } For 3; 000 3; 000 matrices $ gcc -O3 mat.c $ time ./a.out 3m24.843s Chih-Jen Lin (National Taiwan Univ.) 25 / 54
  • 32. Opportunities Lessons from past developments in one machine Example: Matrix-matrix Product (Cont'd) But on Matlab (single-thread mode) $ matlab -singleCompThread tic; c = a*b; toc Elapsed time is 4.095059 seconds. Chih-Jen Lin (National Taiwan Univ.) 26 / 54
  • 33. Opportunities Lessons from past developments in one machine Example: Matrix-matrix Product (Cont'd) How can Matlab be much faster than ours? The fast implementation comes from some deep research and development Matlab calls optimized BLAS (Basic Linear Algebra Subroutines) that was developed in 80's-90's Our implementation is slow because data are not available for computation Chih-Jen Lin (National Taiwan Univ.) 27 / 54
  • 34. Opportunities Lessons from past developments in one machine Example: Matrix-matrix Product (Cont'd) CPU # Registers # Cache # Main Memory # Secondary storage (Disk) : increasing in speed #: increasing in capacity Optimized BLAS: try to make data available in a higher level of memory You don't waste time to frequently move data Chih-Jen Lin (National Taiwan Univ.) 28 / 54
  • 35. Opportunities Lessons from past developments in one machine Example: Matrix-matrix Product (Cont'd) Optimized BLAS uses block algorithms A B = 2 4 A11 A14 ... A41 A44 3 5 2 4 B11 B14 ... B41 B44 3 5 = A11B11 + + A14B41 ... . . . If we compare the number of page faults (cache misses) Ours: much larger Block: much smaller Chih-Jen Lin (National Taiwan Univ.) 29 / 54
  • 36. Opportunities Lessons from past developments in one machine Example: Matrix-matrix Product (Cont'd) I like this example because it involves both mathematical operations (matrix products), and computer architecture (memory hierarchy) Only if knowing both, you can make breakthroughs Chih-Jen Lin (National Taiwan Univ.) 30 / 54
  • 37. Opportunities Lessons from past developments in one machine Example: Matrix-matrix Product (Cont'd) For big-data analytics, we are in a similar situation We want to run mathematical algorithms (classi
  • 38. cation and clustering) in a complicated architecture (distributed system) But we are like at the time point before optimized BLAS was developed Chih-Jen Lin (National Taiwan Univ.) 31 / 54
  • 39. Opportunities Lessons from past developments in one machine Algorithms and Systems To have technical breakthroughs for big-data analytics, we should know both algorithms and systems well, and consider them together Indeed, if you are an expert on both topics, everybody wants you now Many machine learning Ph.D. students don't know much about systems. But this isn't the case in the early days of computer science Chih-Jen Lin (National Taiwan Univ.) 32 / 54
  • 40. Opportunities Lessons from past developments in one machine Algorithms and Systems (Cont'd) At that time, every numerical analyst knows computer architecture well. That's how they successfully developed oating-point systems and IEEE 754/854 standard Chih-Jen Lin (National Taiwan Univ.) 33 / 54
  • 41. Opportunities Lessons from past developments in one machine Example: Machine Learning Using Spark Recently we developed a classi
  • 42. er on Spark Spark is an in-memory cluster-computing platform Beyond algorithms we must take details of Spark Scala into account For example, you want to know the dierence between mapPartitions and map in Spark, and the slower for loop than while loop in Scala Chih-Jen Lin (National Taiwan Univ.) 34 / 54
  • 43. Opportunities Lessons from past developments in one machine Example: Machine Learning Using Spark (Cont'd) During our development, Spark was signi
  • 44. cantly upgraded from version 0.9 to 1.0. We must learn their changes It's like when you write a code on a computer, but the compiler or OS is actively changed. We are in a stage just like that. Chih-Jen Lin (National Taiwan Univ.) 35 / 54
  • 45. Opportunities Successful examples? Outline 3 Opportunities Lessons from past developments in one machine Successful examples? Design of big-data algorithms Chih-Jen Lin (National Taiwan Univ.) 36 / 54
  • 46. Opportunities Successful examples? Example of Distributed Machine Learning I don't think we have many successful examples yet Here I will show one: CTR (Click Through Rate) prediction for computational advertising Many companies now run distributed classi
  • 47. cation for CTR problems Chih-Jen Lin (National Taiwan Univ.) 37 / 54
  • 48. Opportunities Successful examples? Example: CTR Prediction De
  • 49. nition of CTR: CTR = # clicks # impressions . A sequence of events Not clicked Features of user Clicked Features of user Not clicked Features of user A binary classi
  • 50. cation problem. Chih-Jen Lin (National Taiwan Univ.) 38 / 54
  • 51. Opportunities Successful examples? Example: CTR Prediction (Cont'd) Chih-Jen Lin (National Taiwan Univ.) 39 / 54
  • 52. Opportunities Design of big-data algorithms Outline 3 Opportunities Lessons from past developments in one machine Successful examples? Design of big-data algorithms Chih-Jen Lin (National Taiwan Univ.) 40 / 54
  • 53. Opportunities Design of big-data algorithms Design Considerations Generally you want to minimize the data access and communication in a distributed environment It's possible that method A better than B on one computer but method A worse than B in distributed environments Chih-Jen Lin (National Taiwan Univ.) 41 / 54
  • 54. Opportunities Design of big-data algorithms Design Considerations (Cont'd) Example: on one computer, often we do batch rather than online learning Online and streaming learning may be more useful for big-data applications Example: very often we design synchronous parallel algorithms Maybe asynchronous ones are better for big data? Chih-Jen Lin (National Taiwan Univ.) 42 / 54
  • 55. Opportunities Design of big-data algorithms Work ow Issues Data analytics is often only part of the work ow of a big-data application By work ow, I mean things from raw data to
  • 56. nal use of the results Other steps may be more complicated than the analytics step In one-computer situation, the focus is often on the analytics step Chih-Jen Lin (National Taiwan Univ.) 43 / 54
  • 57. Opportunities Design of big-data algorithms How to Get Started? In my opinion, we should start from applications Applications ! programming frameworks and algorithms ! general tools Now almost every big-data application requires special settings of algorithms, but I believe general tools will be possible Chih-Jen Lin (National Taiwan Univ.) 44 / 54
  • 58. Discussion and conclusions Outline 1 From data mining to big data 2 Challenges 3 Opportunities 4 Discussion and conclusions Chih-Jen Lin (National Taiwan Univ.) 45 / 54
  • 59. Discussion and conclusions Risk of This Topic It's unclear how successful we can be Two problems: Technology limits Applicability limits Chih-Jen Lin (National Taiwan Univ.) 46 / 54
  • 60. Discussion and conclusions Risk: Technology limits It's possible that we cannot get satisfactory results because of the distributed con
  • 61. guration Recall that parallel programming or HPC (high performance computing) wasn't very successful in early 90's. But there are two dierences this time 1 We are using commodity machines 2 Data become the focus Well, every area has its limitation. The degree of success varies Chih-Jen Lin (National Taiwan Univ.) 47 / 54
  • 62. Discussion and conclusions Risk: Technology Limits (Cont'd) Let's compare two matrix products: Dense matrix products: very successful as the
  • 63. nal outcome (optimized BLAS) is much better than what ordinary users wrote Sparse matrix products: not as successful. My code is about as good as those provided by Matlab For big data analytics, it's too early to tell We never know until we try Chih-Jen Lin (National Taiwan Univ.) 48 / 54
  • 64. Discussion and conclusions Risk: Applicability Limits What's the percentage of applications that need big-data analytics? Not clear. Indeed some think the percentage is small (so they think big-data analytics is a hype) One main reason is that you can always analyze a random subest on one machine But you may say this is a chicken and egg problem { because of no available tools, so no applications?? Chih-Jen Lin (National Taiwan Univ.) 49 / 54
  • 65. Discussion and conclusions Risk: Applicability Limits (Cont'd) Another problem is the mis-understanding Until recently, few universities or companies can access data center environments. They therefore think those big ones (e.g., Google) are doing big-data analytics for everything In fact, the situation isn't like that Chih-Jen Lin (National Taiwan Univ.) 50 / 54
  • 66. Discussion and conclusions Risk: Applicability Limits (Cont'd) A quote from Dan Ariely, Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it ... In my recent visit to a large company, their people did say that most analytics works are still done on one machine Chih-Jen Lin (National Taiwan Univ.) 51 / 54
  • 67. Discussion and conclusions Risk: Applicability Limits (Cont'd) A quote from Dan Ariely, Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it ... In my recent visit to a large company, their people did say that most analytics works are still done on one machine Chih-Jen Lin (National Taiwan Univ.) 51 / 54
  • 68. Discussion and conclusions Risk: Applicability Limits (Cont'd) A quote from Dan Ariely, Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it ... In my recent visit to a large company, their people did say that most analytics works are still done on one machine Chih-Jen Lin (National Taiwan Univ.) 51 / 54
  • 69. Discussion and conclusions Risk: Applicability Limits (Cont'd) A quote from Dan Ariely, Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it ... In my recent visit to a large company, their people did say that most analytics works are still done on one machine Chih-Jen Lin (National Taiwan Univ.) 51 / 54
  • 70. Discussion and conclusions Open-source Developments Open-source developments are very important for big data analytics How it works: The company must do an application X. They consider an open-source tool Y. But Y is not enough for X. Then their engineers improve Y and submit pull requests Through this process, core developers of a project are formed. They are from various companies Chih-Jen Lin (National Taiwan Univ.) 52 / 54
  • 71. Discussion and conclusions Open-source Developments (Cont'd) For Taiwanese data-science companies, I think we should actively participate in such developments Indeed industry rather than schools are in a better position to do this Chih-Jen Lin (National Taiwan Univ.) 53 / 54
  • 72. Discussion and conclusions Conclusions Big-data analytics is in its infancy It's challenging to development algorithms and tools in a distributed environment To start, we should take both algorithms and systems into consideration Hopefully we will get some breakthroughs in the near future Chih-Jen Lin (National Taiwan Univ.) 54 / 54