Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Java in production for
Data Mining Research
projects
Alexey Zinovyev, Java Trainer in EPAM
2JDD conference
About
I am a <graph theory, machine learning,
traffic jams prediction, BigData algorithms>
scientist
But I...
3JDD conference
In this topic …
A lot of strange pictures and technologies from crazy zoo
We talk about
• Data Mining
• Ha...
4JDD conference
Are you a Hadoop developer?
5JDD conference
Let’s do THIS!
6JDD conference
The Good Old Days
7JDD conference
One of these fine days...
8JDD conference
We need in Python dev 'cause Data Mining
9JDD conference
No, you are JavaEE developer only, continue …
10JDD conference
Write your backends, dude!
11JDD conference
Let’s talk about it, Java-boy...
12JDD conference
Can a Java programmer to be a Data Scientist?
13JDD conference
Sexy Data Scientist
14JDD conference
Real Data Scientist
15JDD conference
And what I tell you, young man
16JDD conference
And what I tell you, young man
WHAT IS DATA MINING?
18JDD conference
Statistics?
19JDD conference
Tag cloud from B2B conference?
20JDD conference
Not OLAP, 100%
21JDD conference
Hey, man, predict something!
22JDD conference
Hey, man, predict something!
23JDD conference
Man or sofa?
24JDD conference
SUBJECT AREA
25JDD conference
Typical questions for DM
• Which loan applicants are high-risk?
26JDD conference
Typical questions for DM
• Which loan applicants are high-risk?
• How do we detect phone card fraud?
27JDD conference
Typical questions for DM
• Which loan applicants are high-risk?
• How do we detect phone card fraud?
• Wh...
28JDD conference
Typical questions for DM
• Which loan applicants are high-risk?
• How do we detect phone card fraud?
• Wh...
29JDD conference
It’s Time for Java Superhero, yeah!
30JDD conference
Before patterns discovering you should ..
• Select small pieces
• Define default values for missed
data
•...
31JDD conference
TARGET DATA &
PERSONAL DATA
32JDD conference
Targeting
by …
33JDD conference
Pay with your personal data
All your personal data (PD) are
being deeply mined
34JDD conference
Pay with your personal data
The industry of collecting,
aggregating, and brokering PD is
“database market...
35JDD conference
Pay with your personal data
1.1 billion browser cookies, 200
million mobile profiles, and an
average of 1...
36JDD conference
RTB
DATA
38JDD conference
Datasets
• Facebook users, tweets
• Trade transactions
• Government
• Medicine (genomic data)
• Telecommu...
39JDD conference
Data Sources
• Relational Databases
• Data warehouses (Historical data)
• Files in CSV or in binary forma...
PATTERN MINING
41JDD conference
Association rule learning
42JDD conference
What is Cluster Analysis?
It is the process of finding model of function that describes
and distinguishes...
43JDD conference
Different algorithms – different results
44JDD conference
Regression
45JDD conference
• Training set of classified
examples (supervised learning)
Classification
46JDD conference
• Training set of classified
examples (supervised learning)
• Test set of non-classified items
Classifica...
47JDD conference
• Training set of classified
examples (supervised learning)
• Test set of non-classified items
• Main goa...
48JDD conference
Decision trees
49JDD conference
Cruel Tree
50JDD conference
Green circle is blue square or red
triangle? Let’s ask its neighbors!
kNN (k-nearest neighbor)
51JDD conference
Hit parade of algorithms
FASHION LANGUAGES
53JDD conference
Octave
54JDD conference
• A small amount of ML algorithms
• All your matrixes are belong to us!
• Single thread model
• Java supp...
55JDD conference
Do you like
this GUI?
56JDD conference
• 25% of R packs are written in Java
• Syntax is too sweet
• You should read 1000 lines in docs
to write ...
57JDD conference
Now Python is an idol for young scientists
due to the low barrier to entry
Why not Python?
58JDD conference
• High-level language
• Have you ever heard about a
Jython?
• Long way to real Highload
production
• We a...
59JDD conference
DM libraries
Let’s run on
JVM!
JAVA ECOSYSTEM
Family
Spring Data
HADOOP
65JDD conference
Hadoop
66JDD conference
Hadoop and Data Knights
67JDD conference
MapReduce for WordCount
68JDD conference
How to make features from Hadoop cluster?
69JDD conference
Pig & Hive
70JDD conference
Hive
71JDD conference
PIG
72JDD conference
PIG (Triangle count)
73JDD conference
Pig
• User Defined Functions (UDF)
74JDD conference
Pig
• User Defined Functions (UDF)
• FOREACH
75JDD conference
Pig
• User Defined Functions (UDF)
• FOREACH
• Pipeline-style
76JDD conference
Pig
• User Defined Functions (UDF)
• FOREACH
• Pipeline-style
• Easy parallelization
77JDD conference
Pig
78JDD conference
Why do we need in special graph approach?
HOW TO MAKE GRAPH
FEATURES
80JDD conference
SNA
81JDD conference
MapReduce for iterative calculations
• High complexity of graph problem reduction to key-value
model
• It...
82JDD conference
Data vs
Graph
83JDD conference
Messaging
84JDD conference
TRAIN
MODEL
85JDD conference
Java API for Data mining, JSR 73 and JSR 247
• javax.datamining.supervised defines the supervised
functio...
86JDD conference
Who knows Weka?
87JDD conference
• Connectors to R, Octave, Matlab, Hadoop, NoSQL/SQL
databases
• Source code of all algorithms in Java
• ...
88JDD conference
Weka
89JDD conference
Weka +
Hadoop
90JDD conference
SPMF
• It’s codebase of algorithms in pattern mining field
• It has cool examples and implementation of 1...
91JDD conference
Mahout
• Scalable machine learning with Samsara
• Advanced Implementations of Java’s Collections Framewor...
92JDD conference
Collaborative Filtering
93JDD conference
Code sample Mahout (K-Means)
// read the point values and generate vectors from input data
final List vec...
94JDD conference
Hadoop
ecosystem
HADOOP IS NOT SEXY
96JDD conference
Whaaaat?
97JDD conference
Map Reduce Job Writing
98JDD conference
Hadoop
Jobs
99JDD conference
Hadoop
Jobs
100JDD conference
YARN?
101JDD conference
SPARK: the bloody son of MR
• MapReduce in memory
• Up to 50x faster than Hadoop
• RDD is a basic buildi...
102JDD conference
GC & Spark
• > 100 GB for Spark apps
• big pauses as a result
• Garbage-First GC
• play with spark.stora...
103JDD conference
Mahout’s killer?
104JDD conference
MLlib supports
• Classification and regression
• Collaborative filtering
• Clustering
• Dimensionality r...
105JDD conference
Code sample MLlib (K-Means)
// Cluster the data into two classes using KMeans
int numClusters = 2;
int n...
106JDD conference
MLlib
• .. extends scikit-learn (Python lib) and Mahout
• .. runs fully on Spark and supports Spark’s Pi...
107JDD conference
It solves all problems!
108JDD conference
In conclusion
• Think about your data
109JDD conference
In conclusion
• Think about your data
• Have friendship with DevOps engineer
110JDD conference
In conclusion
• Think about your data
• Have friendship with DevOps engineer
• Run Spark
111JDD conference
In conclusion
• Think about your data
• Have friendship with DevOps engineer
• Run Spark
• Learn algorit...
112JDD conference
In conclusion
• Think about your data
• Have friendship with DevOps engineer
• Run Spark
• Learn algorit...
113JDD conference
Think Java
114JDD conference
Contacts
E-mail : Alexey_Zinovyev@epam.com
Twitter : @zaleslaw @BigDataRussia
LinkedIn: https://www.link...
Prochain SlideShare
Chargement dans…5
×

JavaDayKiev'15 Java in production for Data Mining Research projects

Alexey Zinoviev presented this paper on the JavaDayKiev'15 conference http://javaday.org.ua/kyiv/#schedule

This paper covers next topics: Java, Spark, Hadoop, Mahout, MLlib, Weka, Machine Learning, Data Mining

  • Soyez le premier à commenter

JavaDayKiev'15 Java in production for Data Mining Research projects

  1. 1. Java in production for Data Mining Research projects Alexey Zinovyev, Java Trainer in EPAM
  2. 2. 2JDD conference About I am a <graph theory, machine learning, traffic jams prediction, BigData algorithms> scientist But I'm a <Java, NoSQL, Hadoop, Spark> programmer
  3. 3. 3JDD conference In this topic … A lot of strange pictures and technologies from crazy zoo We talk about • Data Mining • Hadoop ecosystem • Spark and its friends • Machine Learning libraries
  4. 4. 4JDD conference Are you a Hadoop developer?
  5. 5. 5JDD conference Let’s do THIS!
  6. 6. 6JDD conference The Good Old Days
  7. 7. 7JDD conference One of these fine days...
  8. 8. 8JDD conference We need in Python dev 'cause Data Mining
  9. 9. 9JDD conference No, you are JavaEE developer only, continue …
  10. 10. 10JDD conference Write your backends, dude!
  11. 11. 11JDD conference Let’s talk about it, Java-boy...
  12. 12. 12JDD conference Can a Java programmer to be a Data Scientist?
  13. 13. 13JDD conference Sexy Data Scientist
  14. 14. 14JDD conference Real Data Scientist
  15. 15. 15JDD conference And what I tell you, young man
  16. 16. 16JDD conference And what I tell you, young man
  17. 17. WHAT IS DATA MINING?
  18. 18. 18JDD conference Statistics?
  19. 19. 19JDD conference Tag cloud from B2B conference?
  20. 20. 20JDD conference Not OLAP, 100%
  21. 21. 21JDD conference Hey, man, predict something!
  22. 22. 22JDD conference Hey, man, predict something!
  23. 23. 23JDD conference Man or sofa?
  24. 24. 24JDD conference SUBJECT AREA
  25. 25. 25JDD conference Typical questions for DM • Which loan applicants are high-risk?
  26. 26. 26JDD conference Typical questions for DM • Which loan applicants are high-risk? • How do we detect phone card fraud?
  27. 27. 27JDD conference Typical questions for DM • Which loan applicants are high-risk? • How do we detect phone card fraud? • What is the revenue prediction for next year?
  28. 28. 28JDD conference Typical questions for DM • Which loan applicants are high-risk? • How do we detect phone card fraud? • What is the revenue prediction for next year? • Can you recommend music for users?
  29. 29. 29JDD conference It’s Time for Java Superhero, yeah!
  30. 30. 30JDD conference Before patterns discovering you should .. • Select small pieces • Define default values for missed data • Remove strange signals from data • Merge some tables in one if required
  31. 31. 31JDD conference TARGET DATA & PERSONAL DATA
  32. 32. 32JDD conference Targeting by …
  33. 33. 33JDD conference Pay with your personal data All your personal data (PD) are being deeply mined
  34. 34. 34JDD conference Pay with your personal data The industry of collecting, aggregating, and brokering PD is “database marketing.”
  35. 35. 35JDD conference Pay with your personal data 1.1 billion browser cookies, 200 million mobile profiles, and an average of 1,500 pieces of data per consumer in Acxiom
  36. 36. 36JDD conference RTB
  37. 37. DATA
  38. 38. 38JDD conference Datasets • Facebook users, tweets • Trade transactions • Government • Medicine (genomic data) • Telecommunications
  39. 39. 39JDD conference Data Sources • Relational Databases • Data warehouses (Historical data) • Files in CSV or in binary format • Internet or electronic mails • Scientific, research (R, Octave, Matlab)
  40. 40. PATTERN MINING
  41. 41. 41JDD conference Association rule learning
  42. 42. 42JDD conference What is Cluster Analysis? It is the process of finding model of function that describes and distinguishes data class to predict the class of objects whose class label is unknown.
  43. 43. 43JDD conference Different algorithms – different results
  44. 44. 44JDD conference Regression
  45. 45. 45JDD conference • Training set of classified examples (supervised learning) Classification
  46. 46. 46JDD conference • Training set of classified examples (supervised learning) • Test set of non-classified items Classification
  47. 47. 47JDD conference • Training set of classified examples (supervised learning) • Test set of non-classified items • Main goal: find a function (classifier) that maps input data to a category (class) Classification
  48. 48. 48JDD conference Decision trees
  49. 49. 49JDD conference Cruel Tree
  50. 50. 50JDD conference Green circle is blue square or red triangle? Let’s ask its neighbors! kNN (k-nearest neighbor)
  51. 51. 51JDD conference Hit parade of algorithms
  52. 52. FASHION LANGUAGES
  53. 53. 53JDD conference Octave
  54. 54. 54JDD conference • A small amount of ML algorithms • All your matrixes are belong to us! • Single thread model • Java support • Octave in Java? Why not Octave?
  55. 55. 55JDD conference Do you like this GUI?
  56. 56. 56JDD conference • 25% of R packs are written in Java • Syntax is too sweet • You should read 1000 lines in docs to write 1 line of code • Single thread model for 95% algorithms Why not R?
  57. 57. 57JDD conference Now Python is an idol for young scientists due to the low barrier to entry Why not Python?
  58. 58. 58JDD conference • High-level language • Have you ever heard about a Jython? • Long way to real Highload production • We are not Python developers Why not Python?
  59. 59. 59JDD conference DM libraries
  60. 60. Let’s run on JVM!
  61. 61. JAVA ECOSYSTEM
  62. 62. Family
  63. 63. Spring Data
  64. 64. HADOOP
  65. 65. 65JDD conference Hadoop
  66. 66. 66JDD conference Hadoop and Data Knights
  67. 67. 67JDD conference MapReduce for WordCount
  68. 68. 68JDD conference How to make features from Hadoop cluster?
  69. 69. 69JDD conference Pig & Hive
  70. 70. 70JDD conference Hive
  71. 71. 71JDD conference PIG
  72. 72. 72JDD conference PIG (Triangle count)
  73. 73. 73JDD conference Pig • User Defined Functions (UDF)
  74. 74. 74JDD conference Pig • User Defined Functions (UDF) • FOREACH
  75. 75. 75JDD conference Pig • User Defined Functions (UDF) • FOREACH • Pipeline-style
  76. 76. 76JDD conference Pig • User Defined Functions (UDF) • FOREACH • Pipeline-style • Easy parallelization
  77. 77. 77JDD conference Pig
  78. 78. 78JDD conference Why do we need in special graph approach?
  79. 79. HOW TO MAKE GRAPH FEATURES
  80. 80. 80JDD conference SNA
  81. 81. 81JDD conference MapReduce for iterative calculations • High complexity of graph problem reduction to key-value model • Iteration algorithms, but multiple chained jobs in M/R with full saving and reading of each state Think like a vertex…
  82. 82. 82JDD conference Data vs Graph
  83. 83. 83JDD conference Messaging
  84. 84. 84JDD conference TRAIN MODEL
  85. 85. 85JDD conference Java API for Data mining, JSR 73 and JSR 247 • javax.datamining.supervised defines the supervised function-related interfaces • javax.datamining.algorithm contains all mining algorithm subclass packages • JDM 2.0 adds Text Mining, Time series and so on.. JDM
  86. 86. 86JDD conference Who knows Weka?
  87. 87. 87JDD conference • Connectors to R, Octave, Matlab, Hadoop, NoSQL/SQL databases • Source code of all algorithms in Java • Preprocessing tools: discretization, normalization, resampling, attribute selection, transforming and combining Weka
  88. 88. 88JDD conference Weka
  89. 89. 89JDD conference Weka + Hadoop
  90. 90. 90JDD conference SPMF • It’s codebase of algorithms in pattern mining field • It has cool examples and implementation of 109 algorithms • Cool performance results in specific area • Codebase grows very fast • Not so many classification algorithms are covered
  91. 91. 91JDD conference Mahout • Scalable machine learning with Samsara • Advanced Implementations of Java’s Collections Framework for better Performance. • New algorithms will build on Spark platform • Collaborative Filtering, Classification, Clustering, Dimensionality Reduction, Miscellaneous are supported
  92. 92. 92JDD conference Collaborative Filtering
  93. 93. 93JDD conference Code sample Mahout (K-Means) // read the point values and generate vectors from input data final List vectors = vectorize(points); // Write data to sequence hadoop sequence files writePointsToFile(configuration, vectors); // Write initial centers for clusters writeClusterInitialCenters(configuration, vectors); // Run K-means algorithm final Path inputPath = new Path(POINTS_PATH); final Path clustersPath = new Path(CLUSTERS_PATH); final Path outputPath = new Path(OUTPUT_PATH); HadoopUtil.delete(configuration, outputPath); KMeansDriver.run(configuration, inputPath, clustersPath, outputPath, 0.001, 10, true, 0, false); // Read and print output values readAndPrintOutputValues(configuration);
  94. 94. 94JDD conference Hadoop ecosystem
  95. 95. HADOOP IS NOT SEXY
  96. 96. 96JDD conference Whaaaat?
  97. 97. 97JDD conference Map Reduce Job Writing
  98. 98. 98JDD conference Hadoop Jobs
  99. 99. 99JDD conference Hadoop Jobs
  100. 100. 100JDD conference YARN?
  101. 101. 101JDD conference SPARK: the bloody son of MR • MapReduce in memory • Up to 50x faster than Hadoop • RDD is a basic building block (immutable distributed collections of objects) • Pipeline API (no needs in PIG)
  102. 102. 102JDD conference GC & Spark • > 100 GB for Spark apps • big pauses as a result • Garbage-First GC • play with spark.storage.memoryFraction (cached data/heap for transformation) • no one recipe
  103. 103. 103JDD conference Mahout’s killer?
  104. 104. 104JDD conference MLlib supports • Classification and regression • Collaborative filtering • Clustering • Dimensionality reduction • Optimization
  105. 105. 105JDD conference Code sample MLlib (K-Means) // Cluster the data into two classes using KMeans int numClusters = 2; int numIterations = 20; KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations); // Evaluate clustering by computing Within Set Sum of Squared Errors double WSSSE = clusters.computeCost(parsedData.rdd()); System.out.println("Within Set Sum of Squared Errors = " + WSSSE); // Save and load model clusters.save(sc.sc(), "myModelPath"); KMeansModel sameModel = KMeansModel.load(sc.sc(), "myModelPath");
  106. 106. 106JDD conference MLlib • .. extends scikit-learn (Python lib) and Mahout • .. runs fully on Spark and supports Spark’s Pipeline API • .. dataset is represented by Spark SQL’s SchemaRDD • .. supports Hive like external data source • .. is well for large datasets and parallelized algorithms
  107. 107. 107JDD conference It solves all problems!
  108. 108. 108JDD conference In conclusion • Think about your data
  109. 109. 109JDD conference In conclusion • Think about your data • Have friendship with DevOps engineer
  110. 110. 110JDD conference In conclusion • Think about your data • Have friendship with DevOps engineer • Run Spark
  111. 111. 111JDD conference In conclusion • Think about your data • Have friendship with DevOps engineer • Run Spark • Learn algorithms
  112. 112. 112JDD conference In conclusion • Think about your data • Have friendship with DevOps engineer • Run Spark • Learn algorithms • Write Java code
  113. 113. 113JDD conference Think Java
  114. 114. 114JDD conference Contacts E-mail : Alexey_Zinovyev@epam.com Twitter : @zaleslaw @BigDataRussia LinkedIn: https://www.linkedin.com/in/zaleslaw

×