Machine learning may be overhyped nowadays, but there is still a strong belief that this area is exclusively for data scientists with a deep mathematical background who leverage the Python (scikit-learn, Theano, TensorFlow, etc.) or R ecosystems and use specific tools like R Studio, Matlab, or Octave. Obviously, there is some truth to this statement, but Java engineers can also take the best of the machine-learning world from an applied perspective by using our native language and familiar frameworks like Apache Spark. Taras Matyashovsky explains how to use Apache Spark MLlib to build a supervised learning NLP pipeline to distinguish pop music from heavy metal—and have fun in the process. Along the way, Taras offers an overview of the simplest machine-learning tasks and algorithms, like regression and classification.
Source code: https://github.com/tmatyashovsky/spark-ml-samples
Design by Yarko Filevych: http://filevych.com/
6. “I'm a rolling thunder, a pouring rain
I'm comin' on like a hurricane
My lightning's flashing across the sky
You're only young but you're gonna die
I won't take no prisoners, won't spare no lives
Nobody's putting up a fight
I got my bell, I'm gonna take you to hell
I'm gonna get you, Satan get you”
https://github.com/tmatyashovsky/spark-ml-samples
6
7. “I'm a rolling thunder, a pouring rain
I'm comin' on like a hurricane
My lightning's flashing across the sky
You're only young but you're gonna die
I won't take no prisoners, won't spare no lives
Nobody's putting up a fight
I got my bell, I'm gonna take you to hell
I'm gonna get you, Satan get you”
https://github.com/tmatyashovsky/spark-ml-samples
7
15. Date & time
Conference name
Speaker
Talk name
Track
Duration
Type
Overall impression
Overall rating
Number of slides
Time spent on live
coding
Number of jokes
Etc.
15
32. Collect data set of lyrics:
Abba, Ace of base, Backstreet Boys, Britney Spears,
Christina Aguilera, Madonna, etc.
Black Sabbath, In Flames, Iron Maiden, Metallica,
Moonspell, Nightwish, Sentenced, etc.
Create training set, i.e. label (0|1) + features
Train logistic regression (or other classification
algorithm)
https://github.com/tmatyashovsky/spark-ml-samples
32
38. 38
Verse Cosine Distance
baby one more time 0.482028
crazy for you 0.437875
show me the meaning
of being lonely
0.258147
highway to hell -0.1120049
kill them all -0.231876
https://github.com/tmatyashovsky/spark-ml-samples
51. Is a library of ML algorithms and utilities
designed to run in parallel on Spark cluster
51
52. Introduces a few new data types, e.g.
vector (dense and sparse), labeled point,
rating, etc.
Allows to invoke various algorithms on
distributed datasets (RDD/Dataset)
http://spark.apache.org/docs/latest/mllib-guide.html
52
54. Utilities: linear algebra, statistics, etc.
Features extraction, features transforming, etc.
Regression
Classification
Clustering
Collaborative filtering, e.g. alternating least squares
Dimensionality reduction
And many more
http://spark.apache.org/docs/latest/mllib-guide.html
54
55. ”All” spark.mllib features plus:
• Pipelines
• Persistence
• Model selection and tuning:
• Train validation split
• K-folds cross validation
http://spark.apache.org/docs/latest/ml-guide.html
55
59. I'm a rolling thunder, a pouring rain
I'm comin' on like a hurricane
My lightning's flashing across the sky
You're only young but you're gonna die
I won't take no prisoners, won't spare no lives
Nobody's putting up a fight
I got my bell, I'm gonna take you to hell
I'm gonna get you, Satan get you
https://github.com/tmatyashovsky/spark-ml-samples
59
61. I'm a rolling thunder, a pouring rain
I'm comin' on like a hurricane
My lightning's flashing across the sky
You're only young but you're gonna die
I won't take no prisoners, won't spare no lives
Nobody's putting up a fight
I got my bell, I'm gonna take you to hell
I'm gonna get you, Satan get you
https://github.com/tmatyashovsky/spark-ml-samples
61
63. Im a rolling thunder a pouring rain
Im comin on like a hurricane
My lightnings flashing across the sky
Youre only young but youre gonna die
I wont take no prisoners wont spare no lives
Nobodys putting up a fight
I got my bell Im gonna take you to hell
Im gonna get you Satan get you
https://github.com/tmatyashovsky/spark-ml-samples
63
1
2
3
4
5
6
7
8
65. im a rolling thunder a pouring rain
im comin on like a hurricane
My lightnings flashing across the sky
youre only young but youre gonna die
I wont take no prisoners wont spare no lives
nobodys putting up a fight
I got my bell im gonna take you to hell
im gonna get you satan get you
https://github.com/tmatyashovsky/spark-ml-samples
65
1
2
3
4
5
6
7
8
67. im rolling thunder pouring rain
im comin like hurricane
lightnings flashing across sky
youre young youre gonna die
wont take prisoners wont spare lives
nobodiys putting fight
got bell im gonna take hell
im gonna get satan get
https://github.com/tmatyashovsky/spark-ml-samples
67
1
2
3
4
5
6
7
8
69. 4
im roll thunder pour rain
im comin like hurrican
lightn flash across sky
your young your gonna die
wont take prison wont spare live
nobodi put fight
got bell im gonna take hell
im gonna get satan get
https://github.com/tmatyashovsky/spark-ml-samples
69
1
2
3
4
5
6
7
8
verse1
verse2
70. 8
im roll thunder pour rain
im comin like hurrican
Light n flash across sky
your young your gonna die
wont take prison wont spare live
nobodi put fight
got bell im gonna take hell
im gonna get satan get
https://github.com/tmatyashovsky/spark-ml-samples
70
1
2
3
4
5
6
7
8
verse1
81. • Other feature extractors:
• Term Frequency – Inverse Document
Frequency (TD-IDF), Token counts (TF), etc.
• Other classification algorithms:
• Naive Bayes, Random Forest, Support Vector
Machines (SVM), etc.
http://spark.apache.org/docs/latest/ml-guide.html
81
93. 93
ML is not as complex as it seems from an applied
perspective
Existing libraries and frameworks reduce a lot of
tedious work
For instance, Spark MLlib can help to build nice ML
pipelines
Quantity of jokes used. Liked or not liked the speaker.
Bag of words – a single word is a one hot encoding vector with the size of the dictionary. As a result – a lot of sparse vectors.
Behind the scenes - a two-layer neural net that processes text.
Captures semantic and morphologic similarity so similar words are close in the vector space
Similar words would be clustered together in the high dimensional sphere.
If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close.
For two completely random words, the similarity is pretty close to 0.
On an opposite side there is not an antonym, but usually just a noise.
Used Google News Negative 300.
My corpus - 8316 words
Let’s finally go to the implementation using a library or framework that is going to help us to avoid tedious transformations and provide algorithms as well as feature extractors out-of-the-box.