Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
[221]똑똑한 인공지능 dj 비서 clova music
1. Clova Music:
Smart AI-DJ Assistant
Jung-Woo Ha
Leader, Clova AI Research (CLAIR)
Chanju Kim
Tech Leader, Clova Music
2. Clova: Cloud-based Virtual Assistant
Clova: General-purpose AI platform for empowering user values
Clova-inside
3. Vision of Clova
• Jarvis-like AGI platform
for H.E.M Augmentation
• Real-time, Real-world, Real-life
• Tacking enormous technical
and challenging huddles
• Fundamental and advanced AI
Human
Environ
ment Machine
CLOVA
4. CLAIR: Clova AI Research
• Team responsible for Clova-oriented advanced AI research
• Open, Collaborative, and Self-motivated
• Outstanding global team (working language: English)
• Position: Research scientists, PostDoc, AI SW engineers, Internship researchers
• Research infrastructure and supporting: nsml (presented by Nako Sung)
• Your KPI may include research publication
• Advisory members: Active research involvement + authorships
조경현(NYU) 임재환(USC) 김성훈(HKUST) 박혜원(MIT) 신진우(KAIST) 주재걸(고려대)
…
5. 1. Recommendation in online music services
2. Recommendation in Clova Music
- Playing log analysis
- Musical semantic embedding
- MF-based personalization
3. Music content modeling for Clova Music
- Highlight extraction via CRAN
- Emotional recognition via MCRN
4. Discussion & Future work
Contents
7. How to Recommend
• More similar users, more similar items
• Focusing on relationships
Collaborative Filtering
Content-based Filtering
• Representing the contents of users and items
• Focusing on the contents of each entity
9. Collaborative Filtering
Methods
• Latent factor model
• Assumption
• There exist latent features representing user and item characteristics
• Matrix (Tensor) factorization:
• SVD, ALS, NMF, pMF, etc..
• Topic model: LDA, HDP etc
• NN-based CF [Dziugaite & Roy, 2015, Wang et al. 2015, Kim et al, 2016, Seo et al.
2017]
ܹ H
×
ܸ
≈
10. Collaborative Filtering
• Sparsity
• User-item matrix Most slots are unknown.
• Scalability
• User size and item size: real-world cases
• Synonyms: Same or similar items but different indices
• Gray sheep: Not consistent users = not reflecting true distribution
• Shilling attack: Not fair rating
• Diversity and long-tail: Untrusty features for few log (long-tail data)
• Cold-start problem
Challenges
11. Hybrid with content-based method
• Content-based method: user / item / etc
• Hybrid concept is not new.
• Deep learning enables it to be valuable
• Text, image, video, audio, music, webtoon, etc.
• CNN, RNN, DNN
• Requires a safety logic
• Description is valuable
How to hybridize
13. Remaining Challenges
• Ambiguous objective function
• Changing user preference
• Gap between model metrics and user satisfaction
• Now correct, at that time wrong
• Still require post processing
14. 2. Recommendation in
Clova Music
- Playing log analysis
- Musical semantic embedding
- MF-based personalization
- Challenges
15. AI Speaker vs Conventional Platforms
• WAVE: Clova-inside AI speaker
• Playing music is the key feature of AI speakers
• Compare usage patterns of AI speaker with those
of conventional platforms
16. Top Music Sentences
• 노래 틀어줘
• 자장가 틀어줘
• 동요틀어줘
• 신나는 노래 틀어줘
• 조용한 노래 틀어줘
• 핑크퐁 노래 틀어줘
• 아이유 노래 틀어줘
• 클래식 틀어줘
• 분위기 좋은 음악 틀어줘
• 잔잔한 음악 틀어줘
• 발라드 틀어줘
• 팝송 틀어줘
• Artists rather than tracks
• Genres, moods, and themes
rather than artists
• Just play rather than genres
17. Playing Pattern : Hour of day
0 5 10 15 20 25
NAVER_APP NAVER_PC WAVE CLOVA_APP
Hour of day
Playing ratio
18. Playing Pattern : Genre
가요
기능성음악
팝
동요
OST
클래식
재즈
종교음악
일렉트로니카
락
힙합
기타
NAVER_APP WAVE
Playing ratio
Genre
19. Playing Pattern : Genre
가요
기능성음악
팝
동요
OST
클래식
재즈
종교음악
일렉트로니카
락
힙합
기타
NAVER_APP WAVE
재생 비율
장르
May be caused by
home environments
20. Playing Pattern : Long-tail distribution
Artist
Playing ratio
• ‘Artists’–‘Play count ratio’
log-log scale graph
• Long-tail distribution
• No difference in distribution,
but…
21. Playing Pattern : Long-tail distribution
WAVE NAVER MUSIC APP
핑크퐁 EXO
아이유 아이유
동요 젝스키스
동요 방탄소년단
EXO 뉴이스트(NU`EST)
윤종신 Wanna One
별하나 동요 윤종신
이루마 우원재
오르골뮤직 볼빨간사춘기
볼빨간사춘기 뉴이스트 W
젝스키스 황치열
트니트니 헤이즈
헤이즈 선미
성시경 WINNER
힐링 피아노 자장가
Top Artists
Artist
Playing ratio
22. Implication
• Paradigm shift of music consumption via AI speakers
• New market
• Kids
• Lean-out music
• Classical / jazz
• Music recommendation will play more important roles for AI
assistant platforms
23. Clova Music Recommendation
Two main challenging issues
• Lack of well-refined metadata
• “~~한 노래 틀어줘”
• 신나는 노래
• 혼자 듣기 좋은 노래
• Personalized playlists
• 발화에 부합하고 다양한
24. Musical Semantic Embedding
• Map musical items (tracks, artists, words) to the same semantic
space as vector of real numbers
• Word2vec
• Feature learning
• Usages
• Item similarities
• Input of deep neural network
Goal & Usages
25. Musical Semantic Embedding
• Embedding keywords and tracks (artists) to the same space
• JAMM data
• User-created playlist with hash tags in the Naver music
• About 72,000 playlists
• Treat a track-id as a symbol with semantics as a textual word
Word2Vec with tagged playlists
29. Multimodal Semantic Embedding
Embedding tracks with session data
• Regards user music playing sequence
data as documents
• For multiple track meanings, use
multimodal word distributions formed
from Gaussian mixtures.
Ben Athiwaratkun and Andrew Gordon Wilson, Multimodal Word Distributions, 2017
30. Examples
밤편지 / 아이유
• Similar tracks of two
different ‘밤편지’
embedding
< 밤편지_2 >< 밤편지_1 >
31. MF for Personalized Recommendation
Matrix Factorization
• Select tracks and artists that user prefers when generating a playlist
• Simple but hard to apply
• Evaluation
• Overfitting / Underfitting
• Combining with other models
32. MF for Personalized Recommendation
• Relatively huge item set
• Long-term batch learning for a compact model
• Short-term incremental learning
• Proper regularization factors
• Consider item distribution when performing the negative sampling
• Remove abusing users (ex. Top100 users)
Tips
Fundamental Problem of Collaborative Filtering
• Assumption: Users who have similar preferences are likely to have
similar preferences
• When a user violate the assumption?
• Familiarity vs. discovery
33. MF for Personalized Recommendation
Sparkline Chart
Precision@30 for Top 10,000 Users
https://en.wikipedia.org/wiki/Sparkline
34. Remaining Challenges
• Interactive recommendation
• Lean-in vs lean-out
• More personalized and context-aware recommendation
• Familiarity vs discovery
Conventional Problems
Music recommendation for AI speakers
• Sparsity
• Harry Porter effect (Top 100)
• Cold-start problems
• Explanatory recommendation
35. 3. Music content modeling
for Clova Music
- Highlight extraction via CRAN
(Ha et al, ICML ML4MD Workshop
2017)
- Emotional recognition via MCRN
(Jeon et al, RECSYS 2017)
36. Music Highlight Extraction (MHE)
Motivation
Why only the first 1 minute?
Improved user experiences and recommendations
Increase in potential customers
Valuable data sets
Definition
Extract a representative snippet from a given track
37. Main Task in MHE
Task
Finding significant snippets within a track is an interesting and valuable task
Given a track x, where should a ‘Highlight’ start?
40000
H H+S
mel-spectrogram of x
Input: mel-spectrogram of x
Output: Starting frame (H) of highlight
38. Structure of MHE
Track file Mel-spectrogram
Deep learning model
Attention layer
Convolutional & pooling layer
Recurrent layer
Fully connected & output layer
Candidate
generation
Candidate
selection
Highlight extraction
Highlight Clip
[Ha et al. 2017]
41. Highlight Extraction via CRAN
Attention Mel energy
Attention-weighted
Energy from frame n
Cumulative Sum of S energy values Speed/Acceleration of energy change
Highlight score
of frame n
Frame on where highlight of length S starts
43. Data Description
• 32,083 full tracks with genres tagged as multi-labels
Chosen based on played frequency from December 2016 – January 2017
• 10 representative genres
• Korean music market has a strongly-biased taste to specific genres
(Ballad, Dance, Hiphop, and R&B)
• Idol musicians make up the majority of K-pop.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35 Total
Popular (Top 10%)
New Released (Top 10%)
44. Extraction Results
• Quantitative evaluation
• Ground truth highlights of 280 tracks from 8 experts
• Metric definition: Overlap lengths with ground truth
• Qualitative evaluation
• Explicit scoring from 8 experts with [1, 5] score window
• Results
• Web demo
47. MCRN: Multimodal Convolutional Recurrent Networks
Overview : Our model directly predicts track’s polarity (pos / neg)
from audio and lyrics; it consists of audio and lyrics branches.
Layer (1) : Input layer, Audio – (128, 1024) mel-spectograms,
Lyrics – (27496, 400) padded word vectors
Layer (2) : Conv layers in audio branch, activation function :
ELU, Output of conv layers,
Layer (3) : RNN layers in audio branch, output dims :
64 Output of RNN layers,
Layer (4) : Conv layers in lyrics branch, activation function :
ELU, output dims : 64, Output of conv layers,
Layer (5) : Output layer, merge two branches with concatenation
activation function : ReLU, final output,
48. Data Description
• Naver Music Polarity Emotion Dataset
• 7,484 tracks (Pos : Neg = 1:1)
• Polarity emotion label (pos or neg)
• Lyrics: (27496, 400) word vectors (|V| = 27,496)
• Mel-spectograms: (128 mels, 1,024 time slots = 1 min)
• Dataset creation process
• Seperate pos/neg emotion tags of Naver Jamm Editor’s tags with polarity emotion word dictionary
• Editor’s tags are more reliable than users’ tags
• Filter out the tracks whose tags include both positive and negative words
• Reject the tracks whose length is less than a minute or lyrics include less than 30 words
• Use the first one minute of each mel-spectrogram
• Only use noun, verb, adjective, and adverb in words of lyrics
49. MER Results
Data Model Accuracy
Audio
CNN 0.6479
RNN 0.6303
MCRN 0.6619
Lyrics
CNN 0.7815
RNN 0.7716
Both MCRN, CNN 0.8046
53. More Smart AI DJ Assitant?
• More proactive interaction
• Human-in-the loop
• Real context-aware
• Real personalized
• Balance between familiarity and novelty
54. Future Direction
• More coverage for users’ intent
• Proactive interaction
• More visualization and explanation
• Rich intelligent services
• Discovery
• From smart to touching
55. Concluding Remarks
• Importance of human evaluation
• Untrusty numerical model metrics
• A/B test
• Proper evaluation design
• Do not trust user logs blindly
• Rich curation data by human experts
• Maybe reinforcement learning ?
• Conferences: KDD, RECSYS, CIKM, AAAI, IJCAI, ICDM, …