[221]똑똑한 인공지능 dj 비서 clova music

Clova Music:
Smart AI-DJ Assistant
Jung-Woo Ha
Leader, Clova AI Research (CLAIR)
Chanju Kim
Tech Leader, Clova Music

Clova: Cloud-based Virtual Assistant
Clova: General-purpose AI platform for empowering user values
Clova-inside

Vision of Clova
• Jarvis-like AGI platform
for H.E.M Augmentation
• Real-time, Real-world, Real-life
• Tacking enormous technical
and challenging huddles
• Fundamental and advanced AI
Human
Environ
ment Machine
CLOVA

CLAIR: Clova AI Research
• Team responsible for Clova-oriented advanced AI research
• Open, Collaborative, and Self-motivated
• Outstanding global team (working language: English)
• Position: Research scientists, PostDoc, AI SW engineers, Internship researchers
• Research infrastructure and supporting: nsml (presented by Nako Sung)
• Your KPI may include research publication
• Advisory members: Active research involvement + authorships
조경현(NYU) 임재환(USC) 김성훈(HKUST) 박혜원(MIT) 신진우(KAIST) 주재걸(고려대)
…

1. Recommendation in online music services
2. Recommendation in Clova Music
- Playing log analysis
- Musical semantic embedding
- MF-based personalization
3. Music content modeling for Clova Music
- Highlight extraction via CRAN
- Emotional recognition via MCRN
4. Discussion & Future work
Contents

1. Recommendation in
Online Music Services

How to Recommend
• More similar users, more similar items
• Focusing on relationships
Collaborative Filtering
Content-based Filtering
• Representing the contents of users and items
• Focusing on the contents of each entity

…
User
…
Item
0.5 0.3 0.2
…
Co-occurrence based Model
Users watching
the same items
as
Recommended
items for
[Choi, 2017]

Methods
• Latent factor model
• Assumption
• There exist latent features representing user and item characteristics
• Matrix (Tensor) factorization:
• SVD, ALS, NMF, pMF, etc..
• Topic model: LDA, HDP etc
• NN-based CF [Dziugaite & Roy, 2015, Wang et al. 2015, Kim et al, 2016, Seo et al.
2017]
ܹ H
×
ܸ
≈

• Sparsity
• User-item matrix  Most slots are unknown.
• Scalability
• User size and item size: real-world cases
• Synonyms: Same or similar items but different indices
• Gray sheep: Not consistent users = not reflecting true distribution
• Shilling attack: Not fair rating
• Diversity and long-tail: Untrusty features for few log (long-tail data)
• Cold-start problem
Challenges

Hybrid with content-based method
• Content-based method: user / item / etc
• Hybrid concept is not new.
• Deep learning enables it to be valuable
• Text, image, video, audio, music, webtoon, etc.
• CNN, RNN, DNN
• Requires a safety logic
• Description is valuable
How to hybridize

Neural Hybrid Models
[Kim et al, 2016]ConvMF
More and more NN-based hybrid models

Remaining Challenges
• Ambiguous objective function
• Changing user preference
• Gap between model metrics and user satisfaction
• Now correct, at that time wrong
• Still require post processing

2. Recommendation in
Clova Music
- Playing log analysis
- Musical semantic embedding
- MF-based personalization
- Challenges

AI Speaker vs Conventional Platforms
• WAVE: Clova-inside AI speaker
• Playing music is the key feature of AI speakers
• Compare usage patterns of AI speaker with those
of conventional platforms

Top Music Sentences
• 노래 틀어줘
• 자장가 틀어줘
• 동요틀어줘
• 신나는 노래 틀어줘
• 조용한 노래 틀어줘
• 핑크퐁 노래 틀어줘
• 아이유 노래 틀어줘
• 클래식 틀어줘
• 분위기 좋은 음악 틀어줘
• 잔잔한 음악 틀어줘
• 발라드 틀어줘
• 팝송 틀어줘
• Artists rather than tracks
• Genres, moods, and themes
rather than artists
• Just play rather than genres

Playing Pattern : Hour of day
0 5 10 15 20 25
NAVER_APP NAVER_PC WAVE CLOVA_APP
Hour of day
Playing ratio

Playing Pattern : Genre
가요
기능성음악
팝
동요
OST
클래식
재즈
종교음악
일렉트로니카
락
힙합
기타
NAVER_APP WAVE
Playing ratio
Genre

Playing Pattern : Genre
가요
기능성음악
팝
동요
OST
클래식
재즈
종교음악
일렉트로니카
락
힙합
기타
NAVER_APP WAVE
재생 비율
장르
May be caused by
home environments

Playing Pattern : Long-tail distribution
Artist
Playing ratio
• ‘Artists’–‘Play count ratio’
log-log scale graph
• Long-tail distribution
• No difference in distribution,
but…

Playing Pattern : Long-tail distribution
WAVE NAVER MUSIC APP
핑크퐁 EXO
아이유 아이유
동요 젝스키스
동요 방탄소년단
EXO 뉴이스트(NU`EST)
윤종신 Wanna One
별하나 동요 윤종신
이루마 우원재
오르골뮤직 볼빨간사춘기
볼빨간사춘기 뉴이스트 W
젝스키스 황치열
트니트니 헤이즈
헤이즈 선미
성시경 WINNER
힐링 피아노 자장가
Top Artists
Artist
Playing ratio

Implication
• Paradigm shift of music consumption via AI speakers
• New market
• Kids
• Lean-out music
• Classical / jazz
• Music recommendation will play more important roles for AI
assistant platforms

Clova Music Recommendation
Two main challenging issues
• Lack of well-refined metadata
• “~~한 노래 틀어줘”
• 신나는 노래
• 혼자 듣기 좋은 노래
• Personalized playlists
• 발화에 부합하고 다양한

Musical Semantic Embedding
• Map musical items (tracks, artists, words) to the same semantic
space as vector of real numbers
• Word2vec
• Feature learning
• Usages
• Item similarities
• Input of deep neural network
Goal & Usages

Musical Semantic Embedding
• Embedding keywords and tracks (artists) to the same space
• JAMM data
• User-created playlist with hash tags in the Naver music
• About 72,000 playlists
• Treat a track-id as a symbol with semantics as a textual word
Word2Vec with tagged playlists

Examples
‘가을 ’
• Similar keywords and tracks

Examples
‘신나는’
• Similar tracks

Examples
벛꽃엔딩 / 버스커 버스커
• Similar keywords and tracks

Multimodal Semantic Embedding
Embedding tracks with session data
• Regards user music playing sequence
data as documents
• For multiple track meanings, use
multimodal word distributions formed
from Gaussian mixtures.
Ben Athiwaratkun and Andrew Gordon Wilson, Multimodal Word Distributions, 2017

Examples
밤편지 / 아이유
• Similar tracks of two
different ‘밤편지’
embedding
< 밤편지_2 >< 밤편지_1 >

MF for Personalized Recommendation
Matrix Factorization
• Select tracks and artists that user prefers when generating a playlist
• Simple but hard to apply
• Evaluation
• Overfitting / Underfitting
• Combining with other models

• Relatively huge item set
• Long-term batch learning for a compact model
• Short-term incremental learning
• Proper regularization factors
• Consider item distribution when performing the negative sampling
• Remove abusing users (ex. Top100 users)
Tips
Fundamental Problem of Collaborative Filtering
• Assumption: Users who have similar preferences are likely to have
similar preferences
• When a user violate the assumption?
• Familiarity vs. discovery

Sparkline Chart
Precision@30 for Top 10,000 Users
https://en.wikipedia.org/wiki/Sparkline

Remaining Challenges
• Interactive recommendation
• Lean-in vs lean-out
• More personalized and context-aware recommendation
• Familiarity vs discovery
Conventional Problems
Music recommendation for AI speakers
• Sparsity
• Harry Porter effect (Top 100)
• Cold-start problems
• Explanatory recommendation

3. Music content modeling
for Clova Music
- Highlight extraction via CRAN
(Ha et al, ICML ML4MD Workshop
2017)
- Emotional recognition via MCRN
(Jeon et al, RECSYS 2017)

Music Highlight Extraction (MHE)
Motivation
Why only the first 1 minute?
Improved user experiences and recommendations
Increase in potential customers
Valuable data sets
Definition
Extract a representative snippet from a given track

Main Task in MHE
Task
Finding significant snippets within a track is an interesting and valuable task
Given a track x, where should a ‘Highlight’ start?
40000
H H+S
mel-spectrogram of x
Input: mel-spectrogram of x
Output: Starting frame (H) of highlight

Structure of MHE
Track file Mel-spectrogram
Deep learning model
Attention layer
Convolutional & pooling layer
Recurrent layer
Fully connected & output layer
Candidate
generation
Candidate
selection
Highlight extraction
Highlight Clip
[Ha et al. 2017]

CRAN: Convolutional Recurrent Attention Networks
Track file Mel-spectrogram
Deep learning model
Attention layer
Convolutional & pooling layer
Recurrent layer
Fully connected & output layer
Candidate
generation
Candidate
selection
Highlight extraction
Highlight Clip

CRAN: Convolutional Recurrent Attention Networks
LSTM
Multiple
mel-channel
1D-convolution
&
max pooling
layers
Input layer
Feature
concatenation
Time separation
LSTM
LSTM LSTM
Multiple bidirectional
recurrent layers
LSTM output Attention-weighted
Attention
(softmax)
Output layer
Element-wise multiplication
Fully-connected
layer
Convolution & pooling
Channel summation
Attention layer
• Output: 10 genres
• Input: mel-spectrogram
• 128 mel bins
• Maximum 4000 frames
• 0.061s per frame
• Training Details
• 32,083 tracks
(80/10/10)
• Adam (LR = 0.005)
• 100 epochs

Highlight Extraction via CRAN
Attention Mel energy
Attention-weighted
Energy from frame n
Cumulative Sum of S energy values Speed/Acceleration of energy change
Highlight score
of frame n
Frame on where highlight of length S starts

Highlight Extraction via CRAN
H

Data Description
• 32,083 full tracks with genres tagged as multi-labels
Chosen based on played frequency from December 2016 – January 2017
• 10 representative genres
• Korean music market has a strongly-biased taste to specific genres
(Ballad, Dance, Hiphop, and R&B)
• Idol musicians make up the majority of K-pop.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35 Total
Popular (Top 10%)
New Released (Top 10%)

Extraction Results
• Quantitative evaluation
• Ground truth highlights of 280 tracks from 8 experts
• Metric definition: Overlap lengths with ground truth
• Qualitative evaluation
• Explicit scoring from 8 experts with [1, 5] score window
• Results
• Web demo

Genre Similarity Results
• Confusion Matrix of classifying 3 genres per track

Musical Emotion Recognition
Motivation
Definition
Classify emotional characteristics from a given track

MCRN: Multimodal Convolutional Recurrent Networks
Overview : Our model directly predicts track’s polarity (pos / neg)
from audio and lyrics; it consists of audio and lyrics branches.
Layer (1) : Input layer, Audio – (128, 1024) mel-spectograms,
Lyrics – (27496, 400) padded word vectors
Layer (2) : Conv layers in audio branch, activation function :
ELU, Output of conv layers,
Layer (3) : RNN layers in audio branch, output dims :
64 Output of RNN layers,
Layer (4) : Conv layers in lyrics branch, activation function :
ELU, output dims : 64, Output of conv layers,
Layer (5) : Output layer, merge two branches with concatenation
activation function : ReLU, final output,

Data Description
• Naver Music Polarity Emotion Dataset
• 7,484 tracks (Pos : Neg = 1:1)
• Polarity emotion label (pos or neg)
• Lyrics: (27496, 400) word vectors (|V| = 27,496)
• Mel-spectograms: (128 mels, 1,024 time slots = 1 min)
• Dataset creation process
• Seperate pos/neg emotion tags of Naver Jamm Editor’s tags with polarity emotion word dictionary
• Editor’s tags are more reliable than users’ tags
• Filter out the tracks whose tags include both positive and negative words
• Reject the tracks whose length is less than a minute or lyrics include less than 30 words
• Use the first one minute of each mel-spectrogram
• Only use noun, verb, adjective, and adverb in words of lyrics

MER Results
Data Model Accuracy
Audio
CNN 0.6479
RNN 0.6303
MCRN 0.6619
Lyrics
CNN 0.7815
RNN 0.7716
Both MCRN, CNN 0.8046

MER Results: Lyrics Cloud
[Positive lyrics] [Negative lyrics]

More Smart AI DJ Assitant?
• More proactive interaction
• Human-in-the loop
• Real context-aware
• Real personalized
• Balance between familiarity and novelty

Future Direction
• More coverage for users’ intent
• Proactive interaction
• More visualization and explanation
• Rich intelligent services
• Discovery
• From smart to touching

Concluding Remarks
• Importance of human evaluation
• Untrusty numerical model metrics
• A/B test
• Proper evaluation design
• Do not trust user logs blindly
• Rich curation data by human experts
• Maybe reinforcement learning ?
• Conferences: KDD, RECSYS, CIKM, AAAI, IJCAI, ICDM, …

[221]똑똑한 인공지능 dj 비서 clova music

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to [221]똑똑한 인공지능 dj 비서 clova music

Similar to [221]똑똑한 인공지능 dj 비서 clova music (20)

More from NAVER D2

More from NAVER D2 (20)

Recently uploaded

Recently uploaded (20)

[221]똑똑한 인공지능 dj 비서 clova music