Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

[221]똑똑한 인공지능 dj 비서 clova music

똑똑한 인공지능 dj 비서 clova music

  • Identifiez-vous pour voir les commentaires

[221]똑똑한 인공지능 dj 비서 clova music

  1. 1. Clova Music: Smart AI-DJ Assistant Jung-Woo Ha Leader, Clova AI Research (CLAIR) Chanju Kim Tech Leader, Clova Music
  2. 2. Clova: Cloud-based Virtual Assistant Clova: General-purpose AI platform for empowering user values Clova-inside
  3. 3. Vision of Clova • Jarvis-like AGI platform for H.E.M Augmentation • Real-time, Real-world, Real-life • Tacking enormous technical and challenging huddles • Fundamental and advanced AI Human Environ ment Machine CLOVA
  4. 4. CLAIR: Clova AI Research • Team responsible for Clova-oriented advanced AI research • Open, Collaborative, and Self-motivated • Outstanding global team (working language: English) • Position: Research scientists, PostDoc, AI SW engineers, Internship researchers • Research infrastructure and supporting: nsml (presented by Nako Sung) • Your KPI may include research publication • Advisory members: Active research involvement + authorships 조경현(NYU) 임재환(USC) 김성훈(HKUST) 박혜원(MIT) 신진우(KAIST) 주재걸(고려대) …
  5. 5. 1. Recommendation in online music services 2. Recommendation in Clova Music - Playing log analysis - Musical semantic embedding - MF-based personalization 3. Music content modeling for Clova Music - Highlight extraction via CRAN - Emotional recognition via MCRN 4. Discussion & Future work Contents
  6. 6. 1. Recommendation in Online Music Services
  7. 7. How to Recommend • More similar users, more similar items • Focusing on relationships Collaborative Filtering Content-based Filtering • Representing the contents of users and items • Focusing on the contents of each entity
  8. 8. Collaborative Filtering … User … Item 0.5 0.3 0.2 … Co-occurrence based Model Users watching the same items as Recommended items for [Choi, 2017]
  9. 9. Collaborative Filtering Methods • Latent factor model • Assumption • There exist latent features representing user and item characteristics • Matrix (Tensor) factorization: • SVD, ALS, NMF, pMF, etc.. • Topic model: LDA, HDP etc • NN-based CF [Dziugaite & Roy, 2015, Wang et al. 2015, Kim et al, 2016, Seo et al. 2017] ܹ H × ܸ ≈
  10. 10. Collaborative Filtering • Sparsity • User-item matrix  Most slots are unknown. • Scalability • User size and item size: real-world cases • Synonyms: Same or similar items but different indices • Gray sheep: Not consistent users = not reflecting true distribution • Shilling attack: Not fair rating • Diversity and long-tail: Untrusty features for few log (long-tail data) • Cold-start problem Challenges
  11. 11. Hybrid with content-based method • Content-based method: user / item / etc • Hybrid concept is not new. • Deep learning enables it to be valuable • Text, image, video, audio, music, webtoon, etc. • CNN, RNN, DNN • Requires a safety logic • Description is valuable How to hybridize
  12. 12. Neural Hybrid Models [Kim et al, 2016]ConvMF More and more NN-based hybrid models
  13. 13. Remaining Challenges • Ambiguous objective function • Changing user preference • Gap between model metrics and user satisfaction • Now correct, at that time wrong • Still require post processing
  14. 14. 2. Recommendation in Clova Music - Playing log analysis - Musical semantic embedding - MF-based personalization - Challenges
  15. 15. AI Speaker vs Conventional Platforms • WAVE: Clova-inside AI speaker • Playing music is the key feature of AI speakers • Compare usage patterns of AI speaker with those of conventional platforms
  16. 16. Top Music Sentences • 노래 틀어줘 • 자장가 틀어줘 • 동요틀어줘 • 신나는 노래 틀어줘 • 조용한 노래 틀어줘 • 핑크퐁 노래 틀어줘 • 아이유 노래 틀어줘 • 클래식 틀어줘 • 분위기 좋은 음악 틀어줘 • 잔잔한 음악 틀어줘 • 발라드 틀어줘 • 팝송 틀어줘 • Artists rather than tracks • Genres, moods, and themes rather than artists • Just play rather than genres
  17. 17. Playing Pattern : Hour of day 0 5 10 15 20 25 NAVER_APP NAVER_PC WAVE CLOVA_APP Hour of day Playing ratio
  18. 18. Playing Pattern : Genre 가요 기능성음악 팝 동요 OST 클래식 재즈 종교음악 일렉트로니카 락 힙합 기타 NAVER_APP WAVE Playing ratio Genre
  19. 19. Playing Pattern : Genre 가요 기능성음악 팝 동요 OST 클래식 재즈 종교음악 일렉트로니카 락 힙합 기타 NAVER_APP WAVE 재생 비율 장르 May be caused by home environments
  20. 20. Playing Pattern : Long-tail distribution Artist Playing ratio • ‘Artists’–‘Play count ratio’ log-log scale graph • Long-tail distribution • No difference in distribution, but…
  21. 21. Playing Pattern : Long-tail distribution WAVE NAVER MUSIC APP 핑크퐁 EXO 아이유 아이유 동요 젝스키스 동요 방탄소년단 EXO 뉴이스트(NU`EST) 윤종신 Wanna One 별하나 동요 윤종신 이루마 우원재 오르골뮤직 볼빨간사춘기 볼빨간사춘기 뉴이스트 W 젝스키스 황치열 트니트니 헤이즈 헤이즈 선미 성시경 WINNER 힐링 피아노 자장가 Top Artists Artist Playing ratio
  22. 22. Implication • Paradigm shift of music consumption via AI speakers • New market • Kids • Lean-out music • Classical / jazz • Music recommendation will play more important roles for AI assistant platforms
  23. 23. Clova Music Recommendation Two main challenging issues • Lack of well-refined metadata • “~~한 노래 틀어줘” • 신나는 노래 • 혼자 듣기 좋은 노래 • Personalized playlists • 발화에 부합하고 다양한
  24. 24. Musical Semantic Embedding • Map musical items (tracks, artists, words) to the same semantic space as vector of real numbers • Word2vec • Feature learning • Usages • Item similarities • Input of deep neural network Goal & Usages
  25. 25. Musical Semantic Embedding • Embedding keywords and tracks (artists) to the same space • JAMM data • User-created playlist with hash tags in the Naver music • About 72,000 playlists • Treat a track-id as a symbol with semantics as a textual word Word2Vec with tagged playlists
  26. 26. Examples ‘가을 ’ • Similar keywords and tracks
  27. 27. Examples ‘신나는’ • Similar tracks
  28. 28. Examples 벛꽃엔딩 / 버스커 버스커 • Similar keywords and tracks
  29. 29. Multimodal Semantic Embedding Embedding tracks with session data • Regards user music playing sequence data as documents • For multiple track meanings, use multimodal word distributions formed from Gaussian mixtures. Ben Athiwaratkun and Andrew Gordon Wilson, Multimodal Word Distributions, 2017
  30. 30. Examples 밤편지 / 아이유 • Similar tracks of two different ‘밤편지’ embedding < 밤편지_2 >< 밤편지_1 >
  31. 31. MF for Personalized Recommendation Matrix Factorization • Select tracks and artists that user prefers when generating a playlist • Simple but hard to apply • Evaluation • Overfitting / Underfitting • Combining with other models
  32. 32. MF for Personalized Recommendation • Relatively huge item set • Long-term batch learning for a compact model • Short-term incremental learning • Proper regularization factors • Consider item distribution when performing the negative sampling • Remove abusing users (ex. Top100 users) Tips Fundamental Problem of Collaborative Filtering • Assumption: Users who have similar preferences are likely to have similar preferences • When a user violate the assumption? • Familiarity vs. discovery
  33. 33. MF for Personalized Recommendation Sparkline Chart Precision@30 for Top 10,000 Users https://en.wikipedia.org/wiki/Sparkline
  34. 34. Remaining Challenges • Interactive recommendation • Lean-in vs lean-out • More personalized and context-aware recommendation • Familiarity vs discovery Conventional Problems Music recommendation for AI speakers • Sparsity • Harry Porter effect (Top 100) • Cold-start problems • Explanatory recommendation
  35. 35. 3. Music content modeling for Clova Music - Highlight extraction via CRAN (Ha et al, ICML ML4MD Workshop 2017) - Emotional recognition via MCRN (Jeon et al, RECSYS 2017)
  36. 36. Music Highlight Extraction (MHE) Motivation Why only the first 1 minute? Improved user experiences and recommendations Increase in potential customers Valuable data sets Definition Extract a representative snippet from a given track
  37. 37. Main Task in MHE Task Finding significant snippets within a track is an interesting and valuable task Given a track x, where should a ‘Highlight’ start? 40000 H H+S mel-spectrogram of x Input: mel-spectrogram of x Output: Starting frame (H) of highlight
  38. 38. Structure of MHE Track file Mel-spectrogram Deep learning model Attention layer Convolutional & pooling layer Recurrent layer Fully connected & output layer Candidate generation Candidate selection Highlight extraction Highlight Clip [Ha et al. 2017]
  39. 39. CRAN: Convolutional Recurrent Attention Networks Track file Mel-spectrogram Deep learning model Attention layer Convolutional & pooling layer Recurrent layer Fully connected & output layer Candidate generation Candidate selection Highlight extraction Highlight Clip
  40. 40. CRAN: Convolutional Recurrent Attention Networks LSTM Multiple mel-channel 1D-convolution & max pooling layers Input layer Feature concatenation Time separation LSTM LSTM LSTM Multiple bidirectional recurrent layers LSTM output Attention-weighted Attention (softmax) Output layer Element-wise multiplication Fully-connected layer Convolution & pooling Channel summation Attention layer • Output: 10 genres • Input: mel-spectrogram • 128 mel bins • Maximum 4000 frames • 0.061s per frame • Training Details • 32,083 tracks (80/10/10) • Adam (LR = 0.005) • 100 epochs
  41. 41. Highlight Extraction via CRAN Attention Mel energy Attention-weighted Energy from frame n Cumulative Sum of S energy values Speed/Acceleration of energy change Highlight score of frame n Frame on where highlight of length S starts
  42. 42. Highlight Extraction via CRAN H
  43. 43. Data Description • 32,083 full tracks with genres tagged as multi-labels Chosen based on played frequency from December 2016 – January 2017 • 10 representative genres • Korean music market has a strongly-biased taste to specific genres (Ballad, Dance, Hiphop, and R&B) • Idol musicians make up the majority of K-pop. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Total Popular (Top 10%) New Released (Top 10%)
  44. 44. Extraction Results • Quantitative evaluation • Ground truth highlights of 280 tracks from 8 experts • Metric definition: Overlap lengths with ground truth • Qualitative evaluation • Explicit scoring from 8 experts with [1, 5] score window • Results • Web demo
  45. 45. Genre Similarity Results • Confusion Matrix of classifying 3 genres per track
  46. 46. Musical Emotion Recognition Motivation Definition Classify emotional characteristics from a given track
  47. 47. MCRN: Multimodal Convolutional Recurrent Networks Overview : Our model directly predicts track’s polarity (pos / neg) from audio and lyrics; it consists of audio and lyrics branches. Layer (1) : Input layer, Audio – (128, 1024) mel-spectograms, Lyrics – (27496, 400) padded word vectors Layer (2) : Conv layers in audio branch, activation function : ELU, Output of conv layers, Layer (3) : RNN layers in audio branch, output dims : 64 Output of RNN layers, Layer (4) : Conv layers in lyrics branch, activation function : ELU, output dims : 64, Output of conv layers, Layer (5) : Output layer, merge two branches with concatenation activation function : ReLU, final output,
  48. 48. Data Description • Naver Music Polarity Emotion Dataset • 7,484 tracks (Pos : Neg = 1:1) • Polarity emotion label (pos or neg) • Lyrics: (27496, 400) word vectors (|V| = 27,496) • Mel-spectograms: (128 mels, 1,024 time slots = 1 min) • Dataset creation process • Seperate pos/neg emotion tags of Naver Jamm Editor’s tags with polarity emotion word dictionary • Editor’s tags are more reliable than users’ tags • Filter out the tracks whose tags include both positive and negative words • Reject the tracks whose length is less than a minute or lyrics include less than 30 words • Use the first one minute of each mel-spectrogram • Only use noun, verb, adjective, and adverb in words of lyrics
  49. 49. MER Results Data Model Accuracy Audio CNN 0.6479 RNN 0.6303 MCRN 0.6619 Lyrics CNN 0.7815 RNN 0.7716 Both MCRN, CNN 0.8046
  50. 50. MER Results
  51. 51. MER Results: Lyrics Cloud [Positive lyrics] [Negative lyrics]
  52. 52. 4. Discussion and Future Work
  53. 53. More Smart AI DJ Assitant? • More proactive interaction • Human-in-the loop • Real context-aware • Real personalized • Balance between familiarity and novelty
  54. 54. Future Direction • More coverage for users’ intent • Proactive interaction • More visualization and explanation • Rich intelligent services • Discovery • From smart to touching
  55. 55. Concluding Remarks • Importance of human evaluation • Untrusty numerical model metrics • A/B test • Proper evaluation design • Do not trust user logs blindly • Rich curation data by human experts • Maybe reinforcement learning ? • Conferences: KDD, RECSYS, CIKM, AAAI, IJCAI, ICDM, …
  56. 56. Q & A
  57. 57. Thank you

×