An up-to-date overview of our recent research on music/audio and AI. It contains four parts:
* AI Listener: source separation (ICMLA'18a) and sound event detection (IJCAI'18)
* AI DJ: music thumbnailing (TISMIR'18) and music sequencing (AAAI'18a)
* AI Composer: melody generation (ISMIR'17), lead sheet generation (ICMLA'18b), multitrack pianoroll generation (AAAI'18b), and instrumentation generation (arxiv)
* AI Performer: CNN-based score-to-audio generation (AAAI'19)
Machine learning for creative AI applications in music (2018 nov)
1. Machine Learning for
Creative AI Applications
in Music
Music and AI Lab,
Research Center for IT Innovation,
Academia Sinica
Yi-Hsuan Yang Ph.D.
http://www.citi.sinica.edu.tw/pages/yang/
yang@citi.sinica.edu.tw
2. About Us
• Academia Sinica
National academy of Taiwan, founded in 1928
About 1,000 Full/Associate/Assistant Researchers
Located at Nangang District, Taipei City
• Music and AI Lab (musicai)
Members: Research Assistants, (co-advised) PhD/master
students
Application-oriented research: machine learning & music
2
3. 3
Music transcription (audio2score)
• audio → note (pitch, onset, offset)
• audio → instrument (flute, cello)
• audio → meter (4/4)
• audio → key (E-flat major)
audio score
ML in Music: “Music Info Retrieval/Analysis”
(existing
song)
4. ML in Music: “Music Info Retrieval/Analysis”
4
Music transcription (audio2score)
• audio → note (pitch, onset, offset)
• audio → instrument (flute, cello)
• audio → meter (4/4)
• audio → key (E-flat major)
audio score
Music semantic labeling
• audio → genre (classical)
• audio → emotion (yearning)
• audio → other attributes (slow/fast)
labels
applications in
music retrieval,
education,
archival, etc
(existing
song)
AI listener
5. Music transcription (audio2score)
• audio → note (pitch, onset, offset)
• audio → instrument (flute, cello)
• audio → meter (4/4)
• audio → key (E-flat major)
ML in Music: “Music Generation/Synthesis”
5
audio score
Music semantic labeling
• audio → genre (classical)
• audio → emotion (yearning)
• audio → other attributes (slow/fast)
labels
(new
song)
AI composer
random seed
AI performer (score2audio)
6. Music transcription (audio2score)
• audio → note (pitch, onset, offset)
• audio → instrument (flute, cello)
• audio → meter (4/4)
• audio → key (E-flat major)
ML in Music: “Music Generation/Synthesis”
6
audio
features
Music semantic labeling
• audio → genre (classical)
• audio → emotion (yearning)
• audio → other attributes (slow/fast)
labels
(existing
songs)
AI listener
score
AI DJ
audio
(a new
song)
remix, mashup, etc
(image from the Internet)
7. Recap
• ML in Music
Music information retrieval/analysis
AI listener
Music transcription (audio → score)
Music semantic labeling (audio → label)
For analyzing and indexing existing songs
Music generation/synthesis
AI composer (random seed → score)
AI performer (score → audio)
AI DJ (exis ng songs → new song)
For creating new music
7
8. ML for Creative AI Applications in Music
• AI Listener
• AI DJ
• AI Composer
• AI Performer
8
9. AI Listener: Source Separation
• “Demix” the music signal
input: audio mixture
output: individual tracks
9
(image from the Internet)
11. AI Listener: Source Separation
• “Demix” the music signal
• Application
Music production, DJ related skills
Singing voice processing, karaoke, soundtracks for movies
Smart headset, smart loudspeaker
Education
• Extension
Multi-instrument separation, speech voice separation
Melody extraction, beat estimation
13
13. Algorithm 2/4: Main Idea
mixture vocal
mixture drum
mixture others
• Training data
Demixing Secrets Dataset (DSD): 100 Western pop
songs with multi-track version (vocals, drums, bass, others)
No Chinese or Japanese pop songs at all in DSD
15
DAE1
DAE2
DAE3
15. Algorithm 4/4: Architecture
• U-net
Encoder: Conv2D
Decoder: Deconv2D
Skip connections
• allows low-level
information to flow
directly from the high
resolution input to the
high-resolution output
(at the corresponding
hierarchy)
17
[1] “U-Net: Convolutional networks for biomedical image segmentation,” arXiv 2015
[2] “Singing voice separation with deep u-net convolutional networks,” ISMIR 2017
(figure from [2])
16. Evaluation Campaign: SiSEC 2018
18
Ours
Sony
Oracle
Ours
The “Sony” guys:
Naoya Takahashi,
Nabarun Goswami,
Yuki Mitsufuji
Stefan Uhlich
et al.
17. AI Listener: Sound Event Detection
• Applications
Surveillance
Self-driven car
Industry 4.0
Healthcare
AIoT
Smart city
• Strength
Sound (ears) is comple-
mentary to vision (eyes)
Can work well even under
a dim environment, or
when the event is at a
distance from the camera
19
19. ML for Creative AI Applications in Music
• AI Listener
• AI DJ
• AI Composer
• AI Performer
21
audio
features
(existing
songs)
AI listener AI DJ
audio
(a new
song)
20. AI DJ
• Smart speaker + recommendation + DJ skills
22
21. DJ Skill #1: Music Thumbnailing
• Extract music highlights
• Application: music browsing, ringtone generation
• Related papers published by NAVER Corp
“Automatic DJ mix generation using highlight
detection,” ISMIR 2017
“Automatic music highlight extraction using
convolutional recurrent attention networks,”
arxiv 1712.05901
23
↙30 sec highlight
“A song”
22. Algorithm
• CNN for emotion prediction + attention
(predicting weights of different parts of a song)
• Transfer learning: no need of structural (chorus) labels
24
TISMIR’18TISMIR’18
Pop music highlighter:
Marking the emotion
keypoints
Open source!
24. DJ Skill #2: Music Sequencing
• Find an ordering of music pieces
28
“Automatic playlist sequencing and transitions,” Proc. ISMIR 2017 (from )
25. DJ Skill #2: Music Sequencing
• Find an ordering of music pieces
“Automatic playlist sequencing and transitions,”
Proc. ISMIR 2017 (from )
“Generating music medleys via playing music puzzle games by
unsupervised similarity embedding,” (from MAC Lab)
29
https://remyhuang.github.io/
► Demo:
26. Algorithm 1/2: Music Puzzle Games
• Divide a song into non-overlapping chunks
• Learn to order them by a Siamese CNN network
Positive pair: R1R2, R2R3
Negative pair: R2R1, R3R2, R1R3, R3R1
Unsupervised (self-supervised) learning
30
AAAI’18aAAAI’18a
Generating music
medleys via playing
music puzzle games
27. Algorithm 2/2: Similarity Embedding Net
• Divide a song into non-overlapping chunks
• Siamese CNN + similarity embedding
[a b c d], [a b c d],
[a b c d] [d c a b]
31
AAAI’18aAAAI’18a
Generating music
medleys via playing
music puzzle gameshttps://remyhuang.github.io/► Demo:
Open source!
28. Result
• For 8-pieces puzzle games, our model
reaches 99.0% pairwise accuracy (PA)
and 96.1% overall accuracy (OA)
• For medley, we have 75.0% OA
33
(our method)
(baseline 1)
(baseline 2)
32. AI Composer
• Create music
• Why?
Make musician’s life easier
Create copyright-free music (for films, Ads, games)
Classic AI problem
43
Eminem - When I'm Gone
33. Our Research on “AI Composer”
• Collaboration with musicians & producers from KKBOX
• Projects
MidiNet: melody generation (ISMIR’17) --- already cited by 53
Melody harmonization
Lead sheet generation
Drum VAE
MuseGAN: multitrack music generation (AAAI’18) --- 206 stars
Multi-track music generation using binary neurons (ISMIR’18)
Lead sheet arrangement and interpolation (ICMLA’18)
Automatic instrumentation arrangement
Emotion-based music generation
44
34. Lead Sheet Generation
• Lead sheet
melody
chord
• Given chord, generate melody
• Given melody, generate chord (a.k.a., harmonization)
• Or, from scratch
45
35. Melody Generation by RNN
Google
MelodyRNN
C-RNN-
GAN
Song from PI DeepBach
Google
WaveNet
core model RNN RNN RNN RNN CNN
data type symbolic symbolic symbolic symbolic audio
genre specificity ─ ─ ─
Bach
chorale
─
mandatory prior
knowledge
priming
melody
─
music scale &
melody profile
melody of
one part
priming
wave
follow a priming
melody
V V V
follow a chord
sequence
generate from
scratch
V
generate multi-
part music
V V V
open source V V
47
36. Melody Generation by CNN+GAN
Google
MelodyRNN
MidiNet
Google
WaveNet
core model RNN CNN CNN
data type symbolic symbolic audio
genre specificity ─
─
─
mandatory prior
knowledge
priming
melody
─
priming
wave
follow a priming
melody
V V V
follow a chord
sequence
V
generate from
scratch
V
generate multi-
part music
V V
open source V V
48
• By Google
• RNN
• Trained with
thousands of
melodies
• By MAC Lab
• CNN
• 526 tabs
(4,208 bars)
• One GPU
(GTX 1080)
• <30 mins
ISMIR’17ISMIR’17
MidiNet
37. Algorithm: Desired Output
• Generate the melody of a bar at a time
• Use a matrix to represent the music of a bar
• Condition on the previous bar (the history)
49
96 time steps (current bar)
84notes
(next bar)(previous bar)
38. Algorithm: Main Idea
50
• Generative adversarial nets (GAN)
Discriminator: tell real from fake
Generator: fool the discriminator
• Generate from scratch
real or
fake?
39. Algorithm: Main Idea
51
• Generative adversarial nets (GAN)
Discriminator: tell real from fake
Generator: fool the discriminator
• Generate from scratch
• Or, given chord, generate melody
real or
fake?
40. Algorithm: Temporal Model
52
• Conditioner: provide 2-D conditions
use the same filter shapes as the generator CNN
so that their intermediate outputs are “compatible”
real or
fake?
41. real or
fake?
Algorithm
53
• Generative adversarial nets (GAN)
• Don’t know what the “desired output” should be
(for example, what should be played next)
Only know whether it “sounds like real”
It learns the mapping between two spaces
46. Algorithm: Data
• LPD dataset: 128K MIDIs (piano-rolls) from LMD
61
http://colinraffel.com/projects/lmd/
https://salu133445.github.io/musegan/dataset
47. Algorithm: Intra- & Inter-tracks
• Multi-track
piano, guitar, bass,
strings, drums
• Hybrid model
one “shared” (inter) z
five “private” (intra) zi
five generators
one discriminator
62
55. AI Composer: Open Source Code
• MidiNet:
https://github.com/RichardYang40148/MidiNet
• DrumVAE:
https://github.com/vibertthio/drum_vae_server
• MuseGAN: https://github.com/salu133445/musegan
• BMuseGAN: https://salu133445.github.io/bmusegan/
• Pypianoroll:
https://github.com/salu133445/pypianoroll
• Lead sheet arrangement:
https://github.com/liuhaumin/LeadsheetArrangement
76
56. Music and AI
• AI Listener
• AI DJ
• AI Composer
• AI Performer
78
57. AI Performer
• Generate expressive music audio from score
• Performance “brings music to life”
• Existing work mainly focuses only the piano only
79
58. AI Performer
• Direct score-to-waveform synthesis is hard
• Another idea: score-to-spec correspondence
81
2D to 1D
2D to 2D
60. ContourNet
• Challenge 1: different input/output dimension
Use asymmetric U-net
• Challenge 2: hard to control note duration
Encode additional onset/offset information
84
65. Wrap-Up
• AI Listener
Create the singing-only version of songs
Sound event detection
• AI DJ
Create DJ skills such as thumbnailing and sequencing
• AI Composer
Create lead sheets or multi-track piano-rolls
• AI Performer
From score to audio
90
66. 91
audio features
(existing
songs)
AI listener AI DJ audio
(a new
song)
Music transcription (audio2score)
• audio → note (pitch, onset, offset)
• audio → instrument (flute, cello)
• audio → meter (4/4)
• audio → key (E-flat major)
audio score
Music semantic labeling
• audio → genre (classical)
• audio → emotion (yearning)
• audio → other attributes (slow/fast)
labels
(a new
song)
AI composer
random seed
AI performer (score2audio)
Conclusion