2. 2
SEMINARS
OF
SEMISTER – II
[ YEAR 2013-2014 ]
NAME: SHITAL KATKAR
TOPIC : Query By Humming
SIGNATURE:________________
3. 3
INDEX
1 Introduction
1.1 Query By Humming
2 Basic Architecture
2.1 Extraction
2.2 Transcription
2.3 Comparison
3 Applications
3.1 Shazam
3.2 Sound-Hound
3.3 Midomi
3.4 Musipedia
4 The art of Singing
4.1 Challenges
5 File Formats
5.1 Wav File format
5.2 MIDI File format
6 System Architecture
6.1 Wav to MIDI conversion
7 Parson Code algorithm
7.1 Rules
7.2 Advantages
8 Benchmarking MIR System
8.1 Online MIR System
8.1.1 CatFind
8.1.2 MelDex
8.1.3 MelodyHound
8.1.4 ThemeFinder
8.1.5 Music Retrieval Demo
4. 4
8.2 Comparison of MIR System
8.3 Evaluation Issues
8.4 Subjective and objective testing
9 Conclusion
5. 5
1. INTRODUCTION
Many people often remember as short tidbit of a song but fail to recall the song's name. If
you can remember lyrics that correspond to the song you are trying to recall, finding the
song is as easy as performing a text query on a web search engine. A query by humming
system allows a user to find a song even if he merely knows the tune from part of the
melody.
• “I don’t know the name. I don’t know who does it.
• But I can’t get this song out of my head.”
• Well, why not just hum it.
Query by humming System
It is a music retrieval technology in which users can hum or sing a melody to retrieve the
song.
The user simply sings or hums the tune into a computer microphone, and the system
searches through a database of song for melodies containing the tune and returns a ranked
list of search results. Thus user can then find the desired song by listening to the results.
6. 6
A Query by Humming (QBH) system enables a user to hum a melody into a microphone
connected to a computer in order to retrieve a list of possible song titles that match the
query melody. The system analyzes the melodic and rhythmic information of the input
signal. The extracted data set is used as a database query. The result is presented as a list of
e.g. ten best matching results.
Generally, a QBH system is a Music Information Retrieval (MIR) system. A MIR systems
provides several means for music retrieval, which can be hummed audio signal, but also
music genre classification or text information about the artist or title.
7. 7
2. BASIC ARCHITECTURE
Fig- Basic System Architecture
The basic architecture of the system is depicted in above figure. A microphone takes the
hummed input and sends this as a PCM signal to extraction block. The extracted information
results here which is given to the transcription part. The transcription block forms Melody
Contour to be compared with all contours residing in the database. A result list is finally
presented to the user.
Extraction
The extraction block is also referred as the acoustic front end. After recording the signal
with a computer sound card the signal is band pass filtered to reduce environmental noise
and distortion. In this system a sampling rate of 8000 Hz is used. The signal is band limited
to 80 to 800 Hz, which is sufficient for sung input. This frequency range corresponds to a
musical note range of D2–G5.
Transcription
The transcription block transcribes the extracted information into the representation that is
needed for comparison. The main task is to segment the input stream into single notes. This
can be done using parson code algorithm.
8. 8
Comparison
The transcription result is used as database query. Several distance measures can be used to
find a similar piece of music. The database contains a collection of already transcribed
melodies formatted according to the MelodyContourType.
The Result is finally presented to the user.
9. 9
3. APPLICATIONS
These are some examples of QBH Systems.
Shazam
Shazam is a commercial mobile phone-based music identification service. The company was
founded in 1999 by Chris Barton, Philip Inghelbrecht, Avery Wang and Dhiraj Mukherjee.
Shazam uses a mobile phone's built-in microphone to gather a brief sample of music being
played. An acoustic fingerprint is created based on the sample, and is compared against a
central database for a match. If a match is found, information such as the artist, song title,
and album are relayed back to the user.
Shazam can identify prerecorded music being broadcast from any source, such as a radio,
television, cinema or club, provided that the background noise level is not high enough to
prevent an acoustic fingerprint being taken, and that the song is present in the software's
database.
10. 10
SoundHound
SoundHound (known as Midomi until December 2009) is a mobile device service that allows
users to identify music by humming, singing or playing a recorded track. The service was
launched by Melodis Corporation (now SoundHound Inc), under Chief Executive Keyvan
Mohajer in 2007 and has received funding from Global Catalyst Partners, TransLink Capital
and Walden Venture Capital.
SoundHound is a music search engine available on the Apple App Store, Google Play,
Windows Phone Store, and on June 5, 2013, was available on the BlackBerry 10 platform. It
enables users to identify music by playing, singing or humming a piece. It is also possible to
speak or type the name of the artist, composer, song and piece. Unlike competitor Shazam,
SoundHound can recognise tracks from singing, humming, speaking, or typing, as well as
from a recording. Sound matching is achieved through the company's 'Sound2Sound'
technology, which can match even poorly-hummed performances to professional
recordings.
11. 11
Midomi
Midomi is the ultimate music search tool. Sing, hum, or whistle to instantly find your
favorite music and connect with a community that shares your musical interests.
At midomi you can create your own profile, sing your favorite songs and share them with
your friends and get discovered by other midomi users. You can listen to and rate other
users' musical performances, see their pictures, send them messages, buy original music,
and more.
midomi features an extensive digital music store with a growing collection of more than two
million legal music tracks. You can listen to samples of original recordings, buy the full studio
versions directly from midomi, and play them on your Windows computer or compatible
music players.
12. 12
Musipedia
Musipedia is a search engine for identifying pieces of music. This can be done by whistling a
theme, playing it on a virtual piano keyboard, tapping the rhythm on the computer
keyboard, or entering the Parsons code. Anybody can modify the collection of melodies and
enter MIDI files, bitmaps with sheet music, lyrics or some text about the piece, or the
melodic contours as Parsons Code.
Musipedia's search engine works differently from that of search engines such as Shazam.
The latter can identify short snippets of audio (a few seconds taken from a recording), even
if it is transmitted over a phone connection. Shazam uses Audio Fingerprinting for that, a
technique that makes it possible to identify recordings. Musipedia, on the other hand, can
identify pieces of music that contain a given melody. Shazam finds exactly the recording that
contains a given snippet, but no other recordings of the same piece.
13. 13
4. THE ART OF SINGING
It is obvious that people have imperfect memories for melodies or may lack any formal
singing practice.
1.People sing any part of the melody. A repetitive melodic passage in a song may represent
the ’hook-line’ of a song that ’gets stuck in people’s head’.
2.People sing at the wrong key. People chose a random pitch to start their singing. Only for
their most favorite songs, people are thought to have a latent ability of absolute pitch.
3. People sing at a reasonably correct global tempo. People knew or had a feeling, by
previous hearings, what the correct tempo would be and were able to approach this tempo
reasonably accurately. But still it is not possible to sing in correct tempo.
4.People sing too many or too few notes. Human memory is imperfect to recall all pitches
in the right order. People sang just the line they remembered. They also added all kinds of
ornaments (e.g., grace notes, filler notes, or thinner notes) to beautify their singing or to
ease the muscular motor processes involved in singing.
5.People sing the wrong intervals or confuse some with others. People sang about 59% of
the intervals correctly, though there were differences due to singing experience, song
familiarity and recent song exposure. Interval confusion seems to be symmetric;
interchanging an interval with another was found to be equally likely as the other way
around. A large interval (thirds and larger) tends to be more easily interchanged for another.
6. People sing the contour reasonably accurately. People largely knew when to go up and
when to go down in pitch when singing; they did that correctly in 80% of the times.
14. 14
7. People with singing experience sing better on some aspects than people without singing
experience do. The non-experienced and experienced singers did not differ in singing the
contour of a melody accurately. However, experienced singers reproduced proportionally
more correct intervals and sang at a better timing.
8. People sing familiar melodies better than less familiar ones. Less familiar melodies were
reproduced with fewer notes and had proportionally fewer correct intervals than familiar
melodies. Also, both experienced and non-experienced singers improved their singing of
intervals when they had heard the melody very recently.
15. 15
4.1 CHALLENGES
Building such a system, however, presents some significantly greater challenges than
creating a conventional text-based search engine. Unlike lyrical content, there exists no
intuitively obvious way to represent and store melodic content in a database. The chosen
representation must be indexable for efficient searching. Furthermore, several issues
unique to query by humming systems pose significant challenges to creating an efficient and
accurate music search system.
1. Users may not make perfect queries. Even if a user has a perfect memory of a particular
tune, he may start at the wrong key, or he may hum a few notes off-pitch throughout the
course of the tune. Sometimes he may even drop some notes entirely or add notes that did
not exist in the original melody. Additionally, no user is expected to be able to perfectly
hum at the same tempo as the songs stored in the database. Finally, since none of these
errors are mutually exclusive, a humming query may contain any combination of these
errors.
2. Accurately capturing pitches and notes from user hums is difficult, even if the user
manages to submit a perfect query. Currently existing software for converting raw audio
data into discrete pitch information is mediocre at best and oftentimes will introduce a
great deal of noise when extracting the pitches from a user’s hum.
3. Similarly, accurately capturing melodic information from a pre-recorded music file is
difficult. Properly extracting the melody from a given song is a field of study on its own but
is absolutely critical for an accurate query by would be of little use if the database contains
inaccurate representations of the target songs.
16. 16
5.FILE FORMATS
Wav File Format
WAVE or WAV format is the short form of the Wave Audio File Format (rarely referred to as
the audio for Windows). WAV format compatible with Windows, Macintosh or Linux.
Despite the fact that the WAV file can hold compressed audio, the most common use is to
store it is just an uncompressed audio in linear PCM (LPCM). The standard format of Audio-
CD, for example, is the audio in LPCM, 2-channel, sampling frequency of 44,100 Hz and 16
bits per sample.
As a format, derived from the Resource Interchange File Format (RIFF), WAV-files can have
metadata (tags) in the chunk INFO. In addition, the WAV files can contain metadata
standard Extensible Metadata Platform (XMP).
Uncompressed WAV files are quite large in size, so, as file sharing over the Internet has
become popular, the WAV format has declined in popularity. However, it is still a widely
used, relatively "pure", i.e. lossless, file type, suitable for retaining "first generation"
archived files of high quality, or use on a system where high fidelity sound is required and
disk space is not restricted.
MIDI File Format
The term MIDI stands for Musical Instrument Digital Interface and is essentially a
communications protocol for computers and electronic musical instruments.
Although the produced MIDI files are not exactly the same as the typical digital audio
formats we use (like MP3, AAC, WMA, etc.) to listen to music, MIDI files can still be thought
of as digital music.
Rather than an actual audio recording stored as binary data, a MIDI file in its simplest form
is made up of information that describes what musical notes are to be played, along with
the types of instruments that are to be used
17. 17
MIDI Files therefore do not contain any 'real world' recordings like voice (e.g. Audio books),
live performances, etc.,
However, MIDI files are very small and can be played on a wide range of devices that
support the MIDI protocol. Examples of hardware that can play MIDI files include: cell
phones, smart phones, and even your computer using the right software. Examples of MIDI
file format is Monophonic and polyphonic Ringtones.
In QBH system it is chose to create our database of songs using songs in the midi file format.
Because the midi representation already discretizes the notes, making it easier to extract
the pitch and timing information necessary for our song matching. Alternate music file
formats such as wav, mp3, aiff, etc. would require complicated waveform and signal
processing that could lead to many inaccuracies. Each of our songs is also mapped to a set
of metadata attributes such as song name and song artist for eventual display in the GUI
result list.
18. 18
6. SYSTEM ARCHITECTURE
The architecture is illustrated in above Figure. Operation of the system is straight-forward.
Queries are hummed into a microphone, digitized, and fed into a pitch-tracking module. The
result, a contour representation of the hummed melody, is fed into the query engine, which
produces a ranked list of matching melodies. The database of melodies will be acquired by
processing public domain MIDI songs, and is stored as a flat file database. Pitch tracking can
be performed. Hummed queries may be recorded in a variety of formats. The query engine
uses an approximate pattern matching algorithm, in order to tolerate humming errors. The
melody database is essentially an indexed set of soundtracks. The acoustic query, which is
typically a few notes hummed by the user, is processed to detect its melody line. The
database is searched to find those songs that best match the query.
While the overall task is one that is easily performed by humans, many challenging
problems arise in the implementation of an automatic system. These include the signal
processing needed for extracting the melody from the stored audio and from the acoustic
query, and the pattern matching algorithms to achieve proper ranked retrieval. Further, a
robust system must be able to account for inaccuracies in the user’s singing
19. 19
6.1 WAV TO MIDI CONVERSION
To create a MIDI a file for a song recorded in WAV format a musician must determine pitch,
velocity and duration of each note being played and record these parameters into a
sequence of MIDI events. The Midi created represents the basic melody and chords of
recognized music. The difference between WAV and MIDI formats consists in representation
of sound and music. WAV format is digital recording of any sound (including speech) and
MIDI format is principally sequence of notes (or MIDI events). Here we have an Output File
(.mid) from an Input File (.wav) that contains musical data, and a Tone File (.wav) that
consists of monotone data. An advantage of such a structure is also the fact that the query
is prepared on the client side of the system. In this case the query is very short. Besides,
there is a possibility to evaluate its quality before sending to the server. The system provides
for playback of the recognized melody notes in MIDI format. This allows the user to listen to
a query and take a decision either to send it to the server or to sing it once again.
20. 20
7. PARSON CODE ALGORITHM
The Parsons code, formally named the Parsons Code for Melodic Contours, is a simple
notation used to identify a piece of music through melodic motion—the motion of
the pitch up and down. Denys Parsons developed this system for his 1975 book, The
Directory of Tunes and Musical Themes. Representing a melody in this manner makes it easy
to index or search for particular pieces.
User input to the system (humming) is converted into a sequence of relative pitch
transitions.
A note in the input is classified in one of three ways
1. U = "up," if the note is higher than the previous note
2. D = "down," if the note is lower than the previous note
3. r = "repeat," if the note is the same pitch as the previous note
4. * = first tone as reference
21. 21
First note is C (72nd note). We will make it as reference note. And put the * Second note is
also C, Since it is repeating, we will put R. Next is G. G note is upper than C so we will put U
(U for upper) For second G , We put R. and so on.
This textual pattern will store into database for comparison.
Advantages
1. Pattern remains same, even if user hum the tune in different scale even if user hum
some note off key.
2. Require less space since it is stored in textual file
22. 22
8. BENCHMARKING MUSIC INFORMATION RETRIEVAL SYSTEMS
Research Paper
Benchmarking Music Information Retrieval Systems
Josh Reiss Department of Electronic Engineering Queen Mary, University of London Mile End
Road, London E1 4NS UK +44-207-882-5528 josh.reiss@elec.qmul.ac.uk
Department of Electronic Engineering Queen Mary, University of London Mile End Road,
London E1 4NS UK +44-207-882-5528 josh.reiss@elec.qmul.ac.uk
Mark Sandler Department of Electronic Engineering Queen Mary, University of London Mile
End Road, London E1 4NS UK +44-207-882-7680 mark.sandler@elec.qmul.ac.uk
--
Goal of this research paper is to create an accurate and effective benchmarking system for
music information retrieval (MIR) systems. This will serve the multiple purposes of inspiring
the MIR community to add additional features and increased speed into existing projects,
and to measure the performance of their work and incorporate the ideas of other works. To
date, there has been no systematic rigorous review of the field, and thus there is little
knowledge of when an MIR implementation might fail in a real world setting.
ONLINE MIR SYSTEMS
For the purposes of this work, we considered five online MIR systems. The systems
considered all have certain properties in common. They may all be used online via the World
Wide Web. They all are used by entering a query concerning a piece of music, and all may
return information about music that matches that query. However, these systems differ
greatly in their features, goals and implementation. These differences are discussed in detail
below.
CatFind
CatFind allows one to search MIDI files using either a musical transcription or a melodic
profile based on the Parson’s Code. It has minimal features, and was intended primarily for
demonstration. Although it seems unlikely that this system will be extended, it is still useful
here as a system for comparison.
23. 23
MelDex
This allows searching of the New Zealand Digital Library. The MELody inDEX system is
designed to retrieve melodies from a database on the basis of a few notes sung into a
microphone. It accepts acoustic input from the user, transcribes it into common music
notation, then searches a database for tunes that contain the sung pattern, or patterns
similar to it. Thus the query is audio although the retrieved files are in symbolic
representation. Retrieval is ranked according to the closeness of the match. A variety of
different mechanisms are provided to control the search, depending on the precision of the
input.
MelodyHound
This melody recognition system was developed by Rainer Typke in 1997. It was originally
known as "Tuneserver" and hosted by the university of Karlsruhe. It searches directly on the
Parsons Code and was designed initially for Query By Whistling. That is, it will return the
song in the database that most closely matches a whistled query.
ThemeFinder
Themefinder, created by David Huron, et. al., allows one to identify common themes in
Western classical music, Folksongs, and latin Motets of the sixteenth century. Themefinder
provides a web-based interface to the Humdrum thema command, which in turn allows
searching of databases containing musical themes or incipits (opening note sequences).
Themes and incipits available through Themefinder are first encoded in the kern music data
format. Groups of incipits are assembled into databases. Currently there are three
databases: Classical Instrumental Music, European Folksongs, and Latin Motets from the
sixteenth century. Matched themes are displayed on-screen in graphical notation.
Music Retrieval Demo
The Music Retrieval Demo is notably different from the other MIR systems considered
herein. The Music Retrieval Demo performs similarity searches on raw audio data (WAV
files). No transcription of any kind is applied. It works by calculating the distance between
the selected file and all other files in the database. The other files can then be displayed in a
list ranked by their similarity, such that the more similar files are nearer the top. Distances
24. 24
are computed between templates, which are representations of the audio files, not the
audio itself. The waveform is Hamming-windowed into overlapping segments; each segment
is processed into a spectral representation of Mel- frequency cepstral coefficients. This is a
data-reducing transformation that replaces each 20ms window with 12 cepstral coefficients
plus an energy term, yielding a 13-valued vector. The next step is to quantize each vector
using a specially- designed quantization tree. This recursively divides the vector space into
bins, each of which corresponds to a leaf of the tree. Any MFCC vector will fall into one and
only one bin. Given a segment of audio, the distribution of the vectors in the various bins
characterize that audio. Counting how many vectors fall into each bin yields a histogram
template that is used in the distance measure. For this demonstration, the distance
between audio files is the simple Euclidean distance between their corresponding templates
(or rather 1 minus the distance, so closer files have larger scores). Once scores have been
computed for each audio clip, they are sorted by magnitude to produce a ranked list like
other search engines.
COMPARISON OF MIR SYSTEMS
In Table 1, we present a comparison of the features of the various MIR systems under
investigation. Note first that each of these systems was designed for a different purpose,
25. 25
and none of them can be considered a finished product. This table allows one to get an
overview of the state of the MIR systems available., the features that one may wish to
include in an MIR system, and the areas where improvement is most necessary. It also
highlights the need for a standardized testbed. Each of the MIR systems use a different
database of files for audio retrieval. Both CatFind and the Music Retrieval Demo have
databases with less than 500 files. Thus, any benchmarking estimates, such as retrieval
times and efficiency, are rendered useless. MelDex, MelodyHound and ThemeFinder have
databases containing over 10,000 files. This should be sufficient for estimating search
efficiency and salability.
EVALUATION ISSUES
Table 1 listed and compared the features available in existing online MIR systems. However,
this is not sufficient for effective benchmarking and evaluation of possible music
information retrieval systems that may appear in the near future and be used with large file
collection. The question of what features to evaluate is determined by what we can
measure that will reflect the ability of the system to satisfy the user. In a landmark paper,
Cleverdon[21] listed six main measurable quantities. This has become known as the
Cranfield model of information retrieval evaluation. Here, those properties are listed and
modified as applicable for MIR.
1. The coverage of the collection, that is, the extent to which the system includes relevant
matter.
2. The time lag, that is, the average interval between the time the search request is made
and the time an answer is given. Consideration should also be made of worst case or
close to worst case scenarios. It may be that certain genres or formats of music, as well
as certain types of queries, e. g., query and retrieval of polyphonic transcription based
audio may require far more time than other queries. Furthermore, if the testbed is
particularly large, dispersed or unindexed, such as with peer-to-peer based internet, then
bandwidth limitations and scalability may greatly reduce efficiency while maximizing the
collection size.
26. 26
3. The form of presentation of the output. For MIR systems this not only means having the
option of retrieving various formats, symbolic and audio, but it also implies identifying
multiple performances of the same composition.
4. The effort involved on the part of the user in obtaining answers to his search requests. So
far, MIR research has been dominated by audio engineers, computer scientists,
musicologists and librarians. As the field expands to include developers and user
interface experts this issue will acquire more significance.
5. The recall of the system, that is, the proportion of relevant material actually retrieved in
answer to a search request;
6. The precision of the system, that is, the proportion of retrieved material that is actually
relevant.
27. 27
9.CONCLUSION
Music retrieval is becoming more natural, simple and user friendly with the advancement of
QBH. Thus this technology will give broader application prospects for music retrieval.
Using Parson code algorithm it become easy to implement Query Matching System.
In this work, we have laid down a framework for benchmarking of future MIR systems. At
the moment, this field is in its infancy. There are only a handful of MIR systems available
online, each of which is quite limited in scope. Still, these benchmarking techniques were
applied to five online systems. Proposals were made concerning future benchmarking of full
online audio retrieval systems. It is hoped that these recommendations will be considered
and expanded upon as such systems become available.
28. 28
10.REFERENCES
Benchmarking Music Information Retrieval Systems
Josh Reiss Department of Electronic Engineering Queen Mary, University of London Mile End
Road, London E1 4NS UK +44-207-882-5528 josh.reiss@elec.qmul.ac.uk
Mark Sandler Department of Electronic Engineering Queen Mary, University of London Mile
End Road, London E1 4NS UK +44-207-882-7680 mark.sandler@elec.qmul.ac.uk
A Query by Humming system using MPEG-7 Descriptors
Jan-Mark Batke, Gunnar Eisenberg, Philipp Weishaupt, and Thomas Sikora
Communication Systems Group, Technical University of Berlin
Correspondence should be addressed to Jan-Mark Batke (batke@nue.tu-berlin.de)
MusicDB: A Query by Humming System
Edmond Lau, Annie Ding, Calvin On
6.830: Database Systems Final Project Report Massachusetts Institute of Technology
{edmond, annie_d, calvinon}@mit.edu