2. Video has become one of the most popular multimedia artefacts
used on PCs and the Internet. In a majority of cases within a video, the
sound holds an important place. From this statement, it appears essential
to make the understanding of a sound video available for people with
auditory problems as well as for people with gaps in the spoken language.
The most natural way lies in the use of subtitles.
However, manual subtitle creation is a long and boring activity
and requires the presence of the user. Consequently, the study of
automatic subtitle generation appears to be a valid subject of research.
PROBLEM STATEMENT...
3. The system should take a video file as input and generate a subtitle file (srt/txt) as
output. The Three modules are:-
Audio Extraction:
The audio extraction routine is expected to return a suitable audio
format that can be used by the speech recognition module as pertinent material. It
must handle a defined list of video and audio formats. It has to verify the file
given in input so that it can evaluate the extraction feasibility. The audio track
has to be returned in the most reliable format.
INTRODUCTION...
4. Speech Recognition:
The speech recognition routine is the key part of the system. Indeed, it
affects directly performance and results evaluation. First, it must get the type of
the input file then, if the type is provided, an appropriate processing method is
chosen. Otherwise, the routine uses a default configuration. It must be able to
recognize silences so that text delimitations can be established.
Subtitle Generation:
The subtitle generation routine aims to create and write in a file in
order to add multiple chunks of text corresponding to utterances limited by
silences and their respective start and end times. Time synchronization
considerations are of main importance.
5. BENEFITS OF USING SUBTITLES....
The major benefit is that the viewer does not need to download the subtitle from
internet if he wants to watch the video with subtitle.
Captions help children with word identification, meaning, acquisition, and
retention.
Captions can help children establish a systematic link between the written word
and the spoken word.
Captioning has been related to higher comprehension skills when compared to
viewers watching the same media without captions.
6. Captions provide missing information for individuals who have difficulty
processing speech and auditory components of the visual media (regardless of
whether this difficulty is due to a hearing loss).
Captioning is essential for children who are deaf and hard of hearing, can be
very beneficial to those learning English as a second language, can help those
with reading and literacy problems, and can help those who are learning to
read.
CONTINUED....
7. H E R E C O M E S Y O U R F O O T E R P A G E 7
8.
9. H E R E C O M E S Y O U R F O O T E R P A G E 9
10.
11.
12.
13. H E R E C O M E S Y O U R F O O T E R P A G E 1 3
AUDIO EXTRACTION…
14. H E R E C O M E S Y O U R F O O T E R P A G E 1 4
SPEECH RECOGNITION…
15. H E R E C O M E S Y O U R F O O T E R P A G E 1 5
SUBTITLE GENERATION…
16. H E R E C O M E S Y O U R F O O T E R P A G E 1 6
17. FFMPEG…
H E R E C O M E S Y O U R F O O T E R P A G E 1 7
FFMPEG libraries are used to do most of our multimedia tasks quickly and
easily say, audio compression, audio/video format conversion, extract images
from a video and a lot more. It can be used by developers for transcoding,
streaming and playing. It is very stable framework for transcoding of videos
and audio.
18. JAVA SPEECH API…
It allows developers to incorporate speech technology into user
interfaces for their Java programming language applets and
applications. This API specifies a cross-platform interface to support
command and control recognizers, dictation systems and speech
synthesizers. . Sun has also developed the JSGF(Java Speech Grammar
Format) to provide cross-platform grammar of speech recognizers .
19. CURRENT PROBLEMS…
H E R E C O M E S Y O U R F O O T E R P A G E 1 9
Robustness.
Automatic generation of word lexicons.
Finding the theoretical limit for FSM implementations of ASR systems.
Optimal utterance verification-rejection algorithms.
Accuracy and Word Error Rate.
Filling up missing offset samples with silence.
Synchronize between tracks.
20. H E R E C O M E S Y O U R F O O T E R P A G E 2 0
21. All MPEG standard formats are supported like MP2, MP3 etc. for
audio/video.
Audio of any format can be extracted but speech recognition is done
only in English.
The extracted text from the audio/video is in the .srt format. The text
displayed will have a readable format
Captions appear on-screen long enough to be read. It is preferable to
limit on-screen captions to no more than two lines. Captions are
synchronized with spoken words.
User can convert the extracted audio in any suitable format supported
under MPEG standards.
22.
23. System Requirements – The software is compatible on all the Operating
Systems. The user needs to install the .exe file of the software in their
PCs.
Security – The system has no security constraints.
Performance – The text is synchronized with the song.
Maintainability – The software is easy to maintain.
Reliability - The software will provide a good level of precision.
Modifiability- The software cannot be modified by external user.
Scalability- The software is scalable as a number of users can utilize it
for their benefits simultaneously.
24.
25. MP3 ALGORITHM…
1. Initialize i=0, j=1.
2. tincr = 1.0 / sample_rate
3. dstp = dst, c = 2 * M_PI * 440.0;
4. Generate sin tone with 440Hz frequency and duplicated channels
5. Check if i < nb_samplesIf it is true then generate ths sine wave and store it in dstp
= sin(c * *t)
6. Check if j < nb_channels
7. Store the packets in the destination buffer.
8. Increment dstp += nb_channels and t += tincr
9. Repeat till the dst buffer is filled with nb_samples, generated starting from t
26. MFCC (MEL FREQUENCY CEPSTRAL COEFFECIENT)
Check if Delta frequency which is the ratio between sample rate and number of
fft points
if (deltaFreq == 0) {
Print “deltaFreq has zero value"; }
Check if the left and right boundaries of the filter are too close.
if ((Math.round(rightEdge - leftEdge) == 0)|| (Math.round(centerFreq - leftEdge)
== 0) || (Math.round(rightEdge - centerFreq) == 0))
{
throw new IllegalArgumentException("Filter boundaries too close"); }
Find how many frequency bins we can fit in the current frequency range.
numberElementsWeightField =(int) Math.round((rightEdge - leftEdge) / deltaFreq
+ 1);
Initialize the weight field.
if (numberElementsWeightField == 0) {
throw new IllegalArgumentException("Number of elements in mel" + " is zero."); }
weight = new double[numberElementsWeightField];
27. CONTINUED…
filterHeight = 2.0f / (rightEdge - leftEdge);
Now compute the slopes based on the height.
leftSlope = filterHeight / (centerFreq - leftEdge);
rightSlope = filterHeight / (centerFreq - rightEdge);
Now let's compute the weight for each frequency bin.
for (currentFreq = initialFreq, indexFilterWeight = 0; currentFreq <= rightEdge;
currentFreq += deltaFreq, indexFilterWeight++) {
if (currentFreq < centerFreq) {
weight[indexFilterWeight] = leftSlope * (currentFreq - leftEdge); } else {
weight[indexFilterWeight] = filterHeight + rightSlope * (currentFreq - centerFreq);
}}
Convert linear frequency to mel frequency
private double linToMelFreq(double inputFreq) {
return (2595.0 * (Math.log(1.0 + inputFreq / 700.0) / Math.log(10.0))); }
28.
29.
30.
31.
32.
33.
34. H E R E C O M E S Y O U R F O O T E R P A G E 3 4
35.
36. Risk
ID
Classification Description of Risk Risk Area Probability Impact RE
(P*I)
1. Product
Engineering
Word Error Rate Performance L H M
2. Product
Engineering
Aliasing Performance M M M
3. Development
Environment
Bitrate of extracted
audio more than that
of input audio
Testing Environment L L L
4. Product
Engineering
Accuracy and Speed Performance L H M
5. Program Constraint Format not recognized External Input L H M
37.
38. Risk ID Description of Risk Risk Area Mitigation
1. Word Error Rate Performance Having an effecient database
(Training Set).
2. Aliasing Performance Resampling the samples at a fix
frequency.
3. Bitrate of extracted audio more than
that of input audio
Testing Environment Encode and Decode audio at
the bitrate of the input audio.
4. Accuracy and Speed Performance Synchronization
5. Format not recognized External Input Input audio/video supported
by MPEG standard formats.
39. H E R E C O M E S Y O U R F O O T E R P A G E 3 9
40. Test Case ID Input Expected Output Status
1. 1.1 File.mp3 File.mp3 Pass
1.2 File.mp4 File.mp3 Pass
1.3 File.mp2 File.mp3 Pass
1.4 File.au File.au Pass
1.5 File.aac File.aac Pass
1.6 File.wav File.wav Pass
1.7 File.flac File.flac Pass
1.8 File.wma (format not supported by
MPEG standards)
File.wma Fail
1.9 File.als (format not supported by
MPEG standards)
File.als Fail
41. 2. 2.1 File.wav (Words present in the
dictionary)
Speech Recognized.
Text Printed.
Pass
2.2 File.mp3 (not a .wav file) Speech Recognized.
Text Printed.
Fail
2.3 File.au (not a .wav file) Speech Recognized.
Text Printed.
Fail
2.4 File.flac (not a .wav file) Speech Recognized.
Text Printed.
Fail
2.5 File.wav (Words not found in the
Dictionary)
Speech Recognized.
Text Printed.
Fail
3. 3.1 File.srt (Incorrect Timecode) Subtitles generated but
synchronized with the video
Fail
3.2 File.srt (Correct Timecode)
File.avi
Subtitles generated and
synchronized with the video file
File.avi
Pass
3.3 File.txt (not containing the
Timecode)
Subtitles generated and
synchronized with the video
Fail
3.4 File.srt (Correct Timecode)
File.mp4
Subtitles generated and
synchronized with the video file
File.mp4
Pass
3.5 File.srt (Correct Timecode)
File.wma
Subtitles generated and
synchronized with the video file
Pass
42.
43. H E R E C O M E S Y O U R F O O T E R P A G E 4 3
AUDIO EXTRACTION…
49. Test Case ID Input Expected Output Status
1.8 File.au (format supported by
MPEG standards)
File.au Pass
1.9 File.mp4 (format supported by
MPEG standards)
File.mp3 Pass
2.2 File.wav Speech Recognized.
Text Printed.
Pass
2.3 File.wav Speech Recognized.
Text Printed.
Pass
2.4 File.wav Speech Recognized.
Text Printed.
Pass
2.5 File.wav (Words found in the
Dictionary)
Speech Recognized.
Text Printed.
Pass
3.1 File.srt (Correct Timecode) Subtitles generated and
synchronized with the video
Pass
3.3 File.srt Subtitles generated and
synchronized with the video
Pass
50.
51. DETAILED STUDY OF INPUT AND EXTRACTED FILES…
Time
Taken
for
Extract
ion
(in ms)
Size Bitrate Size Bitrate
(MB) (kbps) (MB) (kbps)
1
Despicable
.avi
10.8 1628 8.24 1411 00:49 0.6 24%
2 Time.mp4 48.1 1663 44.4 1536 04:02 3.12 8%
3
Florida.mp
4
76 2723 39.3 1411 03:54 1.08 48%
4
Internation
al.mp4
79.1 2673 41.7 1411 04:08 1.3 47%
5 Justin.mp4 43.2 1615 41 1536 03:44 1.54 5%
6 Love.mp4 67.1 2112 44.8 1411 04:26 1.98 33%
7 Jojo.avi 61.8 2183 39.9 1411 03:57 1.86 35%
8 Baby.mp4 43.2 1615 41 1536 03:44 3.34 5%
9 Never.mp4 52.5 1657 48.5 1536 04:25 2.15 8%
10 Beep.avi 51.4 1628 38.4 1411 03:48 01:58 25%
Average 53.3 1950 38.7 1461 03:41 1.71 24%
Redu
ction
Rate
S.
N
o.
Input File
Before Audio
Extraction
After Audio
Extraction
Length
of the
input/ou
tput file
(min:sec
)
52. COMPARISON BETWEEN THE SIZE OF THE INPUT FILE AND THE
EXTRACTED FILE
0
20
40
60
80
100
Sizeoffile(inMB)
Input Files (.mp4/.avi)
Size Before Extraction(MB)
Size After Extraction(MB)
From the above graph we can observe that the size of each input file is reduced as the
audio has been extracted from the input video. The maximum reduction rate of the
size of the file is 0.48 and the minimum reduction is 0.05 giving an average
reduction rate of 24%.
53. COMPARISON BETWEEN THE BITRATE OF THE INPUT FILE AND THE
EXTRACTED FILE
0
500
1000
1500
2000
2500
3000
Bitrate(inkbps)
Input Files (.mp4/.avi)
Bitrate Before Extraction(kbps)
Bitrate After Extraction(kbps)
The bitrates of each of the input files range from 1615kbps to 2723kbps and the bitrates
of the extracted files reduces to a minimum of 1411kbps and maximum of 1536kbps
giving an average bitrate of 1461kbps.
54. TIME TAKEN FOR EXTRACTION OF INPUT FILE
0
0.5
1
1.5
2
2.5
3
3.5
4
Time(inms)
Input Files (.mp4/.avi)
Time Taken for Extraction (in
ms)
The time taken to extract each files vary from 0.6 ms to 3.34 ms with the average
extraction time of 1.71 ms
55. H E R E C O M E S Y O U R F O O T E R P A G E 5 5
56. The ASG aims at automatically generating the text for the input
audio/video.
It supports all the MPEG standards.
The video and subtitles are synchronized.
User can extract audio in any MPEG standard formats.
Audio of any format can be extracted but speech recognition
57. [1] B. H. Juang; L. R. Rabiner, “Hidden Markov Models for Speech Recognition” Journal of
Technometrics, Vol.33, No. 3. Aug., 1991.
[2] Hong Zhou and Changhui Yu , “Research and design of the audio coding scheme ,” IEEE
Transactions on Consumer Electronics, International Conference on Multimedia
Technology(ICMT) 2011.
[3] Seymour Shlien,”Guide to MPEG-1 Audio Standard”, Broadcast Technology, IEEE
Transactions on Broadcasting, December 1994.
[4] Justin Burdick, “Building a Regionally Inclusive Dictionary for Speech Recognition”,
Computer Science and Linguistics, Spring 2004.
[5] Anand Vardhan Bhalla, Shailesh Khaparkar, “Performance Improvement of Speaker
Recognition System”,International Journal of Advanced Research in Computer Science
and Software Engineering, Volume 2, Issue 3, March 2012.
[6] Petr Pollak, Martin Behunek, “Accuracy of MP3 Speech Recognition Under Real-World
Conditions”, Electrical Engineering, Czech Technical University in Prague, Technick´a 2.
REFERENCES…
58. [7] Yu Li, LingHua Zhang, “Implementation and Research of Streaming Media System and
AV Codec Based on Handheld Devices” 12th IEEE International Conference on
Communication Technology (ICCT), 2010.
[8] Ibrahim Patel1 Dr. Y. Srinivas Rao, “Speech Recognition Using HMM with MFCC- An
Analysis using Frequency Spectral Decomposition Technique”, Signal & Image
Processing: An International Journal(SIPIJ), Vol.1, No.2, December 2010.
[9] Jorge Martinez, Hector Perez, Enrique Escamilla, Masahisa Mabo Suzuki,” Speaker
recognition using Mel Frequency Cepstral Coefficients (MFCC) and Vector Quantization
(VQ) Techniques”, 22nd International Conference on Electrical Communications and
Computers (CONIELECOMP), 2012.
[10] Sadaoki Furui, Li Deng, Mark Gales,Hermann Ney, and Keiichi Tokuda,, ” Fundamental
Technologies in Modern Speech Recognition”, Signal Processing, IEEE Signal Processing
Society, November 2012.
[11] Youhao Yu “Research on Speech Recognition Technology and Its Application”,
Electronics and Information Engineering, International Conference on Computer
Science and Electronics Engineering, 2012.
CONTINUED…
59. Abhinav Mathur, Tanya Saxena, “Generating Subtitles Automatically using
Audio Extraction and Speech Recognition”, 7th International Conference on
Contemporary Computing (IC3), 2014. (Under Review).
PUBLICATION…