Presentation of my bachelor thesis Information Science. It provides an overview of my attempt to use parsimonious language models on parliamentary proceedings to derive characteristic words for left-wing and right-wing parties, and compare the occurences of these words in subtitles of programmes broadcasted by Dutch public broadcasting organizations.
3. Gentzkow & Shapiro (2010)
Econometrical research: compare
language use of news outlets to political
language
Conclusion: ‘An economically signi cant
demand for news slanted towards one’s
own political ideology exists.’
Gentzkow, M. and Shapiro, J. M. (2010). What drives media slant? Evi-
dence from U.S. daily newspapers. Econometrica, 78(1):35–71.
4. Operationalization
Find characteristic words for Republicans and
Democrats in Congress Proceedings.
Count relative frequencies of these words in newspapers
Compare occurrence of words between newspapers
Gentzkow, M. and Shapiro, J. M. (2010). What drives media slant? Evi-
dence from U.S. daily newspapers. Econometrica, 78(1):35–71.
5. Di erences
Dutch versus English
Television instead of newspapers
More political parties
Other technique to derive characteristic words
Other comparison method(s)
6. Television
Subtitles for the hearing impaired (http://tt888.nl)
Data complete from January 2008 to February 2011
Problem: Hardly any useful metadata
7. Television
Before After
Broadcast with title 16.995 32.491
Unique titles 4.560 --> 2.702 2.238
Broadcast
1.104 1.064
frequency > 2
Solution: TV guide
8. Television
Nova
362.844 words
Pauw & Witteman
895.935 words
DWDD
1.626.929 words
EenVandaag
1.556.642 words
Nos Journaal
12.609.620 words Goedemorgen Nederland
760.658 words
Netwerk
879.635 words
NOS Jeugdjournaal
1.383.728 words
Buitenhof DWDD
EenVandaag Goedemorgen Nederland
Het Elfde Uur Holland Doc
Knevel en Van den Brink Netwerk
Nieuwsuur NOS Jeugdjournaal
Nos Journaal Nova
Ochtendspits Pauw & Witteman
PowNews SchoolTV Weekjournaal
Sinterklaasjournaal Tegenlicht
Uitgesproken Vragenuurtje
Zembla
9. Political groups
Parliamentary period with greatest overlap on TV data set:
Balkenende IV
Ideology: goverment - opposition, not left - right
(Hirst et al., 2010)
Hirst, G., Riabinin, Y., Graham, J., and Boizot-Roche, M. Text to Ideology
or Text to Party Status?
10. Political groups
Government (CDA, PvdA and ChristenUnie)
Left wing opposition (GroenLinks, SP)
Right wing opposition (PVV, VVD)
Hirst, G., Riabinin, Y., Graham, J., and Boizot-Roche, M. Text to Ideology
or Text to Party Status?
11. Parsimonious language
models
λ(t|D)
et = tf (t, D) ·
(1 − λ)P (t|C) + λP (t|D)
et
P (t|D) =
t et
Hiemstra, D., Robertson, S., and Zaragoza, H. (2004). Parsimonious lan-
guage models for information retrieval. In Proceedings of the 27th Annual Inter-
national ACM SIGIR Conference on Research and development in Information
Retrieval, SIGIR ’04, pages 178–185, New York, NY, USA. ACM.
12. Parsimonious language
models
Probability distribution from word frequencies per
document
Compare distribution with collection of documents
Choose terms that are substantially more frequent than
expected
Hiemstra, D., Robertson, S., and Zaragoza, H. (2004). Parsimonious lan-
guage models for information retrieval. In Proceedings of the 27th Annual Inter-
national ACM SIGIR Conference on Research and development in Information
Retrieval, SIGIR ’04, pages 178–185, New York, NY, USA. ACM.
13. Parsimonious language
models
Filter out corpus speci c stopwords (‘voorzitter’)
Remove noise
Hiemstra, D., Robertson, S., and Zaragoza, H. (2004). Parsimonious lan-
guage models for information retrieval. In Proceedings of the 27th Annual Inter-
national ACM SIGIR Conference on Research and development in Information
Retrieval, SIGIR ’04, pages 178–185, New York, NY, USA. ACM.
17. Comparison
Two methods: estimated probability and Kullback-Leibler
divergence
‘For each political group, estimate the probability that an
arbitrary word in a tv-programme is one of their
characteristic words’
‘Calculate the risk of returning a document to the query’
tft,T V P (t|Mq )
ˆ
P (q|T V ) = KL(Md Mq ) = P (t|Mq ) · log
t∈q
|T V | P (t|Md )
tV
18. Results
Right never wins
Casual evaluation does not imply ‘strange’ right wing words
Government and left results are close
Comparison with regular Dutch does imply a little
preference for left wing words
19. Conclusions
Language in Dutch public broadcasting is not
particularly left (only a slight preference was found)
Descriptive right wing words used less
Might be PVV-in uence; further investigation is needed