1. A Quantitative Study of Forum Spamming Using Context-Based Analysis Yi-Min Wang^ Ming Ma^ Yuan Niu* Hao Chen* Francis Hsu* *UC Davis, ^Microsoft Research
Transition better to 2 nd slide More motivation – why SEO is important -- screenshot of search results -- More than spam, possible exploits -- more coherent story about comment spam -- Moderation nightmare
Users want to see useful information. They want to participate in forums, they want to blog, go shopping without being bombarded by irrelevant ads. And of course, everyone has the right to surf the web without fear of being attacked by this or that exploit. Search engines try to point users to quality pages through good search results. They’re also partially motivated by money earned through ads.
More reasons here… Define web forum
More trackbacks or pingbacks (how do they work. Why do they exist) -- similarity based on layout COLOR backgrounds -- Captcha can’t be used. -- more difficult to moderate trackbacks/pingbacks
Content-based analysis We get all the doorway pages + the destination. End game is to direct traffic to the destination Why we chose context-based analysis over content-based -- Define -- Related
Thumbnails Define 3 rd party domain here
More detail on the process of recording pages. 3 rd party domain-defin “ seeded known spammer domains” Mention the double funnel -- blacklist, whitelist, spam policies
Also do picture for crawler-browser
1 st image is: konquerer masquerading from Wget (which doesn’t deal with javascript) The 2 nd image shows konquerer sending the correct user-agent id.
Use circles/emphasize current graph. shrink
First we look at the extent to which web forums are spammed, from the perspective of the web user. Presumably, this is because the spammer has been very busy in leaving his URLs all over the web. And again, the URLs being left about are doorway pages, which are more expendable than actual domains.
WWWBoard, Hypernews, Ikonboard, Ezboard, Bravenet, Invision Board, Phpbb, Phorum, and VBulletin A mix of languages (perl, php) hosted/non-hosted. 9 different softwares – highlight differences rather than names -- list all, but more readable (maybe red circles & graphically)
Top 5 numbers. Show more non-spammy words -- .edu & .gov sites (why web forums as well) Why is this bad?? (for every perspective)
Expand the graph. Growth keeps continuing. Spammers are still visiting. Exponential growth seen on all 3
Change colors. Sum 3 lines -- Shift the number Mark the important dates -- 2 nd graph to show rate of change -- mention length of experiment
Include percentages
Put numbers here -- Google has resources
Blogspoint + blogstudio share spammers *** numbers!! Graph/table showing all 4 webhosts Why isn’t spam consistent across Consistent metrics
Why are .edu/.gov redirs troublesome
Less time on this. Don’t read out loud Highlight how ours differs/relates, their shortcomings (cloaking).
Move WWW paper info to: APPLICATION/FUTURE WORK/IMPACT Explain how useful results are to search engines/forum.