HP Tech Forum 2009 presentation covering some of the ways spammers harvest email addresses on the Internet (and how you can prevent it), including an in-depth look at three commonly used software packages.
2. Overview
• What is email address harvesting?
• How do spammers do it?
• What can you do about it?
• Examples of harvesting software
3. Mandatory Definition Slide
• Email address harvesting is the process used by
spammers to extract email addresses from public
sources.
• Common sources:
− Web sites
− Newsgroups
− Mailing lists
− Chat rooms
4. Mandatory “How Bad Is It?” Slide
• FTC: 86% of all email addresses posted on web
pages receive spam.
• FTC: 93% of all email addresses used in
newsgroups receive spam.
• PSC honeypot record: Address received spam 4
minutes after being included in a newsgroup post.
5. Address Lists
• Spammers use address harvesting to build giant
lists of addresses to send spam to.
• Most lists have 1-20 million addresses.
• Spammers sell/share their lists, so being on even
just one list will get you a lot of spam.
6. Evolution Of The Address List
• Somebody (probably not even a spammer)
harvests addresses from various sources.
• A “good” harvester scrubs the list.
• The harvester sells the list to lots of spammers.
• Once your address is on a list, it’s going to be
on one or more lists forever.
7. Harvesting From Web Sites
• Spammers usually use a spider program to
scrape addresses off of web pages.
11. UseNet Newsgroups
• Spider programs exist to extract these addresses
as well.
• Email addresses are splattered all over:
− Message headers
− Signatures
− Attributions
12. Mailing Lists
• Lots of list manager software provides a list of
every email address on a list.
• Spammers are happy to join a mailing list
temporarily to get access to a list of subscribers.
• Some clever spammers send an innocuous
newbie question from the list archives with a
read-receipt request.
13. 3rd Party Mailing Lists
• People you’ve provided your address to provide it
to 3rd parties (usually for profit).
• Example: Auto insurance quote
• Initial sale of list might be aboveboard, but lists
have a way of trickling down to less desirable
senders.
14. Web Browser Holes
• Newer browsers have eliminated most of these,
but they’re still common in older browsers.
• Extraction of email address from HTTP_FROM
header that browser sends to web server.
• JavaScript to extract email address from
browser’s configuration.
15. Web Browser Holes
• Force browser to fetch an image on a page by
anonymous FTP.
− Most browsers use the configured email address as the
password.
• JavaScript action that sends an email message in
the background on page load.
16. Chat Rooms
• Web bots monitor chat rooms and extract user
names.
• Lots of providers (AOL, Yahoo) use the same
profile names for both chat rooms and email.
• IRC used to be fertile harvesting ground, but it’s
fallen into disuse by less savvy users.
17. Domain Contacts
• Every registered domain name has one or more
contact addresses.
• Addresses are publicly accessible (WHOIS)
• Addresses are almost always valid and read by a
real person on a regular basis.
18. Guessing
• Spammers “guess together” a list of email
addresses.
• The addresses are tested against one or more
email servers.
• Valid addresses are added to a list of addresses
to be spammed.
• Usually referred to as directory harvesting.
19. CAN-SPAM
• Federal CAN-SPAM act explicitly makes email
address harvesting illegal.
• Some providers of the harvesting software
have ceased and desisted, but harvesting has
actually increased.
• Like most legal solutions, CAN-SPAM is
severely constrained by jurisdictional
boundaries.
20. Harvesting Prevention
• The harder it is for spammers to get your
address, the harder it is for them to spam you.
• “I don’t care – my spam filter is awesome. Bring
it on!”
• No filter is 100% accurate
• Filtering still places load on filtering system
and/or email server.
21. Prevention Methods
• Reformatting addresses
• Web forms
• JavaScript-generated mailto links
• Graphical addresses
• Throwaway addresses
22. Reformatting Addresses
• Prevents harvesting from web pages and
newsgroups.
• Simple examples include inserting bogus strings
into the address to make it invalid:
jdoe@NOSPAM.hp.com
jdoeREMOVEME@hp.com
23. Reformatting Addresses
• Writing the address out longhand can prevent
harvesters from recognizing it as an email
address:
jdoe at hp dot com
• Inserting extra whitespace can also help:
jdoe @ hp.com
jdoe @ hp.com
24. Reformatting Addresses
• ASCII-encoded characters in the address are
decoded by most web clients, but not by most
spamware:
jdoe@p&#
114;ocess&#
046;com
25. Web Forms
• Provide an HTML form for web site visitors to
enter a message.
• When the form is submitted, the CGI script mails
the message to the appropriate recipient.
• Avoids displaying the actual address anywhere
on the site.
• Can still be abused, but it’s relatively difficult to
do.
27. JavaScript Generated mailtos
• Use JavaScript to dynamically generate mailto:
link when the link is clicked.
<A HREF=„javascript:window.location=
“mail”+”to:”+”jdoe”+”@”+”hp”+”.”+”com”; return
true‟>Click here to mail John Doe</A>
28. Graphical Addresses
• Displaying all or part of an email address as a
graphical image will throw off most harvesting
software.
• No known harvesting software is OCR-capable.
− Anecdotal reports of at least one large spam
organization trying to develop accurate OCR harvesters
29. Graphical Address Complexity
• Graphical @ sign:
− Probably sufficient to throw off most harvesters.
− Username and hostname are still in close proximity.
− Works easily for multiple users/multiple domains.
jdoe hp.com
30. Graphical Address Complexity
• Graphical @hostname:
− Should prevent any harvester from working.
− Requires a different image for each email domain.
jdoe
31. Graphical Address Complexity
• Graphical everything:
− For the truly paranoid.
− Completely unreadable by harvesters unless they’re
OCR-enabled.
− Requires either a lot of images or a script that can
dynamically generate them.
32. Throwaway Addresses
• Many people create an email account that they
use only for web pages and newsgroups.
• Some software products go further and let you
create an alias for every occasion.
• You still need a static address for business cards,
resumes, etc.
33. Harvesting Software
• Tons of specialized software (spamware) used
by spammers to harvest addresses.
• Most spamware developed in Eastern Europe
and Asia.
• We’re going to look at several of the most popular
packages.
34. List Harvester
• Harvests addresses from web sites.
• “Targeted” harvesting - in theory, the harvested
email addresses have something in common.
• Appears to be based in China.
• http://www.listharvester.com
• Price: $699 US
35. List Harvester - Method
• Performs a search for one or more keywords on
the user’s choice of search engine.
• Parses every site returned by the search engine
in order, looking for addresses and links.
• Follows links to other pages and parses them for
addresses as well.
43. Atomic Email Hunter
• Harvests addresses from web sites.
• Either scans an entire web site for addresses or
performs a “targeted search” like List Harvester.
• Based in Russia, most likely Moscow.
• http://www.massmailsoftware.com/
• Price: $79.85 US
49. Fast Newsgroups Extractor
• Harvests addresses from newsgroups.
• Has a companion web site extractor that’s very
similar to Atomic Email Hunter.
• Based in Russia, most likely Moscow.
• http://www.lencom.com
• Price: $79.00 US
50. Fast Newsgroups Extractor - Method
• Lets user select one or more newsgroups to
extract content from.
• Downloads multiple messages simultaneously
from the NNTP server.
• Extracts addresses from the downloaded
messages.
• Has the ability to limit downloaded messages to
those that contain certain text in the subject.
57. Quick Review
• We talked about:
− What email address harvesting is
− What data sources are harvested
− How you can protect your addresses
− 3 software packages used by spammers to harvest
addresses