A Plan for Spam (The Power of Statistical Filtering)

Free Republic
Browse · Search

News/Activism
Topics · Post Article

Skip to comments.

A Plan for Spam (The Power of Statistical Filtering)
Paul Graham's Website ^ | August 2002 | Paul Graham

Posted on 08/16/2002 1:41:01 PM PDT by E. Pluribus Unum

August 2002

(This article describes the spam-filtering techniques used in the new spamproof web-based mail reader we're building to exercise Arc.)

I think it's possible to stop spam, and that content-based filters are the way to do it. The Achilles heel of the spammers is their message. They can circumvent any other barrier you set up. They have so far, at least. But they have to deliver their message, whatever it is. If we can write software that recognizes their messages, there is no way they can get around that.

_ _ _

To the recipient, spam is easily recognizable. If you hired someone to read your mail and discard the spam, they would have little trouble doing it. How much do we have to do, short of AI, to automate this process?

I think we will be able to solve the problem with fairly simple algorithms. In fact, I've found that you can filter present-day spam acceptably well using nothing more than a Bayesian combination of the spam probabilities of individual words. Using a slightly tweaked (as described below) Bayesian filter, we now miss only 5 per 1000 spams, with 0 false positives.

The statistical approach is not usually the first one people try when they write spam filters. Most hackers' first instinct is to try to write software that recognizes individual properties of spam. You look at spams and you think, the gall of these guys to try sending me mail that begins "Dear Friend" or has a subject line that's all uppercase and ends in eight exclamation points. I can filter out that stuff with about one line of code.

And so you do, and in the beginning it works. A few simple rules will take a big bite out of your incoming spam. Merely looking for the word "click" will catch 79.7% of the emails in my spam corpus, with only 1.2% false positives.

I spent about six months writing software that looked for individual spam features before I tried the statistical approach. What I found was that recognizing that last few percent of spams got very hard, and that as I made the filters stricter I got more false positives.

False positives are innocent emails that get mistakenly identified as spams. In the spam filtering business, false positives are your biggest worry. For most users, missing legitimate email is an order of magnitude worse than receiving spams, so a filter that yields false positives is like an acne cure that carries a risk of death to the patient.

The more spam a user gets, the less likely he'll be to notice one innocent mail sitting in his spam folder. And strangely enough, the better your spam filters get, the more dangerous false positives become, because when the filters are really good, users will be more likely to ignore everything they catch.

I don't know why I avoided trying the statistical approach for so long. I think it was because I got addicted to trying to identify spam features myself, as if I were playing some kind of competitive game with the spammers. (Nonhackers don't often realize this, but most hackers are very competitive.) When I did try statistical analysis, I found immediately that it was much cleverer than I had been. It discovered, of course, that terms like "virtumundo" and "teens" were good indicators of spam. But it also discovered that "per" and "FL" and "ff0000" are good indicators of spam. In fact, "ff0000" (html for bright red) turns out to be as good an indicator of spam as any pornographic term.

_ _ _

Here's a sketch of how I do statistical filtering. I start with one corpus of spam and one of nonspam mail. At the moment each one has about 4000 messages in it. I scan the entire text, including headers and embedded html and javascript, of each message in each corpus. I currently consider alphanumeric characters, dashes, apostrophes, and dollar signs to be part of tokens, and everything else to be a token separator. (There is probably room for improvement here.) I count the number of times each token (ignoring case, currently) occurs in each corpus. At this stage I end up with two large hash tables, one for each corpus, mapping tokens to number of occurrences.

Next I create a third hash table, this time mapping each token to the probability that an email containing it is a spam, which I calculate as follows [1]:

(let ((g (* 2 (or (gethash word good) 0))) (b (or (gethash word bad) 0))) (unless (< (+ g b) 5) (max .01 (min .99 (float (/ (min 1 (/ b nbad)) (+ (min 1 (/ g ngood)) (min 1 (/ b nbad))))))))) where word is the token whose probability we're calculating, good and bad are the hash tables I created in the first step, and ngood and nbad are number of nonspam and spam messages respectively.

I explained this as code to show a couple of important details. I want to bias the probabilities slightly to avoid false positives, and by trial and error I've found that a good way to do it is to double all the numbers in good. This helps to distinguish between words that occasionally do occur in legitimate email and words that almost never do. I only consider words that occur more than five times in total (actually, because of the doubling, occurring three times in nonspam mail would be enough). And then there is the question of what probability to assign to words that occur in one corpus but not the other. Again by trial and error I chose .01 and .99. There may be room for tuning here, but as the corpus grows such tuning will happen automatically anyway.

The especially observant will notice that while I consider each corpus to be a single long stream of text for purposes of counting occurrences, I use the number of emails in each, rather than their combined length, as the divisor in calculating spam probabilities. This adds another slight bias to protect against false positives.

When new mail arrives, it is scanned into tokens, and the most interesting fifteen tokens, where interesting is measured by how far their spam probability is from a neutral .5, are used to calculate the probability that the mail is spam. Bayes Rule says that if probs is a list of the fifteen individual probabilities, you calculate the combined probability thus:

(let ((prod (apply #'* probs))) (/ prod (+ prod (apply #'* (mapcar #'(lambda (x) (- 1 x)) probs))))) One question that arises in practice is what probability to assign to a word you've never seen, i.e. one that doesn't occur in the hash table of word probabilities. I've found, again by trial and error, that .2 is a good number to use. If you've never seen a word before, it is probably fairly innocent; spam words tend to be all too familiar.

There are examples of this algorithm being applied to actual emails in an appendix at the end.

I treat mail as spam if the algorithm above gives it a probability of more than .9 of being spam. But in practice it would not matter much where I put this threshold, because few probabilities end up in the middle of the range.

_ _ _

One great advantage of the statistical approach is that you don't have to read so many spams. Over the past six months, I've read literally thousands of spams, and it is really kind of demoralizing. Norbert Wiener said if you compete with slaves you become a slave, and there is something similarly degrading about competing with spammers. To recognize individual spam features you have to try to get into the mind of the spammer, and frankly I want to spend as little time inside the minds of spammers as possible.

But the real advantage of the Bayesian approach, of course, is that you know what you're measuring. Feature-recognizing filters like SpamAssassin assign a spam "score" to email. The Bayesian approach assigns an actual probability. The problem with a "score" is that no one knows what it means. The user doesn't know what it means, but worse still, neither does the developer of the filter. How many points should an email get for having the word "sex" in it? A probability can of course be mistaken, but there is little ambiguity about what it means, or how evidence should be combined to calculate it. Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam.

Because it is measuring probabilities, the Bayesian approach considers all the evidence in the email, both good and bad. Words that occur disproportionately rarely in spam (like "though" or "tonight" or "apparently") contribute as much to decreasing the probability as bad words like "unsubscribe" and "opt-in" do to increasing it. So an otherwise innocent email that happens to include the word "sex" is not going to get tagged as spam.

Ideally, of course, the probabilities should be calculated individually for each user. I get a lot of email containing the word "Lisp", and (so far) no spam that does. So a word like that is effectively a kind of password for sending mail to me. In my earlier spam-filtering software, the user could set up a list of such words and mail containing them would automatically get past the filters. On my list I put words like "Lisp" and also my zipcode, so that (otherwise rather spammy-sounding) receipts from online orders would get through. I thought I was being very clever, but I found that the Bayesian filter did the same thing for me, and moreover discovered of a lot of words I hadn't thought of.

When I said at the start that our filters let through only 5 spams per 1000 with 0 false positives, I'm talking about filtering my mail based on a corpus of my mail. But these numbers are not misleading, because that is the approach I'm advocating: filter each user's mail based on the spam and nonspam mail he receives. Essentially, each user should have two delete buttons, ordinary delete and delete-as-spam. Anything deleted as spam goes into the spam corpus, and everything else goes into the nonspam corpus.

You could start users with a seed filter, but ultimately each user should have his own per-word probabilities based on the actual mail he receives. This (a) makes the filters more effective, (b) lets each user decide their own precise definition of spam, and (c) perhaps best of all makes it hard for spammers to tune mails to get through the filters. If a lot of the brain of the filter is in the individual databases, then merely tuning spams to get through the seed filters won't guarantee anything about how well they'll get through individual users' varying and much more trained filters.

Content-based spam filtering is often combined with a whitelist, a list of senders whose mail can be accepted with no filtering. One easy way to build such a whitelist is to keep a list of every address the user has ever sent mail to. If a mail reader has a delete-as-spam button then you could also add the from address of every email the user has deleted as ordinary trash.

I'm an advocate of whitelists, but more as a way to save computation than as a way to improve filtering. I used to think that whitelists would make filtering easier, because you'd only have to filter email from people you'd never heard from, and someone sending you mail for the first time is constrained by convention in what they can say to you. Someone you already know might send you an email talking about sex, but someone sending you mail for the first time would not be likely to. The problem is, people can have more than one email address, so a new from-address doesn't guarantee that the sender is writing to you for the first time. It is not unusual for an old frield (especially if he is a hacker) to suddenly send you an email with a new from-address, so you can't risk false positives by filtering mail from unknown addresses especially stringently.

In a sense, though, my filters do themselves embody a kind of whitelist (and blacklist) because they are based on entire messages, including the headers. So to that extent they "know" the email addresses of trusted senders and even the routes by which mail gets from them to me. And they know the same about spam, including the server names, ip addresses, and mailer versions and protocols.

_ _ _

If I thought that I could keep up current rates of spam filtering, I would consider this problem solved. But it doesn't mean much to be able to filter out most present-day spam, because spam evolves. Indeed, most antispam techniques so far have been like pesticides that do nothing more than create a new, resistant strain of bugs.

I'm more hopeful about Bayesian filters, because they evolve with the spam. So as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual words, Bayesian filters automatically notice. Indeed, "c0ck" is far more damning evidence than "cock", and Bayesian filters know precisely how much more.

Still, anyone who proposes a plan for spam filtering has to be able to answer the question: if the spammers knew exactly what you were doing, how well could they get past you? For example, I think that if checksum-based spam filtering becomes a serious obstacle, the spammers will just switch to mad-lib techniques for generating message bodies.

To beat Bayesian filters, it would not be enough for spammers to make their emails unique or to stop using individual naughty words. They'd have to make their mails indistinguishable from your ordinary mail. And this I think would severely constrain them. Spam is mostly sales pitches, so unless your regular mail is all sales pitches, spams will inevitably have a different character. And the spammers would also, of course, have to change (and keep changing) their whole infrastructure, because otherwise the headers would look as bad to the Bayesian filters as ever, no matter what they did to the message body. I don't know enough about the infrastructure that spammers use to know how hard it would be to make the headers look innocent, but my guess is that it would be even harder than making the message look innocent.

Assuming they could solve the problem of the headers, the spam of the future will probably look something like this:

Hey there. Thought you should check out the following: http://www.27meg.com/foo because that is about as much sales pitch as content-based filtering will leave the spammer room to make. (Indeed, it will be hard even to get this past filters, because if everything else in the email is neutral, the spam probability will hinge on the url, and it will take some effort to make that look neutral.)

Spammers range from businesses running so-called opt-in lists who don't even try to conceal their identities, to guys who hijack mail servers to send out spams promoting porn sites. If we use filtering to whittle their options down to mails like the one above, that should pretty much put the spammers on the "legitimate" end of the spectrum out of business, because they feel obliged by various state laws to include boilerplate about why their spam is not spam, and how to cancel your "subscription," and that kind of text is easy to recognize.

(I used to think it was naive to believe that stricter laws would decrease spam. Now I think that while stricter laws may not decrease the amount of spam that spammers send, they can certainly help filters to decrease the amount of spam that recipients actually see.)

All along the spectrum, if you restrict the sales pitches spammers can make, you will inevitably tend to put them out of business. That word business is an important one to remember. The spammers are businessmen. They send spam because it works. It works because although the response rate is abominably low (maybe 15 per million, vs 3000 per million for a catalog mailing), the cost, to them, is practically nothing. The cost is enormous for the recipients, about 5 man-weeks for each million recipients who spend a second to delete the spam, but the spammer doesn't have to pay that. Even so, sending spam does cost the spammer something, so the lower we can get the response rate, the fewer businesses will find it worth their while to send spam.

The reason the spammers use the kinds of sales pitches that they do is to increase response rates. This is possibly even more disgusting than getting inside the mind of a spammer, but let's take a quick look inside the mind of someone who responds to a spam. This person is either astonishingly credulous or deeply in denial about their sexual interests. In either case, repulsive or idiotic as the spam seems to us, it is exciting to them. The spammers wouldn't say these things if they didn't sound exciting. And "thought you should check out the following" is just not going to have nearly the pull with the spam recipient as the kinds of things that spammers say now. Result: if it can't contain exciting sales pitches, spam becomes less effective as a marketing vehicle, and fewer businesses want to use it.

That is the big win in the end. I started writing spam filtering software because I didn't want have to look at the stuff anymore. But if we get good enough at filtering out spam, it will stop working, and the spammers will actually stop sending it.

_ _ _

Of all the approaches to fighting spam, from software to laws, I believe Bayesian filtering will be the single most effective. But I also think that the more different kinds of antispam efforts we undertake, the better, because any measure that constrains spammers will tend to make filtering easier. And even within the world of content-based filtering, I think it will be a good thing if there are many different kinds of software being used simultaneously. The more different filters there are, the harder it will be for spammers to tune spams to get through them.

Appendix: Examples of Filtering

Here is an example of a spam that arrived while I was writing this article. The fifteen most interesting words (note that some occur multiple times) in this spam are:

indira mx-05 intimail qves qvp0045 qves 3779 platinum indira mx-05 intimail $7500 freeyankeedom cdo unsecured The words are a mix of stuff from the headers and from the message body, which is typical of spam. Also typical of spam is that every one of these words has a spam probability, in my database, of .99. In fact there are more than fifteen words with probabilities of .99, and these are just the first fifteen seen.

Unfortunately that makes this email a boring example of the use of Bayes' Rule. To see an interesting variety of probabilities we have to look at this actually quite atypical spam.

The fifteen most interesting words in this spam, with their probabilities, are: 263 0.99 263 0.99 madam 0.99 promotion 0.99 republic 0.99 republic 0.99 shortest 0.047225013 mandatory 0.047225013 standardization 0.07347802 2600 0.0813768 sorry 0.08221981 supported 0.09019077 people's 0.09019077 people's 0.09019077 enter 0.9075001 This time the evidence is a mix of good and bad. A word like "shortest" is almost as much evidence for innocence as a word like "madam" or "promotion" is for guilt. But still the case for guilt is stronger. If you combine these numbers according to Bayes Rule, the resulting probability is .9999281.

I was curious about where these probabilities come from. It turns out that I've had a number of spams from 263.net, not enough that I recognized it, but enough for 263 to look bad to the filters.

"Madam" is obviously from spams beginning "Dear Sir or Madam." They're not very common, but the word "madam" never occurs in my legitimate email, and it's all about the ratio.

"Republic" scores high because it often shows up in Nigerian scam emails, and also occurs once or twice in spams referring to Korea and South Africa. You might say that it's an accident that it thus helps identify this spam. But I've found when examining spam probabilities that there are a lot of these accidents, and they have an uncanny tendency to push things in the right direction rather than the wrong one. In this case, it is not entirely a coincidence that the word "Republic" occurs in Nigerian scam emails and this spam. There is a whole class of dubious business propositions involving less developed countries, and these in turn are more likely to have names that specify explicitly (because they aren't) that they are republics.[2]

On the other hand, "enter" is a genuine miss. It occurs mostly in unsubscribe instructions, but here is used in a completely innocent way. Fortunately the statistical approach is fairly robust, and can tolerate quite a lot of misses before the results start to be thrown off.

Finally, here is an innocent email. Its fifteen most interesting words are as follows:

continuation 0.01 continuation 0.01 describe 0.01 continuations 0.01 example 0.033600237 programming 0.05214485 programming 0.05214485 i'm 0.055427782 examples 0.07972858 color 0.9189189 localhost 0.09883721 localhost 0.09883721 paulgraham 0.10752564 hi 0.116539136 california 0.84421706 Most of the words here indicate the mail is an innocent one. There are two bad smelling words, "color" (spammers love colored fonts) and "California" (which occurs in testimonials and also in menus in forms), but they are not enough to outweigh obviously innocent words like "continuation" and "example".

It's interesting that "describe" rates as so thoroughly innocent. A probability of .01 means it hasn't occurred in a single one of my 4000 spams. The data turns out to be full of such surprises. One of the things you learn when you analyze spam texts is how narrow a subset of the language spammers operate in. It's that fact, together with the equally characteristic vocabulary of any individual user's mail, that makes Bayesian filtering a good bet.

Appendix: More Ideas

One idea that I haven't tried yet is to filter based on word pairs, or even triples, rather than individual words. This should yield a much sharper estimate of the probability. For example, in my current database, the word "offers" has a probability of .96. If you based the probabilities on word pairs, you'd end up with "special offers" and "valuable offers" having probabilities of .99 and, say, "approach offers" (as in "this approach offers") having a probability of .1 or less.

The reason I haven't done this is that filtering based on individual words already works so well. But it does mean that there is room to tighten the filters if spam gets harder to detect. (Curiously, a filter based on word pairs would be in effect a Markov-chaining text generator running in reverse.)

Another thing I haven't done is to focus extra attention on specific parts of the email. About 95% of current spam, for example, includes the url of a site they want you to visit. (The remaining 5% want you to reply by email or to a US mail address, or in a few cases to buy a certain stock.) The url is in such cases practically enough by itself to determine whether the email is spam.

It might be a good idea to have a cooperatively maintained list of urls promoted by spammers. We'd need a trust metric of the type studied by Raph Levien to prevent malicious or incompetent submissions, but if we had such a thing it would provide a boost to any filtering software. It would also be a convenient basis for boycotts.

Another way to test dubious urls would be to send out a crawler to look at the site before the user looked at the email mentioning it. You could use a Bayesian filter to rate the site just as you would an email, and whatever was found on the site could be included in calculating the probability of the email being a spam.

One cooperative project that I think really would be a good idea would be to accumulate a giant corpus of spam. A large, clean corpus is the key to making Bayesian filtering work well. Bayesian filters could actually use the corpus as input. But such a corpus would be useful for other kinds of filters too, because it could be used to test them.

Creating such a corpus poses some technical problems. We'd need trust metrics to prevent malicious or incompetent submissions, of course. We'd also need ways of erasing personal information (not just to-addresses and ccs, but also e.g. the arguments to unsubscribe urls, which often encode the to-address) from mails in the corpus. If anyone wants to take on this project, it would be a good thing for the world.

Appendix: Defining Spam

I think there is a rough consensus on what spam is, but it would be useful to have an explicit definition. We'll need to do this if we want to establish a central corpus of spam, or even to compare spam filtering rates meaningfully.

To start with, spam is not unsolicited commercial email. If someone in my neighborhood heard that I was looking for an old Raleigh three-speed in good condition, and sent me an email offering to sell me one, I'd be delighted, and yet this email would be both commercial and unsolicited. The defining feature of spam (in fact, its raison d'etre) is not that it is unsolicited, but that it is automated.

It is merely incidental, too, that spam is usually commercial. If someone started sending mass email to support some political cause, for example, it would be just as much spam as email promoting a porn site.

I propose we define spam as unsolicited automated email. This definition thus includes some email that many legal definitions of spam don't. Legal definitions of spam, influenced presumably by lobbyists, tend to exclude mail sent by companies that have an "existing relationship" with the recipient. But buying something from a company, for example, does not imply that you have solicited ongoing email from them. If I order something from an online store, and they then send me a stream of spam, it's still spam.

Companies sending spam often give you a way to unsubscribe, or ask you to go to their site and change your "account preferences" if you want to stop getting spam. This is not enough to stop the mail from being spam. Not opting out is not the same as opting in. Unless the recipient explicitly checked a clearly labelled box (whose default was no) asking to receive the email, then it is spam.

In some business relationships, you do implicitly solicit certain kinds of mail. When you order online, I think you implicitly solicit a receipt, and notification when the order ships. I don't mind when Verisign sends me mail warning that a domain name is about to expire (at least, if they are the actual registrar for it). But when Verisign sends me email offering a FREE Guide to Building My E-Commerce Web Site, that's spam.[3]

Notes:

[1] The examples in this article are translated into Common Lisp for, believe it or not, greater accessibility. The application described here is one that we wrote in order to test a new Lisp dialect called Arc that is not yet released.

[2] As a rule of thumb, the more qualifiers there are before the name of a country, the more corrupt the rulers. A country called The Socialist People's Democratic Republic of X is probably the last place in the world you'd want to live.

[3] I've been gradually transferring all my domains from Verisign to EasyDNS, and I am very happy with them. They're cheaper, much more responsive, their site works, and they have never spammed me.

Thanks to Sarah Harlin for reading drafts of this; Dan Giffin (who is also writing the production Arc interpreter) for several good ideas about filtering and for creating our mail infrastructure; Robert Morris and Trevor Blackwell for many discussions about spam; Raph Levien for advice about trust metrics; and Chip Coldwell and Sam Steingold for advice about statistics.

More Info:

Spam Conference. Cambridge, MA, January 2003.

TOPICS: Culture/Society; News/Current Events; Technical
KEYWORDS:

Navigation: use the links below to view more comments.
first 1-20, 21-40, 41-45 next last

1 posted on 08/16/2002 1:41:01 PM PDT by E. Pluribus Unum

[ Post Reply | Private Reply | View Replies]

To: E. Pluribus Unum

"The Achilles heel of the spammers is their message."

No, the Achilles heel of spammers is that they can't receive replies to their spam ads.

Any email program that watches for bounced email ping replies to all unrecognized (i.e. first-time email addresses) email "From" addresses can stop spam dead.

But the content of spam can be changed on the fly by automated software and a thesaurus.

Attack them where they live. Spammers can't receive direct e-mail replies to their ads, and that's patently easy to detect.

2 posted on 08/16/2002 1:46:40 PM PDT by Southack

[ Post Reply | Private Reply | To 1 | View Replies]

To: E. Pluribus Unum

I will bookmark this for later reading. I'm sure there are a lot of good ideas in here.

Another approach that I have heard of is to create phony mail accounts and then spread those email addresses on message boards, web sites, etc., wherever spammers are likely to pick 'em up. Since the email accounts are really not associated with any real person, any email that comes to these accounts are almost by definition spam. The thing that makes spam so attractive to spammers is that it's automated, that basically I get the same spam messages that you do. So if you have a bunch of spam collectors (like ant or roach hotels) then you just look for email in real accounts that are not that different from the ones you collected in the "spam hotels".

Obviously some work would go into determining what "not that different means" and this approach would be better deployed at the enterprise or the ISP level (and not at the individual user level) but I think this approach would have a lot of merit.

3 posted on 08/16/2002 1:49:07 PM PDT by 2 Kool 2 Be 4-Gotten

[ Post Reply | Private Reply | To 1 | View Replies]

To: PatrickHenry; general_re; VadeRetro; Junior; jennyp; longshadow; Gumlegs

Fixin' my spam filter right this minute!

4 posted on 08/16/2002 1:52:00 PM PDT by balrog666

[ Post Reply | Private Reply | To 1 | View Replies]

To: E. Pluribus Unum

Kewl! There's hope.

I especially like the concept of a "delete as spam" button that helps set the rules for what I think is spam...

5 posted on 08/16/2002 2:00:35 PM PDT by null and void

[ Post Reply | Private Reply | To 1 | View Replies]

To: Southack

Attack them where they live. Spammers can't receive direct e-mail replies to their ads, and that's patently easy to detect.

You can't really verify that unless you actually try and send them an e-mail. Most mail servers will give a positive response to any VRFY request, or not respond at all. Some spam comes with a valid FROM: address, except it isn't theirs. Some commercial spammers will run their own servers, and have a server to accept and bit-bucket the bounces. Personally, I think the best approach is to have a bounty on spammers.

6 posted on 08/16/2002 2:01:19 PM PDT by tacticalogic

[ Post Reply | Private Reply | To 2 | View Replies]

To: E. Pluribus Unum

that is an AWESOME article! Thanks for posting it! I just might start coding one of those this weekend. CC'd that one to all my geek friends

7 posted on 08/16/2002 2:02:04 PM PDT by WindMinstrel

[ Post Reply | Private Reply | To 1 | View Replies]

To: null and void

The really cool thing is that it is adaptive - the filter changes in response to statistical changes in spammer tactics.

8 posted on 08/16/2002 2:02:08 PM PDT by E. Pluribus Unum

[ Post Reply | Private Reply | To 5 | View Replies]

To: 2 Kool 2 Be 4-Gotten

Another approach that I have heard of is to create phony mail accounts and then spread those email addresses on message boards, web sites, etc.,

Do a search on wpoison. It's a script that you add to a website that will randomly generate thousands of bogus e-mail addresses to feed the spiders. Every address is also a hyperlink, and following the link gets the spider to a another pagefull of freshly generated addresses.

9 posted on 08/16/2002 2:05:05 PM PDT by tacticalogic

[ Post Reply | Private Reply | To 3 | View Replies]

To: WindMinstrel

The key concept is:

Essentially, each user should have two delete buttons, ordinary delete and delete- as-spam. Anything deleted as spam goes into the spam corpus, and everything else goes into the nonspam corpus.

When you delete something as spam, the program automatically runs statistics on each word and combination of words and ads them to the spam profile database.

Why didn't I think of this?

10 posted on 08/16/2002 2:06:43 PM PDT by E. Pluribus Unum

[ Post Reply | Private Reply | To 7 | View Replies]

To: tacticalogic

Every address is also a hyperlink, and following the link gets the spider to a another pagefull of freshly generated addresses.

So the spammers will have to use statistics to filter out bogus addresses. Too cool.

11 posted on 08/16/2002 2:08:05 PM PDT by E. Pluribus Unum

[ Post Reply | Private Reply | To 9 | View Replies]

To: tacticalogic; Lazamataz

"You can't really verify that unless you actually try and send them an e-mail."

There's no barrier on sending them an email, so it isn't like that's a big hurdle.

Every email that you receive should come from a previsously verified email address (e.g. friends, businesses, family). Emails that are NOT from already verified addresses SHOULD be verified automatically by your spam filter.

That's easy to do. Send a real email to any "unknown" sender, and ask that they make a certain type of reply (on-line, web, email, message) if they want their message to be passed on to you. If there is a real person who wants to communicate with you, then they will verify themselves, but stolen email addresses and invalid email addresses (which is what spammers use) clearly won't be able to get onto your list of verified email addresses.

And people only have to get verified (to send email to you) one time. After that they are then on your list so that all future emails from them to you go through normally.

12 posted on 08/16/2002 2:09:30 PM PDT by Southack

[ Post Reply | Private Reply | To 6 | View Replies]

To: E. Pluribus Unum

from the notes

[2] As a rule of thumb, the more qualifiers there are before the name of a country, the more corrupt the rulers. A country called The Socialist People's Democratic Republic of X is probably the last place in the world you'd want to live.

True enough.

13 posted on 08/16/2002 2:13:51 PM PDT by 2 Kool 2 Be 4-Gotten

[ Post Reply | Private Reply | To 1 | View Replies]

To: Southack

All well and good from the client side. It isn't going to do much for my mail servers, and will probably increase the load on them. (Just the network curmudgeon's .02).

14 posted on 08/16/2002 2:16:54 PM PDT by tacticalogic

[ Post Reply | Private Reply | To 12 | View Replies]

To: E. Pluribus Unum; Southack

Very Good read E. Pluribus Unum...

Your comment is very worthy of reading too Southack. A pinging program that showed email addresses that were not active to recieve email and blocked the incoming spam would be nice.

But all these processes still leave the public with downloading the email and wasted bandwidth. But as the author mentioned, if the spam was not effective there would not be any spam. My suggestion is to never ever respond to spam. If people did that, presto, no more spam.

15 posted on 08/16/2002 2:20:43 PM PDT by LowOiL

[ Post Reply | Private Reply | To 1 | View Replies]

To: tacticalogic

"All well and good from the client side. It isn't going to do much for my mail servers, and will probably increase the load on them. (Just the network curmudgeon's .02)."

True, but your users would percieve that you were blocking out all of their spam (since my filter as described above blocks it from their In Boxes), and you would get the credit (and hopefully, the coresponding raises).

16 posted on 08/16/2002 2:26:58 PM PDT by Southack

[ Post Reply | Private Reply | To 14 | View Replies]

To: Southack

That's easy to do. Send a real email to any "unknown" sender, and ask that they make a certain type of reply (on-line, web, email, message) if they want their message to be passed on to you.

You'll also have to make arrangements for manually whitelisting addresses. Listservers and autoresponsers would be examples of mail you'd want to receive, but there wouldn't be any way for the sender to respond to the verification request without re-programming and having standards for the verification response. If you did that, the spammers would set up their own auto-verify servers.

17 posted on 08/16/2002 2:29:37 PM PDT by tacticalogic

[ Post Reply | Private Reply | To 12 | View Replies]

To: tacticalogic

Yes, whatever addresses are already in your email address book will go through, so to whitelist an automated email address, simply add it to your email address book.

18 posted on 08/16/2002 2:33:39 PM PDT by Southack

[ Post Reply | Private Reply | To 17 | View Replies]

To: 2 Kool 2 Be 4-Gotten

Union of Soviet Socialist Republics.
People's Republic of China.
Democratic People's Republic of Korea.
Republic of Iraq
Islamic Republic of Iran
Federal Democratic Republic of Ethiopia
Arab Republic of Egypt
Syrian Arab Republic
French Republic (Sorry... I had to.)
Somali Democratic Republic
Islamic Republic of Pakistan
Republic of South Africa
Federal Republic of Nigeria
Republic of Zimbabwe
Socialist People's Libyan Arab Jamahiriya
Republic of the Sudan
Republic of Yemen
Democratic and Popular Republic of Algeria

That's a pretty good rule...

19 posted on 08/16/2002 2:37:42 PM PDT by jae471

[ Post Reply | Private Reply | To 13 | View Replies]

To: E. Pluribus Unum

"Madam" is obviously from spams beginning "Dear Sir or Madam." They're not very common, but the word "madam" never occurs in my legitimate email, and it's all about the ratio.

"Dear Sir or Madam" is the appropriate opening to a business correspondence when you are unsure as to the gender of the person you are corresponding with. For example, if you are replying to an employment ad where the reply address is the HR department, you would frame your email like a business letter and the greeting would be "Dear Sir or Madam".

I hope this guy isn't waiting on any resumes.

20 posted on 08/16/2002 2:39:32 PM PDT by Cable225

[ Post Reply | Private Reply | To 1 | View Replies]

Navigation: use the links below to view more comments.
first 1-20, 21-40, 41-45 next last

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search

News/Activism
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794