The Language of E-Mail Spam
I was struck today by an effort of spammers to defeat Gmail's excellent spam detector. This morning I was greeted by bunch of spam e-mails wich were variations of
Grand message. You need to read.There were three instances of
Essential note. You have to read.Each of these was ostensibly from a different sender. The other subject lines were:
Weighty message. You require to read.They all have a largely incomprehensible, very ungrammatical message which is quite comical and very much worth your reading, given that laughter is good for the soul:
Momentous note. You must to read.
Grand note. You require to read.
The great prognosis are drawn up.There are minor variations in the first sentence of the mail. In addition to the first sentence of the quoted passage, one finds:
The increase is up to 70% lately.
(MXXR) is the profitable deal and those who knows it is making money.
The drilling achivements of this highly capable oil partnership exceeded all its expectations.
One time this data hits the outdoors there will be no stopping this one.
Right now it's about 0.025 but we are waiting it to triple.
Once the news is made and the PR gets into full brandish.
Don't waste time and miss out. We counsel you to buy today.
The key is getting in early and the time is pressing. They say that Monday is the day this one will shoot. Find your position
before that happens.
The great predictions are made.I would venture to guess that this e-mail was generated by either a fairly recent emigrant to the US or someone in Asia who speaks a language that doesn't trouble itself with number agreement between subjects and verb. One example is "You require to read," this being an incredibly mangled version of "You are required to read this."I checked Gmail's identified spam and found many instances of this class of e-including
The great anticipations are drawn up.
The great prognosis are made.
The great anticipations are made.
Very important message. You require to read.Some of these had a message differing from the one quoted above.
Weighty note. You must to read.
Unfortunately, I discovered that Gmail is identifying some nonspam as spam. It is unfortunate because recently I have been banishing detected spam without looking at it. Being able to do that is the goal of spam detection algorithms of course. Sadly, my main Ohio State University e-mail account and the main e-mail account of my Roadrunner account collect most of the spam and these are accounts I can't get rid of.
One other class of spam e-mail that got through had the Spanish title "CURSO COMPLETO DE ALEMÁN," which I think offers a complete course of German along with a bilingual dictionary. At least this one offered something of potential interest though how I would use the product is difficult to see given my poor knowledge of both German and Spanish.
Clearly, this morning's English language spam that slipped through was generated by an algorithm combining plausible garden variety English language names, one of which combined an English language first name ("Ethan") and an English rendered Chinese last name("Chan"), but that too is plausibly American. The algorithm clearly used a thesaurus to combine "synonyms" of "weighty" or some other word in its semantic class (loosely defined) with the noun "message" as well as "synonyms" of "need/require" along with the infinitive phrase "to read." The problem is that some of these "synonyms" don't take infinitive phrases like this one.
Why Gmail identified some of these messages as spam but not the rest I don't know. The most amusing sender name of all was Mr. "Debt Help" This e-mail was generated by Blogger when Mr. or Mrs. or Ms. Debt Help commented on my blog "He's my Bitch" blog. When I went to look at that blog's comments, however, I couldn't find Mr. Debt Help's two comments. Perhaps the beta version of Blogger I am using has a nice spam filter. That would be a good thing, as Martha would say.
Identifying phony e-mails involves a language-based algorithm that checks out the sender name, the subject line, and the first sentence, at a bare minimum. That is why, I think, the first sentences of the spam mail were varied by the spammer. The same tools used to identify spam are also used in machine translation and involve the literal meanings of subject lines and the first line/sentence of the body. Unfortunately Blogger does not have even a rudimentary grammar checker like Microsoft Word has because if it did, it would detect the subject-verb agreement failures and the crazy infinitive constructions. An interesting failure of the Gmail spam filter is that it hasn't taken the hint that Spanish language e-mails are unwanted. Perhaps it regards my insensitivity to Spanish language e-mails as a moral failure on my part. I offer up this light weight blog to let you know that as you slog through your e-mail trying to sort out the real messages from the spam, you are not alone.