Friday, October 20, 2006

The Language of E-Mail Spam

I was struck today by an effort of spammers to defeat Gmail's excellent spam detector. This morning I was greeted by bunch of spam e-mails wich were variations of
Grand message. You need to read.
There were three instances of
Essential note. You have to read.
Each of these was ostensibly from a different sender. The other subject lines were:
Weighty message. You require to read.
Momentous note. You must to read.
Grand note. You require to read.
They all have a largely incomprehensible, very ungrammatical message which is quite comical and very much worth your reading, given that laughter is good for the soul:
The great prognosis are drawn up.
The increase is up to 70% lately.
(MXXR) is the profitable deal and those who knows it is making money.
The drilling achivements of this highly capable oil partnership exceeded all its expectations.
One time this data hits the outdoors there will be no stopping this one.
Right now it's about 0.025 but we are waiting it to triple.
Once the news is made and the PR gets into full brandish.
Don't waste time and miss out. We counsel you to buy today.
The key is getting in early and the time is pressing. They say that Monday is the day this one will shoot. Find your position
before that happens.
There are minor variations in the first sentence of the mail. In addition to the first sentence of the quoted passage, one finds:
The great predictions are made.
The great anticipations are drawn up.
The great prognosis are made.
The great anticipations are made.
I would venture to guess that this e-mail was generated by either a fairly recent emigrant to the US or someone in Asia who speaks a language that doesn't trouble itself with number agreement between subjects and verb. One example is "You require to read," this being an incredibly mangled version of "You are required to read this."I checked Gmail's identified spam and found many instances of this class of e-including
Very important message. You require to read.
Weighty note. You must to read.
Some of these had a message differing from the one quoted above.

Unfortunately, I discovered that Gmail is identifying some nonspam as spam. It is unfortunate because recently I have been banishing detected spam without looking at it. Being able to do that is the goal of spam detection algorithms of course. Sadly, my main Ohio State University e-mail account and the main e-mail account of my Roadrunner account collect most of the spam and these are accounts I can't get rid of.

One other class of spam e-mail that got through had the Spanish title "CURSO COMPLETO DE ALEMÁN," which I think offers a complete course of German along with a bilingual dictionary. At least this one offered something of potential interest though how I would use the product is difficult to see given my poor knowledge of both German and Spanish.

Clearly, this morning's English language spam that slipped through was generated by an algorithm combining plausible garden variety English language names, one of which combined an English language first name ("Ethan") and an English rendered Chinese last name("Chan"), but that too is plausibly American. The algorithm clearly used a thesaurus to combine "synonyms" of "weighty" or some other word in its semantic class (loosely defined) with the noun "message" as well as "synonyms" of "need/require" along with the infinitive phrase "to read." The problem is that some of these "synonyms" don't take infinitive phrases like this one.

Why Gmail identified some of these messages as spam but not the rest I don't know. The most amusing sender name of all was Mr. "Debt Help" This e-mail was generated by Blogger when Mr. or Mrs. or Ms. Debt Help commented on my blog "He's my Bitch" blog. When I went to look at that blog's comments, however, I couldn't find Mr. Debt Help's two comments. Perhaps the beta version of Blogger I am using has a nice spam filter. That would be a good thing, as Martha would say.

Identifying phony e-mails involves a language-based algorithm that checks out the sender name, the subject line, and the first sentence, at a bare minimum. That is why, I think, the first sentences of the spam mail were varied by the spammer. The same tools used to identify spam are also used in machine translation and involve the literal meanings of subject lines and the first line/sentence of the body. Unfortunately Blogger does not have even a rudimentary grammar checker like Microsoft Word has because if it did, it would detect the subject-verb agreement failures and the crazy infinitive constructions. An interesting failure of the Gmail spam filter is that it hasn't taken the hint that Spanish language e-mails are unwanted. Perhaps it regards my insensitivity to Spanish language e-mails as a moral failure on my part. I offer up this light weight blog to let you know that as you slog through your e-mail trying to sort out the real messages from the spam, you are not alone.

Tweet This!


Anonymous Anonymous said...

Recently, I've been recieving porn spam with a similarly generated structure. The titles usually run something like the following:

[adjective] [synonym for women] [sex act]

The end result reads like x-rated Mad Libs.

11:22 AM

Blogger The Language Guy said...


Unfortunately, Gmail hasn't learned from my identifying the type of spam messages I wrote about. I got two more early this afternoon. The language is getting worse. One Martina Pike sent me an email with the subject line "Grand note. You should to read." A third arrived as I was writing this.

1:58 PM

Blogger The Language Guy said...

The spam has morphed into a new pattern as of late this afternoon. The body of the messages is the same. The subject lines are

"Significant letter. You need to read."

"Serious letter. You must to read." I love "You must to read." I bought a Yamahoo 250 off a friend back in the late 60's. Its manual had the memorable sentence, "Tachometer tells the moment to do." Clearly Yamaha had a Japanese employee whose opinion of his/her English language skills was a bit inflated.

Gmail still has not cottoned on to this phenomenon. It tries to defeat anti-spam software by using different, but very credible American names, varying the subject lines, and varying the first lines of the body of the message. This gives them an incredible number of variations available to them. I will tell the Gmail people about their problem. They need to build in a grammar checker.

5:50 PM

Blogger Mark said...

I dearly love the language of some spammers. I got one a few years ago which opened with "Bernadine!", contained a link to a website, and was signed "Ginger Sutherland".

I would be very interested to find out how spam filters detect spam. I find this cross-over between maths and linguistics fascinating.

8:14 PM

Blogger Mark said...

PS: persevere with marking emails as spam. I had a similar problem where a few bits of spam were slipping through each day, but Gmail has learnt from my submissions.

It's possible that spammers have begun trying new tricks that Gmail hasn't yet encountered. They key is perseverance. If the number of spams being submitted manually rises enough, it will attract the attention of a human!

I would imagine, also, that Google processes emails salvaged from spam as a matter of urgency. It's annoying, but relatively harmless, if spams get through. It's clearly unacceptable if genuine emails are marked as spam.

8:23 PM

Blogger Ripple said...

It is refreshing to see you blogging more on liguistics than politics.

12:51 AM

Blogger The Language Guy said...

The last two were linguistic and most have a connection to language, however tenuous. However, we are in the political season and I am pissed and won't take it (the Republicans) any more. (Insert Smiley)

There was the stuff on the Olmec slab too, as well as the Brain Lady stuff. So it hasn't been politics 24/7/356

The Supremes should offer up some juicy morsels after a time. Congress has done nothing so has done nothing stupid of linguistic interest. I have just gotten the new DirecTV DVR and will tape some of the current campaign commercials and have a go at them. They are usually deceptive linguistically, whether Democratic or Republican in origin.

7:58 AM

Blogger IbaDaiRon said...

LG, shouldn't that be 24/7/52? :)

I use a (client-side) program called SpamSieve (Mac-only, I think, but there should be Winapps using the same ideas) that relies on Bayesian filtering. Haven't tried to wrap my head around the math since all that matters is that (in combination with server-side filtering) it works REALLY well.

In identifying the spam, that is. I still have to glance through the spam mail folder to make sure there are no false positives; address book filtering makes sure nothing important from people I know gets zapped.

6:27 AM

Blogger MK. Lina said...

Well I dont think Google is blocking the spams efficiently.

Spam Filter Email!

2:13 PM


Post a Comment

<< Home