Monday, May 29, 2006

Data Mining

Back in the mid-60's, a certain US agency wanted a company I worked for while attending graduate school to help them solve a huge problem. The agency was being overwhelmed by audio taped material from potential or actual enemies and wanted some way of analyzing their audio tapes so that they could, to use a more modern term, "mine" them for information. My first reaction to their problem is that we linguists could provide them with no solution to their problem.

The most obvious way to solve this problem was through some sort of key word search, assuming that the "bad guys" would tend to talk about different subjects when talking to each other than when talking to their wives and social friends. But this entailed an ability to do speech recognition and that was a pipe dream at the time. Today, speech recognition software works better than it did back then, but the problem, at least with home software like Dragon Naturally Speaking, is that the software must be trained for each particular voice and one must articulate carefully and fairly slowly. Maybe spy agencies have better speech recognition software than is available to you and me just as our spy satellites have better cameras than I do and their photo enhancement software is better than anything you or I can get. But I doubt that the technology has reached the point where speech recognition plus content analysis will help these spy agencies to distinguish terrorists from others.

The need for some sort of serious content analysis can be illustrated by the following two sentences.
(1) That song is the bomb.
(2) Put a bomb on the first floor.
Clearly just finding the word "bomb" in a conversation isn't going to solve the terrorist identification problem. One is going to need more information from the conversation and something about who is doing the talking.

Another approach to the problem bypasses the need for speech recognition and content analysis. My morning paper (see the title link) had an article on data mining by Brian Bernstein, who has a more accessible article than that at the Columbus Dispatch at the Seattle Times which is similar in nature.

Bernstein's article concerns the use of social network theory to try to figure out who may be linked up in a terrorist plot. This is a very different problem than the linguistic one and in one way is simpler -- the primary data consists of who calls whom, not what they say. But it presents problems of its own. The graphic that appears above illustrates the sort of structures of telephone calling the NSA hopes to find.

The NSA's problem is to identify first who calls whom and then to analyze the pattern of calling and from that to determine who is the hub of the enterprise, the boss terrorist. The question is how do patterns of calling among sets of terrorists differ from the patterns of other social organizations?

My wife is in several tennis playing groups. In some groups, there is a team captain who calls the others to say when they need to show up at the court. Her team members would, when unable to play on the given day, would have to call her to tell her this so she can go to alternates. In other groups my wife is in, each team member would be responsible for finding a substitute. In summers, the situation is much more fluid and there is less organization. I imagine it would be very hard to tell these tennis social networks from terrorist social networks predicated simply on the pattern of the calls. I can't say that it would be impossible because I don't know enough about social network analysis to know just how difficult the problem is. But these tennis networks illustrate the point being made in the article that it seems to be necessary to have an entry point -- a known terrorist -- to work out which social networks are terrorist in nature and which are merely social. Once you have found him/her, one might employ social network analysis to ferret out those who are simply the entry terrorist's friends and who are a part of any conspiracy he/she might be involved in.

This situation reminds me to the mathematics that is used on the CBS show "Numbers," in which a math genius who helps out his FBI agent brother. Our math prof seems to know every application of mathematics to the real world which is pretty implausible but this is right up his alley. The NSA should have consulted him.

There has been quite a stir over the President's authorizing surveillance of us all to find out which of us are the bad guys. It is a perfect example of his lack of respect for American civil liberties -- we must invade the people's privacy in order to preserve it. However, I am inclined to think your secrets are safe from George given the difficulty posed by the problem the NSA faces. The more people whose communications are surveiled, the less the NSA will learn.

I expect this blog will get picked up by the NSA since the word "bomb" appears twice. So, if my blogs stop, you will know why.

Tweet This!


Blogger Mister Pregunto said...

Oh smart, L_G! You just had to use the B-word, didn't you?

Do you realize that you are now going to be dragging down everybody who has ever been remotely associated with your social network?

[For myself, it's obvious I'm pretty much doomed since I have so many aliases, but at least I can be glad I'm not Hugh!]


12:32 AM

Blogger IbaDaiRon said...

Um, hello, NSA? Six-degrees, hello?

If the current bunch of bozos are the government we deserve, what does that say about us?

[Mr P, isn't that an old Alan Parsons Project song title or chorus? I wouldn't wanna be like Hugh...?]

11:40 AM


Post a Comment

Links to this post:

Create a Link

<< Home