Wednesday, March 28, 2007

Did you mean: correctly-spelled words?

I think one of the most fascinating things I've worked on since I started programming for Tribune Interactive is a search feature that most people take for granted nowadays. You might notice it when you type something into Google and get no results. Before, you would just have to scratch your head, wondering why you couldn't find any webpages. But now, search engines are so smart, they'll tell you what you actually meant to search for!

It's actually not as difficult as it sounds. For our purposes, there's a great tool out there called Aspell that will look words up in the dictionary, and come back with a list of spelling suggestions. Since we're developing in Rails, I use the Raspell gem (You still have to have Aspell there, the gem is nothing more than hooks to the C code. And you have to install your dictionaries if you want it to work. You could create your own dictionary, if you need to check against proper names not in the dictionary. I'll get to that later maybe...probably not, though).

There's a surprising depth when considering what people want to search for. Search engines are obviously big business, so it's worth quite a bit to making users happy. One report I read said that people misspell words in searches about 15-20% of the time. Google posted how many queries they had over a three-month period of users looking for Britney Spears, and how many of each spelling variation there were. Which raises a good point: "Britney" usually doesn't come up in the English dictionary. How then, does Google know that "brentley spears" is actually "Britney Spears"?

The answer: I have no idea. Google's quite a bit smarter than I am. But there are some strategies to start with.

First, it's important to recognize that the words are spelled incorrectly. Two of the most common spelling errors are switching two letters around, and leaving off the end letter. The Damerau-Levenshtein distance algorithm can tell you "how far" one word is from another, that is, how many changes you have to make in one word to turn it into the second. It's a pretty nice tool. "Wierd"? How about "weird". "Theif"? How about "thief".

While that's good for catching a lot of errors, there's still more. Consider "Caesar", one of the most popular historical figures in history, yet a man whose name is misspelled pretty often. "Ceasar" is a pretty common one. But if we run our distance algorithm on "Ceasar", could anything else come back? How about "Cesar"?

Both have an "edit-distance" of 1 when it comes to swapping out the word. It's tough to determine which word the user is actually looking for. If the search query is just a single word, you simply might have to go with nothing more than a good guess. Maybe more people search for "Caesar" than "Cesar", so we'll choose option 1.

But if the user is looking for something more specific, we might be able to point them in the right direction. If the entry is "ceasar salad", they probably want "caesar salad", and if the entry is "ceasar chavez", they probably want "cesar chavez". Now, if you're Google or Yahoo, and you have access to trillions of search queries, it's probably easy to figure out which phrase they mean. However, I am neither, but I still catch flack when somebody says "How come my misspelling didn't return the 'did you mean' that I wanted?" While the answer is probably that my suggestion is much "closer" to yours than the suggestion you were hoping for, the ultimate goal is that the user is happy, so I just grumble and say that I'll get to work on fixing it.

So where do I start? My thought is to create a probability model. For example, if I look at "ceasar", and see that both "caesar" and "cesar" are good matches, and the second word is "salad", I'll assign a high priority to "caesar" being before salad, and perhaps a lesser priority for it coming after, as people like to search for things out of order. Perhaps the same if they're searching for "augustus", "brutus", "crossing the rubicon", "Little Caesar's", etc. While this is probably rather inefficient for Google, it will suit my needs fine.

There's a lot more that you can do with this. If you're trying to set up a search engine on your website (and who isn't these days?), and you're worried about people leaving your site in disgust because they can't find what they're looking for, you could go get a PhD in lexical analysis. But if you're only kinda worried, or curious as to how Google knows what you're thinking, hopefully this will get you on your feet.

Hapi saercheeng!

No comments: