20 November, 2008

Spell Checking The Right Way

In my fun-as-a-rock database class I recently got an assignment to correct misspellings in a file full with city names.

Now, there's two ways to do spell checking: the Microsoft way, and the Google way. Care to guess which is the wrong way to do it? Yup, you got it: the Microsoft way sucks! Ok, maybe it didn't suck back in the 17th century when Spanish Monks were doing all the spell checking known to mankind (which I think consisted of 3 or 4 individuals that actually knew how to read, or cared about spelling for that matter).

So, if the Microsoft spell checker and the Google spell checker could talk, what would they say?

Microsoft would say: Listen buster! My dictionary contains all the correct words in the universe; either you comply or you don't. Got it?

Google would say: What do I know about spelling? I'm just trying to figure out a way to make more money from all this content I just indexed. Oh, and by the way, that word you just typed, it look awful close to this other word I see a lot in my index. Is that what you meant?

The problem with the Microsoft approach should be obvious, but it's important to point out that the Google approach is not without faults either.

The biggest problem with the Google approach is that to some extent it's a form of crowdsourcing. If your crowd can't spell, then you're toast.

Last, but not least, I'd just like to show you some pseudo code on how I implemented my spell checker:

  • Read all the city names in the file while keeping track of every variation we've seen and how many times we'v seen it (in a hash, dictionary, etc). Take the most popular spelling for each city, and call that the correct spelling.
  • To correct word X, calculate its edit distance to all the correct spellings. Chances are word X is really the "correct spelling" it mostly resembles.
  • Figure out what do if you've never seen X before.
And that concludes today's post. Now if I could just get Google to write grammatically correct sentences for me, I'd never have to worry about proof reading my posts ever again.

Disclaimer: I would just like it to be known that I'm in no way a MS hater; in fact, I'm somewhat of a MS fan. I'd also like it to be known that I'm not a Google fan boy; in fact, I'm a little afraid of them - they read my email, and I'm sure they're the new federal agency that's in charge of spying on citizens.

0 comments:

Post a Comment