Internet Problem: Building a better search engine
The current leader in Internet search is undoubtedly Google. Pretty much anyone else who wants to enter the search engine market is trying to figure out a way to beat Google. Maybe it’s the algorithm or the content that people really want? No one knows for sure. The problem with trying to beat Google is that no one really knows what it is that makes Google the best search engine.
Does Google give better search results? “Better” is a relative term that only has meaning when used in relation to something else. What is Google’s search results better than? Yahoo!? Ask? Calling up mom and asking what she thinks? When trying to create something that is better than something else, you need to have a measuring stick to know how close or far away you are. It seems like most people magically deem Google’s search results as the best without being able to explain what makes it so. If all the other search engines are trying to have their results match Google’s, then all anyone is doing is trying to recreate what Google does right now, and that isn’t building a better search engine. So the question becomes this: how do you know if your results are good, let alone better than Google’s?
To start, you need to know what determines success for a search engine user. Is it that the first result is the one they’re looking for? Or maybe it’s a success so long as the link the person is looking for is on the first page of results? Let’s assume the latter for simplicity. The next problem is figuring out what someone is really interested when they type in a specific query.
Most search queries are pretty short, probably a handful of words (including common ones that are ignored like “the”). There was a time when it looked like search engines would move in the direction of natural language processors where you’d ask what you wanted (hey, isn’t that what Ask started out as???), but it turns out that it’s harder to form reasonable questions than it is to type in a few words. For instance, if I want the score of the Patriots game, I’ll type “Patriots score”. If I had to type a full question, it would be something like, “what was the score of the Patriots game?” It actually took me about five minutes to figure that out, as opposed to the two-word version, which I thought of quickly and easily. It’s too bad because forcing people to type in questions like that would actually make coming up with relevant search results easier. But I digress.
If I search for “New England Patriots”, what is the best result for that? Could it be the Patriots web site? This is pretty easy since the New England Patriots are a specific organization with a unique name that happens to have a web site. What about a search for “apple”? All of a sudden, things aren’t so easy. I could be interested in an buying an iPod or an iMac, but I could also be interested learning about the fruit. And for both of these searches, how relevant is the Wikipedia article?
Building a better search engine is a daunting task because there really is no measuring stick. Most tasks in computing involve starting out with a set of inputs and a known set of desired outputs and then trying to figure out the most efficient way to process the former to achieve the latter. If you’re trying to build a better search engine, what is your desired output? What are you measuring against? How do you know that you’ve succeeded? I don’t know the answer, I just know that the question exists and is being pondered by many people that are much smarter than I.
Disclaimer: Any viewpoints and opinions expressed in this article are those of Nicholas C. Zakas and do not, in any way, reflect those of my employer, my colleagues, Wrox Publishing, O'Reilly Publishing, or anyone else. I speak only for myself, not for them.