Sunday, June 01, 2008

Answering Some Comments on the Parse Post

Digger Jones writes:
Ha! That would most likely be yours truly. While I still use XP a lot, I've found Linux to be infinitely better, safer and more stable as a browsing platform.

You don't have to convince me of that. From 1996 to 2005, I ran a dual boot desktop at home that host various Linux distributions, including Slackware, SuSE, and RedHat. My work laptop was a dual boot Red Hat and Windows XP machine.

My work laptop, where I do most of my surfing now, is MacOS X, but I do all my development using two Linux virtual machines that host the web services that power our blogging app. Unfortunately, those VMs are configured to be servers, so they don't have X11 or any GUI apps, like firefox. The only way to interact with them is through SSH.

sixdegrees writes:
So here is a bigger question. Is there something fundamentally different about the way a technically trained human brain will approach a specific problem vs the way an human-constructed algorithm does? In your case, if the algorithm had given due weight to the terms in the parentheses as opposed to just looking for OS2 anywhere in the string, the algorithm probably would have gotten the correct answer. Was this just a question of speed vs accuracy in writing the algorithm or just laziness on the part of the person who wrote that Sitemeter algorithm?

I suspect that it was probably a combination of two things, which I believe are the roots of most software bugs:

  1. An incomplete understanding of the problem being solved by the algorithm.

  2. Time constraints and larger priorities that deemphasized the creation of a robust solution.


Some aspects of the web are well defined through published specifications, and some aren't. And then there are things that get specified sooner or later but not all implementers adhere to the standard. The User-Agent field of the HTTP request header falls into this category.

Bits and pieces of people's efforts to get a handle on the real state of affairs on the web have been published, usually as webpages or blog posts.

Not all software developers will do the depth of research to get a good understanding of the problem, so they will wind up coding a naive solution that might work in most cases but not handle the edge cases. Sometimes this is laziness, but in other cases it's because the developer has so much on his plate that he or she cannot meet the overall goals of the project without making some tradeoffs.

I ran into this issue recently. With the rise of blogs, there has been the creation of numerous "ping" services that accept notification messages. Some of these services , like Google Blog Search and Technorati, are used to keep search indices fresh. The grandaddy of these services, weblogs.com, just publishes rolling lists of updated blogs.

Being the first of these service, weblogs.com developed two different ways to access the ping server. One uses XML-RPC, which is a way of invoking functions remotely using HTTP POST operations with an XML message in the post body. The other is REST, which just submits the parameters through an HTTP GET (send data as part of the URL) or POST operation (send data in the HTTP request).

One of my tasks was not only understanding the original specs, but also looking at other services to see whether they implemented these things using any variations so that my code would be forgiving of such deviations.

When I was doing the research, I took a look at how other blogging services supported this kind of configuration, and I didn't see the level of flexibility that would support multiple protocols, so I made sure that my design handled that well because it would make our product more valuable to our customers.
blog comments powered by Disqus