Thursday, March 04, 2010

Information Retrieval for Real-world Tasks

Paul Thomas (CSIRO and ANU) presented a seminar on "Information Retrieval for Real-world Tasks" at the ANU CSIT Seminar Room, N101 today. He argued that web search engines are historically related to document search systems which were sponsored by the US DARPA (TREC). The original task typically was to find a ranked list of documents relevant to a question, such as ones on smuggling plutonium out of the Soviet Union. There was an unintended pun in this as Paul talked about these being "atomic" documents. He argued that returning a list of documents does not suit real world tasks, such as choosing an espresso machine to buy. I was not convinced by the examples he gave which showed Google products listing web pages about espresso machines. The Google products search returns a list of espresso machines, with the assumption that the first in the list is best (Paul missed another pun here by not bringing up the details of the Atomic Coffee Machine). He then changed tacks to show examples of searches for biomedical data, which identified specific items in documents.

It seemed to me that there were two distinct topics Paul was confusing: information retrieval and task support. Information retrieval can be used to support some task, such as selecting a coffee machine. But retrieving information about coffee machines is not the same as purchasing a coffee machine. Real world search engines, such as Google, use heuristics to short cut this process. If people searching for coffee machines are really looking to buy one, then the search is modified to answer the question the user meant to ask, not what they actually asked. This process has proved lucrative for Google, as it results in people buying products and Google being paid for helping with that process.

Returning to the original example Paul used, of identifying plutonium smuggling, the real task is to detect and stop it, not just find documents. What the user of Web2 War systems, such US Army Knowledge Online (AKO) , US intelligence Intellipedia and the Tactical Ground Reporting System (TIGR) would ideally be directed to are not just historical documents, but live systems such as General Dynamics Mediaware's JPEG2000 for Wide Area Airborne Surveillance, with data from Predator UAVs. The system could then offer to issue the relevant tasking order to produce kinetic response, in real time.

ps: One of the side tracks this seminar took was the origin of the 10 documents goal of the TREC information retrieval tasks. One theory was this was as many as could be displayed on an old green screen. My thory was that if more than that many were dislayed, the user would have to take their socks off to count them. ;-)

