Like most IR systems, search engines are document focused animals. When I go searching for information using Google or AV, I generally get pages back; When I search in a bibliographic database, I get citations; When I search the OPAC, I get bibliographic information. Unfortunately, many of our information sources aren?t so discreet.
The other day I was searching the digital archives of the New York Times. ProQuest had conveniently broken the entire paper up into individual articles and provided them as PDF files. Newspapers, however, don?t work this way. When looking at the front page I'm interested in the entire gestalt: the articles, the headlines, the sidebars, and the pull quotes. What's far more interesting than the individual articles is the proximity of their headlines. A weekend paper with 12 related articles crammed onto the front page would likely be a far better information source than a paper with 13 related articles spread over 10 sections.
To this extent, Salton's concept of passage searching seems quite intriguing. While Salton suggest ignoring the bounds of a document by constructing weighting vectors on individual paragraphs, I suggest ignoring the bounds of a document by constructing vectors based on page proximity. With the New York Times we should forget about the concept of individual articles and index everything as 3-word n-grams as things appear on a page. Forget punctuation and gutters. Focus on the words.
It seems that many of our information sources are beginning to ignore the principles of the document. Are the individual documents of a discussion group important or is the entire thread important? What about the files on an SMTP server? How about a random walk of blogs? Should the blogs be counted as individual documents or should the link path of the walk count?
That said, I'm still unsure of a few comments. Clarke et al, for example, talk of "shallow (finite-state) patterns" to support a "context-free grammar and parser." What are they talking about? Another interesting concept is part of speech parsing. How it this supposed to work? What exactly is WordNet and why is it so revered?