Sunday, October 03, 2004

Some Notes for a Friend

Hey Paul:

I've been meaning to write you for quite a while. I've been thinking about how you guys have handled all the hurricanes. My parents place down in Port Charlotte lost its roof and they've been down there trying to find contractors to fix it!

Your problem regarding sales material is an interesting one. It's basically the same one that we faced at i2 and I've had a bit of time to think about it over the past few years.

Basically, your client is looking for a library. Amazon works because it uses a lot of the infrastructure developed for libraries. Early on Bezon recognized just how well structured the descriptive information for books is (e.g., authors, titles, publishers, subject headings, ISBN numbers, etc.) and how easily this information could be exploited online for both e-commerce and recommender systems.

Unfortunately, sales materials just aren't well structured. When we're making PowerPoint slides or writing whitepapers we don't worry about standardized titles or author names. For the KM system at i2, we struggled with some standardizing pretty basic issues such as geographies (EMEA, Americas, etc.), sales territories, products, verticals, etc. Trying to append this sort of information to various documents was really tough! Amazon works because all of this standardization has been done my a government institution: the Library of Congress!

When cataloging materials (books or sales ephemera) to put into an "Information Retrieval" (IR) system (that's librarian talk for accessible database!) you basically have three options. The first option is to use the words in the document for indexing. Most of the major web search engines (e.g., Google) use this type of approach. Amazon's new search engine--A9--applies this approach to the text of books. The second approach is the one traditionally used by librarians and involves the creation of "bibliographic surrogates". Basically, think of cards in an old card catalogue. Each card is a bibliographic surrogate for a particular book. We now do this sort of cataloguing by creating a database with fields for things like author, title, etc. and a pointer to the location of the actual document. We tried to do this type of thing at i2 and it didn't work primarily because we didn't have a librarian devoted to the creation of the records! The third approach involves the use of meta-data embedded in documents. Although we use key words and other meta-information in web pages, structured meta-data is typically a bit more involved. The current standard (supported by a number of office productivity applications) is called the Dublin Core. It was designed to act as a kind of hybrid between the really rigorous cataloging codes used for the creation of formal bibliographic surrogates--they all have very arcane names like the Anglo-American Cataloguing Rules R2 (AACR2) and Machine Readable Cataloguing (MARC)--and the completely unstructured apporach used by commercial IR systems.

Here's the rub. None of these approaches work well. The unstructured approach can work well depending on the technology (think Google) and the redundancy of the collection (think the Web) but can really suck for assisting users in exploring collections with some sort of inherent order (like a collection of sales material). Do you remember the search function on the homepage of the Intranet at i2? It would always cough up incredibly odd documents in response to queries. The problem with the bibliographic surrogate approach is that few organizations have the budget or demeanor to employ a cataloguer... especially a knowledgable one. The problem with the meta-data approach is that few authors are motivated to include meta-data in their documents. A sales guy creates a PowerPoint presentation to sell to a particular prospect and get comped for it. Including meta-data isn't part of this formula!

While your problem is a challenging one, it's not an impossible one. Here's my suggestion: forget about the meta-data approach and forget about the bibliographic surrogate approach. Use a conventional IR tool to index the sales document. Some people say that they need Google for their Intranet but Google probably won't work well because there's not enough redundancy in the collection. Instead you have to inject a bit of structure into the documents. The full-blown meta-data approach won't work but a hybrid one might. Your client could, for example, include a standard page at the back of their PowerPoint and whitepaper templates where authors can fill in limited information such as their name (actually their email address is better since every person has a unique email address), the name of their client, and the name of the products they're selling. If there are standardized product codes even better. Now, at the back of a PowerPoint deck you could have the following lines:

"gg345g#client:taylor made"
"gg345g#product:supply chain planner"

When the IR tool indexes these documents it will create entries for each of these expressions. The "gg345g#" part is meaningless but serves the purpose of making each entry indentifiable as a controlled field.

So now you've got this whole pile of mixed documents indexed by a standard IR tool (htdig is good open source option). You can then create a user interface that creates the search string people are looking for. If somebody was looking for the documents written by George Goodall on Supply Chain Planner, the interface could build the search string: "'' 'gg345g#product:supply chain planner'".

Well, that was a very long answer that may or may not have made sense. With the type of structure provided by either controlled data or bibliographic surrogates, you can then do the fancy stuff that Amazon is famous for like clustering items based on similarity.

The one feature you mentioned that I haven't touched on is the whole recommendation feature i.e., "Customers that bought x, also bought..." This type of feature is commonly referred to as "collaborative filtering" and it's a hot topic. The process of collaborative filtering is fairly straight forward if a bit algorithmically complicated. By allowing user to assign ratings to particular documents it exploits the vector space model commonly used by IR systems. This discussion can get boring very quickly so I'm going to give you a few links.

A dated but good collection of links:
A public domain recommder tool:

In short, my suggestion is to use a standard search tool with modified document templates and a fancy front end for constructing search strings. This approach will get you the most bang for your buck. Of course, it all depends on what your client is looking for. Do they want something off the shelf? Do they want to roll their own? What sort of enterprise environment are they running? etc. As for whitepapers and analyst reports... ummm... I guess start with the standard vendors and go from there: seibel, salesforce, netperceptions maybe.

If you have any questions, please give me a shout. I'm home most days since I'm studying for a week long set of exams that start Oct. 24.

I hope this helps.



Post a Comment

Subscribe to Post Comments [Atom]

Links to this post:

Create a Link

<< Home