jump to navigation

Knowledge Management vocabularies for tacit information processing; creation issues, scalability and auto-categorization September 14, 2006

Posted by Cyril Brookes in General, Tacit (soft) information for BI, Taxonomies, Tags, Corporate Vocabularies.
add a comment

My August 5 post introduced the topic of KM vocabularies and their essential role in building a balanced BI reporting environment – one that delivers both hard and tacit (soft) information This post offers more detail on vocabulary construction and practical use. These guidelines and suggestions are based on my experience building hundreds of vocabularies in KM implementations.

The vocabulary is important since it controls the categorization, retrieval and dissemination of documents. Without it there is little prospect of meaningful collaboration on important issues in the enterprise.

Creation Issues:

Automatic creation of a prototype vocabulary is a common starting point; using widely available text analysis systems. These systems process the content of many documents and compile a categorization list based on keywords, or more sophisticated contextual analysis.

These automatic systems have varying degrees of success in creating the hierarchies a KM vocabulary requires, identifying synonyms and especially determining the context (for example differentiating between alternative meanings of “heat” –

  • A batch of steel being produced;
  • The agent increasing temperature; and
  • Something female dogs exhibit.

Orphan topics also introduce indexation and classification and subsequent retrieval. Orphans are those terms that are unrelated to others in a hierarchy, that is no parents, or children. They ought be avoided.As highlighted in my earlier post, synonym processing is not desirable. All terms in the vocabulary should be “preferred” terms, and should become the universal identifiers for subject matter of documents, messages, etc. Synonyms should be handled in the auto-classification stage, where various common usage topics are converted to the preferred term for use in retrieval, etc.Automatic analysis of newly arrived documents (after the initial vocabulary is created), without reprocessing and reorganizing the entire collection of documents, may be difficult. So, the automatic procedure for building a vocabulary will often be useful only once, at the start of the exercise. Thereafter, the update most probably have to be manual.To facilitate vocabulary navigation it is necessary to embed higher level parent terms in the hierarchy. Manual editing is required here since the automatic process will not do this, and, in any case, the vernacular appropriate to the enterprise ought be used, not some standard industry or linguistic term.

A general purpose vocabulary may be useful testing platform, particularly if the information sources are news and other external sources. Internal documents tend to require a more enterprise and industry specific topic list. Similarly, a test platform can be built easily from an industry oriented set of KPIs, metrics or measures. These will be a subset of a complete vocabulary.

This is often the most effective starting point for creating an enterprise specific vocabulary and the associated rules for auto-classification.

All business oriented KM vocabularies are organic and will, therefore, evolve as the business interests and issues change. Normally, in a well designed vocabulary, these changes only involve the third and fourth levels of the hierarchy (new customers, competitors, mergers with same, new products, technologies, etc.), occasionally the first and second (a takeover creates a new business segment, a new class of customers is created to assist vocabulary navigation, etc.). Evolution is normally best achieved by manual adjustment, with suggestions being made by users as they encounter inadequate terminology or rules.Templates:Notwithstanding the industry focus on automatic construction of vocabularies by analysis of a pool of documents, my preference is to build a set of industy template vocabularies, and to create new versions by modifying earlier ones. This is because the terminology relevant to KM is very similar across organizations in the same industry, and similar, especially at the higher levels, across all businesses with similar operations, irrespective of industry.

Therefore, banks will have almost identical vocabularies, except at the third and fourth levels of detail, and an insurance company vocabulary will be quite similar to that of a bank.

Issues of scalability:

400 to 2000 topics is the common size range for useful BI vocabularies. Smaller than 400 is unlikely to provide sufficient granularity in categorization to satisfy inquiries. More than 2,000 terms will compromise navigation for complexity reasons.

Multiple vocabularies, one for each different community of interest – e.g. marketing, research, executives, etc. – are often required for large businesses. Multiple vocabularies covering similar subjects, but in different languages, are common in large corporations. If multiple vocabularies are used, the BI system needs to support cross-community browsing and alerting, with exchange of relevant documents and collaboration.


Vocabularies of a useful scale almost certainly require auto-classification, since it is not practical to allocate categories manually for an individual corporation.

Auto-classification means the mechanical assignment of vocabulary terms to documents, messages, news items, reports, etc. when they become accessible to the enterprise network. It involves matching new documents to the appropriate preferred topics in the vocabulary using classification rules or inference techniques.

Rules tend to be set for the narrower topics. Therefore, the selected terms are narrow concepts – such as a customer name, supplier, competitor, or a product or service. The higher level, parent terms, such as customer, problem customer, etc. are then added using inheritance provisions. Normally this is done as soon as a document comes within the scope of the KM system. Reclassification is required whenever the vocabulary changes significantly.

If you are interested in more detail on merging hard and tacit information, you can see some examples of KPI templates, a subset of a complete business vocabulary, in the download for my BI Pathfinder project www.bipathfinder.com