Text Mining

Automated Classification of Text

Classification of text records is an example of supervised learning that seeks to build a probabilistic model of a function that maps documents to classifications. In supervised learning of text, where an entire record represents one example to be classified, a learning algorithm is presented with a set of already classified, or labeled, examples. This set is called the training set. A number of classified records from the training set are removed prior to model building to be used for testing the model’s performance. This set is known as the testing set.

To better measure the classification accuracy of a model, often several models are built from different partitions of the examples to training and testing sets. The classification error is then averaged over each model. This process is called n-fold cross validation where “n” is the number of times the example set is partitioned. For example, producing ten models for evaluation using this process would be an example of 10-fold cross validation.

Once a model has been constructed, it can be used to predict the classification of unlabeled records . The accuracy of such models is largely dependent on the “representativeness” of the training data with respect to newly acquired data to be classified. In general, the more representative the training data, the better the performance. A larger number of training examples is often better, because a larger sample is likely to be more reflective of the actual distribution of the data as a whole.

Text Transformation

Classification machine learning algorithms operate on numerical quantities as inputs. A labeled example is often a vector of numeric attribute values with one or more attached labels. For some methods, like Naïve Bayes, integer counts of the attribute values are all that is required. Attributes in these cases can be nominal types but they are still ultimately converted to numeric counts before processing.

The conversion of text records to vectors of numeric attributes is a multi-staged process of feature construction, which employs several key transformations. There are many approaches to this process – one of the most common is to treat each text record as a “bag of words.” Each unique word constitutes an attribute (a position in our example vector). The number of occurrences of a word in a record (frequency of occurrence) is the attribute’s value (in our vector) for that document. Records are therefore represented as vectors of numeric attributes where each attribute value is the frequency of occurrence of a distinct term. This set of document vectors is often referred to as a vector space. Algorithms that operate on such representations are said to be using vector space models of the data.

For even modest sized text collections the number of distinct terms across the entire collection can be quite large – often in the tens of thousands. For any single record, however, the number of distinct terms is often quite small in comparison. As a result, the majority of attribute values in a record’s vector representation will be set to zero (conceptually empty) because those terms do not appear in the record. A table of vectors whose attribute values are predominantly empty is commonly referred to as sparse. Because of the nature of sparse tables a special table implementation is used that only stores the non-empty data values thereby saving a large amount of memory space and greatly reducing the run-time for many algorithms that access the data. In most instances, if this special implementation were not used, the table would be too large to be stored in conventional memory.

Often, simply extracting out all unique terms, or tokens, is not the most effective way to acquire attributes for text mining. Many tokens may not be words at all — possibly constituting noise in the process that should be filtered out. Furthermore, some types of words, or series of words, may be preferable for learning. For example, often nouns or noun phrases are preferred. Part-of-speech identification algorithms and lexical/semantic dictionaries are typically used to provide additional information about terms. Also, very common words like “and” and “the” are often filtered out to improve performance. All of these transformations are performed before any learning takes place. The number of steps involved in this pre-processing can be quite numerous and often constitute the bulk of the overall model building process.

Transformation Steps

Sentence Splitting

Identifying sentence boundaries in a document is not as trivial a process as it may seem. SEASR has components that achieve sentence splitting either using rules or statistical models (or both). Once sentences are identified they are recorded as annotations in their own annotation set.


Tokenization, simply put, is basically labeling individual words or sometimes word parts. This is important because many down stream components need the tokens to be clearly identified of analysis. Tokens are recorded as annotations in their own annotation set.

Part-of-Speech (POS) Tagging

Such components typically assign a POS tag to a token (the Penn Treebank project has provided a set of codes for this purpose that is widely used). Other data such as lemma, lexemes, and synonyms (to name a very few) may also be identified at this stage. POS information is stored as features of the token annotation.

Stop Word Filtering

Very common words like “and” and “the” are often filtered out to improve performance. This process is called stop word removal. SEASR has a components to perform this process. One basic approach is to remove all words that appear on a list of common words. Another approach is to remove words that occur with high frequency across most documents — these types of terms create “noise” that makes text records less distinguishable. The stop word filters will remove token annotations from a documents token annotation set or they can also mark such annotations as “stopped.”

POS Filtering

This component reads a document object as input and filters the tokens for that document based on part of speech tag information. A document object is taken as input. The token list is retrieved from the document and only those tokens with part of speech tags that match at least one value in the selected tag list are retained. The filtered list of tokens is placed into the document (replacing the old list) and the document is output. In the SEASR component the PoS Tags is a comma-delimited list of the part-of-speech tags that we want to retain. Tokens that do not possess one of these values will be removed from the annotation list or flagged as “stopped”.

Some components use regular expressions of POS tags to identify “chunks” of tokens (perhaps all noun phrases for example). This process is sometimes called chunking.

Unsupervised Text Mining: Clustering

Unsupervised approaches do not rely on a training set of labeled examples for building a model. Instead, these approaches use a similarity or difference heuristic that is a function of only the text records’ content. The heuristic function is used to construct classifications of clusters that are said to be emergent from the data.

Cluster formation techniques vary depending on the type and number of cases being analyzed. One of the most prevalent approaches uses a group of methods known as hierarchical clustering algorithms. In the most basic applications, these methods form a tree of classifications whose root is the set of N objects and whose leaves are the individual objects. The intervening levels of the tree define various hierarchical partitions of the set of N objects. The classification process can be a top-down, “divide and conquer” approach or, alternatively, a bottom-up, agglomerative process. The heuristic for grouping (or partitioning) is based on some similar function applied to the values of the p variables for each individual object.

Hierarchical Agglomerative Clustering (HAC)

The type of cluster analysis that is employed in SEASR for defining text concepts is a hierarchical agglomerative (bottom-up) technique (HAC) that models individual text items as points in vector space. This vector space is sometimes referred to as term space because each unique term that appears in any text document defines a dimension in this space. Not all words in a document are considered terms. Terms are primarily nouns and/or noun phrases. Common words that have little semantic content, such as prepositions and conjunctions, are routinely discarded, as are most verbs. For each term, a scalar weight value is computed as the normalized frequency of occurrence of that term. Terms are further weighted based on their uniqueness for that document using the following formula:

term weight=normalized term freq*log (inverse document freq).

The vector of term weights for individual documents defines the point in term space. The measure of similarity between two documents is therefore the Euclidean distance between their respective representative points in space. The validity of this measure of “similarity” hypothesizes that like documents share many of the same terms.

The agglomerative process will begin by placing each individual response in its own cluster and computing the similarity values for all possible cluster pairings. The two most similar clusters will then be combined and the process repeated until either a finite number of clusters are formed or the similarity between any two clusters does not rise above some threshold value.

The similarity between two items is not the same as the similarity between two clusters of items. In fact, there are several ways of computing the similarity between two clusters. One can measure the distance between the members of each cluster that are the closest (since link) or farthest (complete link) apart. We will calculate the group average similarity between all pairs of items in the two groups which recent research suggests is a more accurate approach.

This is how clusters are formed using HAC components. The cluster hierarchy is often represented as a binary tree that is called a dendrogram. SEASR provides a dendrogram visualization component for cluster analysis.

Text Information Retrieval

Information Retrieval (IR) is a classic field that pre-dates the computer. A particular field in IR is the problem of search. Google and Yahoo! are just some of the companies trying to tackle this problem. The problem is conceptually simple: given a collection of documents and a user query, find a subset in the collection that is relevant to the user query. The problem sounds easy but there are many subtleties. Both the documents and the query are often ambiguous and lack well-defined semantics. Unlike the rigid structures found in traditional relational databases, text has barely any structure. A search for “football” is ambiguous across different cultures. Does the user actually mean “soccer?” Furthermore, what if the word “football” never actually appears in a document that is about football? Should the article be discarded as being non-relevant? The judgment for relevance is even hard for human beings. Different backgrounds can result in vastly different judgments.

The presentation of the results is another problem as well. Studies have shown that most users prefer a ranked list of the relevant documents with the more relevant ones higher up. The comparison of document A and B in regards to relevance has no answers other than empirical testing. In most cases, a simple dot product or cosine similarity metric is used to measure the similarity between a document and a query. These similarity values are then used to rank the list of results.

Applications of Text Mining

Text classification has many direct applications such as:

  • Classification of email documents to filter spam or to automatically sort messages into mail folders.
  • Classification in streams of documents, like news feeds, to identify particular items of interest.

Supervised learning in the text domain is also often used to support higher-level objectives like:

  • Information extraction: the process of extracting bits of specific information from unstructured or semi-structured text records. Classification methods are often used to identify the parts of records that qualify for extraction. For example, extracting the date, time, location, speaker, and topic from seminar announcements.
  • Unsupervised techniques like clustering are often used as tools for exploration and discovery in large text collections. Such techniques can greatly reduce the time required for human inspection or search.