Frequent pattern analysis (FPGrowth) allows a researcher to systematically identify patterns that emerge from text and other forms of data. It can also be used to identify changes and the emergence of new topics or patterns over time.
Humanities researchers could use frequent pattern analysis in many ways. Examples include recurrent patterns, patterns or characteristics of authors or topics through time, and detection of unusual events. Of course, each scholar’s research questions or hypothesis will drive the focus of such an analysis.
Frequent pattern mining seeks to discover significant relationships among variables (i.e., people, places, and things) in a dataset (i.e., sentences, chapters, books, or images) . The output of the frequent pattern mining can be a rule or hypothesis set or a visual representation. The visualization of significant relationships is represented at two levels, structural and quantitative. At the structural level, the model will indicate which variables are locally dependent on one another. At the quantitative level, the model will offer some numeric measure of support and confidence for these relationships.
At the structure level, frequent pattern analysis:
- Finds all rules that correlate the presence of one set of items X (person and mode of transportation) with another item Y (location). For example, if a man travels by bus, then he travels to Java Jane’s coffee shop 85% of the time. Or one item set of keywords can correlate with another keyword. For example, (wizard, redbird, and stealing) implies (St. Louis Cardinals) 90% of the time.
At the quantitative level, the researcher can direct the rule discovery by the following two parameters:
- Support is the percentage of the records, sentences, and documents that contains both X and implies Y. A rule must have minimum user-specified support to show its
- Confidence is the percentage of the records that contains X and implies Y out of the number of records that contain X. A rule must have some minimum user-specified confidence to show its value.
This flow loads data, transforms the data, extracts rules, and displays the rules visually. This flow loads the data, bins the data (if necessary), and generates rules. The rules can be viewed in a graphical representation.
==Data Input and Manipulation==
The data are loaded by passing the URL of the data location, so this flow can be easily modified to load your data from your collection. This data has two header rows at the top that indicate the labels row and the types row. These parameters can be modified by adjusting the values in “Create Delimited File Parser.”
When executing the flow, the “Choose Attributes” web user interface will prompt the user to identify the input and output attributes. Use the Shift key to select a range of attributes. Use Control to select and/or deselect an attribute. Select (highlight) the attributes that should be used for input and the output attribute. The File menu also allows for different sorting options. When selections are complete, click the Done button. Note: For this application, we use input and output selections to choose the attribute-value pairs for the rule antecedent and the rule consequent values, respectively.
For this flow, the data is all categorical, so we do not need to bin the data. However, if we had numerical data, we might need to bin the data into discretized groups to reduce the number of possible combinations.
==Execution of Analysis==
An itemset is a collection of items, and an item is an attribute-value pair that exists in the dataset. A data table can be used to build multiple rule tables with different combinations of attributes or with different support or items per itemset values. A rule has two parts: the rule antecedent (X) and the rule consequent (Y) – X implies Y.
The “fpgrowth” component implements the FPGrowth algorithm to generate frequent itemsets consisting of items that occur in a sufficient number of examples to satisfy the minimum support criteria.
- Minimum Support % is the percent of all examples that must contain a given set of items before an association rule will be formed containing those items. This value must be greater than 0 and less than or equal to 100.
- Maximum Items per Rule is the maximum number of items to include in any rule. This setting does not impact performance for this algorithm as it does for the Apriori algorithm. This setting cannot be less than 2.
- Generate Verbose Output should be set to TRUE if the module should report progress information to the console.
- Generate Debug Output should be set to TRUE if the module should write verbose status information to the console.
This “Compute Confidence” component works in conjunction with other components, implementing the Apriori or FPGrowth rule association algorithms to generate association rules satisfying a minimum confidence threshold.
- Minimum Confidence % is the percent of the examples containing a rule antecedent that must also contain the rule consequent before a potential association rule is accepted. This value must be greater than 0 and less than or equal to 100.
- Report Module Progress should be set to TRUE if the component should report progress information to the console.
- Generate Verbose Output should be set to TRUE if the component should write verbose status information to the console.
==Visualization of Results==
Once execution has been completed, the console window contains information about the results of this analysis. The resulting visualization will open in a browser as an applet. This visualization presents a graphical representation of the result of the association rule algorithm. The main region of the display contains a matrix that visually depicts the rules. Each numbered column in the matrix corresponds to an association rule that met the minimum support and confidence requirements specified by the user in the rule-discovery modules. Items used in the rules, that is attribute-value pairs, are listed along the left side of the matrix. Note that some items in the original dataset may not be included in any rule because there was insufficient support and/or confidence to consider the item significant.
An icon in the matrix cell indicates that an item is included in a rule. If the matrix cell icon is a box, then the item is part of the rule antecedent. If the icon is a check mark, then the item is part of the rule consequent.
Above the main matrix are two rows of bars labeled Confidence and Support. These bars align with the corresponding rule columns in the main matrix. For any given rule, the confidence and support values are represented by the degree to which the bars above the rule column are filled in. Brushing the mouse on a confidence or support bar displays the exact value that is graphically represented by the bar height.
The rules can be ordered by confidence or by support. To sort the rules, click either the support or confidence label; these labels are clickable radio buttons. If support is selected, the rules will be sorted using support as the primary key and confidence as the secondary key. Conversely, if the confidence button is chosen, confidence is the primary sort key and support is the secondary key.
Directly above the confidence and support display is a toolbar that provides additional functionality. On the left side of the toolbar are two buttons that allow the rows of the table to be displayed according to different sorting schemes. One of the buttons is active at all times. The Alphabetize button sorts the attribute-value combinations alphabetically. The Rank button sorts the rows based on the current Confidence/Support selection, moving the consequents and antecedents of the highest-ranking rules to the top of the attribute-value list.
On the right side of the toolbar are four additional buttons. Restore Original reverts to the original table that was displayed before any sorting was done. Filter provides an interface that allows the user to display a subset of the generated rules. Filtering is not part of this release. Print prints a screen capture of the visual display. The print output contains only the cells that are visible in the display window, not all the cells in the rule table. Printing is also accessible via the Options menu. Help displays information describing the visualization.
Every attribute-value combination is compared, so a large number of attribute-value pairs can cause take a long time to execute; there is also the chance that you will get an “Out of Memory” error.
Han, J., J. Pei, and Y. Yin. “Mining frequent patterns without candidate generation.” ACM SIGMOD Record 29.2 (2000): 1-12.