Friday, October 26, 2012

Using Orange to generate association rules for a dataset.


by James M. Rogers on Friday, October 26, 2012 at 3:59am ·
MIS 420 Data Warehousing
26 Oct 2012

Everything I know about association knowledge mining in Orange comes from this web page: http://orange.biolab.si/doc/ofb/assoc.htm. Let's discuss the incoming data first, then talk about how the data is processed into rules, and finally discuss what we can do with the rules once we have them. 


In the examples we used the datafile imports-85.tab. If you click on that link you will see the data file is set up in a tabular format, with tabs between each attribute or field and a newline between each data tuple or record. Once the data is set up in the tabular format you can bring the data into the program to be processed:

import Orange  
data = Orange.data.Table("imports-85.tab")  
The data must not be continuous in a column. If it is, then you have to categorize the column. You can use the following command to convert continuous fields into discrete sets of three equally populated intervals:
data = orange.Preprocessor_discretize(data, method=orange.EquiNDiscretization(numberOfIntervals=3))
To select a sub set of the data, such as the first ten columns you can specify a range like so:
data = data.select(range(10))
At this point you can convert the data into a set of rules and calculate the confidence, support and lift of every rule using orange.AssociationRulesInducer like so:

rules = orange.AssociationRulesInducer(data, support=0.78)
The data is just the set of data we have imported, maybe converted to discrete values, and maybe sub-selected a range of values. The support is the minimum support we are going to accept as a rule. If the support is equal to or greater than this support value, the rule is dropped. This conversion is based on the APRIORI algorithm from Agrawal et al.'s Fast discovery of association rules, a chapter in Advances in knowledge discovery and data mining published in 1996. This algorithm is optimized to work on the tabular format we imported. Class variables are treated like attributes by this function.

Now that we have the rules in the list named rule we can now work with the list just like any other python list. The orngAssoc module is already written to work with the rules and includes two functions to conveniently work with the rule set directly, from http://orange.biolab.si/doc/modules/orngAssoc.htm:
printRules(rules, ms = []) Prints out the rules. If ms is left empty, only the rules are printed. If ms contains rules' attributes, e.g. ["support", "confidence"], these are printed out as well.
sort(rules, ms = ["support"]) Sorts the rules according to the given criteria. The default key is "support"; you can list multiple keys.
You can also select() parts of the rule list, del() rules, append() rules.
Once you have the rules you can then filter on confidence and lift using the python lambda function, which is an way to do an unnamed function:
conf = 0.8; lift = 1.1
print "\nRules with support>%5.3f and lift>%5.3f" % (conf, lift)
rulesC=rules.filter(lambda x: x.confidence>conf and x.lift>lift)
rulesC now contains only the rules with confidence above 0.8 and lift above 1.1. Then you can sort by confidence and print out the confidence, support, lift and the rule:
orngAssoc.sort(rulesC, ['confidence'])
orngAssoc.printRules(rulesC, ['confidence','support','lift'])
These orange functions make generating association rules from tabular data almost trivially easy. You can import the data, convert continuous columns to discrete columns, and finally generate rules with confidence, support and lift. Once the rules are generated it is easy to sort, filter, and select subsets of the data and then print out the results you require.

No comments:

Post a Comment