Tuesday, 11 October 2016

BAYESIAN CLASSIFICATION :


Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class.

Bayesian classification is based on Bayes’ theorem, described below. Studies comparing classification algorithms have found a simple Bayesian classifier known as the naïve  Bayesian classifier to be comparable in performance with decision tree and selected neural network classifiers. Bayesian classifiers have also exhibited high accuracy and speed when applied to large databases.

Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes. This assumption is called class conditional independence. It is made to simplify the computations involved and, in this sense, is considered “naïve.” Bayesian belief networks are graphical models, which unlike naïve Bayesian classifiers, allow the representation of dependencies among subsets of attributes. Bayesian belief networks can also be used for classification.

4.2.1. BAYES’ THEOREM :

P(H|X) is the posterior probability, or a posteriori probability, of Hconditioned on X.

In contrast, P(H) is the prior probability, or a priori probability, of H.

The posterior probability, P(H|X), is based on more information (e.g., customer information) than the prior probability, P(H), which is independent of X.

Similarly, P(X|H) is the posterior probability of conditioned on H.

P(X) is the prior probability of X.

P(H), P(X|H), and P(X) may be estimated from the given data, as we shall see below. Bayes’ theorem is useful in that it provides a way of calculating the posterior probability, P(H|X), from P(H), P(X|H), and P(X). Bayes’ theorem is


4.2.2.Naïve Bayesian Classification :

The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:

1. Let be a training set of tuples and their associated class labels. As usual, each tuple

is represented by an n-dimensional attribute vector, = (x1, x2, ….. , xn), depicting measurements made on the tuple from attributes, respectively,A1, A2, ….. , An.

2. Suppose that there are classes, C1, C2, …. , Cm. Given a tuple, X, the classifier will predict that belongs to the class having the highest posterior probability, conditioned on X. That is, the naïve Bayesian classifier predicts that tuple belongs to the class Ci if and only if


Thus we maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized is called the maximum posteriori hypothesis. By Bayes’ theorem


3. As P(X) is constant for all classes, only P(X|CiP(Ci) need be maximized. If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, that is, P(C1) = P(C2) =…. =P(Cm), and we would therefore maximize P(X|Ci).

4. Given data sets with many attributes, it would be extremely computationally expensive

to compute P(X|Ci). In order to reduce computation in evaluating P(XjCi), the naive assumption of class conditional independence is made. This presumes that the values of the attributes are conditionally independent of one another, given the class label of the tuple (i.e., that there are no dependence relationships among the attributes). Thus,


For each attribute, we look at whether the attribute is categorical or continuous-valued. For instance, to compute P(X|Ci), we consider the following:

(a) If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ciin having the value xk for Ak, divided by |Ci,D|, the number of tuples of class Ci in D.

(b) If Ak is continuous-valued, then we need to do a bit more work, but the calculation is pretty straightforward. A continuous-valued attribute is typically assumed to have a Gaussian distribution with a mean μ and standard deviation s, defined by


5. In order to predict the class label of XP(X|Ci)P(Ci) is evaluated for each class Ci. The classifier predicts that the class label of tuple is the classCi if and only if


4.2.3.Bayesian Belief Networks :

The naïve Bayesian classifier makes the assumption of class conditional independence,

that is, given the class label of a tuple, the values of the attributes are assumed to be conditionally

independent of one another. This simplifies computation.

Bayesian belief networks specify joint conditional probability distributions. They allow class conditional independencies to be defined between subsets of variables. They provide a graphical model of causal relationships, on which learning can be performed. Trained Bayesian belief networks can be used for classification. Bayesian belief networks are also known as belief networks, Bayesian networks, and probabilistic networks. For brevity, we will refer to them as belief networks.

A belief network is defined by two components—a directed acyclic graph and a set of conditional  probabability tables.

The variables may be discrete or continuous-valued. They may correspond to actual attributes given in the data or to “hidden variables” believed to form a relationship (e.g., in the case of medical data, a hidden variable may indicate a syndrome, representing a number of symptoms that, together, characterize a specific disease). Each arc represents a probabilistic dependence. If an arc is drawn from a node to a node Z, then is a parent or immediate predecessor of Z, and is a descendant of YEach variable is conditionally independent of its non descendants in the graph, given its parents.

 


Figure 6.11 A simple Bayesian belief network: (a) A proposed causal model, represented by a directed acyclic graph. (b) The conditional probability table for the values of the variable Lung Cancer (LC) showing each possible combination of the values of its parent nodes, Family History (FH)  andSmoker (S).

A belief network has one conditional probability table (CPT) for each variable. The CPT for a variable specifies the conditional distribution P(Y|Parents (Y)), where Parents(Y) are the parents of Y.


4.2.4.Training Bayesian Belief Networks :

The network topology (or “layout” of nodes and arcs) may be given in advance or inferred from the data. The network variables may be observableor hidden in all or some of the training tuples. The case of hidden data is also referred to as missing values or incomplete data.

If the network topology is known and the variables are observable, then training the network is straightforward. It consists of computing the CPT entries, as is similarly done when computing the probabilities involved in naive Bayesian classification. When the network topology is given and some of the variables are hidden, there are various methods to choose from for training the belief network.

gradient descent strategy is used to search for the wi jk values that best model the data, based on the assumption that each possible setting of wi jk is equally likely.

The gradient descent method performs greedy hill-climbing in that, at each iteration or step along the way, the algorithm moves toward what appears to be the best solution at the moment, without backtracking. The weights are updated at each iteration. Eventually, they converge to a local optimum solution.

1.    Compute the gradients: For each ijk, compute,


2.    Take a small step in the direction of the gradient: The weights are updated by


where is the learning rate representing the step size and
 is computed.

3. Renormalize the weights: Because the weights wi, jk are probability values, they must

be between 0.0 and 1.0 and 

          must equal 1 for all ik. Algorithms that follow this form of learning are called Adaptive Probabilistic Networks.

4.3. RULE-BASED CLASSIFICATION :

Rule-based classifiers, where the learned model is represented as a set of IF-THEN rules.

4.3.1. Using IF-THEN Rules for Classification :

Rules are a good way of representing information or bits of knowledge. A rule-based classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression of the form

IF condition THEN conclusion.

An example is rule R1,


The “IF”-part (or left-hand side)of a rule is known as the rule antecedent or precondition. The “THEN”-part (or right-hand side) is the rule consequent. In the rule antecedent, the condition consists of one or more attribute tests(such as age youth, and student yes) that are logically ANDed. The rule’s consequent contains a class prediction (in this case, we are predicting whether a customer will buy a  computer). R1 can also be written as


If the condition (that is, all of the attribute tests) in a rule antecedent holds true for a given tuple,we say that the rule antecedent is satisfied (or simply, that the rule is satisfied) and that the rule covers the tuple.

A rule can be assessed by its coverage and accuracy. Given a tuple,X, from a class labeled data set,D, let ncovers be the number of tuples covered by Rncorrect be the number of tuples correctly classified by R; and |D| be the number of tuples in D. We can define the coverage and accuracy of as


That is, a rule’s coverage is the percentage of tuples that are covered by the rule (i.e., whose attribute values hold true for the rule’s antecedent). For a rule’s accuracy, we look at the tuples that it covers and see what percentage of them the rule can correctly classify.

Let’s see how we can use rule-based classification to predict the class label of a given tuple, X. If a rule is satisfied by X, the rule is said to betriggered. For example, suppose we have


If R1 is the only rule satisfied, then the rule fires by returning the class prediction for X.

If more than one rule is triggered, we need a conflict resolution strategy to figure out which rule gets to fire and assign its class prediction toX. There are many possible strategies ,

ü  Size ordering and

ü  Rule ordering.

The size ordering scheme assigns the highest priority to the triggering rule that has the “toughest” requirements, where toughness is measured by the rule antecedent size.That is, the triggering rule with the most attribute tests is fired.

The rule ordering scheme prioritizes the rules beforehand. The ordering may be class based or rule-based.

With class-based ordering, the classes are sorted in order of decreasing “importance,” such as by decreasing order of prevalence.

With rule-based ordering, the rules are organized into one long priority list, according to some measure  of rule quality such as accuracy, coverage, or size (number of attribute tests in the rule antecedent), or based on advice from domain experts.

4.3.2. Rule Extraction from a Decision Tree :

To extract rules from a decision tree, one rule is created for each path from the root to a leaf node. Each splitting criterion along a given path is logically ANDed to form the rule antecedent (“IF” part). The leaf node holds the class prediction, forming the rule consequent (“THEN” part).


 


A disjunction (logical OR) is implied between each of the extracted rules. Because the rules are extracted directly from the tree, they aremutually exclusive and exhaustive. By mutually exclusive, this means that we cannot have rule conflicts here because no two rules will be triggered for the same tuple. (We have one rule per leaf, and any tuple can map to only one leaf.) By exhaustive, there is one rule for each possible attribute-value combination, so that this set of rules does not require a default rule. Therefore, the order of the rules does not matter—they are unordered.

The training tuples and their associated class labels are used to estimate rule accuracy.

Other problems arise during rule pruning, however, as the rules will no longer be mutually exclusive and exhaustive. For conflict resolution, C4.5 adopts a class-based ordering scheme. It groups all rules for a single class together, and then determines a ranking of these class rule sets. Within a rule set, the rules are not ordered. C4.5 orders the class rule sets so as to minimize the number of false-positive errors (i.e., where a rule predicts a class, C, but the actual class is not C). The class rule set with the least number of false positives is examined first. Once pruning is complete, a final check is done to

remove any duplicates.

4.3.3.Rule Induction Using a Sequential Covering Algorithm :

IF-THEN rules can be extracted directly from the training data (i.e., without having to generate a decision tree first) using a sequential covering algorithm.

Sequential covering algorithms are the most widely used approach to mining disjunctive sets of classification rules, and form the topic of this subsection.

A basic sequential covering algorithm is shown as ,


Here, rules are learned for one class at a time. Ideally, when learning a rule for a class, Ci, we would like the rule to cover all (or many) of the training tuples of class and none (or few) of the tuples from other classes. In this way, the rules learned should be of high accuracy. The rules need not necessarily be of high coverage. This is because we can have more than one rule for a class, so that different rules may cover different tuples within the same class. The process continues until the terminating condition is met, such as when there are no more training tuples or the quality of a rule returned is below a user-specified threshold. The Learn One Rule procedure finds the “best” rule for the current class, given the current set of training tuples.

Typically, rules are grown in a general-to-specific manner, The classifying attribute is loan decision, which indicates whether a loan is accepted (considered safe) or rejected (considered risky). To learn a rule for the class “accept,” we start off with the most general rule possible, that is, the condition of the rule antecedent is empty. The rule is:



Figure shows a general-to-specific search through rule space.

IF income = high THEN loan decision accept.

Each time we add an attribute test to a rule, the resulting rule should cover more of the “accept” tuples. During the next iteration, we again consider the possible attribute tests and end up selecting credit rating = excellent. Our current rule grows to become


The process repeats, where at each step, we continue to greedily grow rules until the

resulting rule meets an acceptable quality level.

4.3.4.Rule Quality Measures :

Learn One Rule needs a measure of rule quality. Every time it considers an attribute test,

it must check to see if appending such a test to the current rule’s condition will result in an improved rule.

Choosing between two rules based on accuracy. Consider the two rules as illustrated in Figure 6.14. Both are for the class loan decision = accept. We use “a” to represent the tuples of class “accept” and “r” for the tuples of class “reject.” Rule R1 correctly classifies 38 of the 40 tuples it covers. Rule R2 covers only two tuples, which it correctly classifies. Their respective accuracies are 95% and 100%. Thus, R2 has greater accuracy than R1, but it is not the better rule because of its small coverage.


Figure shows Rules for the class loan decision = accept, showing accept (a) and reject (r) tuples.

Another measure is based on information gain and was proposed inFOIL (First Order Inductive Learner), a sequential covering algorithm that learns first-order logic rules.

Learning first-order rules is more complex because such rules contain variables, whereas the rules we are concerned with in this section are propositional.

FOIL assesses the information gained by extending condition as


4.3.5.Rule Pruning :

The rules may perform well on the training data, but less well on subsequent data. To compensate for this, we can prune the rules. A rule is pruned by removing a conjunct (attribute test). We choose to prune a rule, R, if the pruned version of has greater quality, as assessed on an independent set of tuples. FOIL uses a simple yet effective method. Given a rule, R,


where pos and neg are the number of positive and negative tuples covered byR, respectively.

This value will increase with the accuracy of on a pruning set. Therefore, if the FOIL Prune value is higher for the pruned version of R, then we prune R.

No comments:

Post a Comment