CLASSIFICATION AND CLUSTERING
Classification and Prediction, Issues, Decision Tree Induction, Bayesian Classification, Association Rule Based, Other Classification Methods, Prediction, Classifier Accuracy, Cluster Analysis, Types of data, Categorisation of methods, Partitioning methods, Outlier Analysis.
4.1. Classification and Prediction:
Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. Such analysis can help provide us with a better understanding of the data at large. Whereas classification predicts categorical (discrete, unordered) labels, prediction models continuous valued functions.
A model or classifier is constructed to predict categorical labels, such as “safe” or “risky” for the loan application data; “yes” or “no” for the marketing data; or “treatment A,” “treatment B,” or “treatment C” for the medical data. These categories can be represented by discrete values, where the ordering among values has no meaning.
Predictor where the model constructed predicts a continuous-valued function, or ordered value, as opposed to a categorical label. This model is apredictor.
Classification and numeric prediction are the two major types of prediction problems.
DATA CLASSIFICATION :
Data classification is a two-step process,
In the first step, a classifier is built describing a predetermined set of data classes or concepts. This is the learning step (or training phase), where a classification algorithm builds the classifier by analyzing or “learning from” a training set made up of database tuples and their associated class labels.
A tuple, X, is represented by an n-dimensional attribute vector, X = (x1, x2, …. , xn), depicting n measurements made on the tuple from ndatabase attributes, respectively, A1, A2,.. , An
Figure shows ,The data classification process: (a) Learning: Training data are analyzed by a classification algorithm. Here, the class label attribute is loan decision, and the learned model or classifier is represented in the form of classification rules. (b) Classification: Test data are used to estimate the accuracy of the classification rules. If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples.
Each tuple, X, is assumed to belong to a predefined class as determined by another database attribute called the class label attribute.
The individual tuples making up the training set are referred to astraining tuples and are selected from the database under analysis.
supervised learning (i.e., the learning of the classifier is “supervised” in that it is told to which class each training tuple belongs.)
It contrasts with unsupervised learning (or clustering), in which the class label of each training tuple is not known, and the number or set of classes to be learned may not be known in advance.
This first step of the classification process can also be viewed as the learning of a mapping or function, y = f (X), that can predict the associated class label y of a given tuple X.
This mapping is represented in the form of classification rules, decision trees, or mathematical formulae.
In the second step,
The model is used for classification. First, the predictive accuracy of the classifier is estimated. If we were to use the training set to measure the accuracy of the classifier, this estimate would likely be optimistic, because the classifier tends to overfit the data (i.e., during learning it may incorporate some particular anomalies of the training data that are not present in the general data set overall). Therefore, a test set is used, made up of test tuples and their associated class labels. These tuples are randomly selected from the general data set.
The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier. The associated class label of each test tuple is compared with the learned classifier’s class prediction for that tuple.
DATA PREDICTION :
Data prediction is a two step process, similar to that of data classification.
However, for prediction, we lose the terminology of “class label attribute” because the attribute for which values are being predicted is continuous-valued (ordered) rather than categorical (discrete-valued and unordered). The attribute can be referred to simply as the predicted attribute.
Note that prediction can also be viewed as a mapping or function, y= f (X), where X is the input (e.g., a tuple describing a loan applicant), and the outputy is a continuous or ordered value (such as the predicted amount that the bank can safely loan the applicant); That is, we wish to learn a mapping or function that models the relationship between X and y.
4.2.ISSUES REGARDING CLASSIFICATION AND PREDICTION :
The following preprocessing steps may be applied to the data to help improve the accuracy,efficiency, and scalability of the classification or prediction process.
4.2.1. Preparing the Data for Classification and Prediction
The following preprocessing steps may be applied to the data to help improve the accuracy, efficiency, and scalability of the classification or prediction process.
Data cleaning: This refers to the preprocessing of data in order to remove or reduce noise (by applying smoothing techniques, for example) and the treatment of missing values (e.g., by replacing a missing value with the most commonly occurring value for that attribute, or with the most probable value based on statistics).
Relevance analysis: Many of the attributes in the data may beredundant. Correlation
analysis can be used to identify whether any two given attributes are statistically related. Attribute subset selection can be used in these cases to find a reduced set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes.
Data transformation and reduction: The data may be transformed by normalization, particularly when neural networks or methods involving distance measurements are used in the learning step.Normalization involves scaling all values for a given attribute so that they fall within a small specified range, such as 1:0 to 1:0, or 0:0 to 1:0.
The data can also be transformed by generalizing it to higher-level concepts.
Data can also be reduced by applying many other methods, ranging from wavelet transformation and principle components analysis to discretization techniques, such as binning, histogram analysis, and clustering.
4.2.2 Comparing Classification and Prediction Methods :
Classification and prediction methods can be compared and evaluated according to the
following criteria:
Accuracy: The accuracy of a classifier refers to the ability of a given classifier to correctly predict the class label of new or previously unseen data (i.e., tuples without class label information). Similarly, the accuracy of a predictor refers to how well a given predictor can guess the value of the predicted attribute for new or previously unseen data.
Speed: This refers to the computational costs involved in generating and using the given classifier or predictor.
Robustness: This is the ability of the classifier or predictor to make correct predictions given noisy data or data with missing values.
Scalability: This refers to the ability to construct the classifier or predictor efficiently given large amounts of data.
Interpretability: This refers to the level of understanding and insight that is provided by the classifier or predictor.
No comments:
Post a Comment