Data Mining Classification

Classification is a data mining (machine learning) technique used to predict group membership for data instances. It predicts categorical class labels (discrete or nominal) Classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data.
The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks.
A classification task begins with a data set in which the class assignments are known. For example, a classification model that predicts credit risk could be developed based on observed data for many loan applicants over a period of time. In addition to the historical credit rating, the data might track employment history, home ownership or rental, years of residence, number and type of Investments, and so on. Credit rating would be the target, the other attributes would be the predictors, and the data for each customer would constitute a case.

Classification versus Prediction

Classification:
  • Predicts categorical class labels.
  • Classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data.
Prediction:
  • Models continuous-valued functions, i.e. predicts unknown or missing values.

Classification is a two-step process
Data Mining Classification
Classification process: Model Construction and Model Uses
a) Model construction: Describing a set of predetermined classes. Each tuple/sample/record is assumed to belong to a predefined class, as determined by the class label attribute. The set of tuples used for model construction is the training set. The model is represented as classification rules, decision trees, or mathematical formulas.

b) Model usage: For classifying future or unknown objects need to estimate the accuracy of the model by comparing known label of a test sample with the classified results from the model. Accuracy rate is the percentage of test set samples that are correctly classified by the model. The test set is independent of the training set, otherwise over-fitting will occur. If the accuracy is acceptable, use the model to classify data objects whose class labels are not known.
SHARE
    Blogger Comment
    Facebook Comment

1 comments:

  1. Thanks for sharing your valuable blog, it is really stunning. But, i want to share an amazing process mining platform that opens the new door of new business opportunities.

    ReplyDelete