Advanced Business Analytics
Data Mining: Classi�cation
Advanced Business Analytics– Majid Karimi
Data Mining RevisitedData Mining
The process of discovering patterns in large data sets for prediction andclassi�cation.
classi�cation
The process of determining the future values of a qualitative variable(s).
Approaches for predicting a qualitative variable(s):• Logistic Regression• k-Nearest Neighborhood (kNN)• Arti�cial Neural Networks (ANN)• …
Data Mining: Classi�cation (©2019 Cengage) 2 – 23
Advanced Business Analytics– Majid Karimi
Data Mining RevisitedData Mining
The process of discovering patterns in large data sets for prediction andclassi�cation.
classi�cation
The process of determining the future values of a qualitative variable(s).
Approaches for predicting a qualitative variable(s):• Logistic Regression• k-Nearest Neighborhood (kNN)• Arti�cial Neural Networks (ANN)• …
Data Mining: Classi�cation (©2019 Cengage) 2 – 23
Advanced Business Analytics– Majid Karimi
Data Mining RevisitedData Mining
The process of discovering patterns in large data sets for prediction andclassi�cation.
classi�cation
The process of determining the future values of a qualitative variable(s).
Approaches for predicting a qualitative variable(s):• Logistic Regression• k-Nearest Neighborhood (kNN)• Arti�cial Neural Networks (ANN)• …
Data Mining: Classi�cation (©2019 Cengage) 2 – 23
Advanced Business Analytics– Majid Karimi
Before we begin: Data Sampling, Preparation, and Partitioning
• When dealing with large volumes of data, best practice is to extract a representativesample for analysis.
• A sample is representative if the analyst can make the same conclusions from it asfrom the entire population of data.
• The sample of data must be large enough to contain signi�cant information, yetsmall enough to be manipulated quickly.
• Data mining algorithms typically are more e�ective given more data.
Data Mining: Classi�cation (©2019 Cengage) 3 – 23
Advanced Business Analytics– Majid Karimi
Data Sampling, Preparation, and Partitioning: Continued
• When obtaining a representative sample, it is generally best to include as manyvariables as possible in the sample.
• After exploring the data with descriptive statistics and visualization, the analyst caneliminate variables that are not of interest.
• Data mining applications deal with an abundance of data that simpli�es the processof assessing the accuracy of data-based estimates of variable e�ects.
Data Mining: Classi�cation (©2019 Cengage) 4 – 23
Advanced Business Analytics– Majid Karimi
Over�tting
• Model over�tting occurs when the analyst builds a model that does a great job ofexplaining the sample of data on which it is based, but fails to accurately predictoutside the sample data.
• We can use the abundance of data to guard against the potential for over�tting bydecomposing the data set into three partitions:
• The training set.• The validation set.• The test set.
Data Mining: Classi�cation (©2019 Cengage) 5 – 23
Advanced Business Analytics– Majid Karimi
Data Partitioning
• Training set: Consists of the data used to build the candidate models.• Validation set: The data set to which the promising subset of models is applied toidentify which model is the most accurate at predicting observations that were notused to build the model.
• Test set: The data set to which the �nal model should be applied to estimate thismodel’s e�ectiveness when applied to data that have not been used to build orselect the model.
Data Mining: Classi�cation (©2019 Cengage) 6 – 23
Advanced Business Analytics– Majid Karimi
Data Partitioning Visualized
Data Mining: Classi�cation (©2019 Cengage) 7 – 23
Advanced Business Analytics– Majid Karimi
Data Partitioning Continued
• There are no de�nite rules for the size of the three partitions.• But the training set is typically the largest.• For estimation tasks, a rule of thumb is to have at least 10 times as manyobservations as variables.
• For classi�cation tasks, a rule of thumb is to have at least 6 × m × q observations,where m is the number of outcome categories and q is the number of variables.
Data Mining: Classi�cation (©2019 Cengage) 8 – 23
Advanced Business Analytics– Majid Karimi
Data Partitioning with Oversampling• When we are interested in predicting a rare event, such as a click-through on anadvertisement posted on a web site or a fraudulent creditcard transaction, it isrecommended that the training set oversample the number of observationscorresponding to the rare events to provide the data-mining algorithm su�cientdata to “learn” about the rare events.
Clicks
If only one out of every 10,000 users clicks on an advertisement posted on a website, we would not have su�cient information to distinguish between users who donot click-through and those who do if we constructed a representative training setconsisting of one observation corresponding to a click-through and 9,999 observa-tions with no click-through. In these cases, the training set should contain equal ornearly equal numbers of observations corresponding to the di�erent values of theoutcome variable.
Data Mining: Classi�cation (©2019 Cengage) 9 – 23
Advanced Business Analytics– Majid Karimi
Data Partitioning with Oversampling
Note that we do not oversample the validation set and test sets; these samplesshould be representative of the overall population so that accuracy measuresevaluated on these data sets appropriately re�ect future performance of thedata-mining model.
Data Mining: Classi�cation (©2019 Cengage) 10 – 23
Advanced Business Analytics– Majid Karimi
Data Partitioning in Excel
Credit Scores
• Download the �le "Optiva.xlsx" from the Classi�cation module. This �leincludes loan costumer’s data. Consider the task of classifying loancustomers as either “default” or “no default.” Partition the data into three setsof training, validation, and test with a 50%, 30%, 20% split, respectively.
• What is the chance of a customer defaulting on their loan?• What can you conclude with regard to the partitioning of the data?
Data Mining: Classi�cation (©2019 Cengage) 11 – 23
Advanced Business Analytics– Majid Karimi
Data Partitioning in Excel
Credit Scores
• Download the �le "Optiva.xlsx" from the Classi�cation module. This �leincludes loan costumer’s data. Consider the task of classifying loancustomers as either “default” or “no default.” Partition the data into three setsof training, validation, and test with a 50%, 30%, 20% split, respectively.
• What is the chance of a customer defaulting on their loan?
• What can you conclude with regard to the partitioning of the data?
Data Mining: Classi�cation (©2019 Cengage) 11 – 23
Advanced Business Analytics– Majid Karimi
Data Partitioning in Excel
Credit Scores
• Download the �le "Optiva.xlsx" from the Classi�cation module. This �leincludes loan costumer’s data. Consider the task of classifying loancustomers as either “default” or “no default.” Partition the data into three setsof training, validation, and test with a 50%, 30%, 20% split, respectively.
• What is the chance of a customer defaulting on their loan?• What can you conclude with regard to the partitioning of the data?
Data Mining: Classi�cation (©2019 Cengage) 11 – 23
Advanced Business Analytics– Majid Karimi
Data Partitioning in Excel Using Analytics Solver
Partitioning with Over-sampling
We cover the implementing of partitioning with over-sampling during the syn-chronous classes.
Data Mining: Classi�cation (©2019 Cengage) 12 – 23
Advanced Business Analytics– Majid Karimi
Who is going to win the Oscars?
Using Oscars nominations to predict the Oscars winners.
Download the OscardDemo �le from the Classi�cation module on cougar courses,and �t a regression equation to predict Winning Oscars using the indpendent vari-able of Oscars Nominations.
Data Mining: Classi�cation (©2019 Cengage) 13 – 23
Advanced Business Analytics– Majid Karimi
Who is going to win the Oscars? (continued)
• Does this make sense?
• Why can’t we apply linearregression to classify acategorical variable?
• We should be estimating the“probability” of winningOscars.
Data Mining: Classi�cation (©2019 Cengage) 14 – 23
Advanced Business Analytics– Majid Karimi
Who is going to win the Oscars? (continued)
• Does this make sense?• Why can’t we apply linearregression to classify acategorical variable?
• We should be estimating the“probability” of winningOscars.
Data Mining: Classi�cation (©2019 Cengage) 14 – 23
Advanced Business Analytics– Majid Karimi
Who is going to win the Oscars? (continued)
• Does this make sense?• Why can’t we apply linearregression to classify acategorical variable?
• We should be estimating the“probability” of winningOscars.
Data Mining: Classi�cation (©2019 Cengage) 14 – 23
Advanced Business Analytics– Majid Karimi
Logistic Regression: The Idea
• Logistic regression attempts to classify a binary categorical outcome as a linearfunction of explanatory variables.
• A linear regression model fails to appropriately explain a categorical outcomevariable.
• Odds is a measure related to probability.• If an estimate of the probability of an event is p̂, the the equivalent odds measure is
p̂1−p̂ .
• The odds metric ranges between zero and positive in�nity.• We eliminate the �t problem by using logit, ln
(p̂1−p̂
)
Data Mining: Classi�cation (©2019 Cengage) 15 – 23
Advanced Business Analytics– Majid Karimi
Logistic Regression: The Procedure
Logistic Regression Model:
ln(
p̂1 − p̂
)= b0 + b1x1 + · · · + bnxn
Given a set of explanatory variables, a logistic regression algorithm determines values ofb0, b1, · · · , bn that best estimate the log odds.
To calculate the estimated odds, we can use the logistic function:
p̂ =1
1 + e−(b0+b1x1+···+bnxn)
Data Mining: Classi�cation (©2019 Cengage) 16 – 23
Advanced Business Analytics– Majid Karimi
Logistic Regression: The Procedure
Logistic Regression Model:
ln(
p̂1 − p̂
)= b0 + b1x1 + · · · + bnxn
Given a set of explanatory variables, a logistic regression algorithm determines values ofb0, b1, · · · , bn that best estimate the log odds.To calculate the estimated odds, we can use the logistic function:
p̂ =1
1 + e−(b0+b1x1+···+bnxn)
Data Mining: Classi�cation (©2019 Cengage) 16 – 23
Advanced Business Analytics– Majid Karimi
Back to the Oscars Example.
• If we apply logistics regressionto the Oscars example we get:
p̂ =1
1 + e−(−6.214+0.596x)
• For example, a movie with �venominations has 3.8% chanceof winning the Oscars:
p̂ =1
1 + e−(−6.214+0.596(5))= 0.038.
Data Mining: Classi�cation (©2019 Cengage) 17 – 23
Advanced Business Analytics– Majid Karimi
Back to the Oscars Example.
• If we apply logistics regressionto the Oscars example we get:
p̂ =1
1 + e−(−6.214+0.596x)
• For example, a movie with �venominations has 3.8% chanceof winning the Oscars:
p̂ =1
1 + e−(−6.214+0.596(5))= 0.038.
Data Mining: Classi�cation (©2019 Cengage) 17 – 23
Advanced Business Analytics– Majid Karimi
Logistic Regression in Excel
Oscars
Assume we have been given the task to construct a logistic regression model toclassify winners of the Best Picture Oscar; using Winner as the output variable andOscarNominations, GoldenGlobeWins, and Comedy as input variables.Can we use our linear regression model to �t a logistic regression equation for thisdata?
Logistics Regression in Analytics Solver
We cover the implementing of logistics regression and the above practice questionsduring the synchronous classes.
Data Mining: Classi�cation (©2019 Cengage) 18 – 23
Advanced Business Analytics– Majid Karimi
k-Nearest Neighbors
• k-Nearest Neighbors (k-NN): This method can be used either to classify a categoricaloutcome or predict a continuous outcome.
• k-NN uses the k most similar observations from the training set, where similarity istypically measured with Euclidean distance.
• A nearest-neighbor is a “lazy learner” that directly uses the entire training set toclassify observations in the validation and test sets.
Data Mining: Classi�cation (©2019 Cengage) 19 – 23
Advanced Business Analytics– Majid Karimi
k-Nearest Neighbors ExampleLoan Default
Consider the following costumer information.
What are the chances in which a 28 year old costumer with Average Balance of 900default their loan?
Data Mining: Classi�cation (©2019 Cengage) 20 – 23
Advanced Business Analytics– Majid Karimi
k-Nearest Neighbors Example Continued
Data Mining: Classi�cation (©2019 Cengage) 21 – 23
Advanced Business Analytics– Majid Karimi
k-Nearest Neighbors Example Continued
Data Mining: Classi�cation (©2019 Cengage) 21 – 23
Advanced Business Analytics– Majid Karimi
k-Nearest Neighbors Example Continued
k=1: Classi�ed as a Loan Default (Class 1) because its nearest neighbor (Observation 2) isin Class 1
Data Mining: Classi�cation (©2019 Cengage) 21 – 23
Advanced Business Analytics– Majid Karimi
k-Nearest Neighbors Example Continued
k=2: Two nearest neighbors are Observation 2 (Class 1) and Observation 7 (Class 0). Atleast 0.5 of the k = 2 neighbors are Class 1, the new observation is classi�ed as Class 1.
Data Mining: Classi�cation (©2019 Cengage) 21 – 23
Advanced Business Analytics– Majid Karimi
k-Nearest Neighbors Example Continued
k=3: Three nearest neighbors are Observation 2 (Class 1), Observation 7 (Class 0), andObservation 6 (Class 0). Because only 1/3 of the neighbors are Class 1, the newobservation is classi�ed as Class 0.
Data Mining: Classi�cation (©2019 Cengage) 21 – 23
Advanced Business Analytics– Majid Karimi
k-Nearest Neighbors for Prediction
• When k-NN is used to estimate a continuous outcome, a new observation’s outcomevalue is predicted to be the average of the outcome values of its k nearest neighborsin the training set.
Data Mining: Classi�cation (©2019 Cengage) 22 – 23
Advanced Business Analytics– Majid Karimi
kNN in Excel
Loan Default
Download the �le "Optiva.xlsx" from the Classi�cation module. This �le includesloan costumers’ data. Consider the task of classifying loan customers as either“default” or “no default.” Partition the data into three sets of training, validation,and test. Appy the k-NN algorithm to answer the following question.
• What is the chance of the following costumer to default their loan.• Average Balance: $1500, Age: 25, Employed, Married, and College Student.
kNN in Analytics Solver
We cover the implementing of kNN and the above practice questions during thesynchronous classes.
Data Mining: Classi�cation (©2019 Cengage) 23 – 23