Classification.pdf

April 5th, 2022

Advanced Business Analytics

Data Mining: Classi�cation

Advanced Business Analytics– Majid Karimi

Data Mining RevisitedData Mining

The process of discovering patterns in large data sets for prediction andclassi�cation.

classi�cation

The process of determining the future values of a qualitative variable(s).

Approaches for predicting a qualitative variable(s):• Logistic Regression• k-Nearest Neighborhood (kNN)• Arti�cial Neural Networks (ANN)• …

Advanced Business Analytics– Majid Karimi

Data Mining RevisitedData Mining

The process of discovering patterns in large data sets for prediction andclassi�cation.

classi�cation

The process of determining the future values of a qualitative variable(s).

Approaches for predicting a qualitative variable(s):• Logistic Regression• k-Nearest Neighborhood (kNN)• Arti�cial Neural Networks (ANN)• …

Advanced Business Analytics– Majid Karimi

Data Mining RevisitedData Mining

The process of discovering patterns in large data sets for prediction andclassi�cation.

classi�cation

The process of determining the future values of a qualitative variable(s).

Approaches for predicting a qualitative variable(s):• Logistic Regression• k-Nearest Neighborhood (kNN)• Arti�cial Neural Networks (ANN)• …

Advanced Business Analytics– Majid Karimi

Before we begin: Data Sampling, Preparation, and Partitioning

• When dealing with large volumes of data, best practice is to extract a representativesample for analysis.

• A sample is representative if the analyst can make the same conclusions from it asfrom the entire population of data.

• The sample of data must be large enough to contain signi�cant information, yetsmall enough to be manipulated quickly.

• Data mining algorithms typically are more e�ective given more data.

Advanced Business Analytics– Majid Karimi

Data Sampling, Preparation, and Partitioning: Continued

• When obtaining a representative sample, it is generally best to include as manyvariables as possible in the sample.

• After exploring the data with descriptive statistics and visualization, the analyst caneliminate variables that are not of interest.

• Data mining applications deal with an abundance of data that simpli�es the processof assessing the accuracy of data-based estimates of variable e�ects.

Advanced Business Analytics– Majid Karimi

Over�tting

• Model over�tting occurs when the analyst builds a model that does a great job ofexplaining the sample of data on which it is based, but fails to accurately predictoutside the sample data.

• We can use the abundance of data to guard against the potential for over�tting bydecomposing the data set into three partitions:

• The training set.• The validation set.• The test set.

Advanced Business Analytics– Majid Karimi

Data Partitioning

• Training set: Consists of the data used to build the candidate models.• Validation set: The data set to which the promising subset of models is applied toidentify which model is the most accurate at predicting observations that were notused to build the model.

• Test set: The data set to which the �nal model should be applied to estimate thismodel’s e�ectiveness when applied to data that have not been used to build orselect the model.

Advanced Business Analytics– Majid Karimi

Data Partitioning Visualized

Advanced Business Analytics– Majid Karimi

Data Partitioning Continued

• There are no de�nite rules for the size of the three partitions.• But the training set is typically the largest.• For estimation tasks, a rule of thumb is to have at least 10 times as manyobservations as variables.

• For classi�cation tasks, a rule of thumb is to have at least 6 × m × q observations,where m is the number of outcome categories and q is the number of variables.

Advanced Business Analytics– Majid Karimi

Data Partitioning with Oversampling• When we are interested in predicting a rare event, such as a click-through on anadvertisement posted on a web site or a fraudulent creditcard transaction, it isrecommended that the training set oversample the number of observationscorresponding to the rare events to provide the data-mining algorithm su�cientdata to “learn” about the rare events.

Clicks

If only one out of every 10,000 users clicks on an advertisement posted on a website, we would not have su�cient information to distinguish between users who donot click-through and those who do if we constructed a representative training setconsisting of one observation corresponding to a click-through and 9,999 observa-tions with no click-through. In these cases, the training set should contain equal ornearly equal numbers of observations corresponding to the di�erent values of theoutcome variable.

Advanced Business Analytics– Majid Karimi

Data Partitioning with Oversampling

Note that we do not oversample the validation set and test sets; these samplesshould be representative of the overall population so that accuracy measuresevaluated on these data sets appropriately re�ect future performance of thedata-mining model.

Advanced Business Analytics– Majid Karimi

Data Partitioning in Excel

Credit Scores

• Download the �le "Optiva.xlsx" from the Classi�cation module. This �leincludes loan costumer’s data. Consider the task of classifying loancustomers as either “default” or “no default.” Partition the data into three setsof training, validation, and test with a 50%, 30%, 20% split, respectively.

• What is the chance of a customer defaulting on their loan?• What can you conclude with regard to the partitioning of the data?

Advanced Business Analytics– Majid Karimi

Data Partitioning in Excel

Credit Scores

• What is the chance of a customer defaulting on their loan?

• What can you conclude with regard to the partitioning of the data?

Advanced Business Analytics– Majid Karimi

Data Partitioning in Excel

Credit Scores

• What is the chance of a customer defaulting on their loan?• What can you conclude with regard to the partitioning of the data?

Advanced Business Analytics– Majid Karimi

Data Partitioning in Excel Using Analytics Solver

Partitioning with Over-sampling

We cover the implementing of partitioning with over-sampling during the syn-chronous classes.

Advanced Business Analytics– Majid Karimi

Who is going to win the Oscars?

Using Oscars nominations to predict the Oscars winners.

Download the OscardDemo �le from the Classi�cation module on cougar courses,and �t a regression equation to predict Winning Oscars using the indpendent vari-able of Oscars Nominations.

Advanced Business Analytics– Majid Karimi

Who is going to win the Oscars? (continued)

• Does this make sense?

• Why can’t we apply linearregression to classify acategorical variable?

• We should be estimating the“probability” of winningOscars.

Advanced Business Analytics– Majid Karimi

Who is going to win the Oscars? (continued)

• Does this make sense?• Why can’t we apply linearregression to classify acategorical variable?

• We should be estimating the“probability” of winningOscars.

Advanced Business Analytics– Majid Karimi

Who is going to win the Oscars? (continued)

• Does this make sense?• Why can’t we apply linearregression to classify acategorical variable?

• We should be estimating the“probability” of winningOscars.

Advanced Business Analytics– Majid Karimi

Logistic Regression: The Idea

• Logistic regression attempts to classify a binary categorical outcome as a linearfunction of explanatory variables.

• A linear regression model fails to appropriately explain a categorical outcomevariable.

• Odds is a measure related to probability.• If an estimate of the probability of an event is p̂, the the equivalent odds measure is

p̂1−p̂ .

• The odds metric ranges between zero and positive in�nity.• We eliminate the �t problem by using logit, ln

(p̂1−p̂

)

Advanced Business Analytics– Majid Karimi

Logistic Regression: The Procedure

Logistic Regression Model:

ln(

p̂1 − p̂

)= b0 + b1x1 + · · · + bnxn

Given a set of explanatory variables, a logistic regression algorithm determines values ofb0, b1, · · · , bn that best estimate the log odds.

To calculate the estimated odds, we can use the logistic function:

p̂ =1

1 + e−(b0+b1x1+···+bnxn)

Advanced Business Analytics– Majid Karimi

Logistic Regression: The Procedure

Logistic Regression Model:

ln(

p̂1 − p̂

)= b0 + b1x1 + · · · + bnxn

Given a set of explanatory variables, a logistic regression algorithm determines values ofb0, b1, · · · , bn that best estimate the log odds.To calculate the estimated odds, we can use the logistic function:

p̂ =1

1 + e−(b0+b1x1+···+bnxn)

Advanced Business Analytics– Majid Karimi

Back to the Oscars Example.

• If we apply logistics regressionto the Oscars example we get:

p̂ =1

1 + e−(−6.214+0.596x)

• For example, a movie with �venominations has 3.8% chanceof winning the Oscars:

p̂ =1

1 + e−(−6.214+0.596(5))= 0.038.

Advanced Business Analytics– Majid Karimi

Back to the Oscars Example.

• If we apply logistics regressionto the Oscars example we get:

p̂ =1

1 + e−(−6.214+0.596x)

• For example, a movie with �venominations has 3.8% chanceof winning the Oscars:

p̂ =1

1 + e−(−6.214+0.596(5))= 0.038.

Advanced Business Analytics– Majid Karimi

Logistic Regression in Excel

Oscars

Assume we have been given the task to construct a logistic regression model toclassify winners of the Best Picture Oscar; using Winner as the output variable andOscarNominations, GoldenGlobeWins, and Comedy as input variables.Can we use our linear regression model to �t a logistic regression equation for thisdata?

Logistics Regression in Analytics Solver

We cover the implementing of logistics regression and the above practice questionsduring the synchronous classes.

Advanced Business Analytics– Majid Karimi

k-Nearest Neighbors

• k-Nearest Neighbors (k-NN): This method can be used either to classify a categoricaloutcome or predict a continuous outcome.

• k-NN uses the k most similar observations from the training set, where similarity istypically measured with Euclidean distance.

• A nearest-neighbor is a “lazy learner” that directly uses the entire training set toclassify observations in the validation and test sets.

Advanced Business Analytics– Majid Karimi

k-Nearest Neighbors ExampleLoan Default

Consider the following costumer information.

What are the chances in which a 28 year old costumer with Average Balance of 900default their loan?

Advanced Business Analytics– Majid Karimi

k-Nearest Neighbors Example Continued

Advanced Business Analytics– Majid Karimi

k-Nearest Neighbors Example Continued

Advanced Business Analytics– Majid Karimi

k-Nearest Neighbors Example Continued

k=1: Classi�ed as a Loan Default (Class 1) because its nearest neighbor (Observation 2) isin Class 1

Advanced Business Analytics– Majid Karimi

k-Nearest Neighbors Example Continued

k=2: Two nearest neighbors are Observation 2 (Class 1) and Observation 7 (Class 0). Atleast 0.5 of the k = 2 neighbors are Class 1, the new observation is classi�ed as Class 1.

Advanced Business Analytics– Majid Karimi

k-Nearest Neighbors Example Continued

k=3: Three nearest neighbors are Observation 2 (Class 1), Observation 7 (Class 0), andObservation 6 (Class 0). Because only 1/3 of the neighbors are Class 1, the newobservation is classi�ed as Class 0.

Advanced Business Analytics– Majid Karimi

k-Nearest Neighbors for Prediction

• When k-NN is used to estimate a continuous outcome, a new observation’s outcomevalue is predicted to be the average of the outcome values of its k nearest neighborsin the training set.

Advanced Business Analytics– Majid Karimi

kNN in Excel

Loan Default

Download the �le "Optiva.xlsx" from the Classi�cation module. This �le includesloan costumers’ data. Consider the task of classifying loan customers as either“default” or “no default.” Partition the data into three sets of training, validation,and test. Appy the k-NN algorithm to answer the following question.

• What is the chance of the following costumer to default their loan.• Average Balance: $1500, Age: 25, Employed, Married, and College Student.

kNN in Analytics Solver

We cover the implementing of kNN and the above practice questions during thesynchronous classes.

Posted in Uncategorized

You can leave a response, or trackback from your own site.

How We Hire Writers

How We Ensure Quality

Reasons to Choose us

Flexibility

Full Control over Orders

Self-Motivated Writers

Confidentiality and Security

Recent Posts

Recent Comments

Archives

Categories

Meta

Classification.pdf

Leave a Reply

Format and Features

We Guarantee

We Accept