Build a predictive model using Python

Contents

Introduction

I came across this strategic virtue of Sun Tzu recently:

6-strategic-principles-by-sun-tzu-13-728

What does this have to do with a data science blog? This is the essence of how competitions are won and hackatones. You come to the competition better prepared than the competitors, you run quickly, you learn and iterate to get the best of you.

Last week, We publish “Perfect way to build a predictive model in less than 10 minutes using R". Anyone can guess a quick follow up to this post. Given the rise of Python in recent years and its simplicity, it makes sense to have this toolkit ready for Pythonists in the data science world. I will follow a structure similar to that of the previous post with my additional contributions in different stages of the construction of the model. These two posts will help you build your first predictive model faster and with better power. Most of the best data scientists and Kagglers build their first efficient model quickly and ship it. This not only helps them to have an advantage on the leader board, but also provides them with a reference solution to overcome.

predictive modeling in python at 10 minutes

Breakdown of the predictive modeling procedure

I always focus on investing quality time during the initial stage of model building, as hypothesis generation / brainstorming sessions / discussion (s) or domain understanding. All of these activities help me relate to the problem, which eventually leads me to design more powerful business solutions. There are good reasons why you should spend this time at the beginning:

  1. You have enough time to invest and you are fresh (has impact)
  2. It is not biased with other data points or thoughts (I always suggest you generate hypotheses before digging into the data)
  3. At a later stage, would be in a rush to complete the project and would not be able to dedicate quality time.

This stage will need quality time, so i am not mentioning the timeline here, I would recommend that you do it as a standard practice. It will help you build better predictive models and result in less iteration of work in later stages. Let's look at the remaining stages in the first build of the model with timelines:

  1. Descriptive analysis of the data: 50% weather
  2. Data treatment (missing value and outlier correction): 40% weather
  3. Data modeling: 4% weather
  4. Performance estimation: 6% weather

PD: this is the division of the time dedicated only to the first build of the model

Let's go through the procedure step by step (with estimates of the time spent in each step):

Stage 1: Descriptive analysis / Data exploration:

In my early days as a data scientist, data exploration used to take a long time. Over time, I have automated many operations with the data. Since data preparation occupies the 50% of the work in the construction of a first model, the benefits of automation are obvious. You can check the “7 data exploration steps” to see the most common data exploration operations.

Tavish has already mentioned in his post that with advanced machine learning tools on the run, the time required to perform this task has been significantly reduced. Since this is our first reference model, we eliminate any type of function engineering. Therefore, the time it might take to perform a descriptive analysis is restricted to find out missing values ​​and large features that are directly visible. In my methodology, needs to 2 minutes to complete this step (Assumption, 100.000 observations in the data set).

The operations I perform for my first model include:

  1. Identify identifying characteristics, entry and destination
  2. Identify categorical and numerical characteristics
  3. Identify columns with missing values

Stage 2: Data treatment (treatment of missing values):

There are several alternatives to face it. For our first model, we will focus on the smart and fast techniques to build your first effective model (these were already discussed by Tavish in his post, I am adding some methods)

  • Create dummy indicators for missing values: it works, sometimes the missing values ​​themselves contain a good amount of information.
  • Impute missing value with mean / median / any other simpler method: imputation of mean and median works fine, most people prefer to impute with the mean value, but in case of skewed distribution, I suggest you choose the median. Other smart methods are imputing values ​​through similar cases and median imputation using other relevant characteristics or building a model. As an example: in the Titanic survival challenge, You can impute the missing Age values ​​using the passenger name salutation as “Sr.”, “Miss”, “Ms.”, “Maestro” and others, and this has shown a good impact on the performance of the model. .
  • Imputar el valor faltante de la variable categórica: create a new level to impute the categorical variable so that all missing value is encoded as a single value, say it “New_Cat” or you can look at the combination of frequencies and impute the missing value with the value that has a higher frequency.

With such simple data processing methods, can reduce the data processing time for 3-4 minutes.

Stage 3. Data modeling:

I recommend using any of the GBM techniques / Random Forest, depending on the business problem. These two techniques are extremely effective in creating a reference solution.. I have seen data scientists use these two methods often as their first model and, in some cases, it also acts as a final model. This will take the maximum amount of time (~ 4-5 minutes).

Stage 4. Performance estimation:

There are several methods to validate the performance of your model, I suggest you split your train dataset into Train and validate (ideally 70:30) and create a model based on the 70% from the train data set. Now, perform a cross validation using the 30% from the validated data set and evaluate performance using the evaluation metric. This in conclusion takes 1-2 minutes execute and document.

The intention of this post is not to win the competition, but to determine a benchmark for ourselves. Let's look at the Python codes to perform the above steps and build your first model with the highest impact.

Let's start putting this into action

I assumed he's done all the hypothesis generation first and is good with basic data science using Python.. I'm illustrating this with an example of a data science challenge.. Let's look at the structure:

Paso 1 : Importe las bibliotecas imprescindibles y lea el conjunto de datos de prueba y training. Attach both.

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
train=pd.read_csv('C:/Users/DataPeaker/Desktop/challenge/Train.csv')
test=pd.read_csv('C:/Users/DataPeaker/Desktop/challenge/Test.csv')
train['Type']='Train' #Create a flag for Train and Test Data set
test['Type']='Test'
fullData = pd.concat([train,test],axis=0) #Combined both Train and Test Data set

Paso 2: Step 2 del marco no es necesario en Python. Pasamos al siguiente paso.

Paso 3: View column names / dataset summary

fullData.columns # This will show all the column names
fullData.head(10) # Show first 10 records of dataframe
fullData.describe() #You can look at summary of numerical fields by using describe() function

Capture_10

Paso 4: Identificar las a) Variables de identificación b) Variables objetivo c) Variables categóricas d) Variables numéricas e) Otras variables

ID_col = ['REF_NO']
target_col = ["Account.Status"]
cat_cols = ['children','age_band','status','occupation','occupation_partner','home_status','family_income','self_employed', 'self_employed_partner','year_last_moved','TVarea','post_code','post_area','gender','region']
num_cols= list(set(list(fullData.columns))-set(cat_cols)-set(ID_col)-set(target_col)-set(data_col))
other_col=['Type'] #Test and Train Data set identifier

Paso 5 : Identifique las variables con valores perdidos y cree una bandera para esas

fullData.isnull().any()#Will return the feature with True or False,True means have missing value else False
Capture11
num_cat_cols = num_cols+cat_cols # Combined numerical and Categorical variables

#Create a new variable for each variable having missing value with VariableName_NA 
# and flag missing value with 1 and other with 0

for var in num_cat_cols:
    if fullData[where].isnull().any()==True:
        fullData[var+'_NA']=fullData[where].isnull()*1 

Paso 6 : Impute missing values

#Impute numerical missing values with mean
fullData[num_cols] = fullData[num_cols].fillna(fullData[num_cols].mean(),inplace=True)
#Impute categorical missing values with -9999
fullData[cat_cols] = fullData[cat_cols].fillna(value = -9999)

Paso 7 : Cree codificadores de etiquetas para variables categóricas y divida el conjunto de datos para entrenar y probar, divida aún más el conjunto de datos del tren para entrenar y validar

#create label encoders for categorical features
for var in cat_cols:
 number = LabelEncoder()
 fullData[where] = number.fit_transform(fullData[where].astype('str'))

#Target variable is also a categorical so convert it
fullData["Account.Status"] = number.fit_transform(fullData["Account.Status"].astype('str'))

train=fullData[fullData['Type']=='Train']
test=fullData[fullData['Type']=='Test']

train['is_train'] = np.random.uniform(0, 1, len(train)) <= .75
Train, Validate = train[train['is_train']==True], train[train['is_train']==False]

Paso 8 : Pass the imputed and fictitious variables (indicators of lost values) to the modeling procedure. I'm using a random forest to predict the class.

features=list(set(list(fullData.columns))-set(ID_col)-set(target_col)-set(other_col))
x_train = Train[list(features)].values
y_train = Train["Account.Status"].values
x_validate = Validate[list(features)].values
y_validate = Validate["Account.Status"].values
x_test=test[list(features)].values
random.seed(100)
rf = RandomForestClassifier(n_estimators = 1000)
rf.fit(x_train, y_train)

Paso 9 : Check performance and make predictions

status = rf.predict_proba(x_validate)
fpr, tpr, _ = roc_curve(y_validate, status[:,1])
roc_auc = auc(fpr, tpr)
print roc_auc

final_status = rf.predict_proba(x_test)
test["Account.Status"]=final_status[:,1]
test.to_csv('C:/Users/DataPeaker/Desktop/model_output.csv',columns=['REF_NO','Account.Status'])

And submit!

Final notes

Hopefully, this post will allow you to start creating your own scoring code from 10 minutes. Most of Kaggle's teachers and the best scientists in our country hackatones Have these codes ready and fire your first shipment before doing a detailed analysis. Once they have an estimate of the reference point, they start to improvise more. Share your full codes in the comment box below.

Has this post been useful to you? Share your opinions / thoughts in the comment section below.

If you like what you have just read and want to continue learning about analytics, subscribe to our emails, Follow us on twitter or like ours page the Facebook.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.