SCORING BANCAIRE (Prédiction bon et mauvais clients pour des emprunts)¶

No description has been provided for this image
Lorsqu'une banque prête de l'argent à une personne, elle prend le risque que cette dernière ne rembourse pas cet argent dans le délai convenu. Ce risque est appelé Risque de Crédit. Alors avant d'octroyer un crédit, les banques vérifient si le client (ou la cliente) qui demandent un prêt sera capable ou pas de le rembourser.¶
Cette vérification se fait grâce à l'analyse de plusieurs paramètres tels que les revenus, les biens, les dépenses actuelles du client, etc. Cette analyse est encore effectuée manuellement par plusieurs banques. Ainsi, elle est très consommatrice en temps et en ressources financières.¶
Grâce au Machine Learning, il est possible d'automatiser cette tâche et de pouvoir prédire avec plus de précision les clients qui seront en défaut de paiement.¶

Technologies Utilisées : Python, Machine Learning, Traitement de Données.¶

Dans ce projet, j’ai :¶

  • Nettoyer et structurer des données brutes pour faciliter leur exploitation.
  • Appliquer des algorithmes d’apprentissage automatique afin de tirer des insights précis et pertinents des données analysées.
  • Visualiser les résultats sous forme de graphiques et rapports exploitables pour aider à la prise de décision.
  • Développer un modèle de Machine Learning pour la prédiction.

CODING TIME ...¶

1- CHARGEMENT DES PACKAGES NECESSAIRES¶

In [72]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.metrics as metrics
from sklearn.metrics import f1_score, confusion_matrix, classification_report
from sklearn.model_selection import learning_curve
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder,RobustScaler
from sklearn.pipeline import make_pipeline

2- NETTOYAGE DE LA BASE DE DONNEES¶

ANALYSE DE LA FORME DES DONNEES¶
In [74]:
url='banking.csv'
data= pd.read_csv(url)
In [76]:
data.head()
Out[76]:
age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed y
0 44 blue-collar married basic.4y unknown yes no cellular aug thu ... 1 999 0 nonexistent 1.4 93.444 -36.1 4.963 5228.1 0
1 53 technician married unknown no no no cellular nov fri ... 1 999 0 nonexistent -0.1 93.200 -42.0 4.021 5195.8 0
2 28 management single university.degree no yes no cellular jun thu ... 3 6 2 success -1.7 94.055 -39.8 0.729 4991.6 1
3 39 services married high.school no no no cellular apr fri ... 2 999 0 nonexistent -1.8 93.075 -47.1 1.405 5099.1 0
4 55 retired married basic.4y no yes no cellular aug fri ... 1 3 1 success -2.9 92.201 -31.4 0.869 5076.2 1

5 rows × 21 columns

In [78]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp_var_rate    41188 non-null  float64
 16  cons_price_idx  41188 non-null  float64
 17  cons_conf_idx   41188 non-null  float64
 18  euribor3m       41188 non-null  float64
 19  nr_employed     41188 non-null  float64
 20  y               41188 non-null  int64  
dtypes: float64(5), int64(6), object(10)
memory usage: 6.6+ MB
In [80]:
for col in data.select_dtypes('object'):
    status = data[col]
    print(status.value_counts())
job
admin.           10422
blue-collar       9254
technician        6743
services          3969
management        2924
retired           1720
entrepreneur      1456
self-employed     1421
housemaid         1060
unemployed        1014
student            875
unknown            330
Name: count, dtype: int64
marital
married     24928
single      11568
divorced     4612
unknown        80
Name: count, dtype: int64
education
university.degree      12168
high.school             9515
basic.9y                6045
professional.course     5243
basic.4y                4176
basic.6y                2292
unknown                 1731
illiterate                18
Name: count, dtype: int64
default
no         32588
unknown     8597
yes            3
Name: count, dtype: int64
housing
yes        21576
no         18622
unknown      990
Name: count, dtype: int64
loan
no         33950
yes         6248
unknown      990
Name: count, dtype: int64
contact
cellular     26144
telephone    15044
Name: count, dtype: int64
month
may    13769
jul     7174
aug     6178
jun     5318
nov     4101
apr     2632
oct      718
sep      570
mar      546
dec      182
Name: count, dtype: int64
day_of_week
thu    8623
mon    8514
wed    8134
tue    8090
fri    7827
Name: count, dtype: int64
poutcome
nonexistent    35563
failure         4252
success         1373
Name: count, dtype: int64
In [82]:
df = data.copy()
In [84]:
df.dtypes.value_counts().plot.pie()
Out[84]:
<Axes: ylabel='count'>
No description has been provided for this image
In [86]:
taux_valeurs_manquantes =(df.isna().sum()/df.shape[0]).sort_values(ascending=True)
taux_valeurs_manquantes
Out[86]:
age               0.0
euribor3m         0.0
cons_conf_idx     0.0
cons_price_idx    0.0
emp_var_rate      0.0
poutcome          0.0
previous          0.0
pdays             0.0
campaign          0.0
nr_employed       0.0
duration          0.0
month             0.0
contact           0.0
loan              0.0
housing           0.0
default           0.0
education         0.0
marital           0.0
job               0.0
day_of_week       0.0
y                 0.0
dtype: float64
IMPUTATION¶
In [88]:
# b. Drop unrepresentative features
#df.drop(columns = ["month", "previous", "day_of_week", "pdays"],inplace = True)
#df.head()
#Numerical features: ["age", "balance", "duration", "campaign"]
#Categorical Features: ["job", "marital", "education", "default", "housing", "loan", "poutcome", "y"]
def imputation(df):
    df = df.dropna(axis=0)
    df.drop(columns = ["month", "previous", "day_of_week", "pdays","emp_var_rate","cons_price_idx","cons_conf_idx",
                       "euribor3m","nr_employed","contact"],inplace = True)
    return  df
SUPPRESSION DES INCOHERENCES¶
In [90]:
def deleteInconherenteVal(df):
    for col in df.select_dtypes('object'):
        df[col].replace(["unknown"],df[col].mode(),inplace = True)
    return df
ENCODAGE¶
In [92]:
def encodage(df):
    for col in df.select_dtypes('object'):
        encoder = LabelEncoder()
        encode = encoder.fit_transform(df[col].unique())
        cl=df[col].unique()
        code = {a:b for a,b in zip(cl,encode)}
        df[col] = df[col].map(code)
        #print(encode)
        #print(cl)
        #print(code)
    return df
DEFINITION DE LA FONCTION DE NETTOYAGE DE LA BASE DE DONNEES¶
In [94]:
def apurement(df):
    df=imputation(df)
    df=deleteInconherenteVal(df)
    df=encodage(df)
    return df
EXECUTION DE LA FONCTION DE NETTOYAGE DE LA BASE DE DONNEES¶
In [96]:
df_apurer=apurement(df)
df_apurer.head()
C:\Users\lucre\AppData\Local\Temp\ipykernel_22932\4192318323.py:3: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].replace(["unknown"],df[col].mode(),inplace = True)
Out[96]:
age job marital education default housing loan duration campaign poutcome y
0 44 1 1 0 0 1 0 210 1 1 0
1 53 9 1 6 0 0 0 138 1 1 0
2 28 4 2 6 0 1 0 339 3 2 1
3 39 7 1 3 0 0 0 185 2 1 0
4 55 5 1 0 0 1 0 137 1 2 1
MATRICE DE CORRELATION DES VARIABLES¶
In [98]:
plt.figure(figsize=(20,10))
sns.heatmap(df_apurer.corr(),annot = True)
Out[98]:
<Axes: >
No description has been provided for this image

3- Analyses Descriptives (Duration,âge,job,default)¶

Duration Feature¶
In [115]:
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True) # gridspec_kw= {"height_ratios": (0.3, 1)}
sns.set(rc={'figure.figsize':(11,8)}, font_scale=1.5, style='whitegrid')
mean=df_apurer['duration'].mean()
median=df_apurer['duration'].median()
mode=df_apurer['duration'].mode().values[0]

duration = sns.boxplot(data=df_apurer, x="duration", y="y", ax=ax_box, order = df_apurer["y"].value_counts().index)
duration.set(xscale="log")
ax_box.axvline(mean, color='r', linestyle='--')
ax_box.axvline(median, color='g', linestyle='-')
ax_box.axvline(mode, color='b', linestyle='-')

sns.histplot(data=df_apurer, x="duration", ax=ax_hist, kde=True)
ax_hist.axvline(mean, color='r', linestyle='--', label="Mean")
ax_hist.axvline(median, color='g', linestyle='-', label="Median")
ax_hist.axvline(mode, color='b', linestyle='-', label="Mode")
ax_hist.legend()
ax_box.set(xlabel='')
plt.show()
No description has been provided for this image
Age Feature¶
In [203]:
mean=df_apurer['age'].mean()
median=df_apurer['age'].median()
mode=df_apurer['age'].mode().values[0]

sns.boxplot(data=df_apurer, x="age", y="y")
sns.histplot(data=df_apurer, x="age",kde=True)

plt.show()
No description has been provided for this image
Job feature¶
In [146]:
sns.countplot(x="job", data = df_apurer, hue = "y", order = df_apurer["job"].value_counts().index)
plt.title("Analyse de la relation entre 'job' et 'y'")
plt.show()
No description has been provided for this image
Default feature¶
In [148]:
default = sns.countplot(x="default", data = df_apurer, hue = "y", order = df_apurer["default"].value_counts().index)
plt.title("Analyse bivariée entre 'default' et 'y'")
plt.show()
No description has been provided for this image

4- Modélisation¶

In [150]:
trainset, testset = train_test_split(df_apurer, test_size=0.2, random_state=0)
In [152]:
trainset['y'].value_counts()
Out[152]:
y
0    29223
1     3727
Name: count, dtype: int64
In [154]:
testset['y'].value_counts()
Out[154]:
y
0    7325
1     913
Name: count, dtype: int64
In [156]:
def preprocessing(df):
    
    df = encodage(df)
    df = deleteInconherenteVal(df)
    #df = imputation(df)
    
    X = df.drop('y', axis=1)
    y = df['y']
    
    print(y.value_counts())
    
    return X, y
In [158]:
X_train, y_train = preprocessing(df_apurer)
y
0    36548
1     4640
Name: count, dtype: int64
In [160]:
X_test, y_test = preprocessing(testset)
y
0    7325
1     913
Name: count, dtype: int64
In [162]:
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)
model.score(X_test, y_test)
Out[162]:
0.8983976693372178
In [164]:
def evaluation(model):
    
    model.fit(X_train, y_train)
    ypred = model.predict(X_test)
    
    print(confusion_matrix(y_test, ypred))
    print(classification_report(y_test, ypred))

5- EVALUATION DU MODELE¶

In [166]:
evaluation(model)
[[7217  108]
 [ 729  184]]
              precision    recall  f1-score   support

           0       0.91      0.99      0.95      7325
           1       0.63      0.20      0.31       913

    accuracy                           0.90      8238
   macro avg       0.77      0.59      0.63      8238
weighted avg       0.88      0.90      0.87      8238