[ 머신러닝 ] 로지스틱 회귀
2023. 9. 6. 03:48ㆍ머신러닝
728x90
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
hr_df=pd.read_csv('/content/drive/MyDrive/8. 머신러닝 딥러닝/hr.csv')
hr_df.head()
✔️ 결과
hr_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54808 entries, 0 to 54807
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 employee_id 54808 non-null int64
1 department 54808 non-null object
2 region 54808 non-null object
3 education 52399 non-null object
4 gender 54808 non-null object
5 recruitment_channel 54808 non-null object
6 no_of_trainings 54808 non-null int64
7 age 54808 non-null int64
8 previous_year_rating 50684 non-null float64
9 length_of_service 54808 non-null int64
10 awards_won? 54808 non-null int64
11 avg_training_score 54808 non-null int64
12 is_promoted 54808 non-null int64
dtypes: float64(1), int64(7), object(5)
memory usage: 5.4+ MB
- employee_id: 임의의 직원 아이디
- department: 부서
- region: 지역
- education: 학력
- gender: 성별
- recruitment_channel: 채용 방법
- no_of_trainings: 트레이닝 받은 횟수
- age: 나이
- previous_year_rating: 이전 년도 고과 점수
- length_of_service: 근속 년수
- awards_won: 수상 경력
- avg_training_score: 평균 고과 점수
- is_promoted: 승진 여부
hr_df.describe()
✔️ 결과

sns.barplot(x= 'previous_year_rating',y='is_promoted',data=hr_df)
✔️ 결과

sns.lineplot(x= 'previous_year_rating',y='is_promoted',data=hr_df)
✔️ 결과

sns.lineplot(x= 'avg_training_score',y='is_promoted',data=hr_df)
✔️ 결과

sns.barplot(x= 'recruitment_channel',y='is_promoted',data=hr_df)
✔️ 결과

hr_df['recruitment_channel'].value_counts()
✔️ 결과
other 30446
sourcing 23220
referred 1142
Name: recruitment_channel, dtype: int64
sns.barplot(x= 'gender',y='is_promoted',data=hr_df)
✔️ 결과

hr_df['gender'].value_counts()
✔️ 결과
m 38496
f 16312
Name: gender, dtype: int64
sns.barplot(x= 'department',y='is_promoted',data=hr_df)
plt.xticks(rotation=45) # 글씨가 길어 겹칠때 이용
✔️ 결과

hr_df['department'].value_counts()
Sales & Marketing 16840
Operations 11348
Technology 7138
Procurement 7138
Analytics 5352
Finance 2536
HR 2418
Legal 1039
R&D 999
Name: department, dtype: int64
plt.figure(figsize=(14,10))
sns.barplot(x='region',y='is_promoted',data=hr_df)
plt.xticks(rotation=45)
✔️ 결과

hr_df.isna().mean()
employee_id 0.000000
department 0.000000
region 0.000000
education 0.043953
gender 0.000000
recruitment_channel 0.000000
no_of_trainings 0.000000
age 0.000000
previous_year_rating 0.075244
length_of_service 0.000000
awards_won? 0.000000
avg_training_score 0.000000
is_promoted 0.000000
dtype: float64
hr_df['education'].value_counts()
Bachelor's 36669
Master's & above 14925
Below Secondary 805
Name: education, dtype: int64
hr_df['previous_year_rating'].value_counts()
3.0 18618
5.0 11741
4.0 9877
1.0 6223
2.0 4225
Name: previous_year_rating, dtype: int64
hr_df = hr_df.dropna()
hr_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48660 entries, 0 to 54807
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 employee_id 48660 non-null int64
1 department 48660 non-null object
2 region 48660 non-null object
3 education 48660 non-null object
4 gender 48660 non-null object
5 recruitment_channel 48660 non-null object
6 no_of_trainings 48660 non-null int64
7 age 48660 non-null int64
8 previous_year_rating 48660 non-null float64
9 length_of_service 48660 non-null int64
10 awards_won? 48660 non-null int64
11 avg_training_score 48660 non-null int64
12 is_promoted 48660 non-null int64
dtypes: float64(1), int64(7), object(5)
memory usage: 5.2+ MB
for i in ['department','region','education','gender','recruitment_channel']:
print(i,hr_df[i].nunique())
department 9
region 34
education 3
gender 2
recruitment_channel 3
hr_df = pd.get_dummies(hr_df,columns=['department','region','education','gender','recruitment_channel'])
hr_df.head(3)
✔️ 결과

pd.get_option('display.max_columns',60)
20
hr_df.head(3)
✔️ 결과

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(hr_df.drop('is_promoted',axis=1),hr_df['is_promoted'],test_size=0.2,random_state=10)
2. 로지스틱 회귀(Logistic Regression)
- 둘 중의 하나를 결정하는 문제(이진 분류)를 풀기 위한 대표적인 알고리즘
- 도큐먼트
- 3개 이상의 클래스에 대한 판별을 하는 경우 OvR( One -vs - Rest), OvO( One-vs- One) 전략으로 판별
대부분 OvR 전략을 선호, 데이터가 한쪽으로 많이 치우친 경우 OvO을 사용
from sklearn. linear_model import LogisticRegression lr=LogisticRegression() lr.fit(x_train ,y_train) ------------------------------------------------------------------------------------- # 결과 /usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( LogisticRegression LogisticRegression()
pred = lr.predict(x_test)
from sklearn. metrics import accuracy_score,confusion_matrix
from sklearn.metrics import accuracy_score
accuracy_score(y_test,pred)
0.9114262227702425
hr_df['is_promoted'].value_counts()
0 44428
1 4232
Name: is_promoted, dtype: int64
3. 혼돈 행렬(confusion matrix)
- 정밀도와 재현율( 민감도 )을 활용하여 평가용 지수
TN(8869) FP(0)
FN(862) TP(1)
- TN: 승진하지 못했는데, 승진하지 못했다고 예측
- FN: 승진하지 못했는데, 승진했다고 예측
- FP: 승진했는데, 승진하지 못했다고 예측
- TP: 승진했는데, 승진하지 못했다고 예측
confusion_matrix(y_test,pred)
array([[8869, 0],
[ 862, 1]])
sns.heatmap(confusion_matrix(y_test, pred), annot=True, cmap='Blues')
✔️ 결과

3-1. 정밀도(precision)
- TP / (TP + FP)
- 무조건 양성으로 판단해서 계산하는 방법
- 1이라고 예측한 것 중, 얼마 만큼을 제대로 맞혔는가?
3-2. 재현율(recall)
- TP/(TP+FN)
- 정확하게 감지한 양성 샘플의 비율
- 실제 1이라고 예측한것 중에 얼마 만큼을 제대로 맞췄는가?
- 민감도 또는 TPR(True Positive Rate)라고도 부름
3-3 f1 score
- 정밀도와 재현율의 조화평균을 나타내는 지표
-
2∗정밀도∗재현율정밀도+재현율=TPTP+FN+FP2

from sklearn.metrics import precision_score, recall_score, f1_score
precision_score(y_test, pred)
1.0
recall_score(y_test,pred)
0.0011587485515643105
f1_score(y_test,pred)
0.0023148148148148147
lr.coef_# 58개 컬럼에 대한 기울기
array([[-5.42682567e-06, -2.11566320e-01, -1.24739314e-01,
4.04217840e-01, 8.39462548e-02, 1.19382822e-01,
1.24469097e-02, -4.53116409e-02, -1.59556720e-02,
-1.69079211e-02, -1.06814883e-02, 3.10169499e-02,
3.47912379e-03, -1.69516987e-02, -1.37996914e-02,
-1.51941604e-02, -2.88706914e-03, -2.41993427e-03,
-1.84270273e-02, -6.87835341e-03, -2.43612077e-03,
-5.00401302e-03, -7.32654108e-03, -7.67662052e-03,
7.71412885e-03, -3.21353065e-04, -6.68708228e-03,
2.55409975e-02, -1.02529819e-02, -7.25376472e-03,
9.23097034e-03, 9.63606751e-03, -9.39043583e-03,
5.16728257e-03, -2.15726175e-02, -1.16101632e-02,
9.39154969e-03, -1.82813463e-02, 1.50261133e-03,
-1.97860883e-03, -2.28665036e-02, -1.35441186e-02,
-5.26900930e-03, -6.03421587e-03, 3.29129457e-02,
-1.48312957e-02, -1.18516648e-02, 2.54568502e-02,
-2.39518611e-03, -9.66357577e-03, -2.00440936e-01,
-1.40474208e-02, 1.14182158e-01, -1.53689321e-02,
-8.49372671e-02, -6.62000218e-02, 4.04855793e-03,
-3.81547353e-02]])
x_test
✔️ 결과

# 독립변수 2개, 종속변수 1개
tempx=hr_df[['age','length_of_service']]
tempy = hr_df['is_promoted']
temp_lr =LogisticRegression()
temp_lr.fit(tempx,tempy)

temp_df = pd.DataFrame({'age':[20,27,30],'length_of_service':[1,3,6]})
temp_df

pred= temp_lr.predict(temp_df)
pred
✔️ 결과
array([0, 0, 0])
temp_lr.coef_
✔️ 결과
array([[-0.01074458, -0.00053409]])
temp_lr.intercept_
✔️ 결과
array([-1.96818509])
proba =temp_lr.predict_proba(temp_df)
proba
✔️ 결과
array([[0.89876806, 0.10123194],
[0.9055003 , 0.0944997 ],
[0.90835617, 0.09164383]])
4. 교차 검증(Cross Validation)
- train_test_split에서 발생하는 데이터의 섞임에 따라 성능이 좌우되는 문제를 해결하기 위한 기술
- k겹(Fold) 교차 검증을 가장 많이 사용
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
kf
✔️ 결과
KFold(n_splits=5, random_state=None, shuffle=False)
for train_index, test_index in kf.split(range(len(hr_df))):
print(train_index, test_index)
print(len(train_index), len(test_index))
✔️ 결과
[ 2 3 4 ... 48656 48657 48659] [ 0 1 5 ... 48652 48653 48658]
38928 9732
[ 0 1 2 ... 48657 48658 48659] [ 18 23 29 ... 48639 48641 48645]
38928 9732
[ 0 1 2 ... 48657 48658 48659] [ 12 15 17 ... 48647 48650 48654]
38928 9732
[ 0 1 2 ... 48654 48656 48658] [ 3 24 31 ... 48655 48657 48659]
38928 9732
[ 0 1 3 ... 48657 48658 48659] [ 2 4 6 ... 48640 48644 48656]
38928 9732
kf = KFold(n_splits=5,random_state=10,shuffle=True)
kf
✔️ 결과
KFold(n_splits=5, random_state=10, shuffle=True)
for train_index, test_index in kf.split(range(len(hr_df))):
X = hr_df.drop('is_promoted', axis=1)
y = hr_df['is_promoted']
acc_list = []
for train_index, test_index in kf.split(range(len(hr_df))):
X = hr_df.drop('is_promoted', axis=1)
y = hr_df['is_promoted']
X_train = X.iloc[train_index]
X_test = X.iloc[test_index]
y_train = y.iloc[train_index]
y_test = y.iloc[test_index]
lr = LogisticRegression()
lr.fit(X_train, y_train)
pred = lr.predict(X_test)
acc_list.append(accuracy_score(y_test, pred))
✔️ 결과
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
acc_list
✔️ 결과
[0.9114262227702425,
0.9094739005343198,
0.9173859432799013,
0.914406083025072,
0.9125565145910398]
np.array(acc_list).mean()
✔️ 결과
0.913049732840115
크로스 벨리데이션을 사용하는 이유는 결과를 좋게 하기 위함이 아니라 믿을만한 검증을 하기 위함
728x90
반응형
'머신러닝' 카테고리의 다른 글
[ 머신러닝 ] 랜덤 포레스트 (0) | 2023.09.11 |
---|---|
[ 머신러닝 ] 서포트 벡터 머신 (0) | 2023.09.06 |
[ 머신러닝 ] 의사결정나무 (0) | 2023.09.06 |
[ 머신러닝 ] 선형회귀 (0) | 2023.09.06 |
[머신러닝] Titanic Dataset (2) | 2023.08.30 |