0

0

【金融风控系列】_[3]_贷款违约识别

P粉084495128

P粉084495128

发布时间:2025-07-22 11:59:21

|

316人浏览过

|

来源于php中文网

原创

本文围绕Kaggle的Home Credit Default Risk赛题展开,利用客户申请表等7张表数据构建模型预测客户还款能力。通过数据清洗、特征工程,融合多表信息生成衍生特征,经LightGBM模型训练,最终线上评分为0.78277,为信用记录不足人群的贷款评估提供参考。

☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

【金融风控系列】_[3]_贷款违约识别 - php中文网

Home Credit Default Risk(家庭信用违约风险)

该赛题来自 KAGGLE,仅用作学习交流


由于信用记录不足或不存在,许多人往往被划分为低信用借贷人而难以获得贷款。 为了确保这些人群获得贷款,Home Credit公司利用替代数据(包括电信和交易信息)预测客户的还款能力。

Home Credit一共提供了7张表,一共218个字段,其中训练集样本约31万(逾期8%),测试集样本约5万。


信息表

application_train/test 客户申请表

包含了

  • 目标变量(客户是否违约-0/1变量)
  • 客户申请贷款信息(贷款类型, 贷款总额, 年金)
  • 客户基本信息(性别, 年龄, 家庭, 学历, 职业, 行业, 居住地情况)
  • 客户财务信息(年收入, 房/车情况)
  • 申请时提供的资料等.

bureau/bureau_balance 由其他金融机构提供给征信中心的客户信用记录历史(月数据)

包含了客户在征信中心的

  • 信用记录,
  • 违约金额,
  • 违约时间等.

以时间序列(按行)的形式进行记录.

POS_CASH_balance 客户在Home Credit数据库中POS(point of sales)和现金贷款历史(月数据)

包含了客户

PixVerse
PixVerse

PixVerse是一款强大的AI视频生成工具,可以轻松地将多种输入转化为令人惊叹的视频。

下载
  • 已付款情况
  • 未付款情况

credit_card_balance 客户在Home Credit数据库中信用卡的snapshot历史(月数据)

包含了客户

  • 消费次数
  • 消费金额

等情况.

previous_application 客户先前的申请记录

包含了客户所有历史申请记录(申请信息, 申请结果等).

installments_payments 客户先前信用卡的还款记录

包含了客户的还款情况

  • 还款日期
  • 是否逾期
  • 还款金额
  • 是否欠款等

参考:

[1] https://zhuanlan.zhihu.com/p/43541825

[2] https://www.kaggle.com/xucheng/cv-7993-private-score-7996/

[3] https://zhuanlan.zhihu.com/p/40790434

[4] https://www.kaggle.com/tahmidnafi/cse499

[5] https://blog.csdn.net/zhangchen2449/article/details/83338978

主要字段表

Field Description
SK_ID_CURR 此次申请的ID
TARGET 申请人本次申请的还款风险:1-风险较高;0-风险较低
NAME_CONTRACT_TYPE 贷款类型:cash(现金)还是revolving(周转金,一次申请,多次循环提取)
CODE_GENDER 申请人性别
FLAG_OWN_CAR 申请人是否有车
FLAG_OWN_REALTY 申请人是否有房
CNT_CHILDREN 申请人子女个数
AMT_INCOME_TOTAL 申请人收入状况
AMT_CREDIT 此次申请的贷款金额
AMT_ANNUITY 贷款年金
AMT_GOODS_PRICE 如果是消费贷款,改字段表示商品的实际价格
NAME_TYPE_SUITE 申请人此次申请的陪同人员
NAME_INCOME_TYPE 申请人收入类型
NAME_EDUCATION_TYPE 申请人受教育程度
NAME_FAMILY_STATUS 申请人婚姻状况
NAME_HOUSING_TYPE 申请人居住状况(租房,已购房,和父母一起住等)
REGION_POPULATION_RELATIVE 申请人居住地人口密度,已标准化
DAYS_BIRTH 申请人出生日(距离申请当日的天数,负值)
DAYS_EMPLOYED 申请人当前工作的工作年限(距离申请当日的天数,负值)
DAYS_REGISTRATION 申请人最近一次修改注册信息的时间(距离申请当日的天数,负值)
DAYS_ID_PUBLISH 申请人最近一次修改申请贷款的身份证明文件的时间(距离申请当日的天数,负值)
FLAG_MOBIL 申请人是否提供个人电话(1-yes,0-no)
FLAG_EMP_PHONE 申请人是否提供家庭电话(1-yes,0-no)
FLAG_WORK_PHONE 申请人是否提供工作电话(1-yes,0-no)
FLAG_CONT_MOBILE 申请人个人电话是否能拨通(1-yes,0-no)
FLAG_EMAIL 申请人是否提供电子邮箱(1-yes,0-no)
OCCUPATION_TYPE 申请人职务
REGION_RATING_CLIENT ben公司对申请人居住区域的评分等级(1,2,3)
REGION_RATING_CLIENT_W_CITY 在考虑所在城市的情况下,ben公司对申请人居住区域的评分等级(1,2,3)
WEEKDAY_APPR_PROCESS_START 申请人发起申请日是星期几
HOUR_APPR_PROCESS_START 申请人发起申请的hour
REG_REGION_NOT_LIVE_REGION 申请人提供的的永久地址和联系地址是否匹配(1-不匹配,2-匹配,区域级别的)
REG_REGION_NOT_WORK_REGION 申请人提供的的永久地址和工作地址是否匹配(1-不匹配,2-匹配,区域级别的)
LIVE_REGION_NOT_WORK_REGION 申请人提供的的联系地址和工作地址是否匹配(1-不匹配,2-匹配,区域级别的)
REG_CITY_NOT_LIVE_CITY 申请人提供的的永久地址和联系地址是否匹配(1-不匹配,2-匹配,城市级别的)
REG_CITY_NOT_WORK_CITY 申请人提供的的永久地址和工作地址是否匹配(1-不匹配,2-匹配,城市级别的)
LIVE_CITY_NOT_WORK_CITY 申请人提供的的联系地址和工作地址是否匹配(1-不匹配,2-匹配,城市级别的)
ORGANIZATION_TYPE 申请人工作所属组织类型
EXT_SOURCE_1 外部数据源1的标准化评分
EXT_SOURCE_2 外部数据源2的标准化评分
EXT_SOURCE_3 外部数据源3的标准化评分
APARTMENTS_AVG <----> EMERGENCYSTATE_MODE 申请人居住环境各项指标的标准化评分
OBS_30_CNT_SOCIAL_CIRC LE <----> DEF_60_CNT_SOCIAL_CIRCLE 这部分字段含义没看懂
DAYS_LAST_PHONE_CHANGE 申请人最近一次修改手机号码的时间(距离申请当日的天数,负值)
FLAG_DOCUMENT_2 <----> FLAG_DOCUMENT_21 申请人是否额外提供了文件2,3,4. . .21
AMT_REQ_CREDIT_BUREAU_HOUR 申请人发起申请前1个小时以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_DAY 申请人发起申请前一天以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_WEEK 申请人发起申请前一周以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_MONTH 申请人发起申请前一个月以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_QRT 申请人发起申请前一个季度以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_YEAR 申请人发起申请前一年以内,被查询征信的次数
In [20]
#!unzip -q -o data/data105246/home_credit_default_risk.zip -d /home/aistudio/data
       
unzip:  cannot find or open data/data104475/IEEE_CIS_Fraud_Detection.zip, data/data104475/IEEE_CIS_Fraud_Detection.zip.zip or data/data104475/IEEE_CIS_Fraud_Detection.zip.ZIP.
       
In [22]
# 安装依赖包!pip install xgboost
!pip install lightgbm
       
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: xgboost in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (1.3.3)
Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from xgboost) (1.6.3)
Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from xgboost) (1.20.3)
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: lightgbm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (3.1.1)
Requirement already satisfied: scikit-learn!=0.22.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.24.2)
Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.20.3)
Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.6.3)
Requirement already satisfied: wheel in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.36.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (2.1.0)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (0.14.1)
       
In [23]
import osimport gcimport numpy as npimport pandas as pdfrom scipy.stats import kurtosisfrom sklearn.metrics import roc_auc_scorefrom sklearn.preprocessing import MinMaxScalerfrom sklearn.impute import SimpleImputerfrom sklearn.linear_model import LogisticRegressionimport matplotlib.pyplot as pltimport seaborn as snsimport warningsfrom sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFoldimport xgboost as xgbfrom xgboost import XGBClassifier
warnings.simplefilter(action='ignore', category=FutureWarning)from lightgbm import LGBMClassifier
   
In [24]
DATA_DIRECTORY = "./data"df_train = pd.read_csv(os.path.join(DATA_DIRECTORY, 'application_train.csv'))
df_test = pd.read_csv(os.path.join(DATA_DIRECTORY, 'application_test.csv'))
df = df_train.append(df_test)del df_train, df_test; gc.collect()
       
39
               
In [25]
df = df[df['AMT_INCOME_TOTAL'] < 20000000]
df = df[df['CODE_GENDER'] != 'XNA']
df['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)
df['DAYS_LAST_PHONE_CHANGE'].replace(0, np.nan, inplace=True)
   
In [26]
def get_age_group(days_birth):
    age_years = -days_birth / 365
    if age_years < 27: return 1
    elif age_years < 40: return 2
    elif age_years < 50: return 3
    elif age_years < 65: return 4
    elif age_years < 99: return 5
    else: return 0
   
In [27]
docs = [f for f in df.columns if 'FLAG_DOC' in f]
df['DOCUMENT_COUNT'] = df[docs].sum(axis=1)
df['NEW_DOC_KURT'] = df[docs].kurtosis(axis=1)
df['AGE_RANGE'] = df['DAYS_BIRTH'].apply(lambda x: get_age_group(x))
   
In [28]
df['EXT_SOURCES_PROD'] = df['EXT_SOURCE_1'] * df['EXT_SOURCE_2'] * df['EXT_SOURCE_3']
df['EXT_SOURCES_WEIGHTED'] = df.EXT_SOURCE_1 * 2 + df.EXT_SOURCE_2 * 1 + df.EXT_SOURCE_3 * 3np.warnings.filterwarnings('ignore', r'All-NaN (slice|axis) encountered')for function_name in ['min', 'max', 'mean', 'nanmedian', 'var']:
    feature_name = 'EXT_SOURCES_{}'.format(function_name.upper())
    df[feature_name] = eval('np.{}'.format(function_name))(
        df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']], axis=1)
   
In [29]
df['CREDIT_TO_ANNUITY_RATIO'] = df['AMT_CREDIT'] / df['AMT_ANNUITY']
df['CREDIT_TO_GOODS_RATIO'] = df['AMT_CREDIT'] / df['AMT_GOODS_PRICE']
df['ANNUITY_TO_INCOME_RATIO'] = df['AMT_ANNUITY'] / df['AMT_INCOME_TOTAL']
df['CREDIT_TO_INCOME_RATIO'] = df['AMT_CREDIT'] / df['AMT_INCOME_TOTAL']
df['INCOME_TO_EMPLOYED_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_EMPLOYED']
df['INCOME_TO_BIRTH_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_BIRTH']    
df['EMPLOYED_TO_BIRTH_RATIO'] = df['DAYS_EMPLOYED'] / df['DAYS_BIRTH']
df['ID_TO_BIRTH_RATIO'] = df['DAYS_ID_PUBLISH'] / df['DAYS_BIRTH']
df['CAR_TO_BIRTH_RATIO'] = df['OWN_CAR_AGE'] / df['DAYS_BIRTH']
df['CAR_TO_EMPLOYED_RATIO'] = df['OWN_CAR_AGE'] / df['DAYS_EMPLOYED']
df['PHONE_TO_BIRTH_RATIO'] = df['DAYS_LAST_PHONE_CHANGE'] / df['DAYS_BIRTH']
   
In [30]
def do_mean(df, group_cols, counted, agg_name):
    gp = df[group_cols + [counted]].groupby(group_cols)[counted].mean().reset_index().rename(
        columns={counted: agg_name})
    df = df.merge(gp, on=group_cols, how='left')    del gp
    gc.collect()    return df
   
In [31]
def do_median(df, group_cols, counted, agg_name):
    gp = df[group_cols + [counted]].groupby(group_cols)[counted].median().reset_index().rename(
        columns={counted: agg_name})
    df = df.merge(gp, on=group_cols, how='left')    del gp
    gc.collect()    return df
   
In [32]
def do_std(df, group_cols, counted, agg_name):
    gp = df[group_cols + [counted]].groupby(group_cols)[counted].std().reset_index().rename(
        columns={counted: agg_name})
    df = df.merge(gp, on=group_cols, how='left')    del gp
    gc.collect()    return df
   
In [33]
def do_sum(df, group_cols, counted, agg_name):
    gp = df[group_cols + [counted]].groupby(group_cols)[counted].sum().reset_index().rename(
        columns={counted: agg_name})
    df = df.merge(gp, on=group_cols, how='left')    del gp
    gc.collect()    return df
   
In [34]
group = ['ORGANIZATION_TYPE', 'NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_RANGE', 'CODE_GENDER']
df = do_median(df, group, 'EXT_SOURCES_MEAN', 'GROUP_EXT_SOURCES_MEDIAN')
df = do_std(df, group, 'EXT_SOURCES_MEAN', 'GROUP_EXT_SOURCES_STD')
df = do_mean(df, group, 'AMT_INCOME_TOTAL', 'GROUP_INCOME_MEAN')
df = do_std(df, group, 'AMT_INCOME_TOTAL', 'GROUP_INCOME_STD')
df = do_mean(df, group, 'CREDIT_TO_ANNUITY_RATIO', 'GROUP_CREDIT_TO_ANNUITY_MEAN')
df = do_std(df, group, 'CREDIT_TO_ANNUITY_RATIO', 'GROUP_CREDIT_TO_ANNUITY_STD')
df = do_mean(df, group, 'AMT_CREDIT', 'GROUP_CREDIT_MEAN')
df = do_mean(df, group, 'AMT_ANNUITY', 'GROUP_ANNUITY_MEAN')
df = do_std(df, group, 'AMT_ANNUITY', 'GROUP_ANNUITY_STD')
   
In [35]
def label_encoder(df, categorical_columns=None):
    if not categorical_columns:
        categorical_columns = [col for col in df.columns if df[col].dtype == 'object']    for col in categorical_columns:
        df[col], uniques = pd.factorize(df[col])    return df, categorical_columns
   
In [36]
def drop_application_columns(df):
    drop_list = [        'CNT_CHILDREN', 'CNT_FAM_MEMBERS', 'HOUR_APPR_PROCESS_START',        'FLAG_EMP_PHONE', 'FLAG_MOBIL', 'FLAG_CONT_MOBILE', 'FLAG_EMAIL', 'FLAG_PHONE',        'FLAG_OWN_REALTY', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',        'REG_CITY_NOT_WORK_CITY', 'OBS_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',        'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_YEAR', 
        'COMMONAREA_MODE', 'NONLIVINGAREA_MODE', 'ELEVATORS_MODE', 'NONLIVINGAREA_AVG',        'FLOORSMIN_MEDI', 'LANDAREA_MODE', 'NONLIVINGAREA_MEDI', 'LIVINGAPARTMENTS_MODE',        'FLOORSMIN_AVG', 'LANDAREA_AVG', 'FLOORSMIN_MODE', 'LANDAREA_MEDI',        'COMMONAREA_MEDI', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'BASEMENTAREA_AVG',        'BASEMENTAREA_MODE', 'NONLIVINGAPARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 
        'LIVINGAPARTMENTS_AVG', 'ELEVATORS_AVG', 'YEARS_BUILD_MEDI', 'ENTRANCES_MODE',        'NONLIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'LIVINGAPARTMENTS_MEDI',        'YEARS_BUILD_MODE', 'YEARS_BEGINEXPLUATATION_AVG', 'ELEVATORS_MEDI', 'LIVINGAREA_MEDI',        'YEARS_BEGINEXPLUATATION_MODE', 'NONLIVINGAPARTMENTS_AVG', 'HOUSETYPE_MODE',        'FONDKAPREMONT_MODE', 'EMERGENCYSTATE_MODE'
    ]    for doc_num in [2,4,5,6,7,9,10,11,12,13,14,15,16,17,19,20,21]:
        drop_list.append('FLAG_DOCUMENT_{}'.format(doc_num))
    df.drop(drop_list, axis=1, inplace=True)    return df
   
In [37]
df, le_encoded_cols = label_encoder(df, None)
df = drop_application_columns(df)
   
In [38]
df = pd.get_dummies(df)
   
In [39]
bureau = pd.read_csv(os.path.join(DATA_DIRECTORY, 'bureau.csv'))
   
In [40]
bureau['CREDIT_DURATION'] = -bureau['DAYS_CREDIT'] + bureau['DAYS_CREDIT_ENDDATE']
bureau['ENDDATE_DIF'] = bureau['DAYS_CREDIT_ENDDATE'] - bureau['DAYS_ENDDATE_FACT']
bureau['DEBT_PERCENTAGE'] = bureau['AMT_CREDIT_SUM'] / bureau['AMT_CREDIT_SUM_DEBT']
bureau['DEBT_CREDIT_DIFF'] = bureau['AMT_CREDIT_SUM'] - bureau['AMT_CREDIT_SUM_DEBT']
bureau['CREDIT_TO_ANNUITY_RATIO'] = bureau['AMT_CREDIT_SUM'] / bureau['AMT_ANNUITY']
   
In [41]
def one_hot_encoder(df, categorical_columns=None, nan_as_category=True):
    original_columns = list(df.columns)    if not categorical_columns:
        categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df, columns=categorical_columns, dummy_na=nan_as_category)
    categorical_columns = [c for c in df.columns if c not in original_columns]    return df, categorical_columns
   
In [42]
def group(df_to_agg, prefix, aggregations, aggregate_by= 'SK_ID_CURR'):
    agg_df = df_to_agg.groupby(aggregate_by).agg(aggregations)
    agg_df.columns = pd.Index(['{}{}_{}'.format(prefix, e[0], e[1].upper())                               for e in agg_df.columns.tolist()])    return agg_df.reset_index()
   
In [43]
def group_and_merge(df_to_agg, df_to_merge, prefix, aggregations, aggregate_by= 'SK_ID_CURR'):
    agg_df = group(df_to_agg, prefix, aggregations, aggregate_by= aggregate_by)    return df_to_merge.merge(agg_df, how='left', on= aggregate_by)
   
In [44]
def get_bureau_balance(path, num_rows= None):
    bb = pd.read_csv(os.path.join(path, 'bureau_balance.csv'))
    bb, categorical_cols = one_hot_encoder(bb, nan_as_category= False)    # Calculate rate for each category with decay
    bb_processed = bb.groupby('SK_ID_BUREAU')[categorical_cols].mean().reset_index()    # Min, Max, Count and mean duration of payments (months)
    agg = {'MONTHS_BALANCE': ['min', 'max', 'mean', 'size']}
    bb_processed = group_and_merge(bb, bb_processed, '', agg, 'SK_ID_BUREAU')    del bb; gc.collect()    return bb_processed
   
In [45]
bureau, categorical_cols = one_hot_encoder(bureau, nan_as_category= False)
bureau = bureau.merge(get_bureau_balance(DATA_DIRECTORY), how='left', on='SK_ID_BUREAU')
bureau['STATUS_12345'] = 0for i in range(1,6):
    bureau['STATUS_12345'] += bureau['STATUS_{}'.format(i)]
   
In [46]
features = ['AMT_CREDIT_MAX_OVERDUE', 'AMT_CREDIT_SUM_OVERDUE', 'AMT_CREDIT_SUM',    'AMT_CREDIT_SUM_DEBT', 'DEBT_PERCENTAGE', 'DEBT_CREDIT_DIFF', 'STATUS_0', 'STATUS_12345']
agg_length = bureau.groupby('MONTHS_BALANCE_SIZE')[features].mean().reset_index()
agg_length.rename({feat: 'LL_' + feat for feat in features}, axis=1, inplace=True)
bureau = bureau.merge(agg_length, how='left', on='MONTHS_BALANCE_SIZE')del agg_length; gc.collect()
       
39
               
In [47]
BUREAU_AGG = {    'SK_ID_BUREAU': ['nunique'],    'DAYS_CREDIT': ['min', 'max', 'mean'],    'DAYS_CREDIT_ENDDATE': ['min', 'max'],    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM': ['max', 'mean', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['max', 'mean', 'sum'],    'AMT_CREDIT_SUM_OVERDUE': ['max', 'mean', 'sum'],    'AMT_ANNUITY': ['mean'],    'DEBT_CREDIT_DIFF': ['mean', 'sum'],    'MONTHS_BALANCE_MEAN': ['mean', 'var'],    'MONTHS_BALANCE_SIZE': ['mean', 'sum'],    'STATUS_0': ['mean'],    'STATUS_1': ['mean'],    'STATUS_12345': ['mean'],    'STATUS_C': ['mean'],    'STATUS_X': ['mean'],    'CREDIT_ACTIVE_Active': ['mean'],    'CREDIT_ACTIVE_Closed': ['mean'],    'CREDIT_ACTIVE_Sold': ['mean'],    'CREDIT_TYPE_Consumer credit': ['mean'],    'CREDIT_TYPE_Credit card': ['mean'],    'CREDIT_TYPE_Car loan': ['mean'],    'CREDIT_TYPE_Mortgage': ['mean'],    'CREDIT_TYPE_Microloan': ['mean'],    'LL_AMT_CREDIT_SUM_OVERDUE': ['mean'],    'LL_DEBT_CREDIT_DIFF': ['mean'],    'LL_STATUS_12345': ['mean'],
}

BUREAU_ACTIVE_AGG = {    'DAYS_CREDIT': ['max', 'mean'],    'DAYS_CREDIT_ENDDATE': ['min', 'max'],    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM': ['max', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['mean', 'sum'],    'AMT_CREDIT_SUM_OVERDUE': ['max', 'mean'],    'DAYS_CREDIT_UPDATE': ['min', 'mean'],    'DEBT_PERCENTAGE': ['mean'],    'DEBT_CREDIT_DIFF': ['mean'],    'CREDIT_TO_ANNUITY_RATIO': ['mean'],    'MONTHS_BALANCE_MEAN': ['mean', 'var'],    'MONTHS_BALANCE_SIZE': ['mean', 'sum'],
}

BUREAU_CLOSED_AGG = {    'DAYS_CREDIT': ['max', 'var'],    'DAYS_CREDIT_ENDDATE': ['max'],    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM_OVERDUE': ['mean'],    'AMT_CREDIT_SUM': ['max', 'mean', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['max', 'sum'],    'DAYS_CREDIT_UPDATE': ['max'],    'ENDDATE_DIF': ['mean'],    'STATUS_12345': ['mean'],
}

BUREAU_LOAN_TYPE_AGG = {    'DAYS_CREDIT': ['mean', 'max'],    'AMT_CREDIT_MAX_OVERDUE': ['mean', 'max'],    'AMT_CREDIT_SUM': ['mean', 'max'],    'AMT_CREDIT_SUM_DEBT': ['mean', 'max'],    'DEBT_PERCENTAGE': ['mean'],    'DEBT_CREDIT_DIFF': ['mean'],    'DAYS_CREDIT_ENDDATE': ['max'],
}

BUREAU_TIME_AGG = {    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM_OVERDUE': ['mean'],    'AMT_CREDIT_SUM': ['max', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['mean', 'sum'],    'DEBT_PERCENTAGE': ['mean'],    'DEBT_CREDIT_DIFF': ['mean'],    'STATUS_0': ['mean'],    'STATUS_12345': ['mean'],
}
   
In [48]
agg_bureau = group(bureau, 'BUREAU_', BUREAU_AGG)
active = bureau[bureau['CREDIT_ACTIVE_Active'] == 1]
agg_bureau = group_and_merge(active,agg_bureau,'BUREAU_ACTIVE_',BUREAU_ACTIVE_AGG)
closed = bureau[bureau['CREDIT_ACTIVE_Closed'] == 1]
agg_bureau = group_and_merge(closed,agg_bureau,'BUREAU_CLOSED_',BUREAU_CLOSED_AGG)del active, closed; gc.collect()for credit_type in ['Consumer credit', 'Credit card', 'Mortgage', 'Car loan', 'Microloan']:
    type_df = bureau[bureau['CREDIT_TYPE_' + credit_type] == 1]
    prefix = 'BUREAU_' + credit_type.split(' ')[0].upper() + '_'
    agg_bureau = group_and_merge(type_df, agg_bureau, prefix, BUREAU_LOAN_TYPE_AGG)    del type_df; gc.collect()for time_frame in [6, 12]:
    prefix = "BUREAU_LAST{}M_".format(time_frame)
    time_frame_df = bureau[bureau['DAYS_CREDIT'] >= -30*time_frame]
    agg_bureau = group_and_merge(time_frame_df, agg_bureau, prefix, BUREAU_TIME_AGG)    del time_frame_df; gc.collect()
   
In [49]
sort_bureau = bureau.sort_values(by=['DAYS_CREDIT'])
gr = sort_bureau.groupby('SK_ID_CURR')['AMT_CREDIT_MAX_OVERDUE'].last().reset_index()
gr.rename({'AMT_CREDIT_MAX_OVERDUE': 'BUREAU_LAST_LOAN_MAX_OVERDUE'}, inplace=True)
agg_bureau = agg_bureau.merge(gr, on='SK_ID_CURR', how='left')
agg_bureau['BUREAU_DEBT_OVER_CREDIT'] = \
    agg_bureau['BUREAU_AMT_CREDIT_SUM_DEBT_SUM']/agg_bureau['BUREAU_AMT_CREDIT_SUM_SUM']
agg_bureau['BUREAU_ACTIVE_DEBT_OVER_CREDIT'] = \
    agg_bureau['BUREAU_ACTIVE_AMT_CREDIT_SUM_DEBT_SUM']/agg_bureau['BUREAU_ACTIVE_AMT_CREDIT_SUM_SUM']
   
In [50]
df = pd.merge(df, agg_bureau, on='SK_ID_CURR', how='left')del agg_bureau, bureau
gc.collect()
       
39
               
In [51]
prev = pd.read_csv(os.path.join(DATA_DIRECTORY, 'previous_application.csv'))
pay = pd.read_csv(os.path.join(DATA_DIRECTORY, 'installments_payments.csv'))
   
In [52]
PREVIOUS_AGG = {    'SK_ID_PREV': ['nunique'],    'AMT_ANNUITY': ['min', 'max', 'mean'],    'AMT_DOWN_PAYMENT': ['max', 'mean'],    'HOUR_APPR_PROCESS_START': ['min', 'max', 'mean'],    'RATE_DOWN_PAYMENT': ['max', 'mean'],    'DAYS_DECISION': ['min', 'max', 'mean'],    'CNT_PAYMENT': ['max', 'mean'],    'DAYS_TERMINATION': ['max'],    # Engineered features
    'CREDIT_TO_ANNUITY_RATIO': ['mean', 'max'],    'APPLICATION_CREDIT_DIFF': ['min', 'max', 'mean'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean', 'var'],    'DOWN_PAYMENT_TO_CREDIT': ['mean'],
}

PREVIOUS_ACTIVE_AGG = {    'SK_ID_PREV': ['nunique'],    'SIMPLE_INTERESTS': ['mean'],    'AMT_ANNUITY': ['max', 'sum'],    'AMT_APPLICATION': ['max', 'mean'],    'AMT_CREDIT': ['sum'],    'AMT_DOWN_PAYMENT': ['max', 'mean'],    'DAYS_DECISION': ['min', 'mean'],    'CNT_PAYMENT': ['mean', 'sum'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    # Engineered features
    'AMT_PAYMENT': ['sum'],    'INSTALMENT_PAYMENT_DIFF': ['mean', 'max'],    'REMAINING_DEBT': ['max', 'mean', 'sum'],    'REPAYMENT_RATIO': ['mean'],
}
PREVIOUS_LATE_PAYMENTS_AGG = {    'DAYS_DECISION': ['min', 'max', 'mean'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    # Engineered features
    'APPLICATION_CREDIT_DIFF': ['min'],    'NAME_CONTRACT_TYPE_Consumer loans': ['mean'],    'NAME_CONTRACT_TYPE_Cash loans': ['mean'],    'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],
}

PREVIOUS_LOAN_TYPE_AGG = {    'AMT_CREDIT': ['sum'],    'AMT_ANNUITY': ['mean', 'max'],    'SIMPLE_INTERESTS': ['min', 'mean', 'max', 'var'],    'APPLICATION_CREDIT_DIFF': ['min', 'var'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'],    'DAYS_DECISION': ['max'],    'DAYS_LAST_DUE_1ST_VERSION': ['max', 'mean'],    'CNT_PAYMENT': ['mean'],
}

PREVIOUS_TIME_AGG = {    'AMT_CREDIT': ['sum'],    'AMT_ANNUITY': ['mean', 'max'],    'SIMPLE_INTERESTS': ['mean', 'max'],    'DAYS_DECISION': ['min', 'mean'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    # Engineered features
    'APPLICATION_CREDIT_DIFF': ['min'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'],    'NAME_CONTRACT_TYPE_Consumer loans': ['mean'],    'NAME_CONTRACT_TYPE_Cash loans': ['mean'],    'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],
}

PREVIOUS_APPROVED_AGG = {    'SK_ID_PREV': ['nunique'],    'AMT_ANNUITY': ['min', 'max', 'mean'],    'AMT_CREDIT': ['min', 'max', 'mean'],    'AMT_DOWN_PAYMENT': ['max'],    'AMT_GOODS_PRICE': ['max'],    'HOUR_APPR_PROCESS_START': ['min', 'max'],    'DAYS_DECISION': ['min', 'mean'],    'CNT_PAYMENT': ['max', 'mean'],    'DAYS_TERMINATION': ['mean'],    # Engineered features
    'CREDIT_TO_ANNUITY_RATIO': ['mean', 'max'],    'APPLICATION_CREDIT_DIFF': ['max'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'],    # The following features are only for approved applications
    'DAYS_FIRST_DRAWING': ['max', 'mean'],    'DAYS_FIRST_DUE': ['min', 'mean'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    'DAYS_LAST_DUE': ['max', 'mean'],    'DAYS_LAST_DUE_DIFF': ['min', 'max', 'mean'],    'SIMPLE_INTERESTS': ['min', 'max', 'mean'],
}

PREVIOUS_REFUSED_AGG = {    'AMT_APPLICATION': ['max', 'mean'],    'AMT_CREDIT': ['min', 'max'],    'DAYS_DECISION': ['min', 'max', 'mean'],    'CNT_PAYMENT': ['max', 'mean'],    # Engineered features
    'APPLICATION_CREDIT_DIFF': ['min', 'max', 'mean', 'var'],    'APPLICATION_CREDIT_RATIO': ['min', 'mean'],    'NAME_CONTRACT_TYPE_Consumer loans': ['mean'],    'NAME_CONTRACT_TYPE_Cash loans': ['mean'],    'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],
}
   
In [53]
ohe_columns = [    'NAME_CONTRACT_STATUS', 'NAME_CONTRACT_TYPE', 'CHANNEL_TYPE',    'NAME_TYPE_SUITE', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',    'NAME_PRODUCT_TYPE', 'NAME_CLIENT_TYPE']
prev, categorical_cols = one_hot_encoder(prev, ohe_columns, nan_as_category= False)
   
In [54]
prev['APPLICATION_CREDIT_DIFF'] = prev['AMT_APPLICATION'] - prev['AMT_CREDIT']
prev['APPLICATION_CREDIT_RATIO'] = prev['AMT_APPLICATION'] / prev['AMT_CREDIT']
prev['CREDIT_TO_ANNUITY_RATIO'] = prev['AMT_CREDIT']/prev['AMT_ANNUITY']
prev['DOWN_PAYMENT_TO_CREDIT'] = prev['AMT_DOWN_PAYMENT'] / prev['AMT_CREDIT']
total_payment = prev['AMT_ANNUITY'] * prev['CNT_PAYMENT']
prev['SIMPLE_INTERESTS'] = (total_payment/prev['AMT_CREDIT'] - 1)/prev['CNT_PAYMENT']
   
In [55]
approved = prev[prev['NAME_CONTRACT_STATUS_Approved'] == 1]
active_df = approved[approved['DAYS_LAST_DUE'] == 365243]
active_pay = pay[pay['SK_ID_PREV'].isin(active_df['SK_ID_PREV'])]
active_pay_agg = active_pay.groupby('SK_ID_PREV')[['AMT_INSTALMENT', 'AMT_PAYMENT']].sum()
active_pay_agg.reset_index(inplace= True)
active_pay_agg['INSTALMENT_PAYMENT_DIFF'] = active_pay_agg['AMT_INSTALMENT'] - active_pay_agg['AMT_PAYMENT']
active_df = active_df.merge(active_pay_agg, on= 'SK_ID_PREV', how= 'left')
active_df['REMAINING_DEBT'] = active_df['AMT_CREDIT'] - active_df['AMT_PAYMENT']
active_df['REPAYMENT_RATIO'] = active_df['AMT_PAYMENT'] / active_df['AMT_CREDIT']
active_agg_df = group(active_df, 'PREV_ACTIVE_', PREVIOUS_ACTIVE_AGG)
active_agg_df['TOTAL_REPAYMENT_RATIO'] = active_agg_df['PREV_ACTIVE_AMT_PAYMENT_SUM']/\
                                            active_agg_df['PREV_ACTIVE_AMT_CREDIT_SUM']del active_pay, active_pay_agg, active_df; gc.collect()
       
0
               
In [56]
prev['DAYS_FIRST_DRAWING'].replace(365243, np.nan, inplace= True)
prev['DAYS_FIRST_DUE'].replace(365243, np.nan, inplace= True)
prev['DAYS_LAST_DUE_1ST_VERSION'].replace(365243, np.nan, inplace= True)
prev['DAYS_LAST_DUE'].replace(365243, np.nan, inplace= True)
prev['DAYS_TERMINATION'].replace(365243, np.nan, inplace= True)
   
In [57]
prev['DAYS_LAST_DUE_DIFF'] = prev['DAYS_LAST_DUE_1ST_VERSION'] - prev['DAYS_LAST_DUE']
approved['DAYS_LAST_DUE_DIFF'] = approved['DAYS_LAST_DUE_1ST_VERSION'] - approved['DAYS_LAST_DUE']
   
In [58]
categorical_agg = {key: ['mean'] for key in categorical_cols}
   
In [59]
agg_prev = group(prev, 'PREV_', {**PREVIOUS_AGG, **categorical_agg})
agg_prev = agg_prev.merge(active_agg_df, how='left', on='SK_ID_CURR')del active_agg_df; gc.collect()
       
0
               
In [60]
agg_prev = group_and_merge(approved, agg_prev, 'APPROVED_', PREVIOUS_APPROVED_AGG)
refused = prev[prev['NAME_CONTRACT_STATUS_Refused'] == 1]
agg_prev = group_and_merge(refused, agg_prev, 'REFUSED_', PREVIOUS_REFUSED_AGG)del approved, refused; gc.collect()
       
0
               
In [61]
for loan_type in ['Consumer loans', 'Cash loans']:
    type_df = prev[prev['NAME_CONTRACT_TYPE_{}'.format(loan_type)] == 1]
    prefix = 'PREV_' + loan_type.split(" ")[0] + '_'
    agg_prev = group_and_merge(type_df, agg_prev, prefix, PREVIOUS_LOAN_TYPE_AGG)    del type_df; gc.collect()
   
In [62]
pay['LATE_PAYMENT'] = pay['DAYS_ENTRY_PAYMENT'] - pay['DAYS_INSTALMENT']
pay['LATE_PAYMENT'] = pay['LATE_PAYMENT'].apply(lambda x: 1 if x > 0 else 0)
dpd_id = pay[pay['LATE_PAYMENT'] > 0]['SK_ID_PREV'].unique()
   
In [63]
agg_dpd = group_and_merge(prev[prev['SK_ID_PREV'].isin(dpd_id)], agg_prev,                                    'PREV_LATE_', PREVIOUS_LATE_PAYMENTS_AGG)del agg_dpd, dpd_id; gc.collect()
       
0
               
In [64]
for time_frame in [12, 24]:
    time_frame_df = prev[prev['DAYS_DECISION'] >= -30*time_frame]
    prefix = 'PREV_LAST{}M_'.format(time_frame)
    agg_prev = group_and_merge(time_frame_df, agg_prev, prefix, PREVIOUS_TIME_AGG)    del time_frame_df; gc.collect()del prev; gc.collect()
       
0
               
In [65]
df = pd.merge(df, agg_prev, on='SK_ID_CURR', how='left')
   
In [66]
train = df[df['TARGET'].notnull()]
test = df[df['TARGET'].isnull()]del df
gc.collect()
       
98
               
In [67]
labels = train['TARGET']
test_lebels=test['TARGET']
train = train.drop(columns=['TARGET'])
test = test.drop(columns=['TARGET'])
   
In [68]
feature = list(train.columns)

train.replace([np.inf, -np.inf], np.nan, inplace=True)
test.replace([np.inf, -np.inf], np.nan, inplace=True)
test_df = test.copy()
train_df = train.copy()
train_df['TARGET'] = labels
test_df['TARGET'] = test_lebels
   
In [69]
imputer = SimpleImputer(strategy = 'median')
imputer.fit(train)
imputer.fit(test)
train = imputer.transform(train)
test = imputer.transform(test)
   
In [70]
scaler = MinMaxScaler(feature_range = (0, 1))
scaler.fit(train)
scaler.fit(test)
train = scaler.transform(train)
test = scaler.transform(test)
   
In [71]
from lightgbm import LGBMClassifier

lgbmc = LGBMClassifier()
lgbmc.fit(train, labels)
       
LGBMClassifier()
               
In [72]
lgbm_pred = lgbmc.predict_proba(test)[:, 1]
   
In [74]
submit = test_df[['SK_ID_CURR']]
submit['TARGET'] = lgbm_pred
   
In [75]
submit.to_csv('lgbm.csv', index = False)
   

总结

数据的提交结果如下:(提交需要科学上网)

数据集 Home Credit Default Risk
线上评分 0.78277

热门AI工具

更多
DeepSeek
DeepSeek

幻方量化公司旗下的开源大模型平台

豆包大模型
豆包大模型

字节跳动自主研发的一系列大型语言模型

WorkBuddy
WorkBuddy

腾讯云推出的AI原生桌面智能体工作台

腾讯元宝
腾讯元宝

腾讯混元平台推出的AI助手

文心一言
文心一言

文心一言是百度开发的AI聊天机器人,通过对话可以生成各种形式的内容。

讯飞写作
讯飞写作

基于讯飞星火大模型的AI写作工具,可以快速生成新闻稿件、品宣文案、工作总结、心得体会等各种文文稿

即梦AI
即梦AI

一站式AI创作平台,免费AI图片和视频生成。

ChatGPT
ChatGPT

最最强大的AI聊天机器人程序,ChatGPT不单是聊天机器人,还能进行撰写邮件、视频脚本、文案、翻译、代码等任务。

相关专题

更多
数据分析的方法
数据分析的方法

数据分析的方法有:对比分析法,分组分析法,预测分析法,漏斗分析法,AB测试分析法,象限分析法,公式拆解法,可行域分析法,二八分析法,假设性分析法。php中文网为大家带来了数据分析的相关知识、以及相关文章等内容。

504

2023.07.04

数据分析方法有哪几种
数据分析方法有哪几种

数据分析方法有:1、描述性统计分析;2、探索性数据分析;3、假设检验;4、回归分析;5、聚类分析。本专题为大家提供数据分析方法的相关的文章、下载、课程内容,供大家免费下载体验。

292

2023.08.07

网站建设功能有哪些
网站建设功能有哪些

网站建设功能包括信息发布、内容管理、用户管理、搜索引擎优化、网站安全、数据分析、网站推广、响应式设计、社交媒体整合和电子商务等功能。这些功能可以帮助网站管理员创建一个具有吸引力、可用性和商业价值的网站,实现网站的目标。

759

2023.10.16

数据分析网站推荐
数据分析网站推荐

数据分析网站推荐:1、商业数据分析论坛;2、人大经济论坛-计量经济学与统计区;3、中国统计论坛;4、数据挖掘学习交流论坛;5、数据分析论坛;6、网站数据分析;7、数据分析;8、数据挖掘研究院;9、S-PLUS、R统计论坛。想了解更多数据分析的相关内容,可以阅读本专题下面的文章。

534

2024.03.13

Python 数据分析处理
Python 数据分析处理

本专题聚焦 Python 在数据分析领域的应用,系统讲解 Pandas、NumPy 的数据清洗、处理、分析与统计方法,并结合数据可视化、销售分析、科研数据处理等实战案例,帮助学员掌握使用 Python 高效进行数据分析与决策支持的核心技能。

82

2025.09.08

Python 数据分析与可视化
Python 数据分析与可视化

本专题聚焦 Python 在数据分析与可视化领域的核心应用,系统讲解数据清洗、数据统计、Pandas 数据操作、NumPy 数组处理、Matplotlib 与 Seaborn 可视化技巧等内容。通过实战案例(如销售数据分析、用户行为可视化、趋势图与热力图绘制),帮助学习者掌握 从原始数据到可视化报告的完整分析能力。

60

2025.10.14

TypeScript类型系统进阶与大型前端项目实践
TypeScript类型系统进阶与大型前端项目实践

本专题围绕 TypeScript 在大型前端项目中的应用展开,深入讲解类型系统设计与工程化开发方法。内容包括泛型与高级类型、类型推断机制、声明文件编写、模块化结构设计以及代码规范管理。通过真实项目案例分析,帮助开发者构建类型安全、结构清晰、易维护的前端工程体系,提高团队协作效率与代码质量。

42

2026.03.13

Python异步编程与Asyncio高并发应用实践
Python异步编程与Asyncio高并发应用实践

本专题围绕 Python 异步编程模型展开,深入讲解 Asyncio 框架的核心原理与应用实践。内容包括事件循环机制、协程任务调度、异步 IO 处理以及并发任务管理策略。通过构建高并发网络请求与异步数据处理案例,帮助开发者掌握 Python 在高并发场景中的高效开发方法,并提升系统资源利用率与整体运行性能。

79

2026.03.12

C# ASP.NET Core微服务架构与API网关实践
C# ASP.NET Core微服务架构与API网关实践

本专题围绕 C# 在现代后端架构中的微服务实践展开,系统讲解基于 ASP.NET Core 构建可扩展服务体系的核心方法。内容涵盖服务拆分策略、RESTful API 设计、服务间通信、API 网关统一入口管理以及服务治理机制。通过真实项目案例,帮助开发者掌握构建高可用微服务系统的关键技术,提高系统的可扩展性与维护效率。

234

2026.03.11

热门下载

更多
网站特效
/
网站源码
/
网站素材
/
前端模板

精品课程

更多
相关推荐
/
热门推荐
/
最新课程
最新Python教程 从入门到精通
最新Python教程 从入门到精通

共4课时 | 22.5万人学习

Django 教程
Django 教程

共28课时 | 5万人学习

SciPy 教程
SciPy 教程

共10课时 | 1.9万人学习

关于我们 免责申明 举报中心 意见反馈 讲师合作 广告合作 最新更新
php中文网:公益在线php培训,帮助PHP学习者快速成长!
关注服务号 技术交流群
PHP中文网订阅号
每天精选资源文章推送

Copyright 2014-2026 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号