scorecardpipeline

@Time : 2023/2/15 17:55 @Author : itlubber @Site : itlubber.art

class scorecardpipeline.FeatureSelection(target='target', empty=0.95, iv=0.02, corr=0.7, exclude=None, return_drop=True, identical=0.95, remove=None, engine='scorecardpy', target_rm=False)[源代码]

基类：TransformerMixin, BaseEstimator

__init__(target='target', empty=0.95, iv=0.02, corr=0.7, exclude=None, return_drop=True, identical=0.95, remove=None, engine='scorecardpy', target_rm=False)[源代码]

特征筛选方法

参数:

target – 数据集中标签名称，默认 target
empty – 空值率，默认 0.95, 即空值占比超过 95% 的特征会被剔除
iv – IV值，默认 0.02，即iv值小于 0.02 时特征会被剔除
corr – 相关性，默认 0.7，即特征之间相关性大于 0.7 时会剔除iv较小的特征
identical – 单一值占比，默认 0.95，即当特征的某个值占比超过 95% 时，特征会被剔除
exclude – 是否需要强制保留某些特征
return_drop – 是否返回删除信息，默认 True，即默认返回删除特征信息
remove – 引擎使用 scorecardpy 时，可以传入需要强制删除的变量
engine – 特征筛选使用的引擎，可选 “toad”, “scorecardpy” 两种，默认 scorecardpy
target_rm – 是否剔除标签，默认 False，即不剔除

fit(x, y=None)[源代码]

训练特征筛选方法

参数:: x – 数据集，需要包含目标变量
返回:: 训练后的 FeatureSelection

transform(x, y=None)[源代码]

特征筛选转换器

参数:: x – 需要进行特征筛选的数据集
返回:: pd.DataFrame，特征筛选后的数据集

class scorecardpipeline.FeatureImportanceSelector(top_k=126, target='target', selector='catboost', params=None, max_iv=None)[源代码]

基类：BaseEstimator, TransformerMixin

__init__(top_k=126, target='target', selector='catboost', params=None, max_iv=None)[源代码]

基于特征重要性的特征筛选方法

参数:

top_k – 依据特征重要性进行排序，筛选最重要的 top_k 个特征
target – 数据集中标签名称，默认 target
selector – 特征选择器，目前只支持 catboost ，可以支持数据集中包含字符串的数据
params – selector 的参数，不传使用默认参数
max_iv – 是否需要删除 IV 过高的特征，建议设置为 1.0

fit(x, y=None)[源代码]

特征重要性筛选器训练

参数:: x – 数据集，需要包含目标变量
返回:: 训练后的 FeatureImportanceSelector

transform(x, y=None)[源代码]

特征重要性筛选器转换方法

参数:: x – 需要进行特征筛选的数据集
返回:: pd.DataFrame，特征筛选后的数据集

catboost_selector(x, y, cat_features=None)[源代码]

基于 CatBoost 的特征重要性筛选器

参数:

x – 需要进行特征重要性筛选的数据集，不包含目标变量
y – 数据集中对应的目标变量值
cat_features – 类别型特征的索引

class scorecardpipeline.StepwiseSelection(target='target', estimator='ols', direction='both', criterion='aic', max_iter=None, return_drop=True, exclude=None, intercept=True, p_value_enter=0.2, p_remove=0.01, p_enter=0.01, target_rm=False)[源代码]

基类：TransformerMixin, BaseEstimator

__init__(target='target', estimator='ols', direction='both', criterion='aic', max_iter=None, return_drop=True, exclude=None, intercept=True, p_value_enter=0.2, p_remove=0.01, p_enter=0.01, target_rm=False)[源代码]

逐步回归特征筛选方法

参数:

target – 数据集中标签名称，默认 target
estimator – 预估器，默认 ols，可选 “ols”, “lr”, “lasso”, “ridge”，通常默认即可
direction – 逐步回归方向，默认both，可选 “forward”, “backward”, “both”，通常默认即可
criterion – 评价指标，默认 aic，可选 “aic”, “bic”, “ks”, “auc”，通常默认即可
max_iter – 最大迭代次数，sklearn中使用的参数，默认为 None
return_drop – 是否返回特征剔除信息，默认 True
exclude – 强制保留的某些特征
intercept – 是否包含截距，默认为 True
p_value_enter – 特征进入的 p 值，用于前向筛选时决定特征是否进入模型
p_remove – 特征剔除的 p 值，用于后向剔除时决定特征是否要剔除
p_enter – 特征 p 值，用于判断双向逐步回归是否剔除或者准入特征
target_rm – 是否剔除数据集中的标签，默认为 False，即剔除数据集中的标签

fit(x, y=None)[源代码]

训练逐步回归特征筛选方法

参数:: x – 数据集，需要包含目标变量
返回:: 训练后的 StepwiseSelection

transform(x, y=None)[源代码]

逐步回归特征筛选转换器

参数:: x – 需要进行特征筛选的数据集
返回:: pd.DataFrame，特征筛选后的数据集

class scorecardpipeline.Combiner(target='target', method='chi', empty_separate=True, min_n_bins=2, max_n_bins=None, max_n_prebins=20, min_prebin_size=0.02, min_bin_size=0.05, max_bin_size=None, gamma=0.01, monotonic_trend='auto_asc_desc', adj_rules={}, n_jobs=1, **kwargs)[源代码]

基类：TransformerMixin, BaseEstimator

__init__(target='target', method='chi', empty_separate=True, min_n_bins=2, max_n_bins=None, max_n_prebins=20, min_prebin_size=0.02, min_bin_size=0.05, max_bin_size=None, gamma=0.01, monotonic_trend='auto_asc_desc', adj_rules={}, n_jobs=1, **kwargs)[源代码]

特征分箱封装方法

参数:

target – 数据集中标签名称，默认 target
method – 特征分箱方法，可选 “chi”, “dt”, “quantile”, “step”, “kmeans”, “cart”, “mdlp”, “uniform”, 参考 toad.Combiner: https://github.com/amphibian-dev/toad/blob/master/toad/transform.py#L178-L355 & optbinning.OptimalBinning: https://gnpalencia.org/optbinning/
empty_separate – 是否空值单独一箱, 默认 True
min_n_bins – 最小分箱数，默认 2，即最小拆分2箱
max_n_bins – 最大分箱数，默认 None，即不限制拆分箱数，推荐设置 3 ～ 5，不宜过多，偶尔使用 optbinning 时不起效
max_n_prebins – 使用 optbinning 时预分箱数量
min_prebin_size – 使用 optbinning 时预分箱叶子结点（或者每箱）样本占比，默认 2%
min_bin_size – 使用 optbinning 正式分箱叶子结点（或者每箱）最小样本占比，默认 5%
max_bin_size – 使用 optbinning 正式分箱叶子结点（或者每箱）最大样本占比，默认 None
gamma – 使用 optbinning 分箱时限制过拟合的正则化参数，值越大惩罚越多，默认 0.01
monotonic_trend – 使用 optbinning 正式分箱时的坏率策略，默认 auto，可选 “auto”, “auto_heuristic”, “auto_asc_desc”, “ascending”, “descending”, “convex”, “concave”, “peak”, “valley”, “peak_heuristic”, “valley_heuristic”
adj_rules – 自定义分箱规则，toad.Combiner 能够接收的形式
n_jobs – 使用多进程加速的worker数量，默认单进程

update(rules)[源代码]

更新 Combiner 中特征的分箱规则

参数:: rules – dict，需要更新规则，格式如下：{特征名称: 分箱规则}

static optbinning_bins(feature, data=None, target='target', min_n_bins=2, max_n_bins=3, max_n_prebins=10, min_prebin_size=0.02, min_bin_size=0.05, max_bin_size=None, gamma=0.01, monotonic_trend='auto_asc_desc', **kwargs)[源代码]

基于 optbinning.OptimalBinning 的特征分箱方法，使用 optbinning.OptimalBinning 分箱失败时，使用 toad.transform.Combiner 的卡方分箱处理

参数:

feature – 需要进行分箱的特征名称
data – 训练数据集
target – 数据集中标签名称，默认 target
min_n_bins – 最小分箱数，默认 2，即最小拆分2箱
max_n_bins – 最大分箱数，默认 None，即不限制拆分箱数，推荐设置 3 ～ 5，不宜过多，偶尔不起效
max_n_prebins – 使用 optbinning 时预分箱数量
min_prebin_size – 使用 optbinning 时预分箱叶子结点（或者每箱）样本占比，默认 2%
min_bin_size – 使用 optbinning 正式分箱叶子结点（或者每箱）最小样本占比，默认 5%
max_bin_size – 使用 optbinning 正式分箱叶子结点（或者每箱）最大样本占比，默认 None
gamma – 使用 optbinning 分箱时限制过拟合的正则化参数，值越大惩罚越多，默认 0.01
monotonic_trend – 使用 optbinning 正式分箱时的坏率策略，默认 auto，可选 “auto”, “auto_heuristic”, “auto_asc_desc”, “ascending”, “descending”, “convex”, “concave”, “peak”, “valley”, “peak_heuristic”, “valley_heuristic”

fit(x: DataFrame, y=None)[源代码]

特征分箱训练

参数:: x – 需要分箱的数据集，需要包含目标变量
返回:: Combiner，训练完成的分箱器

check_rules(feature=None)[源代码]: 检查类别变量空值是否被转为字符串，如果转为了字符串，强制转回空值，同时检查分箱顺序并调整为正确顺序

transform(x, y=None, labels=False)[源代码]

特征分箱转换方法

参数:

x – 需要进行分箱转换的数据集
labels – 进行分箱转换时是否转换为分箱信息，默认 False，即转换为分箱索引

返回:

pd.DataFrame，分箱转换后的数据集

export(to_json=None)[源代码]

特征分箱器导出 json 保存

参数:: to_json – json 文件的路径
返回:: dict，特征分箱信息

load(from_json)[源代码]

特征分箱器加载离线保存的 json 文件

参数:: from_json – json 文件的路径
返回:: Combiner，特征分箱器

classmethod feature_bin_stats(data, feature, target='target', rules=None, method='step', desc='', combiner=None, ks=True, max_n_bins=None, min_bin_size=None, max_bin_size=None, greater_is_better='auto', amount=None, empty_separate=True, return_cols=None, return_rules=False, verbose=0, **kwargs)[源代码]

特征分箱统计表，汇总统计特征每个分箱的各项指标信息

参数:

data – 需要查看分箱统计表的数据集
feature – 需要查看的分箱统计表的特征名称
target – 数据集中标签名称，默认 target
rules – 根据自定义的规则查看特征分箱统计表，支持 list（单个特征分箱规则）或 dict（多个特征分箱规则）格式传入
combiner – 提前训练好的特征分箱器，优先级小于 rules
method – 特征分箱方法，当传入 rules 或 combiner 时失效，可选 “chi”, “dt”, “quantile”, “step”, “kmeans”, “cart”, “mdlp”, “uniform”, 参考 toad.Combiner: https://github.com/amphibian-dev/toad/blob/master/toad/transform.py#L178-L355 & optbinning.OptimalBinning: https://gnpalencia.org/optbinning/
desc – 特征描述信息，大部分时候用于传入特征对应的中文名称或者释义
ks – 是否统计 KS 信息
max_n_bins – 最大分箱数，默认 None，即不限制拆分箱数，推荐设置 3 ～ 5，不宜过多，偶尔使用 optbinning 时不起效
min_bin_size – 使用 optbinning 正式分箱叶子结点（或者每箱）最小样本占比，默认 5%
max_bin_size – 使用 optbinning 正式分箱叶子结点（或者每箱）最大样本占比，默认 None
empty_separate – 是否空值单独一箱, 默认 False，推荐设置为 True
return_cols – list，指定返回部分特征分箱统计表的列，默认 None
return_rules – 是否返回特征分箱信息，默认 False
greater_is_better – 是否越大越好，默认 “”auto”, 根据最后两箱的 lift 指标自动推断是否越大越好, 可选 True、False、auto
amount – 默认为空, 支持传入数值字段（通常为放款金额）, 在分析逾期率时，输出对应的分析结果
kwargs – scorecardpipeline.processing.Combiner 的其他参数

返回:

特征分箱统计表: pd.DataFrame
特征分箱信息: list，当参数 return_rules 为 True 时返回

bin_plot(data, x, rule={}, desc='', result=False, save=None, **kwargs)[源代码]

特征分箱图

参数:

data – 需要查看分箱图的数据集
x – 需要查看的分箱图的特征名称
rule – 自定义的特征分箱规则，不会修改已训练好的特征分箱信息
desc – 特征描述信息，大部分时候用于传入特征对应的中文名称或者释义
result – 是否返回特征分箱统计表，默认 False
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
kwargs – scorecardpipeline.utils.bin_plot 方法其他的参数，参考：http://localhost:63342/scorecardpipeline/docs/build/html/scorecardpipeline.html#scorecardpipeline.utils.bin_plot

返回:

pd.DataFrame，特征分箱统计表，当 result 参数为 True 时返回

proportion_plot(data, x, transform=False, labels=False, keys=None)[源代码]

数据集中特征的分布情况

参数:

data – 需要查看样本分布的数据集
x – 需要查看样本分布的特征名称
transform – 是否进行分箱转换，默认 False，当特征为数值型变量时推荐转换分箱后在查看数据分布
labels – 进行分箱转换时是否转换为分箱信息，默认 False，即转换为分箱索引
keys – 根据某个 key 划分数据集查看数据分布情况，默认 None

corr_plot(data, transform=False, figure_size=(20, 15), save=None)[源代码]

特征相关图

参数:

data – 需要查看特征相关性的数据集
transform – 是否进行分箱转换，默认 False
figure_size – 图像大小，默认 (20, 15)
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None

badrate_plot(data, date_column, feature, labels=True)[源代码]

查看不同时间段的分箱是否平稳，线敞口随时间变化而增大为优，代表了特征在更新的时间区分度更强。线之前没有交叉为优，代表分箱稳定

参数:

data – 需要查看分箱平稳情况的数据集，需包含时间列
feature – 需要查看分箱平稳性的特征名称
date_column – 数据集中的日期列名称
labels – 进行分箱转换时是否转换为分箱信息，默认 True，即转换为分箱

property rules: dict，特征分箱明细信息

class scorecardpipeline.WOETransformer(target='target', exclude=None)[源代码]

基类：TransformerMixin, BaseEstimator

__init__(target='target', exclude=None)[源代码]

WOE转换器

参数:

target – 数据集中标签名称，默认 target
exclude – 不需要转换 woe 的列

fit(x, y=None)[源代码]

WOE转换器训练

参数:: x – Combiner 转换后的数据（label 为 False），需要包含目标变量
返回:: WOETransformer，训练完成的WOE转换器

transform(x, y=None)[源代码]

特征WOE转换方法

参数:: x – 需要进行WOE转换的数据集
返回:: pd.DataFrame，WOE转换后的数据集

export(to_json=None)[源代码]

特征分箱器导出 json 保存

参数:: to_json – json 文件的路径
返回:: dict，特征分箱信息

load(from_json)[源代码]

特征分箱器加载离线保存的 json 文件

参数:: from_json – json 文件的路径
返回:: Combiner，特征分箱器

property rules: dict，特征 WOE 明细信息

class scorecardpipeline.ITLubberLogisticRegression(target='target', penalty='l2', calculate_stats=True, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)[源代码]

基类：LogisticRegression

__init__(target='target', penalty='l2', calculate_stats=True, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)[源代码]

ITLubberLogisticRegression，继承 sklearn.linear_model.LogisticRegression 方法，增加了统计性描述相关的内容输出，核心实现逻辑参考：https://github.com/ing-bank/skorecard/blob/main/skorecard/linear_model/linear_model.py#L11

参数:

target – 数据集中标签名称，默认 target
calculate_stats – 是否在训练模型时记录模型统计信息，默认 True，可以通过 summary 方法输出相关统计信息
tol – 停止求解的标准，float类型，默认为1e-4
C – 正则化系数λ的倒数，float类型，默认为1.0，必须是正浮点型数，值越小惩罚越大
fit_intercept – 是否存在截距或偏差，bool类型，默认为True
class_weight – 类型权重参数，默认 None，支持传入 dict or balanced，当设置 balanced 时，权重计算方式：n_samples / (n_classes * np.bincount(y))
solver – 求解器设置，默认 lbfgs。对于小型数据集来说，选择 liblinear 更好；对于大型数据集来说，saga 或者 sag 会更快一些。对于多类问题我们只能使用 newton-cg、sag、saga、lbfgs。对于正则化来说，newton-cg、lbfgs 和 sag 只能用于L2正则化(因为这些优化算法都需要损失函数的一阶或者二阶连续导数，因此无法用于没有连续导数的L1正则化)；而 liblinear，saga 则可处理L1正则化。newton-cg 是牛顿家族中的共轭梯度法，lbfgs 是一种拟牛顿法，sag 则是随机平均梯度下降法，saga 是随机优化算法，liblinear 是坐标轴下降法。
penalty – 惩罚项，默认 l2，可选 l1、l2，solver 为 newton-cg、sag 和 lbfgs 时只支持L2，L1假设的是模型的参数满足拉普拉斯分布，L2假设的模型参数满足高斯分布
intercept_scaling – 仅在 solver 选择 liblinear 并且 fit_intercept 设置为 True 的时候才有用
dual – 对偶或原始方法，bool类型，默认为False，对偶方法只用在求解线性多核(liblinear)的L2惩罚项上。当样本数量>样本特征的时候，dual通常设置为False
random_state – 随机数种子，int类型，可选参数，默认为无，仅在 solver 为 sag 和 liblinear 时有用
max_iter – 算法收敛最大迭代次数，int类型，默认 100。只在 solver 为 newton-cg、sag 和 lbfgs 时有用
multi_class – 分类方法参数选择，默认 auto，可选 ovr、multinomial，如果分类问题是二分类问题，那么这两个参数的效果是一样的，主要体现在多分类问题上
verbose – 日志级别，当 solver 为 liblinear、lbfgs 时设置为任意正数显示详细计算过程
warm_start – 热启动参数，bool类型，表示是否使用上次的模型结果作为初始化，默认为 False
n_jobs – 并行运算数量，默认为1，如果设置为-1，则表示将电脑的cpu全部用上
l1_ratio – 弹性网络参数，其中0 <= l1_ratio <=1，仅当 penalty 为 elasticnet 时有效

参考样例

>>> feature_pipeline = Pipeline([
>>>     ("preprocessing_select", FeatureSelection(target=target, engine="scorecardpy")),
>>>     ("combiner", Combiner(target=target, min_samples=0.2)),
>>>     ("transform", WOETransformer(target=target)),
>>>     ("processing_select", FeatureSelection(target=target, engine="scorecardpy")),
>>>     ("stepwise", StepwiseSelection(target=target)),
>>>     # ("logistic", LogisticClassifier(target=target)),
>>>     ("logistic", ITLubberLogisticRegression(target=target)),
>>> ])
>>> feature_pipeline.fit(train)
>>> summary = feature_pipeline.named_steps['logistic'].summary()
>>> summary
                                                    Coef.  Std.Err       z  P>|z|  [ 0.025  0.975 ]    VIF
const                                               -0.8511   0.0991 -8.5920 0.0000  -1.0452  -0.6569 1.0600
credit_history                                       0.8594   0.1912  4.4954 0.0000   0.4847   1.2341 1.0794
age_in_years                                         0.6176   0.2936  2.1032 0.0354   0.0421   1.1932 1.0955
savings_account_and_bonds                            0.8842   0.2408  3.6717 0.0002   0.4122   1.3563 1.0331
credit_amount                                        0.7027   0.2530  2.7771 0.0055   0.2068   1.1987 1.1587
status_of_existing_checking_account                  0.6891   0.1607  4.2870 0.0000   0.3740   1.0042 1.0842
personal_status_and_sex                              0.8785   0.5051  1.7391 0.0820  -0.1116   1.8685 1.0113
purpose                                              1.1370   0.2328  4.8844 0.0000   0.6807   1.5932 1.0282
present_employment_since                             0.7746   0.3247  2.3855 0.0171   0.1382   1.4110 1.0891
installment_rate_in_percentage_of_disposable_income  1.3785   0.3434  4.0144 0.0001   0.7055   2.0515 1.0300
duration_in_month                                    0.9310   0.1986  4.6876 0.0000   0.5417   1.3202 1.1636
other_installment_plans                              0.8521   0.3459  2.4637 0.0138   0.1742   1.5301 1.0117
housing                                              0.8251   0.4346  1.8983 0.0577  -0.0268   1.6770 1.0205

fit(x, sample_weight=None, **kwargs)[源代码]

逻辑回归训练方法

参数:

x – 训练数据集，需包含目标变量
sample_weight – 样本权重，参考：https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit
kwargs – 其他逻辑回归模型训练参数

返回:

ITLubberLogisticRegression，训练完成的逻辑回归模型

decision_function(x)[源代码]

决策函数

参数:: x – 需要预测的数据集，可以包含目标变量，会根据列名进行判断，如果包含会删除相关特征
返回:: np.ndarray，预测结果

corr(data, save=None, annot=True)[源代码]

数据集的特征相关性图

参数:

data – 需要画特征相关性图的数据集
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
annot – 是否在图中显示相关性的数值，默认 True

report(data)[源代码]

逻辑回归模型报告

参数:: data – 需要评估的数据集
返回:: pd.DataFrame，模型报告，包含准确率、F1等指标，参考：https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

summary()[源代码]

返回:: pd.DataFrame，逻辑回归模型统计信息

Coef.: 逻辑回归入模特征系数
Std.Err: 标准误差
z: Z检验统计量
P>|z|: P值
[ 0.025: 置信区间下界
0.975 ]: 置信区间上界
VIF: 膨胀方差因子

参考样例

>>> summary = logistic.summary()
>>> summary
                                                    Coef.  Std.Err       z  P>|z|  [ 0.025  0.975 ]    VIF
const                                               -0.8511   0.0991 -8.5920 0.0000  -1.0452  -0.6569 1.0600
credit_history                                       0.8594   0.1912  4.4954 0.0000   0.4847   1.2341 1.0794
age_in_years                                         0.6176   0.2936  2.1032 0.0354   0.0421   1.1932 1.0955
savings_account_and_bonds                            0.8842   0.2408  3.6717 0.0002   0.4122   1.3563 1.0331
credit_amount                                        0.7027   0.2530  2.7771 0.0055   0.2068   1.1987 1.1587
status_of_existing_checking_account                  0.6891   0.1607  4.2870 0.0000   0.3740   1.0042 1.0842
personal_status_and_sex                              0.8785   0.5051  1.7391 0.0820  -0.1116   1.8685 1.0113
purpose                                              1.1370   0.2328  4.8844 0.0000   0.6807   1.5932 1.0282
present_employment_since                             0.7746   0.3247  2.3855 0.0171   0.1382   1.4110 1.0891
installment_rate_in_percentage_of_disposable_income  1.3785   0.3434  4.0144 0.0001   0.7055   2.0515 1.0300
duration_in_month                                    0.9310   0.1986  4.6876 0.0000   0.5417   1.3202 1.1636
other_installment_plans                              0.8521   0.3459  2.4637 0.0138   0.1742   1.5301 1.0117
housing                                              0.8251   0.4346  1.8983 0.0577  -0.0268   1.6770 1.0205

summary2(feature_map={})[源代码]

summary 的基础上，支持传入数据字典，输出带有特征释义的统计信息表

参数:: feature_map – 数据字典，默认 {}
返回:: pd.DataFrame，逻辑回归模型统计信息

static convert_sparse_matrix(x)[源代码]: 稀疏特征优化

plot_weights(save=None, figsize=(15, 8), fontsize=14, color=['#2639E9', '#F76E6C', '#FE7715'])[源代码]

逻辑回归模型系数误差图

参数:

save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
figsize – 图片大小，默认 (15, 8)
fontsize – 字体大小，默认 14
color – 图片主题颜色，默认即可

返回:

Figure

class scorecardpipeline.ScoreCard(target='target', pdo=60, rate=2, base_odds=35, base_score=750, combiner={}, transer=None, pretrain_lr=None, pipeline=None, **kwargs)[源代码]

基类：ScoreCard, TransformerMixin

__init__(target='target', pdo=60, rate=2, base_odds=35, base_score=750, combiner={}, transer=None, pretrain_lr=None, pipeline=None, **kwargs)[源代码]

评分卡模型

参数:

target – 数据集中标签名称，默认 target
pdo – odds 每增加 rate 倍时减少 pdo 分，默认 60
rate – 倍率
base_odds – 基础 odds，通常根据业务经验设置的基础比率（违约概率/正常概率），估算方法：（1-样本坏客户占比）/坏客户占比，默认 35，即 35:1 => 0.972 => 坏样本率 2.8%
base_score – 基础 odds 对应的分数，默认 750
combiner – 分箱转换器，传入 pipeline 时可以为None
transer – woe转换器，传入 pipeline 时可以为None
pretrain_lr – 预训练好的逻辑回归模型，可以不传
pipeline – 训练好的 pipeline，必须包含 Combiner 和 WOETransformer
kwargs – 其他相关参数，具体参考 toad.ScoreCard

fit(x)[源代码]

评分卡模型训练方法

参数:: x – 转换为 WOE 后的训练数据，需包含目标变量
返回:: ScoreCard，训练好的评分卡模型

transform(x)[源代码]

评分转换方法

参数:: x – 需要预测模型评分的原始数据，非 woe 转换后的数据
返回:: 预测的评分分数

static score_clip(score, clip=50)[源代码]

传入评分分数，根据评分分布情况，返回评分等距分箱规则

参数:

score – 评分数据
clip – 区间间隔

返回:

list，评分分箱规则

scorecard_scale()[源代码]

输出评分卡基准信息，包含 base_odds、base_score、rate、pdo、A、B

返回:: pd.DataFrame，评分卡基准信息

classmethod format_bins(bins, index=False, ellipsis=None, decimal=4)[源代码]

分箱转换为标签

参数:

bins – 分箱
index – 是否需要索引
ellipsis – 字符显示最大长度

返回:

ndarray: 分箱标签

scorecard_points(feature_map={})[源代码]

输出评分卡分箱信息及其对应的分数

参数:: feature_map – 数据字典，默认 {}，传入入模特征的数据字典，输出信息中将增加一列变量含义
返回:: pd.DataFrame，评分卡分箱信息

scorecard2pmml(pmml: str = 'scorecard.pmml', debug: bool = False)[源代码]

转换评分卡模型为本地 PMML 文件，使用本功能需要提前在环境中安装 jdk 1.8+ 以及 sklearn2pmml 库

参数:

pmml – 保存 PMML 模型文件的路径
debug – bool，是否开启调试模式，默认 False，当设置为 True 时，会返回评分卡 pipeline，同时显示转换细节

返回:

sklearn.pipeline.Pipeline，当设置 debug 为 True 时，返回评分卡 pipeline

static KS_bucket(y_pred, y_true, bucket=10, method='quantile')[源代码]

用于评估评分卡排序性的方法

参数:

y_pred – 模型预测结果，传入评分卡预测的评分或LR预测的概率
y_true – 样本好坏标签
bucket – 分箱数量，默认 10
method – 分箱方法，支持 chi、dt、quantile、step、kmeans，默认 quantile

返回:

评分卡分箱后的统计信息，推荐直接使用 feature_bin_stats 方法

static KS(y_pred, y_true)[源代码]

计算 KS 指标

参数:

y_pred – 模型预测结果，传入评分卡预测的评分或LR预测的概率
y_true – 样本好坏标签

返回:

float，KS 指标

static AUC(y_pred, y_true)[源代码]

计算 AUC 指标

参数:

y_pred – 模型预测结果，传入评分卡预测的评分或LR预测的概率
y_true – 样本好坏标签

返回:

float，AUC 指标

static perf_eva(y_pred, y_true, title='', plot_type=['ks', 'roc'], save=None, figsize=(14, 6))[源代码]

评分卡效果评估方法

参数:

y_pred – 模型预测结果，传入评分卡预测的评分或LR预测的概率
y_true – 样本好坏标签
title – 图像标题
plot_type – 画图的类型，可选 ks、auc、lift、pr
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
figsize – 图像尺寸大小，传入一个tuple，默认（14， 6）

返回:

dict，包含 ks、auc、gini、figure

static ks_plot(score, y_true, title='', fontsize=14, figsize=(16, 8), save=None, **kwargs)[源代码]

数值特征 KS曲线 & ROC曲线

参数:

score – 数值特征，通常为评分卡分数
y_true – 标签值
title – 图像标题
fontsize – 字体大小，默认 14
figsize – 图像大小，默认 (16, 8)
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
kwargs – 其他参数，参考：scorecardpipeline.utils.hist_plot

static PSI(y_pred_train, y_pred_oot)[源代码]

计算两个数据集评分或预测结果的 PSI

参数:

y_pred_train – 基准数据集的数值特征，通常为评分卡分数
y_pred_oot – 对照数据集的数值特征

返回:

float，PSI 指标值

static perf_psi(y_pred_train, y_pred_oot, y_true_train, y_true_oot, keys=['train', 'test'], x_limits=None, x_tick_break=50, show_plot=True, return_distr_dat=False)[源代码]

scorecardpy 的 perf_psi 方法，基于两个数据集的画 PSI 图

参数:

y_pred_train – 基准数据集的数值特征，通常为评分卡分数
y_pred_oot – 对照数据集的数值特征
y_true_train – 基准数据集的真实标签
y_true_oot – 基准数据集的真实标签
keys – 基准数据集和对照数据集的名称
x_limits – x 轴的区间，默认为 None
x_tick_break – 评分区间步长
show_plot – 是否显示图像，默认 True
return_distr_dat – 是否返回分布数据

返回:

dict，PSI 指标 & 图片

static score_hist(score, y_true, figsize=(15, 10), bins=20, save=None, **kwargs)[源代码]

数值特征分布图

参数:

score – 数值特征，通常为评分卡分数
y_true – 标签值
figsize – 图像大小，默认 (15, 10)
bins – 分箱数量大小，默认 30
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
kwargs – scorecardpipeline.utils.hist_plot 方法的其他参数

static class_steps(pipeline, query)[源代码]

根据 query 查询 pipeline 中对应的 step

参数:

pipeline – sklearn.pipeline.Pipeline，训练后的数据预处理 pipeline
query – 需要查询的类，可以从 pipeline 中查找 WOETransformer 和 Combiner

返回:

list，对应的组件

feature_bin_stats(data, feature, rules={}, method='step', max_n_bins=10, desc='评分卡分数', ks=False, **kwargs)[源代码]

评估评分卡排序性的方法，可以输出各分数区间的各项指标

参数:

data – 需要查看的数据集
feature – 数值性特征名称，通常为预测的概率或评分卡分数
rules – 自定义的区间划分规则
method – 分箱方法
max_n_bins – 最大分箱数
desc – 特征描述
ks – 是否统计 KS 指标并输出相关统计信息
kwargs – Combiner.feature_bin_stats 方法的其他参数

返回:

pd.DataFrame，评分各区间的统计信息

class scorecardpipeline.Rule(expr)[源代码]

基类：object

__init__(expr)[源代码]

规则集

参数:: expr – 类似 DataFrame 的 query 方法传参方式即可，目前仅支持数值型变量规则

参考样例

>>> from scorecardpipeline import *
>>> target = "creditability"
>>> data = germancredit()
>>> data[target] = data[target].map({"good": 0, "bad": 1})
>>> data = data.select_dtypes("number") # 暂不支持字符型规则
>>> rule1 = Rule("duration_in_month < 10")
>>> rule2 = Rule("credit_amount < 500")
>>> rule1.report(data, target=target)
>>> rule2.report(data, target=target)
>>> (rule1 | rule2).report(data, target=target)
>>> (rule1 & rule2).report(data, target=target)

predict(X: DataFrame, part='')[源代码]

report(datasets: DataFrame, target='target', overdue=None, dpd=None, del_grey=False, desc='', filter_cols=None, prior_rules=None) → DataFrame[源代码]

规则效果报告表格输出

参数:

datasets – 数据集，需要包含目标变量或逾期天数，当不包含目标变量时，会通过逾期天数计算目标变量，同时需要传入逾期定义的DPD天数
target – 目标变量名称，默认 target
desc – 规则相关的描述，会出现在返回的表格当中
filter_cols – 指定返回的字段列表，默认不传
prior_rules – 先验规则，可以传入先验规则先筛选数据后再评估规则效果
overdue – 逾期天数字段名称
dpd – 逾期定义方式，逾期天数 > DPD 为 1，其他为 0，仅 overdue 字段起作用时有用
del_grey – 是否删除逾期天数 (0, dpd] 的数据，仅 overdue 字段起作用时有用

返回:

pd.DataFrame，规则效果评估表

result()[源代码]

static save(report, excel_writer, sheet_name=None, merge_column=None, percent_cols=None, condition_cols=None, custom_cols=None, custom_format='#,##0', color_cols=None, start_col=2, start_row=2, **kwargs)[源代码]: 保存规则结果至excel中，参数与 https://scorecardpipeline.itlubber.art/scorecardpipeline.html#scorecardpipeline.dataframe2excel 一致

class scorecardpipeline.DecisionTreeRuleExtractor(target='target', labels=['positive', 'negative'], feature_map={}, nan=-1.0, max_iter=128, writer=None, seed=None, theme_color='2639E9', decimal=4)[源代码]

基类：object

__init__(target='target', labels=['positive', 'negative'], feature_map={}, nan=-1.0, max_iter=128, writer=None, seed=None, theme_color='2639E9', decimal=4)[源代码]

决策树自动规则挖掘工具包

参数:

target – 数据集中好坏样本标签列名称，默认 target
labels – 好坏样本标签名称，传入一个长度为2的列表，第0个元素为好样本标签，第1个元素为坏样本标签，默认 [“positive”, “negative”]
feature_map – 变量名称及其含义，在后续输出报告和策略信息时增加可读性，默认 {}
nan – 在决策树策略挖掘时，默认空值填充的值，默认 -1
max_iter – 最多支持在数据集上训练多少颗树模型，每次生成一棵树后，会剔除特征重要性最高的特征后，再生成树，默认 128
writer – 在之前程序运行时生成的 ExcelWriter，可以支持传入一个已有的writer，后续所有内容将保存至该workbook中，默认 None
seed – 随机种子，保证结果可复现使用，默认为 None
theme_color – 主题色，默认 2639E9 克莱因蓝，可设置位其他颜色
decimal – 精度，决策树分裂节点阈值的精度范围，默认 4，即保留4位小数

encode_cat_features(X, y)[源代码]

get_dt_rules(tree)[源代码]

select_dt_rules(decision_tree, x, y, lift=0.0, max_samples=1.0, save=None, verbose=False, drop=False)[源代码]

query_dt_rules(x, y, parsed_rules=None)[源代码]

insert_dt_rules(parsed_rules, end_row, start_col, save=None, sheet=None, figsize=(500, 350))[源代码]

fit(x, y=None, max_depth=2, lift=0.0, max_samples=1.0, min_score=None, verbose=False, *args, **kwargs)[源代码]

组合策略挖掘

参数:

x – 包含标签的数据集
max_depth – 决策树最大深度，即最多组合的特征个数，默认 2
lift – 组合策略最小的lift值，默认 0.，即全部组合策略
max_samples – 每条组合策略的最大样本占比，默认 1.0，即全部组合策略
min_score – 决策树拟合时最小的auc，如果不满足则停止后续生成决策树
verbose – 是否调试模式，仅在 jupyter 环境有效
kwargs – DecisionTreeClassifier 参数，参考 https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

transform(x, y=None)[源代码]

report(valid=None, sheet='组合策略汇总', save=None)[源代码]

组合策略插入excel文档

参数:

valid – 验证数据集
sheet – 保存组合策略的表格sheet名称
save – 保存报告的文件路径

返回:

返回每个数据集组合策略命中情况

scorecardpipeline.KS(score, target)[源代码]

calculate ks value

Args:: score (array-like): list of score or probability that the model predict target (array-like): list of real target
Returns:: float: the max KS value

scorecardpipeline.AUC(score, target, return_curve=False)[源代码]

AUC Score

Args:: score (array-like): list of score or probability that the model predict target (array-like): list of real target return_curve (bool): if need return curve data for ROC plot
Returns:: float: auc score

scorecardpipeline.PSI(test, base, combiner=None, return_frame=False)[源代码]

calculate PSI

Args:: test (array-like): data to test PSI base (array-like): base data for calculate PSI combiner (Combiner|list|dict): combiner to combine data return_frame (bool): if need to return frame of proportion
Returns:: float|Series

scorecardpipeline.F1(score, target, split='best', return_split=False)[源代码]

calculate f1 value

Args:: score (array-like) target (array-like)
Returns:: float: best f1 score float: best spliter

scorecardpipeline.init_logger(filename=None, stream=True, fmt='[ %(asctime)s ][ %(levelname)s ][ %(filename)s:%(funcName)s:%(lineno)d ] %(message)s', datefmt=None)[源代码]

初始化日志

参数:

filename – 日志文件存储地址，如果不传不记录日志到文件中，默认为 None
stream – 是否显示在终端中，默认 True
fmt – 日志格式，参考：https://docs.python.org/3/library/logging.html#formatter-objects
datefmt – 日期格式

返回:

logging.Logger

scorecardpipeline.init_setting(font_path=None, seed=None, freeze_torch=False, logger=False, **kwargs)[源代码]

初始化环境配置，去除警告信息、修改 pandas 默认配置、固定随机种子、日志记录

参数:

seed – 随机种子，默认为 None
freeze_torch – 是否固定 pytorch 环境
font_path – 画图时图像使用的字体，支持系统字体名称、本地字体文件路径，默认为 scorecardppeline 提供的中文字体
logger – 是否需要初始化日志器，默认为 False ，当参数为 True 时返回 logger
kwargs – 日志初始化传入的相关参数

返回:

当 logger 为 True 时返回 logging.Logger

scorecardpipeline.load_pickle(file, engine='joblib')[源代码]

导入 pickle 文件

参数:: file – pickle 文件路径
返回:: pickle 文件的内容

scorecardpipeline.save_pickle(obj, file, engine='joblib')[源代码]

保持数据至 pickle 文件

参数:

obj – 需要保存的数据
file – 文件路径

scorecardpipeline.germancredit()[源代码]

加载德国信贷数据集 German Credit Data

数据来源：https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data

返回:: pd.DataFrame

scorecardpipeline.ColorScaleRule(start_type=None, start_value=None, start_color=None, mid_type=None, mid_value=None, mid_color=None, end_type=None, end_value=None, end_color=None)[源代码]: Backwards compatibility

scorecardpipeline.get_column_letter(col_idx)[源代码]

Convert decimal column position to its ASCII (base 26) form.

Because column indices are 1-based, strides are actually pow(26, n) + 26 Hence, a correction is applied between pow(26, n) and pow(26, 2) + 26 to prevent and additional column letter being prepended

“A” == 1 == pow(26, 0) “Z” == 26 == pow(26, 0) + 26 // decimal equivalent 10 “AA” == 27 == pow(26, 1) + 1 “ZZ” == 702 == pow(26, 2) + 26 // decimal equivalent 100

scorecardpipeline.column_index_from_string(col)[源代码]

Convert ASCII column name (base 26) to decimal with 1-based index

Characters represent descending multiples of powers of 26

“AFZ” == 26 * pow(26, 0) + 6 * pow(26, 1) + 1 * pow(26, 2)

scorecardpipeline.seed_everything(seed: int, freeze_torch=False)[源代码]

固定当前环境随机种子，以保证后续实验可重复

参数:

seed – 随机种子
freeze_torch – 是否固定 pytorch 的随机种子

scorecardpipeline.feature_bins(bins, decimal=4)[源代码]

根据 Combiner 的规则生成分箱区间，并生成区间对应的索引

参数:

bins – Combiner 的规则
decimal – 区间上下界需要保留的精度，默认小数点后4位

返回:

dict ，key 为区间的索引，value 为区间

scorecardpipeline.feature_bin_stats(data, feature, target='target', overdue=None, dpd=None, rules=None, method='step', desc='', combiner=None, ks=True, max_n_bins=None, min_bin_size=None, max_bin_size=None, greater_is_better='auto', amount=None, empty_separate=True, return_cols=None, return_rules=False, del_grey=False, verbose=0, **kwargs)[源代码]

特征分箱统计表，汇总统计特征每个分箱的各项指标信息

参数:

data – 需要查看分箱统计表的数据集
feature – 需要查看的分箱统计表的特征名称
target – 数据集中标签名称，默认 target
overdue – 逾期天数字段名称, 当传入 overdue 时，会忽略 target 参数
dpd – 逾期定义方式，逾期天数 > DPD 为 1，其他为 0，仅 overdue 字段起作用时有用
del_grey – 是否删除逾期天数 (0, dpd] 的数据，仅 overdue 字段起作用时有用
rules – 根据自定义的规则查看特征分箱统计表，支持 list（单个特征分箱规则）或 dict（多个特征分箱规则）格式传入
combiner – 提前训练好的特征分箱器，优先级小于 rules
method – 特征分箱方法，当传入 rules 或 combiner 时失效，可选 “chi”, “dt”, “quantile”, “step”, “kmeans”, “cart”, “mdlp”, “uniform”, 参考 toad.Combiner: https://github.com/amphibian-dev/toad/blob/master/toad/transform.py#L178-L355 & optbinning.OptimalBinning: https://gnpalencia.org/optbinning/
desc – 特征描述信息，大部分时候用于传入特征对应的中文名称或者释义
ks – 是否统计 KS 信息
max_n_bins – 最大分箱数，默认 None，即不限制拆分箱数，推荐设置 3 ～ 5，不宜过多，偶尔使用 optbinning 时不起效
min_bin_size – 使用 optbinning 正式分箱叶子结点（或者每箱）最小样本占比，默认 5%
max_bin_size – 使用 optbinning 正式分箱叶子结点（或者每箱）最大样本占比，默认 None
empty_separate – 是否空值单独一箱, 默认 False，推荐设置为 True
return_cols – list，指定返回部分特征分箱统计表的列，默认 None
return_rules – 是否返回特征分箱信息，默认 False
greater_is_better – 是否越大越好，默认 “”auto”, 根据最后两箱的 lift 指标自动推断是否越大越好, 可选 True、False、auto
amount – 默认为空, 支持传入数值字段（通常为放款金额）, 在分析逾期率时，输出对应的分析结果
kwargs – scorecardpipeline.processing.Combiner 的其他参数

返回:

特征分箱统计表: pd.DataFrame
特征分箱信息: list，当参数 return_rules 为 True 时返回

scorecardpipeline.feature_efficiency_analysis(data, feature, overdue=['MOB1'], dpd=[7, 3, 0], greater_is_better='auto', verbose=True, ks=False, **kwargs)[源代码]

scorecardpipeline.extract_feature_bin(bin_var)[源代码]

根据单个区间提取的分箱的上下界

参数:: bin_var – 区间字符串
返回:: list or tuple

scorecardpipeline.inverse_feature_bins(feature_table, bin_col='分箱')[源代码]

根据变量分箱表得到 Combiner 的规则

参数:

feature_table – 变量分箱表
bin_col – 变量分箱表中分箱对应的列名，默认分箱

返回:

list

scorecardpipeline.sample_lift_transformer(df, rule, target='target', sample_rate=0.7)[源代码]

采取好坏样本 sample_rate:1 的抽样方式时，计算抽样样本和原始样本上的 lift 指标

参数:

df – 原始数据，需全部为数值型变量
rule – Rule
target – 目标变量名称
sample_rate – 好样本采样比例

返回:

lift_sam: float, 抽样样本上拒绝人群的lift lift_ori: float, 原始样本上拒绝人群的lift

scorecardpipeline.feature_describe(data, feature=None, percentiles=None, missing=None, cardinality=None)[源代码]

scorecardpipeline.groupby_feature_describe(data, by=None, n_jobs=-1, **kwargs)[源代码]

scorecardpipeline.bin_plot(feature_table, desc='', figsize=(10, 6), colors=['#2639E9', '#F76E6C', '#FE7715'], save=None, anchor=0.935, max_len=35, fontdict={'color': '#000000'}, hatch=True, ending='分箱图')[源代码]

简单策略挖掘：特征分箱图

参数:

feature_table – 特征分箱的统计信息表，由 feature_bin_stats 运行得到
desc – 特征中文含义或者其他相关信息
figsize – 图像尺寸大小，传入一个tuple，默认（10， 6）
colors – 图片主题颜色，默认即可
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
anchor – 图例在图中的位置，通常 0.95 左右，根据图片标题与图例之间的空隙自行调整即可
max_len – 分箱显示的最大长度，防止分类变量分箱过多文本过长导致图像显示区域很小，默认最长 35 个字符
fontdict – 柱状图上的文字内容格式设置，参考 https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.text.html
hatch – 柱状图是否显示斜杠，默认显示
ending – 分箱图标题显示的后缀，标题格式为: f’{desc}{ending}’

返回:

Figure

scorecardpipeline.corr_plot(data, figure_size=(16, 8), fontsize=16, mask=False, save=None, annot=True, max_len=35, linewidths=0.1, fmt='.2f', step=11, linecolor='white', **kwargs)[源代码]

特征相关图

参数:

data – 原始数据
figure_size – 图片大小，默认 (16, 8)
fontsize – 字体大小，默认 16
mask – 是否只显示下三角部分内容，默认 False
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
annot – 是否在图中显示相关性的数值，默认 True
max_len – 特征显示的最大长度，防止特征名称过长导致图像区域非常小，默认 35，可以传 None 表示不限制
fmt – 数值显示格式，当 annot 为 True 时该参数生效，默认显示两位小数点
step – 色阶的步数，以 0 为中心，默认 2（以0为中心对称） * 5（划分五个色阶） + 1（0一档单独显示）= 11
linewidths – 相关图之间的线条宽度，默认 0.1 ，如果设置为 None 则不现实线条
linecolor – 线的颜色，当 linewidths 大于 0 时生效，默认为 white
kwargs – sns.heatmap 函数其他参数，参考：https://seaborn.pydata.org/generated/seaborn.heatmap.html

返回:

Figure

scorecardpipeline.ks_plot(score, target, title='', fontsize=14, figsize=(16, 8), save=None, colors=['#2639E9', '#F76E6C', '#FE7715'], anchor=0.945)[源代码]

数值特征 KS曲线 & ROC曲线

参数:

score – 数值特征，通常为评分卡分数
target – 标签值
title – 图像标题
fontsize – 字体大小，默认 14
figsize – 图像大小，默认 (16, 8)
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
colors – 图片主题颜色，默认即可
anchor – 图例显示的位置，默认 0.945，根据实际显示情况进行调整即可，0.95 附近小范围调整

返回:

Figure

scorecardpipeline.hist_plot(score, y_true=None, figsize=(15, 10), bins=30, save=None, labels=['好样本', '坏样本'], desc='', anchor=1.11, fontsize=14, kde=False, **kwargs)[源代码]

数值特征分布图

参数:

score – 数值特征，通常为评分卡分数
y_true – 标签值
figsize – 图像大小，默认 (15, 10)
bins – 分箱数量大小，默认 30
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
labels – 字典或列表，图例显示的分类名称，默认 [“好样本”, “坏样本”]，按照目标变量顺序对应即可，从0开始
anchor – 图例显示的位置，默认 1.1，根据实际显示情况进行调整即可，1.1 附近小范围调整
fontsize – 字体大小，默认 14
kwargs – sns.histplot 函数其他参数，参考：https://seaborn.pydata.org/generated/seaborn.histplot.html

返回:

Figure

scorecardpipeline.psi_plot(expected, actual, labels=['预期', '实际'], desc='', save=None, colors=['#2639E9', '#F76E6C', '#FE7715'], figsize=(15, 8), anchor=0.94, width=0.35, result=False, plot=True, max_len=None, hatch=True)[源代码]

特征 PSI 图

参数:

expected – 期望分布情况，传入需要验证的特征分箱表
actual – 实际分布情况，传入需要参照的特征分箱表
labels – 期望分布和实际分布的名称，默认 [“预期”, “实际”]
desc – 标题前缀显示的名称，默认为空，推荐传入特征名称或评分卡名字
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
colors – 图片主题颜色，默认即可
figsize – 图像大小，默认 (15, 8)
anchor – 图例显示的位置，默认 0.94，根据实际显示情况进行调整即可，0.94 附近小范围调整
width – 预期分布与实际分布柱状图之间的间隔，默认 0.35
result – 是否返回 PSI 统计表，默认 False
plot – 是否画 PSI图，默认 True
max_len – 特征显示的最大长度，防止特征名称过长导致图像区域非常小，默认 None 表示不限制
hatch – 是否显示柱状图上的斜线，默认为 True

返回:

当 result 为 True 时，返回 pd.DataFrame

scorecardpipeline.csi_plot(expected, actual, score_bins, labels=['预期', '实际'], desc='', save=None, colors=['#2639E9', '#F76E6C', '#FE7715'], figsize=(15, 8), anchor=0.94, width=0.35, result=False, plot=True, max_len=None, hatch=True)[源代码]

特征 CSI 图

参数:

expected – 期望分布情况，传入需要验证的特征分箱表
actual – 实际分布情况，传入需要参照的特征分箱表
score_bins – 逻辑回归模型评分表
labels – 期望分布和实际分布的名称，默认 [“预期”, “实际”]
desc – 标题前缀显示的名称，默认为空，推荐传入特征名称或评分卡名字
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
colors – 图片主题颜色，默认即可
figsize – 图像大小，默认 (15, 8)
anchor – 图例显示的位置，默认 0.94，根据实际显示情况进行调整即可，0.94 附近小范围调整
width – 预期分布与实际分布柱状图之间的间隔，默认 0.35
result – 是否返回 CSI 统计表，默认 False
plot – 是否画 CSI图，默认 True
max_len – 特征显示的最大长度，防止特征名称过长导致图像区域非常小，默认 None 表示不限制
hatch – 是否显示柱状图上的斜线，默认为 True

返回:

当 result 为 True 时，返回 pd.DataFrame

scorecardpipeline.dataframe_plot(df, row_height=0.4, font_size=14, header_color='#2639E9', row_colors=['#dae3f3', 'w'], edge_color='w', bbox=[0, 0, 1, 1], header_columns=0, ax=None, save=None, **kwargs)[源代码]

将 dataframe 转换为图片，推荐行和列都不多的数据集使用该方法

参数:

df – 需要画图的 dataframe 数据
row_height – 行高，默认 0.4
font_size – 字体大小，默认 14
header_color – 标题颜色，默认 #2639E9
row_colors – 行颜色，默认 [‘#dae3f3’, ‘w’]，交替使用两种颜色
edge_color – 表格边框颜色，默认白色
bbox – 边的显示情况，[左，右，上，下]，即仅显示上下两条边框
header_columns – 标题行数，默认仅有一个标题行，即 0
ax – 如果需要在某张画布的子图中显示，那么传入对应的 ax 即可
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
kwargs – plt.table 相关的参数，参考：https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.table.html

返回:

Figure

scorecardpipeline.distribution_plot(data, date='date', target='target', save=None, figsize=(10, 6), colors=['#2639E9', '#F76E6C', '#FE7715'], freq='M', anchor=0.94, result=False, hatch=True)[源代码]

样本时间分布图

参数:

data – 数据集
date – 日期列名称，如果格式非日期，会尝试自动转为日期格式，默认 date，替换为数据中对应的日期列（如申请时间、授信时间、放款时间等）
target – 数据集中标签列的名称，默认 target
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
figsize – 图像大小，默认 (10, 6)
colors – 图片主题颜色，默认即可
freq – 汇总统计的日期格式，按年、季度、月、周、日等统计，参考：https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects
anchor – 图例显示的位置，默认 0.94，根据实际显示情况进行调整即可，0.94 附近小范围调整
result – 是否返回分布表，默认 False
hatch – 是否显示柱状图上的斜线，默认为 True

返回:

class scorecardpipeline.ExcelWriter(style_excel=None, style_sheet_name='初始化', mode='replace', fontsize=10, font='楷体', theme_color='2639E9', opacity=0.85, system=None)[源代码]

基类：object

__init__(style_excel=None, style_sheet_name='初始化', mode='replace', fontsize=10, font='楷体', theme_color='2639E9', opacity=0.85, system=None)[源代码]

excel 写入方法

参数:

style_excel – 样式模版文件，默认安装包路径下的 template.xlsx ，如果路径调整需要进行相应的调整
style_sheet_name – 模版文件内初始样式sheet名称，默认即可
mode – 写入模式，默认 replace，可选 replace、append，当选择 append 模式时会将已存在的excel中的内容复制到新的文件中
fontsize – 插入excel文件中内容的字体大小，默认 10
font – 插入excel文件中内容的字体，默认楷体
theme_color – 主题色，默认 2639E9，注意不包含 #
system – excel报告适配的系统，默认 mac，可选 windows、linux，设置为 windows 时会重新适配 picture 的大小
opacity – 写入dataframe时使用颜色填充主题色的透明度设置，默认 0.85

add_conditional_formatting(worksheet, start_space, end_space)[源代码]

设置条件格式

参数:

worksheet – 当前选择设置条件格式的sheet
start_space – 开始单元格位置
end_space – 结束单元格位置

static set_column_width(worksheet, column, width)[源代码]

调整excel列宽

参数:

worksheet – 当前选择调整列宽的sheet
column – 列，可以直接输入 index 或者字母
width – 设置列的宽度

static set_number_format(worksheet, space, _format)[源代码]

设置数值显示格式

参数:

worksheet – 当前选择调整数值显示格式的sheet
space – 单元格范围
_format – 显示格式，参考 openpyxl

set_freeze_panes(worksheet, space)[源代码]

设置数值显示格式

参数:

worksheet – 当前选择调整数值显示格式的sheet
space – 单元格范围

get_sheet_by_name(name)[源代码]

获取sheet名称为name的工作簿，如果不存在，则从初始模版文件中拷贝一个名称为name的sheet

参数:: name – 需要获取的工作簿名称

move_sheet(worksheet, offset: int = 0, index: int | None = None)[源代码]

移动 sheet 位置

参数:

worksheet – 需要移动的sheet，支持输入字符串和Worksheet
offset – 需要移动的相对位置，默认 0，在传入 index 时参数不生效
index – 需要移动到的位置索引，超出移动到最后

insert_hyperlink2sheet(worksheet, insert_space, hyperlink=None, file=None, sheet=None, target_space=None)[源代码]

向sheet中的某个单元格插入超链接

参数:

worksheet – 需要插入超链接的sheet
insert_space – 超链接插入的单元格位置，可以是 “B2” 或者 (2, 2) 任意一种形式，首个单元格从 (1, 1) 开始
hyperlink – 超链接的地址, 与 target_space 参数互斥，优先 hyperlink，格式参考: [f”file://{文件地址} - #{表名}!{单元格位置}”, f”#{表名}!{单元格位置}”, f”{单元格位置}”], 其中单元格位置为类似 “B2” 的格式
file – 超链接的文件路径，默认 None，即当前excel文件，传入 hyperlink 参数时无效，传入file时确保sheet参数已传，否则默认为当前sheet
sheet – 超链接的sheet名称，默认 None，即当前sheet，传入 hyperlink 参数时无效
target_space – 超链接的单元格位置，默认 None，支持 “B2” 或者 (2, 2) 任意一种形式，传入 hyperlink 参数时无效

insert_value2sheet(worksheet, insert_space, value='', style='content', auto_width=False, end_space=None, align: dict | None = None, max_col_width=50)[源代码]

向sheet中的某个单元格插入某种样式的内容

参数:

worksheet – 需要插入内容的sheet
insert_space – 内容插入的单元格位置，可以是 “B2” 或者 (2, 2) 任意一种形式
value – 需要插入的内容
style – 渲染的样式，参考 init_style 中初始设置的样式
end_space – 如果需要合并单元格，传入需要截止的单元格位置信息，可以是 “B2” 或者 (2, 2) 任意一种形式
auto_width – 是否开启自动调整列宽
align – 文本排列方式, 参考: Alignment
max_col_width – 单元格列最大宽度，默认 50

返回:

返回插入元素最后一列之后、最后一行之后的位置

insert_pic2sheet(worksheet, fig, insert_space, figsize=(600, 250))[源代码]

向excel中插入图片内容

参数:

worksheet – 需要插入内容的sheet
fig – 需要插入的图片路径
insert_space – 插入图片的起始单元格
figsize – 图片大小设置

返回:

返回插入元素最后一列之后、最后一行之后的位置

insert_rows(worksheet, row, row_index, col_index, merge_rows=None, style='', auto_width=False, style_only=False, multi_levels=False)[源代码]

向excel中插入一行数据，insert_df2sheet 依赖本方法

参数:

worksheet – 需要插入内容的sheet
row – 数据内容
row_index – 插入数据的行索引，用来判断使用哪种边框样式
col_index – 插入数据的列索引，用来判断使用哪种边框样式
merge_rows – 需要合并单元的行索引
style – 插入数据的excel风格
auto_width – 是否自动调整列宽，自动调整列宽会导致该列样式模版发生变化，非内容列默认填充的白色失效
style_only – 是否使用填充样式
multi_levels – 是否多层索引或多层级列

insert_df2sheet(worksheet, data, insert_space, merge_column=None, header=True, index=False, auto_width=False, fill=False, merge=False, merge_index=True)[源代码]

向excel文件中插入指定样式的dataframe数据

参数:

worksheet – 需要插入内容的sheet
data – 需要插入的dataframe
insert_space – 插入内容的起始单元格位置
merge_column – 需要分组显示的列，index或者列名，需要提前排好序，从 0.1.33 开始 ExcleWriter 不会自动处理顺序
header – 是否存储dataframe的header，暂不支持多级表头，从 0.1.30 开始支持多层表头和多层索引
index – 是否存储dataframe的index
merge_index – 当存储dataframe的index时，index中同一层级连续相同值是否合并，默认 True，即合并
auto_width – 是否自动调整列宽，自动调整列宽会导致该列样式模版发生变化，非内容列默认填充的白色失效
fill – 是否使用颜色填充而非边框，当 fill 为 True 时，merge_column 失效
merge – 是否合并单元格，配合 merge_column 一起使用，当前版本仅在 merge_column 只有一列时有效

返回:

返回插入元素最后一列之后、最后一行之后的位置

merge_cells(worksheet, start, end)[源代码]

合并同一列单元格并保证样式相应合并

参数:

worksheet – 需要合并单元格的sheet
start – 合并单元格开始的位置
end – 合并单元格结束的位置

返回:

static check_contain_chinese(check_str)[源代码]

检查字符串中是否包含中文

参数:: check_str – 需要检查的字符串
返回:: 返回每个字符是否是中文 list<bool>，英文字符个数，中文字符个数

static astype_insertvalue(value, decimal_point=4)[源代码]

格式化需要存储excel的内容，如果是浮点型，按照设置的精度进行保存，如果是类别型或其他特殊类型，转字符串存储，如果非以上两种，直接原始值存储

参数:

value – 需要插入 excel 的内容
decimal_point – 如果是浮点型，需要保留的精度，默认小数点后4位数

返回:

格式化后存入excel的内容

static calc_continuous_cnt(list_, index_=0)[源代码]

根据传入的 list ，计算 list 中某个 index 开始，连续出现该元素的个数

参数:

list – 需要检索的 list
index – 元素索引

返回:

元素值，索引值，连续出现的个数

参考样例

>>> calc_continuous_cnt = ExcelWriter.calc_continuous_cnt
>>> list_ = ['A','A','A','A','B','C','C','D','D','D']
>>> calc_continuous_cnt(list_, 0)
('A', 0, 4)
>>> calc_continuous_cnt(list_, 4)
('B', 4, 1)
>>> calc_continuous_cnt(list_, 6)
('C', 6, 1)

static itlubber_border(border, color, white=False)[源代码]

itlubber 的边框样式生成器

参数:

border – 边框样式，如果输入长度为 3，则生成 [左，右，下]，如果长度为4，则生成 [左，右，下，上]
color – 边框颜色
white – 是否显示白色边框

返回:

Border

static get_cell_space(space)[源代码]

根据传入的不同格式的位置，转换为另一种形式的excel单元格定位

参数:: space – 传入的excel单元格定位，支持两种格式，B1 或 (2, 2)
返回:: 返回单元格定位，tuple / str

参考样例

>>> get_cell_space = ExcelWriter.get_cell_space
>>> get_cell_space("B3")
(2, 3)
>>> get_cell_space((2, 2))
'B2'

static calculate_rgba_color(hex_color, opacity, prefix='#')[源代码]

根据某个颜色计算某个透明度对应的颜色

参数:

hex_color – hex格式的颜色值
opacity – 透明度，[0, 1] 之间的数值
prefix – 返回颜色的前缀

返回:

对应某个透明度的颜色

init_style(font, fontsize, theme_color)[源代码]

初始化单元格样式

参数:

font – 字体名称
fontsize – 字体大小
theme_color – 主题颜色

save(filename, close=True)[源代码]

保存excel文件

参数:

filename – 需要保存 excel 文件的路径
close – 是否需要释放 writer

scorecardpipeline.dataframe2excel(data, excel_writer, sheet_name=None, title=None, header=True, theme_color='2639E9', condition_color=None, fill=True, percent_cols=None, condition_cols=None, custom_cols=None, custom_format='#,##0', color_cols=None, percent_rows=None, condition_rows=None, custom_rows=None, color_rows=None, start_col=2, start_row=2, mode='replace', figures=None, figsize=(600, 350), writer_params={}, **kwargs)[源代码]

向excel文件中插入指定样式的dataframe数据

参数:

data – 需要保存的dataframe数据，index默认不保存，如果需要保存先 .reset_index().rename(columns={“index”: “索引名称”}) 再保存，有部分索引 reset_index 之后是 0 而非 index，根据实际情况进行修改
excel_writer – 需要保存到的 excel 文件路径或者 ExcelWriter
sheet_name – 需要插入内容的sheet，如果是 Worksheet，则直接向 Worksheet 插入数据
title – 是否在dataframe之前的位置插入一个标题
figures – 需要数据表与标题之间插入的图片，支持一次性传入多张图片的路径，会根据传入顺序依次插入
figsize – 插入图像的大小，为了统一排版，目前仅支持设置一个图片大小，默认: (600, 350) (长度, 高度)
header – 是否存储dataframe的header，暂不支持多级表头
theme_color – 主题色
condition_color – 条件格式主题颜色，不传默认为 theme_color
fill – 是否使用单元个颜色填充样式还是使用边框样式
percent_cols – 需要显示为百分数的列，仅修改显示格式，不更改数值
condition_cols – 需要显示条件格式的列（无边框渐变数据条）
color_cols – 需要显示为条件格式颜色填充的列（单元格填充渐变色）
custom_cols – 需要显示自定义格式的列，与 custom_format 参数搭配使用
custom_format – 显示的自定义格式，与 custom_cols 参数搭配使用，默认 #,##0 ，即显示为有分隔符的整数
start_col – 在excel中的开始列数，默认 2，即第二列开始
start_row – 在excel中的开始行数，默认 2，即第二行开始，如果 title 有值的话，会从 start_row + 2 行开始插入dataframe数据
mode – excel写入的模式，可选 append 和 replace ，默认 replace ，选择 append 时会在已有的excel文件中增加内容，不覆盖原有内容
writer_params – 透传至 ExcelWriter 内的参数
**kwargs –
其他参数，透传至 insert_df2sheet 方法，例如传入 auto_width=True 会根据内容自动调整列宽

返回:

返回插入元素最后一列之后、最后一行之后的位置

参考样例

>>> writer = ExcelWriter(theme_color='3f1dba')
>>> worksheet = writer.get_sheet_by_name("模型报告")
>>> end_row, end_col = writer.insert_value2sheet(worksheet, "B2", value="模型报告", style="header")
>>> end_row, end_col = writer.insert_value2sheet(worksheet, "B4", value="模型报告", style="header", end_space="D4")
>>> end_row, end_col = writer.insert_value2sheet(worksheet, "B6", value="当前模型主要为评分卡模型", style="header_middle", auto_width=True)
>>> # 单层索引保存样例
>>> sample = pd.DataFrame(np.concatenate([np.random.random_sample((10, 10)) * 40, np.random.randint(0, 3, (10, 2))], axis=1), columns=[f"B{i}" for i in range(10)] + ["target", "type"])
>>> end_row, end_col = writer.insert_df2sheet(worksheet, sample, (end_row + 2, column_index_from_string("B")))
>>> end_row, end_col = writer.insert_df2sheet(worksheet, sample, (end_row + 2, column_index_from_string("B")), fill=True)
>>> end_row, end_col = writer.insert_df2sheet(worksheet, sample, (end_row + 2, column_index_from_string("B")), fill=True, header=False, index=True)
>>> end_row, end_col = writer.insert_df2sheet(worksheet, sample, (end_row + 2, column_index_from_string("B")), merge_column="target")
>>> end_row, end_col = writer.insert_df2sheet(worksheet, sample.set_index("type"), (end_row + 2, column_index_from_string("B")), merge_column="target", index=True, fill=True)
>>> end_row, end_col = writer.insert_df2sheet(worksheet, sample, (end_row + 2, column_index_from_string("B")), merge_column=["target", "type"])
>>> end_row, end_col = writer.insert_df2sheet(worksheet, sample, (end_row + 2, column_index_from_string("B")), merge_column=[10, 11])
>>> end_row, end_col = dataframe2excel(sample, writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, percent_cols=["B2", "B6"], condition_cols=["B3", "B9"], color_cols=["B4"])
>>> end_row, end_col = dataframe2excel(sample, writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, percent_cols=["B2", "B6"], condition_cols=["B3", "B9"], color_cols=["B4"], title="测试样例")
>>> end_row, end_col = dataframe2excel(sample, writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, percent_cols=["B2", "B6"], condition_cols=["B3", "B9"], color_cols=["B4"], title="测试样例", figures=["../examples/model_report/auto_report_corr_plot.png"])
>>> # 多层索引保存样例
>>> multi_sample = pd.DataFrame(np.random.randint(0, 150, size=(8, 12)), columns=pd.MultiIndex.from_product([['模拟考', '正式考'], ['数学', '语文', '英语', '物理', '化学', '生物']]), index=pd.MultiIndex.from_product([['期中', '期末'], ['雷军', '李斌'], ['测试一', '测试二']]))
>>> multi_sample.index.names = ["考试类型", "姓名", "测试"]
>>> end_row, end_col = dataframe2excel(multi_sample, writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, title="测试样例", index=True, header=False)
>>> end_row, end_col = dataframe2excel(multi_sample, writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, title="测试样例", index=True)
>>> end_row, end_col = dataframe2excel(multi_sample, writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, title="测试样例", index=True, fill=False)
>>> end_row, end_col = dataframe2excel(multi_sample.reset_index(names=multi_sample.index.names, col_level=-1), writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, title="测试样例", index=False, fill=False, merge_column=[('', '考试类型'), ('', '姓名')])
>>> end_row, end_col = dataframe2excel(multi_sample.reset_index(names=multi_sample.index.names, col_level=-1), writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, title="测试样例", index=False, fill=False, merge_column=[('', '考试类型')], merge=True)
>>> end_row, end_col = dataframe2excel(multi_sample.reset_index(names=multi_sample.index.names, col_level=-1), writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, title="测试样例", index=True, fill=True, merge_column=[('', '考试类型')], merge=True)
>>> writer.save("./测试样例.xlsx")

scorecardpipeline.auto_eda_sweetviz(all_data, target=None, save='model_report/auto_eda.html', pairwise=True, labels=None, exclude=None, num_features=None, cat_features=None, text_features=None)[源代码]

对数据量和特征个数较少的数据集进行自动 EDA 产出分析报告文档

参数:

all_data – 需要 EDA 的数据集
target – 目标变量，仅支持整数型或布尔型的目标变量
labels – 当传入 target 时为需要对比的数据集名称 [true, false]，当不传入 target 时为数据集名字
save – 报告存储路径，后缀使用 .html
pairwise – 是否需要显示特征之间的分布情况
exclude – 需要排除的特征名称
num_features – 需要强制转为数值型的特征名称
cat_features – 需要强制转为类别型的特征名称
text_features – 需要强制转为文本的特征名称

scorecardpipeline.auto_data_testing_report(data: DataFrame, features=None, target='target', overdue=None, dpd=None, date=None, data_summary_comment='', freq='M', excel_writer=None, sheet='分析报告', start_col=2, start_row=2, dropna=False, writer_params={}, bin_params={}, feature_map={}, corr=False, pictures=['bin', 'ks', 'hist'], suffix='')[源代码]

自动数据测试报告，用于三方数据评估或自有评分效果评估

参数:

suffix – 用于避免未保存excel时，同名图片被覆盖的图片后缀名称
corr – 是否需要评估数值类变量之间的相关性，默认为 False，设置为 True 后会输出变量相关性图和表
pictures – 需要包含的图片，支持 [“ks”, “hist”, “bin”]
data – 需要评估的数据集，需要包含目标变量
features – 需要进行分析的特征名称，支持单个字符串传入或列表传入
target – 目标变量名称
overdue – 逾期天数字段名称, 当传入 overdue 时，会忽略 target 参数
dpd – 逾期定义方式，逾期天数 > DPD 为 1，其他为 0，仅 overdue 字段起作用时有用
date – 日期列，通常为借款人申请日期或放款日期，可选字段，传入的情况下，结合字段 freq 参数输出不同时间粒度下的好坏客户分布情况
freq – 结合 date 日期使用，输出需要统计的粒度，默认 M，即按月统计
data_summary_comment – 数据样本概况中需要填入的备注信息，例如 “去除了历史最大逾期天数[0, dpd]内的灰客户” 等
excel_writer – 需要保存的excel文件名称或写入器
sheet – 需要保存的 sheet 名称，可传入已有的 worksheet 或文字信息
start_col – 开始列
start_row – 开始行
dropna – 在分析字段是是否剔除缺失值或指定值，默认 False
writer_params – excel写入器初始化参数，仅在 excel_writer 为字符串时有效
bin_params – 统计分箱的参数，支持 feature_bin_stats 方法的参数
feature_map – 特征字典，增加文档可读性使用，默认 {}

参考样例

>>> import numpy as np
>>> from scorecardpipeline import *
>>>
>>> # 加载数据集，标签转换为 0 和 1
>>> target = "creditability"
>>> data = germancredit()
>>> data[target] = data[target].map({"good": 0, "bad": 1})
>>> data["MOB1"] = [np.random.randint(0, 30) for i in range(len(data))]
>>> features = data.columns.drop([target, "MOB1"]).tolist()
>>>
>>> # 测试报告输出
>>> auto_data_testing_report(data
>>>                          , features=features
>>>                          , target=target
>>>                          , date=None # 传入日期列名，会按 freq 统计不同时间维度好坏样本的分布情况
>>>                          , freq="M"
>>>                          , data_summary_comment="三方数据测试报告样例，支持同时评估多个不同标签定义下的数据有效性"
>>>                          , excel_writer="三方数据测试报告.xlsx"
>>>                          , sheet="分析报告"
>>>                          , start_col=2
>>>                          , start_row=2
>>>                          , dropna=False
>>>                          , writer_params={}
>>>                          , overdue=["MOB1"]
>>>                          , dpd=[15, 7, 3]
>>>                          , bin_params={"method": "dt", "min_bin_size": 0.05, "max_n_bins": 10, "return_cols": ["坏样本数", "坏样本占比", "坏样本率", "LIFT值", "坏账改善", "累积LIFT值", "分档KS值"]} # feature_bin_stats 函数的相关参数
>>>                          , pictures=['bin', 'ks', 'hist'] # 类别型变量不支持 ks 和 hist
>>>                          , corr=True
>>>                          )

class scorecardpipeline.NumExprDerive(derivings=None)[源代码]

基类：BaseEstimator, TransformerMixin

Derive features by expressions.

参考样例

>>> import pandas as pd
>>> from scorecardpipeline.feature_engineering import NumExprDerive
>>> X = pd.DataFrame({"f0": [2, 1.0, 3], "f1": [np.inf, 2, 3], "f2": [2, 3, 4], "f3": [2.1, 1.4, -6.2]})
>>> fd = NumExprDerive(derivings=[("f4", "where(f1>1, 0, 1)"), ("f5"、"f1+f2"), ("f6", "sin(f1)"), ("f7", "abs(f3)"))
>>> fd.fit_transform(X)

__init__(derivings=None)[源代码]

参数:: derivings – list, default=None. Each entry is a (name, expr) pair representing a deriving rule.

fit(X, y=None)[源代码]

transform(X)[源代码]

class scorecardpipeline.StandardScoreTransformer(base_score=660, pdo=75, rate=2, bad_rate=0.15, down_lmt=300, up_lmt=1000, greater_is_better=True, cutoff=None)[源代码]

基类：BaseScoreTransformer

Stretch the predicted probability to a normal distributed score.

__init__(base_score=660, pdo=75, rate=2, bad_rate=0.15, down_lmt=300, up_lmt=1000, greater_is_better=True, cutoff=None)[源代码]

fit(X, y=None, **fit_params)[源代码]

scorecard_scale()[源代码]

输出评分卡基准信息，包含 base_odds、base_score、rate、pdo、A、B

返回:: pd.DataFrame，评分卡基准信息

transform(X)[源代码]

predict(X)[源代码]

inverse_transform(X)[源代码]

class scorecardpipeline.NPRoundStandardScoreTransformer(base_score=660, pdo=75, bad_rate=0.15, down_lmt=300, up_lmt=1000, round_decimals=0, greater_is_better=True, cutoff=None)[源代码]

基类：StandardScoreTransformer

__init__(base_score=660, pdo=75, bad_rate=0.15, down_lmt=300, up_lmt=1000, round_decimals=0, greater_is_better=True, cutoff=None)[源代码]

class scorecardpipeline.RoundStandardScoreTransformer(base_score=660, pdo=75, bad_rate=0.15, down_lmt=300, up_lmt=1000, round_decimals=0, greater_is_better=True, cutoff=None)[源代码]

基类：StandardScoreTransformer

Stretch the predicted probability to a normal distributed score.

__init__(base_score=660, pdo=75, bad_rate=0.15, down_lmt=300, up_lmt=1000, round_decimals=0, greater_is_better=True, cutoff=None)[源代码]

class scorecardpipeline.BoxCoxScoreTransformer(down_lmt=300, up_lmt=1000, greater_is_better=True, cutoff=None)[源代码]

基类：BaseScoreTransformer

__init__(down_lmt=300, up_lmt=1000, greater_is_better=True, cutoff=None)[源代码]

fit(X, y=None, **fit_params)[源代码]

transform(X)[源代码]

predict(X)[源代码]

inverse_transform(X)[源代码]

class scorecardpipeline.TypeSelector(dtype_include=None, dtype_exclude=None, exclude=None)[源代码]

基类：SelectorMixin

__init__(dtype_include=None, dtype_exclude=None, exclude=None)[源代码]

fit(x: DataFrame, y=None, **fit_params)[源代码]

class scorecardpipeline.RegexSelector(pattern=None, exclude=None)[源代码]

基类：SelectorMixin

__init__(pattern=None, exclude=None)[源代码]

fit(x: DataFrame, y=None, **fit_params)[源代码]

class scorecardpipeline.ModeSelector(threshold=0.95, exclude=None, dropna=True, n_jobs=None, **kwargs)[源代码]

基类：SelectorMixin

__init__(threshold=0.95, exclude=None, dropna=True, n_jobs=None, **kwargs)[源代码]

fit(x: DataFrame, y=None)[源代码]

class scorecardpipeline.NullSelector(threshold=0.95, missing_values=nan, exclude=None, **kwargs)[源代码]

基类：SelectorMixin

__init__(threshold=0.95, missing_values=nan, exclude=None, **kwargs)[源代码]

fit(x: DataFrame, y=None)[源代码]

class scorecardpipeline.InformationValueSelector(threshold=0.02, target='target', regularization=1.0, methods=None, n_jobs=None, combiner=None, **kwargs)[源代码]

基类：SelectorMixin

__init__(threshold=0.02, target='target', regularization=1.0, methods=None, n_jobs=None, combiner=None, **kwargs)[源代码]

fit(x: DataFrame, y=None)[源代码]

class scorecardpipeline.LiftSelector(target='target', threshold=3.0, n_jobs=None, methods=None, combiner=None, **kwargs)[源代码]

基类：SelectorMixin

Feature selection via lift score.

属性字段

参数:: threshold – float. The threshold value used for feature selection.

:param scores_ : array-like of shape (n_features,). Lift scores of features. :param select_columns : array-like :param dropped : DataFrame

__init__(target='target', threshold=3.0, n_jobs=None, methods=None, combiner=None, **kwargs)[源代码]

参数:

target – target
threshold – float or str (default=3.0). Feature which has a lift score greater than threshold will be kept.
n_jobs – int or None, (default=None). Number of parallel.
combiner – Combiner
methods – Combiner’s methods

fit(x: DataFrame, y=None, **fit_params)[源代码]

class scorecardpipeline.VarianceSelector(threshold=0.0, exclude=None)[源代码]

基类：SelectorMixin

Feature selector that removes all low-variance features.

__init__(threshold=0.0, exclude=None)[源代码]

fit(x, y=None)[源代码]

class scorecardpipeline.VIFSelector(threshold=4.0, exclude=None, missing=-1, n_jobs=None)[源代码]

基类：SelectorMixin

__init__(threshold=4.0, exclude=None, missing=-1, n_jobs=None)[源代码]

VIF越高，多重共线性的影响越严重, 在金融风险中我们使用经验法则:若VIF>4，则我们认为存在多重共线性, 计算比较消耗资源, 如果数据维度较大的情况下, 尽量不要使用

参数:

exclude – 数据集中需要强制保留的变量
threshold – 阈值, VIF 大于 threshold 即剔除该特征
missing – 缺失值默认填充 -1
n_jobs – 线程数

fit(x: DataFrame, y=None)[源代码]

class scorecardpipeline.CorrSelector(threshold=0.7, method='pearson', weights=None, exclude=None, **kwargs)[源代码]

基类：SelectorMixin

__init__(threshold=0.7, method='pearson', weights=None, exclude=None, **kwargs)[源代码]

fit(x: DataFrame, y=None)[源代码]

class scorecardpipeline.PSISelector(threshold=0.1, cv=None, method=None, exclude=None, n_jobs=None, verbose=0, pre_dispatch='2*n_jobs', **kwargs)[源代码]

基类：SelectorMixin

__init__(threshold=0.1, cv=None, method=None, exclude=None, n_jobs=None, verbose=0, pre_dispatch='2*n_jobs', **kwargs)[源代码]

fit(x: DataFrame, y=None, groups=None)[源代码]

class scorecardpipeline.NullImportanceSelector(estimator, target='target', threshold=1.0, norm_order=1, importance_getter='auto', cv=3, n_runs=5, **kwargs)[源代码]

基类：SelectorMixin

__init__(estimator, target='target', threshold=1.0, norm_order=1, importance_getter='auto', cv=3, n_runs=5, **kwargs)[源代码]

fit(x: DataFrame, y=None)[源代码]

class scorecardpipeline.TargetPermutationSelector(estimator, target='target', threshold=1.0, norm_order=1, importance_getter='auto', cv=3, n_runs=5, **kwargs)[源代码]

基类：NullImportanceSelector

__init__(estimator, target='target', threshold=1.0, norm_order=1, importance_getter='auto', cv=3, n_runs=5, **kwargs)[源代码]

class scorecardpipeline.ExhaustiveSelector(estimator, min_features=1, max_features=1, scoring='accuracy', cv=3, verbose=0, n_jobs=None, pre_dispatch='2*n_jobs')[源代码]

基类：SelectorMixin, MetaEstimatorMixin

Exhaustive Feature Selection for Classification and Regression.

属性字段

参数:

subset_info – list of dicts. A list of dictionary with the following keys: ‘support_mask’, mask array of the selected features ‘cv_scores’, cross validate scores
support_mask – array-like of booleans. Array of final chosen features
best_idx – array-like, shape = [n_predictions]. Feature Indices of the selected feature subsets.
best_score – float. Cross validation average score of the selected subset.
best_feature_indices – array-like, shape = (n_features,), Feature indices of the selected feature subsets.

参考样例

>>> from sklearn.neighbors import KNeighborsClassifier
>>> from sklearn.datasets import load_iris
>>> from scorecardpipeline.feature_selection import ExhaustiveSelector
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> knn = KNeighborsClassifier(n_neighbors=3)
>>> efs = ExhaustiveSelector(knn, min_features=1, max_features=4, cv=3)
>>> efs.fit(X, y)
ExhaustiveFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3), max_features=4)
>>> efs.best_score_
0.9733333333333333
>>> efs.best_idx_
12

__init__(estimator, min_features=1, max_features=1, scoring='accuracy', cv=3, verbose=0, n_jobs=None, pre_dispatch='2*n_jobs')[源代码]

参数:

estimator – scikit-learn classifier or regressor
min_features – int (default: 1). Minimum number of features to select
max_features – int (default: 1). Maximum number of features to select
verbose – bool (default: True). Prints progress as the number of epochs to stdout.
scoring – str, (default=’_passthrough_scorer’). Scoring metric in faccuracy, f1, precision, recall, roc_auc) for classifiers, {‘mean_absolute_error’, ‘mean_squared_error’, ‘median_absolute_error’, ‘r2’} for regressors, or a callable object or function with signature scorer(estimator, X, y).
cv – int (default: 5). Scikit-learn cross-validation generator or int, If estimator is a classifier (or y consists of integer class labels), stratified k-fold is performed, and regular k-fold cross-validation otherwise. No cross-validation if cv is None, False, or 0.
n_jobs – int (default: 1). The number of CPUs to use for evaluating different feature subsets in parallel. -1 means ‘all CPUs’.
pre_dispatch – int, or string (default: ‘2*n_jobs’). Controls the number of jobs that get dispatched during parallel execution if n_jobs > 1 or n_jobs=-1.

static ncr(n, r)[源代码]

Return the number of combinations of length r from n items.

参数:

n – int, Total number of items
r – int, Number of items to select from n

返回:

Number of combinations, integer

fit(X, y, groups=None, **fit_params)[源代码]

Perform feature selection and learn model from training data.

参数:

X – array-like of shape (n_samples, n_features)
y – array-like of shape (n_samples, ), Target values.
groups – array-like of shape (n_samples,), Group labels for the samples used while splitting the dataset into train/test set. Passed to the fit method of the cross-validator.
fit_params – dict, Parameters to pass to the fit method of classifier

返回:

ExhaustiveFeatureSelector

scorecardpipeline.processing

@Time : 2023/05/21 16:23 @Author : itlubber @Site : itlubber.art

scorecardpipeline.processing.drop_identical(frame, threshold=0.95, return_drop=False, exclude=None, target=None)[源代码]

剔除数据集中单一值占比过高的特征

参数:

frame – 需要进行特征单一值占比过高筛选的数据集
threshold – 单一值占比阈值，超过阈值剔除特征
return_drop – 是否返回特征剔除信息，默认 False
exclude – 是否排除某些特征，不进行单一值占比筛选，默认为 None
target – 数据集中的目标变量列名，默认为 None，即数据集中不包含 target

返回:

筛选后的数据集: pd.DataFrame，剔除单一值占比过高特征的数据集
剔除的特征: list / np.ndarray，当 return_drop 设置为 True 时，返回被剔除的单一值占比过高的特征列表

scorecardpipeline.processing.drop_corr(frame, target=None, threshold=0.7, by='IV', return_drop=False, exclude=None)[源代码]

剔除数据集中特征相关性过高的特征

参数:

frame – 需要进行特征相关性过高筛选的数据集
target – 数据集中的目标变量列名，默认为 None
threshold – 相关性阈值，超过阈值剔除特征
by – 剔除指标的依据，两个特征相关性超过阈值时，保留指标更大的特征，默认根据 IV 进行判断
return_drop – 是否返回特征剔除信息，默认 False
exclude – 是否排除某些特征，不进行单一值占比筛选，默认为 None

返回:

筛选后的数据集: pd.DataFrame，剔除特征相关性过高的数据集
剔除的特征: list，当 return_drop 设置为 True 时，返回被剔除的相关性过高的特征列表

scorecardpipeline.processing.select(frame, target='target', empty=0.95, iv=0.02, corr=0.7, identical=0.95, return_drop=False, exclude=None)[源代码]

根据缺失率、IV指标、相关性、单一值占比等进行特征筛选，返回特征剔除后的数据集和剔除的特征信息

参数:

frame – 需要进行特征筛选的数据集
target – 数据集中的目标变量列名，默认为 target
empty – 缺失率阈值，超过阈值剔除特征
iv – IV 阈值，低于阈值剔除特征
corr – 相关性阈值，超过阈值剔除 IV 较小的特征
identical – 单一值占比阈值，超过阈值剔除特征
return_drop – 是否返回剔除特征信息，默认 False
exclude – 是否排除某些特征，不进行特征筛选，默认为 None

返回:

筛选后的数据集: pd.DataFrame，特征筛选后的数据集
剔除的特征信息: dict，当 return_drop 设置为 True 时，返回被剔除特征信息

class scorecardpipeline.processing.FeatureSelection(target='target', empty=0.95, iv=0.02, corr=0.7, exclude=None, return_drop=True, identical=0.95, remove=None, engine='scorecardpy', target_rm=False)[源代码]

基类：TransformerMixin, BaseEstimator

__init__(target='target', empty=0.95, iv=0.02, corr=0.7, exclude=None, return_drop=True, identical=0.95, remove=None, engine='scorecardpy', target_rm=False)[源代码]

特征筛选方法

参数:

target – 数据集中标签名称，默认 target
empty – 空值率，默认 0.95, 即空值占比超过 95% 的特征会被剔除
iv – IV值，默认 0.02，即iv值小于 0.02 时特征会被剔除
corr – 相关性，默认 0.7，即特征之间相关性大于 0.7 时会剔除iv较小的特征
identical – 单一值占比，默认 0.95，即当特征的某个值占比超过 95% 时，特征会被剔除
exclude – 是否需要强制保留某些特征
return_drop – 是否返回删除信息，默认 True，即默认返回删除特征信息
remove – 引擎使用 scorecardpy 时，可以传入需要强制删除的变量
engine – 特征筛选使用的引擎，可选 “toad”, “scorecardpy” 两种，默认 scorecardpy
target_rm – 是否剔除标签，默认 False，即不剔除

fit(x, y=None)[源代码]

训练特征筛选方法

参数:: x – 数据集，需要包含目标变量
返回:: 训练后的 FeatureSelection

transform(x, y=None)[源代码]

特征筛选转换器

参数:: x – 需要进行特征筛选的数据集
返回:: pd.DataFrame，特征筛选后的数据集

class scorecardpipeline.processing.StepwiseSelection(target='target', estimator='ols', direction='both', criterion='aic', max_iter=None, return_drop=True, exclude=None, intercept=True, p_value_enter=0.2, p_remove=0.01, p_enter=0.01, target_rm=False)[源代码]

基类：TransformerMixin, BaseEstimator

逐步回归特征筛选方法

参数:

target – 数据集中标签名称，默认 target
estimator – 预估器，默认 ols，可选 “ols”, “lr”, “lasso”, “ridge”，通常默认即可
direction – 逐步回归方向，默认both，可选 “forward”, “backward”, “both”，通常默认即可
criterion – 评价指标，默认 aic，可选 “aic”, “bic”, “ks”, “auc”，通常默认即可
max_iter – 最大迭代次数，sklearn中使用的参数，默认为 None
return_drop – 是否返回特征剔除信息，默认 True
exclude – 强制保留的某些特征
intercept – 是否包含截距，默认为 True
p_value_enter – 特征进入的 p 值，用于前向筛选时决定特征是否进入模型
p_remove – 特征剔除的 p 值，用于后向剔除时决定特征是否要剔除
p_enter – 特征 p 值，用于判断双向逐步回归是否剔除或者准入特征
target_rm – 是否剔除数据集中的标签，默认为 False，即剔除数据集中的标签

fit(x, y=None)[源代码]

训练逐步回归特征筛选方法

参数:: x – 数据集，需要包含目标变量
返回:: 训练后的 StepwiseSelection

transform(x, y=None)[源代码]

逐步回归特征筛选转换器

参数:: x – 需要进行特征筛选的数据集
返回:: pd.DataFrame，特征筛选后的数据集

class scorecardpipeline.processing.FeatureImportanceSelector(top_k=126, target='target', selector='catboost', params=None, max_iv=None)[源代码]

基类：BaseEstimator, TransformerMixin

__init__(top_k=126, target='target', selector='catboost', params=None, max_iv=None)[源代码]

基于特征重要性的特征筛选方法

参数:

top_k – 依据特征重要性进行排序，筛选最重要的 top_k 个特征
target – 数据集中标签名称，默认 target
selector – 特征选择器，目前只支持 catboost ，可以支持数据集中包含字符串的数据
params – selector 的参数，不传使用默认参数
max_iv – 是否需要删除 IV 过高的特征，建议设置为 1.0

fit(x, y=None)[源代码]

特征重要性筛选器训练

参数:: x – 数据集，需要包含目标变量
返回:: 训练后的 FeatureImportanceSelector

transform(x, y=None)[源代码]

特征重要性筛选器转换方法

参数:: x – 需要进行特征筛选的数据集
返回:: pd.DataFrame，特征筛选后的数据集

catboost_selector(x, y, cat_features=None)[源代码]

基于 CatBoost 的特征重要性筛选器

参数:

x – 需要进行特征重要性筛选的数据集，不包含目标变量
y – 数据集中对应的目标变量值
cat_features – 类别型特征的索引

class scorecardpipeline.processing.Combiner(target='target', method='chi', empty_separate=True, min_n_bins=2, max_n_bins=None, max_n_prebins=20, min_prebin_size=0.02, min_bin_size=0.05, max_bin_size=None, gamma=0.01, monotonic_trend='auto_asc_desc', adj_rules={}, n_jobs=1, **kwargs)[源代码]

基类：TransformerMixin, BaseEstimator

特征分箱封装方法

参数:

target – 数据集中标签名称，默认 target
method – 特征分箱方法，可选 “chi”, “dt”, “quantile”, “step”, “kmeans”, “cart”, “mdlp”, “uniform”, 参考 toad.Combiner: https://github.com/amphibian-dev/toad/blob/master/toad/transform.py#L178-L355 & optbinning.OptimalBinning: https://gnpalencia.org/optbinning/
empty_separate – 是否空值单独一箱, 默认 True
min_n_bins – 最小分箱数，默认 2，即最小拆分2箱
max_n_bins – 最大分箱数，默认 None，即不限制拆分箱数，推荐设置 3 ～ 5，不宜过多，偶尔使用 optbinning 时不起效
max_n_prebins – 使用 optbinning 时预分箱数量
min_prebin_size – 使用 optbinning 时预分箱叶子结点（或者每箱）样本占比，默认 2%
min_bin_size – 使用 optbinning 正式分箱叶子结点（或者每箱）最小样本占比，默认 5%
max_bin_size – 使用 optbinning 正式分箱叶子结点（或者每箱）最大样本占比，默认 None
gamma – 使用 optbinning 分箱时限制过拟合的正则化参数，值越大惩罚越多，默认 0.01
monotonic_trend – 使用 optbinning 正式分箱时的坏率策略，默认 auto，可选 “auto”, “auto_heuristic”, “auto_asc_desc”, “ascending”, “descending”, “convex”, “concave”, “peak”, “valley”, “peak_heuristic”, “valley_heuristic”
adj_rules – 自定义分箱规则，toad.Combiner 能够接收的形式
n_jobs – 使用多进程加速的worker数量，默认单进程

update(rules)[源代码]

更新 Combiner 中特征的分箱规则

参数:: rules – dict，需要更新规则，格式如下：{特征名称: 分箱规则}

基于 optbinning.OptimalBinning 的特征分箱方法，使用 optbinning.OptimalBinning 分箱失败时，使用 toad.transform.Combiner 的卡方分箱处理

参数:

feature – 需要进行分箱的特征名称
data – 训练数据集
target – 数据集中标签名称，默认 target
min_n_bins – 最小分箱数，默认 2，即最小拆分2箱
max_n_bins – 最大分箱数，默认 None，即不限制拆分箱数，推荐设置 3 ～ 5，不宜过多，偶尔不起效
max_n_prebins – 使用 optbinning 时预分箱数量
min_prebin_size – 使用 optbinning 时预分箱叶子结点（或者每箱）样本占比，默认 2%
min_bin_size – 使用 optbinning 正式分箱叶子结点（或者每箱）最小样本占比，默认 5%
max_bin_size – 使用 optbinning 正式分箱叶子结点（或者每箱）最大样本占比，默认 None
gamma – 使用 optbinning 分箱时限制过拟合的正则化参数，值越大惩罚越多，默认 0.01
monotonic_trend – 使用 optbinning 正式分箱时的坏率策略，默认 auto，可选 “auto”, “auto_heuristic”, “auto_asc_desc”, “ascending”, “descending”, “convex”, “concave”, “peak”, “valley”, “peak_heuristic”, “valley_heuristic”

fit(x: DataFrame, y=None)[源代码]

特征分箱训练

参数:: x – 需要分箱的数据集，需要包含目标变量
返回:: Combiner，训练完成的分箱器

check_rules(feature=None)[源代码]: 检查类别变量空值是否被转为字符串，如果转为了字符串，强制转回空值，同时检查分箱顺序并调整为正确顺序

transform(x, y=None, labels=False)[源代码]

特征分箱转换方法

参数:

x – 需要进行分箱转换的数据集
labels – 进行分箱转换时是否转换为分箱信息，默认 False，即转换为分箱索引

返回:

pd.DataFrame，分箱转换后的数据集

export(to_json=None)[源代码]

特征分箱器导出 json 保存

参数:: to_json – json 文件的路径
返回:: dict，特征分箱信息

load(from_json)[源代码]

特征分箱器加载离线保存的 json 文件

参数:: from_json – json 文件的路径
返回:: Combiner，特征分箱器

特征分箱统计表，汇总统计特征每个分箱的各项指标信息

参数:

data – 需要查看分箱统计表的数据集
feature – 需要查看的分箱统计表的特征名称
target – 数据集中标签名称，默认 target
rules – 根据自定义的规则查看特征分箱统计表，支持 list（单个特征分箱规则）或 dict（多个特征分箱规则）格式传入
combiner – 提前训练好的特征分箱器，优先级小于 rules
method – 特征分箱方法，当传入 rules 或 combiner 时失效，可选 “chi”, “dt”, “quantile”, “step”, “kmeans”, “cart”, “mdlp”, “uniform”, 参考 toad.Combiner: https://github.com/amphibian-dev/toad/blob/master/toad/transform.py#L178-L355 & optbinning.OptimalBinning: https://gnpalencia.org/optbinning/
desc – 特征描述信息，大部分时候用于传入特征对应的中文名称或者释义
ks – 是否统计 KS 信息
max_n_bins – 最大分箱数，默认 None，即不限制拆分箱数，推荐设置 3 ～ 5，不宜过多，偶尔使用 optbinning 时不起效
min_bin_size – 使用 optbinning 正式分箱叶子结点（或者每箱）最小样本占比，默认 5%
max_bin_size – 使用 optbinning 正式分箱叶子结点（或者每箱）最大样本占比，默认 None
empty_separate – 是否空值单独一箱, 默认 False，推荐设置为 True
return_cols – list，指定返回部分特征分箱统计表的列，默认 None
return_rules – 是否返回特征分箱信息，默认 False
greater_is_better – 是否越大越好，默认 “”auto”, 根据最后两箱的 lift 指标自动推断是否越大越好, 可选 True、False、auto
amount – 默认为空, 支持传入数值字段（通常为放款金额）, 在分析逾期率时，输出对应的分析结果
kwargs – scorecardpipeline.processing.Combiner 的其他参数

返回:

特征分箱统计表: pd.DataFrame
特征分箱信息: list，当参数 return_rules 为 True 时返回

bin_plot(data, x, rule={}, desc='', result=False, save=None, **kwargs)[源代码]

特征分箱图

参数:

data – 需要查看分箱图的数据集
x – 需要查看的分箱图的特征名称
rule – 自定义的特征分箱规则，不会修改已训练好的特征分箱信息
desc – 特征描述信息，大部分时候用于传入特征对应的中文名称或者释义
result – 是否返回特征分箱统计表，默认 False
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
kwargs – scorecardpipeline.utils.bin_plot 方法其他的参数，参考：http://localhost:63342/scorecardpipeline/docs/build/html/scorecardpipeline.html#scorecardpipeline.utils.bin_plot

返回:

pd.DataFrame，特征分箱统计表，当 result 参数为 True 时返回

proportion_plot(data, x, transform=False, labels=False, keys=None)[源代码]

数据集中特征的分布情况

参数:

data – 需要查看样本分布的数据集
x – 需要查看样本分布的特征名称
transform – 是否进行分箱转换，默认 False，当特征为数值型变量时推荐转换分箱后在查看数据分布
labels – 进行分箱转换时是否转换为分箱信息，默认 False，即转换为分箱索引
keys – 根据某个 key 划分数据集查看数据分布情况，默认 None

corr_plot(data, transform=False, figure_size=(20, 15), save=None)[源代码]

特征相关图

参数:

data – 需要查看特征相关性的数据集
transform – 是否进行分箱转换，默认 False
figure_size – 图像大小，默认 (20, 15)
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None

badrate_plot(data, date_column, feature, labels=True)[源代码]

查看不同时间段的分箱是否平稳，线敞口随时间变化而增大为优，代表了特征在更新的时间区分度更强。线之前没有交叉为优，代表分箱稳定

参数:

data – 需要查看分箱平稳情况的数据集，需包含时间列
feature – 需要查看分箱平稳性的特征名称
date_column – 数据集中的日期列名称
labels – 进行分箱转换时是否转换为分箱信息，默认 True，即转换为分箱

property rules: dict，特征分箱明细信息

scorecardpipeline.processing.feature_bin_stats(data, feature, target='target', overdue=None, dpd=None, rules=None, method='step', desc='', combiner=None, ks=True, max_n_bins=None, min_bin_size=None, max_bin_size=None, greater_is_better='auto', amount=None, empty_separate=True, return_cols=None, return_rules=False, del_grey=False, verbose=0, **kwargs)[源代码]

特征分箱统计表，汇总统计特征每个分箱的各项指标信息

参数:

data – 需要查看分箱统计表的数据集
feature – 需要查看的分箱统计表的特征名称
target – 数据集中标签名称，默认 target
overdue – 逾期天数字段名称, 当传入 overdue 时，会忽略 target 参数
dpd – 逾期定义方式，逾期天数 > DPD 为 1，其他为 0，仅 overdue 字段起作用时有用
del_grey – 是否删除逾期天数 (0, dpd] 的数据，仅 overdue 字段起作用时有用
rules – 根据自定义的规则查看特征分箱统计表，支持 list（单个特征分箱规则）或 dict（多个特征分箱规则）格式传入
combiner – 提前训练好的特征分箱器，优先级小于 rules
method – 特征分箱方法，当传入 rules 或 combiner 时失效，可选 “chi”, “dt”, “quantile”, “step”, “kmeans”, “cart”, “mdlp”, “uniform”, 参考 toad.Combiner: https://github.com/amphibian-dev/toad/blob/master/toad/transform.py#L178-L355 & optbinning.OptimalBinning: https://gnpalencia.org/optbinning/
desc – 特征描述信息，大部分时候用于传入特征对应的中文名称或者释义
ks – 是否统计 KS 信息
max_n_bins – 最大分箱数，默认 None，即不限制拆分箱数，推荐设置 3 ～ 5，不宜过多，偶尔使用 optbinning 时不起效
min_bin_size – 使用 optbinning 正式分箱叶子结点（或者每箱）最小样本占比，默认 5%
max_bin_size – 使用 optbinning 正式分箱叶子结点（或者每箱）最大样本占比，默认 None
empty_separate – 是否空值单独一箱, 默认 False，推荐设置为 True
return_cols – list，指定返回部分特征分箱统计表的列，默认 None
return_rules – 是否返回特征分箱信息，默认 False
greater_is_better – 是否越大越好，默认 “”auto”, 根据最后两箱的 lift 指标自动推断是否越大越好, 可选 True、False、auto
amount – 默认为空, 支持传入数值字段（通常为放款金额）, 在分析逾期率时，输出对应的分析结果
kwargs – scorecardpipeline.processing.Combiner 的其他参数

返回:

特征分箱统计表: pd.DataFrame
特征分箱信息: list，当参数 return_rules 为 True 时返回

class scorecardpipeline.processing.WOETransformer(target='target', exclude=None)[源代码]

基类：TransformerMixin, BaseEstimator

__init__(target='target', exclude=None)[源代码]

WOE转换器

参数:

target – 数据集中标签名称，默认 target
exclude – 不需要转换 woe 的列

fit(x, y=None)[源代码]

WOE转换器训练

参数:: x – Combiner 转换后的数据（label 为 False），需要包含目标变量
返回:: WOETransformer，训练完成的WOE转换器

transform(x, y=None)[源代码]

特征WOE转换方法

参数:: x – 需要进行WOE转换的数据集
返回:: pd.DataFrame，WOE转换后的数据集

export(to_json=None)[源代码]

特征分箱器导出 json 保存

参数:: to_json – json 文件的路径
返回:: dict，特征分箱信息

load(from_json)[源代码]

特征分箱器加载离线保存的 json 文件

参数:: from_json – json 文件的路径
返回:: Combiner，特征分箱器

property rules: dict，特征 WOE 明细信息

scorecardpipeline.processing.feature_efficiency_analysis(data, feature, overdue=['MOB1'], dpd=[7, 3, 0], greater_is_better='auto', verbose=True, ks=False, **kwargs)[源代码]

scorecardpipeline.feature_selection

@Time : 2024/5/8 14:06 @Author : itlubber @Site : itlubber.art

class scorecardpipeline.feature_selection.SelectorMixin[源代码]

基类：BaseEstimator, TransformerMixin

__init__()[源代码]

transform(x)[源代码]

fit(x, y=None)[源代码]

class scorecardpipeline.feature_selection.TypeSelector(dtype_include=None, dtype_exclude=None, exclude=None)[源代码]

基类：SelectorMixin

__init__(dtype_include=None, dtype_exclude=None, exclude=None)[源代码]

fit(x: DataFrame, y=None, **fit_params)[源代码]

class scorecardpipeline.feature_selection.RegexSelector(pattern=None, exclude=None)[源代码]

基类：SelectorMixin

__init__(pattern=None, exclude=None)[源代码]

fit(x: DataFrame, y=None, **fit_params)[源代码]

scorecardpipeline.feature_selection.value_ratio(x, value)[源代码]

scorecardpipeline.feature_selection.mode_ratio(x, dropna=True)[源代码]

class scorecardpipeline.feature_selection.NullSelector(threshold=0.95, missing_values=nan, exclude=None, **kwargs)[源代码]

基类：SelectorMixin

__init__(threshold=0.95, missing_values=nan, exclude=None, **kwargs)[源代码]

fit(x: DataFrame, y=None)[源代码]

class scorecardpipeline.feature_selection.ModeSelector(threshold=0.95, exclude=None, dropna=True, n_jobs=None, **kwargs)[源代码]

基类：SelectorMixin

__init__(threshold=0.95, exclude=None, dropna=True, n_jobs=None, **kwargs)[源代码]

fit(x: DataFrame, y=None)[源代码]

class scorecardpipeline.feature_selection.CardinalitySelector(threshold=10, exclude=None, dropna=True)[源代码]

基类：SelectorMixin

Feature selection via categorical feature’s cardinality.

参考样例

>>> import pandas as pd
>>> from scorecardpipeline.feature_selection import CardinalitySelector
>>> x = pd.DataFrame({"f2": ["F", "м", "F"], "f3": ["M1", "M2", "м3"]})
>>> cs = CardinalitySelector(threshold=2)
>>> cs.fit_transform(x)

__init__(threshold=10, exclude=None, dropna=True)[源代码]

fit(x, y=None, **fit_params)[源代码]

scorecardpipeline.feature_selection.IV(x, y, regularization=1.0)[源代码]

class scorecardpipeline.feature_selection.InformationValueSelector(threshold=0.02, target='target', regularization=1.0, methods=None, n_jobs=None, combiner=None, **kwargs)[源代码]

基类：SelectorMixin

__init__(threshold=0.02, target='target', regularization=1.0, methods=None, n_jobs=None, combiner=None, **kwargs)[源代码]

fit(x: DataFrame, y=None)[源代码]

scorecardpipeline.feature_selection.LIFT(y_pred, y_true)[源代码]

Calculate lift according to label data.

参考样例

>>> import numpy as np
>>> y_true = np.array([0, 1, 1, 0, 1, 1, 0, 1, 1])
>>> y_pred = np.array([1, 0, 1, 0, 1, 1, 1, 1, 1])
>>> LIFT(y_true, y_pred) # (5 / 7) / (6 / 9)
1.0714285714285716

class scorecardpipeline.feature_selection.LiftSelector(target='target', threshold=3.0, n_jobs=None, methods=None, combiner=None, **kwargs)[源代码]

基类：SelectorMixin

Feature selection via lift score.

属性字段

参数:: threshold – float. The threshold value used for feature selection.

:param scores_ : array-like of shape (n_features,). Lift scores of features. :param select_columns : array-like :param dropped : DataFrame

__init__(target='target', threshold=3.0, n_jobs=None, methods=None, combiner=None, **kwargs)[源代码]

参数:

target – target
threshold – float or str (default=3.0). Feature which has a lift score greater than threshold will be kept.
n_jobs – int or None, (default=None). Number of parallel.
combiner – Combiner
methods – Combiner’s methods

fit(x: DataFrame, y=None, **fit_params)[源代码]

class scorecardpipeline.feature_selection.VarianceSelector(threshold=0.0, exclude=None)[源代码]

基类：SelectorMixin

Feature selector that removes all low-variance features.

__init__(threshold=0.0, exclude=None)[源代码]

fit(x, y=None)[源代码]

scorecardpipeline.feature_selection.VIF(x, n_jobs=None, missing=-1)[源代码]

class scorecardpipeline.feature_selection.VIFSelector(threshold=4.0, exclude=None, missing=-1, n_jobs=None)[源代码]

基类：SelectorMixin

__init__(threshold=4.0, exclude=None, missing=-1, n_jobs=None)[源代码]

参数:

exclude – 数据集中需要强制保留的变量
threshold – 阈值, VIF 大于 threshold 即剔除该特征
missing – 缺失值默认填充 -1
n_jobs – 线程数

fit(x: DataFrame, y=None)[源代码]

class scorecardpipeline.feature_selection.CorrSelector(threshold=0.7, method='pearson', weights=None, exclude=None, **kwargs)[源代码]

基类：SelectorMixin

__init__(threshold=0.7, method='pearson', weights=None, exclude=None, **kwargs)[源代码]

fit(x: DataFrame, y=None)[源代码]

scorecardpipeline.feature_selection.PSI(train, test, n_jobs=None, verbose=0, pre_dispatch='2*n_jobs')[源代码]

class scorecardpipeline.feature_selection.PSISelector(threshold=0.1, cv=None, method=None, exclude=None, n_jobs=None, verbose=0, pre_dispatch='2*n_jobs', **kwargs)[源代码]

基类：SelectorMixin

__init__(threshold=0.1, cv=None, method=None, exclude=None, n_jobs=None, verbose=0, pre_dispatch='2*n_jobs', **kwargs)[源代码]

fit(x: DataFrame, y=None, groups=None)[源代码]

class scorecardpipeline.feature_selection.NullImportanceSelector(estimator, target='target', threshold=1.0, norm_order=1, importance_getter='auto', cv=3, n_runs=5, **kwargs)[源代码]

基类：SelectorMixin

__init__(estimator, target='target', threshold=1.0, norm_order=1, importance_getter='auto', cv=3, n_runs=5, **kwargs)[源代码]

fit(x: DataFrame, y=None)[源代码]

class scorecardpipeline.feature_selection.TargetPermutationSelector(estimator, target='target', threshold=1.0, norm_order=1, importance_getter='auto', cv=3, n_runs=5, **kwargs)[源代码]

基类：NullImportanceSelector

__init__(estimator, target='target', threshold=1.0, norm_order=1, importance_getter='auto', cv=3, n_runs=5, **kwargs)[源代码]

class scorecardpipeline.feature_selection.ExhaustiveSelector(estimator, min_features=1, max_features=1, scoring='accuracy', cv=3, verbose=0, n_jobs=None, pre_dispatch='2*n_jobs')[源代码]

基类：SelectorMixin, MetaEstimatorMixin

Exhaustive Feature Selection for Classification and Regression.

属性字段

参数:

subset_info – list of dicts. A list of dictionary with the following keys: ‘support_mask’, mask array of the selected features ‘cv_scores’, cross validate scores
support_mask – array-like of booleans. Array of final chosen features
best_idx – array-like, shape = [n_predictions]. Feature Indices of the selected feature subsets.
best_score – float. Cross validation average score of the selected subset.
best_feature_indices – array-like, shape = (n_features,), Feature indices of the selected feature subsets.

参考样例

>>> from sklearn.neighbors import KNeighborsClassifier
>>> from sklearn.datasets import load_iris
>>> from scorecardpipeline.feature_selection import ExhaustiveSelector
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> knn = KNeighborsClassifier(n_neighbors=3)
>>> efs = ExhaustiveSelector(knn, min_features=1, max_features=4, cv=3)
>>> efs.fit(X, y)
ExhaustiveFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3), max_features=4)
>>> efs.best_score_
0.9733333333333333
>>> efs.best_idx_
12

__init__(estimator, min_features=1, max_features=1, scoring='accuracy', cv=3, verbose=0, n_jobs=None, pre_dispatch='2*n_jobs')[源代码]

参数:

estimator – scikit-learn classifier or regressor
min_features – int (default: 1). Minimum number of features to select
max_features – int (default: 1). Maximum number of features to select
verbose – bool (default: True). Prints progress as the number of epochs to stdout.
scoring – str, (default=’_passthrough_scorer’). Scoring metric in faccuracy, f1, precision, recall, roc_auc) for classifiers, {‘mean_absolute_error’, ‘mean_squared_error’, ‘median_absolute_error’, ‘r2’} for regressors, or a callable object or function with signature scorer(estimator, X, y).
cv – int (default: 5). Scikit-learn cross-validation generator or int, If estimator is a classifier (or y consists of integer class labels), stratified k-fold is performed, and regular k-fold cross-validation otherwise. No cross-validation if cv is None, False, or 0.
n_jobs – int (default: 1). The number of CPUs to use for evaluating different feature subsets in parallel. -1 means ‘all CPUs’.
pre_dispatch – int, or string (default: ‘2*n_jobs’). Controls the number of jobs that get dispatched during parallel execution if n_jobs > 1 or n_jobs=-1.

static ncr(n, r)[源代码]

Return the number of combinations of length r from n items.

参数:

n – int, Total number of items
r – int, Number of items to select from n

返回:

Number of combinations, integer

fit(X, y, groups=None, **fit_params)[源代码]

Perform feature selection and learn model from training data.

参数:

X – array-like of shape (n_samples, n_features)
y – array-like of shape (n_samples, ), Target values.
groups – array-like of shape (n_samples,), Group labels for the samples used while splitting the dataset into train/test set. Passed to the fit method of the cross-validator.
fit_params – dict, Parameters to pass to the fit method of classifier

返回:

ExhaustiveFeatureSelector

class scorecardpipeline.feature_selection.BorutaSelector[源代码]

基类：SelectorMixin

__init__()[源代码]

class scorecardpipeline.feature_selection.MICSelector[源代码]: 基类：SelectorMixin

class scorecardpipeline.feature_selection.FeatureImportanceSelector[源代码]: 基类：SelectorMixin

class scorecardpipeline.feature_selection.StabilitySelector[源代码]: 基类：SelectorMixin

class scorecardpipeline.feature_selection.REFSelector[源代码]: 基类：SelectorMixin

class scorecardpipeline.feature_selection.SequentialFeatureSelector[源代码]: 基类：SelectorMixin

scorecardpipeline.model

@Time : 2023/05/21 16:23 @Author : itlubber @Site : itlubber.art

class scorecardpipeline.model.ITLubberLogisticRegression(target='target', penalty='l2', calculate_stats=True, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)[源代码]

基类：LogisticRegression

参数:

target – 数据集中标签名称，默认 target
calculate_stats – 是否在训练模型时记录模型统计信息，默认 True，可以通过 summary 方法输出相关统计信息
tol – 停止求解的标准，float类型，默认为1e-4
C – 正则化系数λ的倒数，float类型，默认为1.0，必须是正浮点型数，值越小惩罚越大
fit_intercept – 是否存在截距或偏差，bool类型，默认为True
class_weight – 类型权重参数，默认 None，支持传入 dict or balanced，当设置 balanced 时，权重计算方式：n_samples / (n_classes * np.bincount(y))
solver – 求解器设置，默认 lbfgs。对于小型数据集来说，选择 liblinear 更好；对于大型数据集来说，saga 或者 sag 会更快一些。对于多类问题我们只能使用 newton-cg、sag、saga、lbfgs。对于正则化来说，newton-cg、lbfgs 和 sag 只能用于L2正则化(因为这些优化算法都需要损失函数的一阶或者二阶连续导数，因此无法用于没有连续导数的L1正则化)；而 liblinear，saga 则可处理L1正则化。newton-cg 是牛顿家族中的共轭梯度法，lbfgs 是一种拟牛顿法，sag 则是随机平均梯度下降法，saga 是随机优化算法，liblinear 是坐标轴下降法。
penalty – 惩罚项，默认 l2，可选 l1、l2，solver 为 newton-cg、sag 和 lbfgs 时只支持L2，L1假设的是模型的参数满足拉普拉斯分布，L2假设的模型参数满足高斯分布
intercept_scaling – 仅在 solver 选择 liblinear 并且 fit_intercept 设置为 True 的时候才有用
dual – 对偶或原始方法，bool类型，默认为False，对偶方法只用在求解线性多核(liblinear)的L2惩罚项上。当样本数量>样本特征的时候，dual通常设置为False
random_state – 随机数种子，int类型，可选参数，默认为无，仅在 solver 为 sag 和 liblinear 时有用
max_iter – 算法收敛最大迭代次数，int类型，默认 100。只在 solver 为 newton-cg、sag 和 lbfgs 时有用
multi_class – 分类方法参数选择，默认 auto，可选 ovr、multinomial，如果分类问题是二分类问题，那么这两个参数的效果是一样的，主要体现在多分类问题上
verbose – 日志级别，当 solver 为 liblinear、lbfgs 时设置为任意正数显示详细计算过程
warm_start – 热启动参数，bool类型，表示是否使用上次的模型结果作为初始化，默认为 False
n_jobs – 并行运算数量，默认为1，如果设置为-1，则表示将电脑的cpu全部用上
l1_ratio – 弹性网络参数，其中0 <= l1_ratio <=1，仅当 penalty 为 elasticnet 时有效

参考样例

>>> feature_pipeline = Pipeline([
>>>     ("preprocessing_select", FeatureSelection(target=target, engine="scorecardpy")),
>>>     ("combiner", Combiner(target=target, min_samples=0.2)),
>>>     ("transform", WOETransformer(target=target)),
>>>     ("processing_select", FeatureSelection(target=target, engine="scorecardpy")),
>>>     ("stepwise", StepwiseSelection(target=target)),
>>>     # ("logistic", LogisticClassifier(target=target)),
>>>     ("logistic", ITLubberLogisticRegression(target=target)),
>>> ])
>>> feature_pipeline.fit(train)
>>> summary = feature_pipeline.named_steps['logistic'].summary()
>>> summary
                                                    Coef.  Std.Err       z  P>|z|  [ 0.025  0.975 ]    VIF
const                                               -0.8511   0.0991 -8.5920 0.0000  -1.0452  -0.6569 1.0600
credit_history                                       0.8594   0.1912  4.4954 0.0000   0.4847   1.2341 1.0794
age_in_years                                         0.6176   0.2936  2.1032 0.0354   0.0421   1.1932 1.0955
savings_account_and_bonds                            0.8842   0.2408  3.6717 0.0002   0.4122   1.3563 1.0331
credit_amount                                        0.7027   0.2530  2.7771 0.0055   0.2068   1.1987 1.1587
status_of_existing_checking_account                  0.6891   0.1607  4.2870 0.0000   0.3740   1.0042 1.0842
personal_status_and_sex                              0.8785   0.5051  1.7391 0.0820  -0.1116   1.8685 1.0113
purpose                                              1.1370   0.2328  4.8844 0.0000   0.6807   1.5932 1.0282
present_employment_since                             0.7746   0.3247  2.3855 0.0171   0.1382   1.4110 1.0891
installment_rate_in_percentage_of_disposable_income  1.3785   0.3434  4.0144 0.0001   0.7055   2.0515 1.0300
duration_in_month                                    0.9310   0.1986  4.6876 0.0000   0.5417   1.3202 1.1636
other_installment_plans                              0.8521   0.3459  2.4637 0.0138   0.1742   1.5301 1.0117
housing                                              0.8251   0.4346  1.8983 0.0577  -0.0268   1.6770 1.0205

fit(x, sample_weight=None, **kwargs)[源代码]

逻辑回归训练方法

参数:

x – 训练数据集，需包含目标变量
sample_weight – 样本权重，参考：https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit
kwargs – 其他逻辑回归模型训练参数

返回:

ITLubberLogisticRegression，训练完成的逻辑回归模型

decision_function(x)[源代码]

决策函数

参数:: x – 需要预测的数据集，可以包含目标变量，会根据列名进行判断，如果包含会删除相关特征
返回:: np.ndarray，预测结果

corr(data, save=None, annot=True)[源代码]

数据集的特征相关性图

参数:

data – 需要画特征相关性图的数据集
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
annot – 是否在图中显示相关性的数值，默认 True

report(data)[源代码]

逻辑回归模型报告

参数:: data – 需要评估的数据集
返回:: pd.DataFrame，模型报告，包含准确率、F1等指标，参考：https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

summary()[源代码]

返回:: pd.DataFrame，逻辑回归模型统计信息

Coef.: 逻辑回归入模特征系数
Std.Err: 标准误差
z: Z检验统计量
P>|z|: P值
[ 0.025: 置信区间下界
0.975 ]: 置信区间上界
VIF: 膨胀方差因子

参考样例

>>> summary = logistic.summary()
>>> summary
                                                    Coef.  Std.Err       z  P>|z|  [ 0.025  0.975 ]    VIF
const                                               -0.8511   0.0991 -8.5920 0.0000  -1.0452  -0.6569 1.0600
credit_history                                       0.8594   0.1912  4.4954 0.0000   0.4847   1.2341 1.0794
age_in_years                                         0.6176   0.2936  2.1032 0.0354   0.0421   1.1932 1.0955
savings_account_and_bonds                            0.8842   0.2408  3.6717 0.0002   0.4122   1.3563 1.0331
credit_amount                                        0.7027   0.2530  2.7771 0.0055   0.2068   1.1987 1.1587
status_of_existing_checking_account                  0.6891   0.1607  4.2870 0.0000   0.3740   1.0042 1.0842
personal_status_and_sex                              0.8785   0.5051  1.7391 0.0820  -0.1116   1.8685 1.0113
purpose                                              1.1370   0.2328  4.8844 0.0000   0.6807   1.5932 1.0282
present_employment_since                             0.7746   0.3247  2.3855 0.0171   0.1382   1.4110 1.0891
installment_rate_in_percentage_of_disposable_income  1.3785   0.3434  4.0144 0.0001   0.7055   2.0515 1.0300
duration_in_month                                    0.9310   0.1986  4.6876 0.0000   0.5417   1.3202 1.1636
other_installment_plans                              0.8521   0.3459  2.4637 0.0138   0.1742   1.5301 1.0117
housing                                              0.8251   0.4346  1.8983 0.0577  -0.0268   1.6770 1.0205

summary2(feature_map={})[源代码]

summary 的基础上，支持传入数据字典，输出带有特征释义的统计信息表

参数:: feature_map – 数据字典，默认 {}
返回:: pd.DataFrame，逻辑回归模型统计信息

static convert_sparse_matrix(x)[源代码]: 稀疏特征优化

plot_weights(save=None, figsize=(15, 8), fontsize=14, color=['#2639E9', '#F76E6C', '#FE7715'])[源代码]

逻辑回归模型系数误差图

参数:

save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
figsize – 图片大小，默认 (15, 8)
fontsize – 字体大小，默认 14
color – 图片主题颜色，默认即可

返回:

Figure

class scorecardpipeline.model.ScoreCard(target='target', pdo=60, rate=2, base_odds=35, base_score=750, combiner={}, transer=None, pretrain_lr=None, pipeline=None, **kwargs)[源代码]

基类：ScoreCard, TransformerMixin

__init__(target='target', pdo=60, rate=2, base_odds=35, base_score=750, combiner={}, transer=None, pretrain_lr=None, pipeline=None, **kwargs)[源代码]

评分卡模型

参数:

target – 数据集中标签名称，默认 target
pdo – odds 每增加 rate 倍时减少 pdo 分，默认 60
rate – 倍率
base_odds – 基础 odds，通常根据业务经验设置的基础比率（违约概率/正常概率），估算方法：（1-样本坏客户占比）/坏客户占比，默认 35，即 35:1 => 0.972 => 坏样本率 2.8%
base_score – 基础 odds 对应的分数，默认 750
combiner – 分箱转换器，传入 pipeline 时可以为None
transer – woe转换器，传入 pipeline 时可以为None
pretrain_lr – 预训练好的逻辑回归模型，可以不传
pipeline – 训练好的 pipeline，必须包含 Combiner 和 WOETransformer
kwargs – 其他相关参数，具体参考 toad.ScoreCard

fit(x)[源代码]

评分卡模型训练方法

参数:: x – 转换为 WOE 后的训练数据，需包含目标变量
返回:: ScoreCard，训练好的评分卡模型

transform(x)[源代码]

评分转换方法

参数:: x – 需要预测模型评分的原始数据，非 woe 转换后的数据
返回:: 预测的评分分数

static score_clip(score, clip=50)[源代码]

传入评分分数，根据评分分布情况，返回评分等距分箱规则

参数:

score – 评分数据
clip – 区间间隔

返回:

list，评分分箱规则

scorecard_scale()[源代码]

输出评分卡基准信息，包含 base_odds、base_score、rate、pdo、A、B

返回:: pd.DataFrame，评分卡基准信息

classmethod format_bins(bins, index=False, ellipsis=None, decimal=4)[源代码]

分箱转换为标签

参数:

bins – 分箱
index – 是否需要索引
ellipsis – 字符显示最大长度

返回:

ndarray: 分箱标签

scorecard_points(feature_map={})[源代码]

输出评分卡分箱信息及其对应的分数

参数:: feature_map – 数据字典，默认 {}，传入入模特征的数据字典，输出信息中将增加一列变量含义
返回:: pd.DataFrame，评分卡分箱信息

scorecard2pmml(pmml: str = 'scorecard.pmml', debug: bool = False)[源代码]

转换评分卡模型为本地 PMML 文件，使用本功能需要提前在环境中安装 jdk 1.8+ 以及 sklearn2pmml 库

参数:

pmml – 保存 PMML 模型文件的路径
debug – bool，是否开启调试模式，默认 False，当设置为 True 时，会返回评分卡 pipeline，同时显示转换细节

返回:

sklearn.pipeline.Pipeline，当设置 debug 为 True 时，返回评分卡 pipeline

static KS_bucket(y_pred, y_true, bucket=10, method='quantile')[源代码]

用于评估评分卡排序性的方法

参数:

y_pred – 模型预测结果，传入评分卡预测的评分或LR预测的概率
y_true – 样本好坏标签
bucket – 分箱数量，默认 10
method – 分箱方法，支持 chi、dt、quantile、step、kmeans，默认 quantile

返回:

评分卡分箱后的统计信息，推荐直接使用 feature_bin_stats 方法

static KS(y_pred, y_true)[源代码]

计算 KS 指标

参数:

y_pred – 模型预测结果，传入评分卡预测的评分或LR预测的概率
y_true – 样本好坏标签

返回:

float，KS 指标

static AUC(y_pred, y_true)[源代码]

计算 AUC 指标

参数:

y_pred – 模型预测结果，传入评分卡预测的评分或LR预测的概率
y_true – 样本好坏标签

返回:

float，AUC 指标

static perf_eva(y_pred, y_true, title='', plot_type=['ks', 'roc'], save=None, figsize=(14, 6))[源代码]

评分卡效果评估方法

参数:

y_pred – 模型预测结果，传入评分卡预测的评分或LR预测的概率
y_true – 样本好坏标签
title – 图像标题
plot_type – 画图的类型，可选 ks、auc、lift、pr
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
figsize – 图像尺寸大小，传入一个tuple，默认（14， 6）

返回:

dict，包含 ks、auc、gini、figure

static ks_plot(score, y_true, title='', fontsize=14, figsize=(16, 8), save=None, **kwargs)[源代码]

数值特征 KS曲线 & ROC曲线

参数:

score – 数值特征，通常为评分卡分数
y_true – 标签值
title – 图像标题
fontsize – 字体大小，默认 14
figsize – 图像大小，默认 (16, 8)
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
kwargs – 其他参数，参考：scorecardpipeline.utils.hist_plot

static PSI(y_pred_train, y_pred_oot)[源代码]

计算两个数据集评分或预测结果的 PSI

参数:

y_pred_train – 基准数据集的数值特征，通常为评分卡分数
y_pred_oot – 对照数据集的数值特征

返回:

float，PSI 指标值

static perf_psi(y_pred_train, y_pred_oot, y_true_train, y_true_oot, keys=['train', 'test'], x_limits=None, x_tick_break=50, show_plot=True, return_distr_dat=False)[源代码]

scorecardpy 的 perf_psi 方法，基于两个数据集的画 PSI 图

参数:

y_pred_train – 基准数据集的数值特征，通常为评分卡分数
y_pred_oot – 对照数据集的数值特征
y_true_train – 基准数据集的真实标签
y_true_oot – 基准数据集的真实标签
keys – 基准数据集和对照数据集的名称
x_limits – x 轴的区间，默认为 None
x_tick_break – 评分区间步长
show_plot – 是否显示图像，默认 True
return_distr_dat – 是否返回分布数据

返回:

dict，PSI 指标 & 图片

static score_hist(score, y_true, figsize=(15, 10), bins=20, save=None, **kwargs)[源代码]

数值特征分布图

参数:

score – 数值特征，通常为评分卡分数
y_true – 标签值
figsize – 图像大小，默认 (15, 10)
bins – 分箱数量大小，默认 30
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
kwargs – scorecardpipeline.utils.hist_plot 方法的其他参数

static class_steps(pipeline, query)[源代码]

根据 query 查询 pipeline 中对应的 step

参数:

pipeline – sklearn.pipeline.Pipeline，训练后的数据预处理 pipeline
query – 需要查询的类，可以从 pipeline 中查找 WOETransformer 和 Combiner

返回:

list，对应的组件

feature_bin_stats(data, feature, rules={}, method='step', max_n_bins=10, desc='评分卡分数', ks=False, **kwargs)[源代码]

评估评分卡排序性的方法，可以输出各分数区间的各项指标

参数:

data – 需要查看的数据集
feature – 数值性特征名称，通常为预测的概率或评分卡分数
rules – 自定义的区间划分规则
method – 分箱方法
max_n_bins – 最大分箱数
desc – 特征描述
ks – 是否统计 KS 指标并输出相关统计信息
kwargs – Combiner.feature_bin_stats 方法的其他参数

返回:

pd.DataFrame，评分各区间的统计信息

scorecardpipeline.auto_eda

@Time : 2023/12/29 11:17 @Author : itlubber @Site : itlubber.art

scorecardpipeline.auto_eda.auto_eda_sweetviz(all_data, target=None, save='model_report/auto_eda.html', pairwise=True, labels=None, exclude=None, num_features=None, cat_features=None, text_features=None)[源代码]

对数据量和特征个数较少的数据集进行自动 EDA 产出分析报告文档

参数:

all_data – 需要 EDA 的数据集
target – 目标变量，仅支持整数型或布尔型的目标变量
labels – 当传入 target 时为需要对比的数据集名称 [true, false]，当不传入 target 时为数据集名字
save – 报告存储路径，后缀使用 .html
pairwise – 是否需要显示特征之间的分布情况
exclude – 需要排除的特征名称
num_features – 需要强制转为数值型的特征名称
cat_features – 需要强制转为类别型的特征名称
text_features – 需要强制转为文本的特征名称

scorecardpipeline.auto_report

@Time : 2023/12/29 11:17 @Author : itlubber @Site : itlubber.art

scorecardpipeline.auto_report.auto_data_testing_report(data: DataFrame, features=None, target='target', overdue=None, dpd=None, date=None, data_summary_comment='', freq='M', excel_writer=None, sheet='分析报告', start_col=2, start_row=2, dropna=False, writer_params={}, bin_params={}, feature_map={}, corr=False, pictures=['bin', 'ks', 'hist'], suffix='')[源代码]

自动数据测试报告，用于三方数据评估或自有评分效果评估

参数:

suffix – 用于避免未保存excel时，同名图片被覆盖的图片后缀名称
corr – 是否需要评估数值类变量之间的相关性，默认为 False，设置为 True 后会输出变量相关性图和表
pictures – 需要包含的图片，支持 [“ks”, “hist”, “bin”]
data – 需要评估的数据集，需要包含目标变量
features – 需要进行分析的特征名称，支持单个字符串传入或列表传入
target – 目标变量名称
overdue – 逾期天数字段名称, 当传入 overdue 时，会忽略 target 参数
dpd – 逾期定义方式，逾期天数 > DPD 为 1，其他为 0，仅 overdue 字段起作用时有用
date – 日期列，通常为借款人申请日期或放款日期，可选字段，传入的情况下，结合字段 freq 参数输出不同时间粒度下的好坏客户分布情况
freq – 结合 date 日期使用，输出需要统计的粒度，默认 M，即按月统计
data_summary_comment – 数据样本概况中需要填入的备注信息，例如 “去除了历史最大逾期天数[0, dpd]内的灰客户” 等
excel_writer – 需要保存的excel文件名称或写入器
sheet – 需要保存的 sheet 名称，可传入已有的 worksheet 或文字信息
start_col – 开始列
start_row – 开始行
dropna – 在分析字段是是否剔除缺失值或指定值，默认 False
writer_params – excel写入器初始化参数，仅在 excel_writer 为字符串时有效
bin_params – 统计分箱的参数，支持 feature_bin_stats 方法的参数
feature_map – 特征字典，增加文档可读性使用，默认 {}

参考样例

>>> import numpy as np
>>> from scorecardpipeline import *
>>>
>>> # 加载数据集，标签转换为 0 和 1
>>> target = "creditability"
>>> data = germancredit()
>>> data[target] = data[target].map({"good": 0, "bad": 1})
>>> data["MOB1"] = [np.random.randint(0, 30) for i in range(len(data))]
>>> features = data.columns.drop([target, "MOB1"]).tolist()
>>>
>>> # 测试报告输出
>>> auto_data_testing_report(data
>>>                          , features=features
>>>                          , target=target
>>>                          , date=None # 传入日期列名，会按 freq 统计不同时间维度好坏样本的分布情况
>>>                          , freq="M"
>>>                          , data_summary_comment="三方数据测试报告样例，支持同时评估多个不同标签定义下的数据有效性"
>>>                          , excel_writer="三方数据测试报告.xlsx"
>>>                          , sheet="分析报告"
>>>                          , start_col=2
>>>                          , start_row=2
>>>                          , dropna=False
>>>                          , writer_params={}
>>>                          , overdue=["MOB1"]
>>>                          , dpd=[15, 7, 3]
>>>                          , bin_params={"method": "dt", "min_bin_size": 0.05, "max_n_bins": 10, "return_cols": ["坏样本数", "坏样本占比", "坏样本率", "LIFT值", "坏账改善", "累积LIFT值", "分档KS值"]} # feature_bin_stats 函数的相关参数
>>>                          , pictures=['bin', 'ks', 'hist'] # 类别型变量不支持 ks 和 hist
>>>                          , corr=True
>>>                          )

scorecardpipeline.utils

@Time : 2023/05/21 16:23 @Author : itlubber @Site : itlubber.art

scorecardpipeline.utils.seed_everything(seed: int, freeze_torch=False)[源代码]

固定当前环境随机种子，以保证后续实验可重复

参数:

seed – 随机种子
freeze_torch – 是否固定 pytorch 的随机种子

scorecardpipeline.utils.init_setting(font_path=None, seed=None, freeze_torch=False, logger=False, **kwargs)[源代码]

初始化环境配置，去除警告信息、修改 pandas 默认配置、固定随机种子、日志记录

参数:

seed – 随机种子，默认为 None
freeze_torch – 是否固定 pytorch 环境
font_path – 画图时图像使用的字体，支持系统字体名称、本地字体文件路径，默认为 scorecardppeline 提供的中文字体
logger – 是否需要初始化日志器，默认为 False ，当参数为 True 时返回 logger
kwargs – 日志初始化传入的相关参数

返回:

当 logger 为 True 时返回 logging.Logger

scorecardpipeline.utils.load_pickle(file, engine='joblib')[源代码]

导入 pickle 文件

参数:: file – pickle 文件路径
返回:: pickle 文件的内容

scorecardpipeline.utils.save_pickle(obj, file, engine='joblib')[源代码]

保持数据至 pickle 文件

参数:

obj – 需要保存的数据
file – 文件路径

scorecardpipeline.utils.feature_describe(data, feature=None, percentiles=None, missing=None, cardinality=None)[源代码]

scorecardpipeline.utils.groupby_feature_describe(data, by=None, n_jobs=-1, **kwargs)[源代码]

scorecardpipeline.utils.germancredit()[源代码]

加载德国信贷数据集 German Credit Data

数据来源：https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data

返回:: pd.DataFrame

scorecardpipeline.utils.round_float(num, decimal=4)[源代码]

调整数值分箱的上下界小数点精度，如未超出精度保持原样输出

参数:

num – 分箱的上界或者下界
decimal – 小数点保留的精度

返回:

精度调整后的数值

scorecardpipeline.utils.feature_bins(bins, decimal=4)[源代码]

根据 Combiner 的规则生成分箱区间，并生成区间对应的索引

参数:

bins – Combiner 的规则
decimal – 区间上下界需要保留的精度，默认小数点后4位

返回:

dict ，key 为区间的索引，value 为区间

scorecardpipeline.utils.extract_feature_bin(bin_var)[源代码]

根据单个区间提取的分箱的上下界

参数:: bin_var – 区间字符串
返回:: list or tuple

scorecardpipeline.utils.inverse_feature_bins(feature_table, bin_col='分箱')[源代码]

根据变量分箱表得到 Combiner 的规则

参数:

feature_table – 变量分箱表
bin_col – 变量分箱表中分箱对应的列名，默认分箱

返回:

list

scorecardpipeline.utils.bin_plot(feature_table, desc='', figsize=(10, 6), colors=['#2639E9', '#F76E6C', '#FE7715'], save=None, anchor=0.935, max_len=35, fontdict={'color': '#000000'}, hatch=True, ending='分箱图')[源代码]

简单策略挖掘：特征分箱图

参数:

feature_table – 特征分箱的统计信息表，由 feature_bin_stats 运行得到
desc – 特征中文含义或者其他相关信息
figsize – 图像尺寸大小，传入一个tuple，默认（10， 6）
colors – 图片主题颜色，默认即可
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
anchor – 图例在图中的位置，通常 0.95 左右，根据图片标题与图例之间的空隙自行调整即可
max_len – 分箱显示的最大长度，防止分类变量分箱过多文本过长导致图像显示区域很小，默认最长 35 个字符
fontdict – 柱状图上的文字内容格式设置，参考 https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.text.html
hatch – 柱状图是否显示斜杠，默认显示
ending – 分箱图标题显示的后缀，标题格式为: f’{desc}{ending}’

返回:

Figure

scorecardpipeline.utils.corr_plot(data, figure_size=(16, 8), fontsize=16, mask=False, save=None, annot=True, max_len=35, linewidths=0.1, fmt='.2f', step=11, linecolor='white', **kwargs)[源代码]

特征相关图

参数:

data – 原始数据
figure_size – 图片大小，默认 (16, 8)
fontsize – 字体大小，默认 16
mask – 是否只显示下三角部分内容，默认 False
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
annot – 是否在图中显示相关性的数值，默认 True
max_len – 特征显示的最大长度，防止特征名称过长导致图像区域非常小，默认 35，可以传 None 表示不限制
fmt – 数值显示格式，当 annot 为 True 时该参数生效，默认显示两位小数点
step – 色阶的步数，以 0 为中心，默认 2（以0为中心对称） * 5（划分五个色阶） + 1（0一档单独显示）= 11
linewidths – 相关图之间的线条宽度，默认 0.1 ，如果设置为 None 则不现实线条
linecolor – 线的颜色，当 linewidths 大于 0 时生效，默认为 white
kwargs – sns.heatmap 函数其他参数，参考：https://seaborn.pydata.org/generated/seaborn.heatmap.html

返回:

Figure

scorecardpipeline.utils.ks_plot(score, target, title='', fontsize=14, figsize=(16, 8), save=None, colors=['#2639E9', '#F76E6C', '#FE7715'], anchor=0.945)[源代码]

数值特征 KS曲线 & ROC曲线

参数:

score – 数值特征，通常为评分卡分数
target – 标签值
title – 图像标题
fontsize – 字体大小，默认 14
figsize – 图像大小，默认 (16, 8)
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
colors – 图片主题颜色，默认即可
anchor – 图例显示的位置，默认 0.945，根据实际显示情况进行调整即可，0.95 附近小范围调整

返回:

Figure

scorecardpipeline.utils.hist_plot(score, y_true=None, figsize=(15, 10), bins=30, save=None, labels=['好样本', '坏样本'], desc='', anchor=1.11, fontsize=14, kde=False, **kwargs)[源代码]

数值特征分布图

参数:

score – 数值特征，通常为评分卡分数
y_true – 标签值
figsize – 图像大小，默认 (15, 10)
bins – 分箱数量大小，默认 30
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
labels – 字典或列表，图例显示的分类名称，默认 [“好样本”, “坏样本”]，按照目标变量顺序对应即可，从0开始
anchor – 图例显示的位置，默认 1.1，根据实际显示情况进行调整即可，1.1 附近小范围调整
fontsize – 字体大小，默认 14
kwargs – sns.histplot 函数其他参数，参考：https://seaborn.pydata.org/generated/seaborn.histplot.html

返回:

Figure

scorecardpipeline.utils.psi_plot(expected, actual, labels=['预期', '实际'], desc='', save=None, colors=['#2639E9', '#F76E6C', '#FE7715'], figsize=(15, 8), anchor=0.94, width=0.35, result=False, plot=True, max_len=None, hatch=True)[源代码]

特征 PSI 图

参数:

expected – 期望分布情况，传入需要验证的特征分箱表
actual – 实际分布情况，传入需要参照的特征分箱表
labels – 期望分布和实际分布的名称，默认 [“预期”, “实际”]
desc – 标题前缀显示的名称，默认为空，推荐传入特征名称或评分卡名字
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
colors – 图片主题颜色，默认即可
figsize – 图像大小，默认 (15, 8)
anchor – 图例显示的位置，默认 0.94，根据实际显示情况进行调整即可，0.94 附近小范围调整
width – 预期分布与实际分布柱状图之间的间隔，默认 0.35
result – 是否返回 PSI 统计表，默认 False
plot – 是否画 PSI图，默认 True
max_len – 特征显示的最大长度，防止特征名称过长导致图像区域非常小，默认 None 表示不限制
hatch – 是否显示柱状图上的斜线，默认为 True

返回:

当 result 为 True 时，返回 pd.DataFrame

scorecardpipeline.utils.csi_plot(expected, actual, score_bins, labels=['预期', '实际'], desc='', save=None, colors=['#2639E9', '#F76E6C', '#FE7715'], figsize=(15, 8), anchor=0.94, width=0.35, result=False, plot=True, max_len=None, hatch=True)[源代码]

特征 CSI 图

参数:

expected – 期望分布情况，传入需要验证的特征分箱表
actual – 实际分布情况，传入需要参照的特征分箱表
score_bins – 逻辑回归模型评分表
labels – 期望分布和实际分布的名称，默认 [“预期”, “实际”]
desc – 标题前缀显示的名称，默认为空，推荐传入特征名称或评分卡名字
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
colors – 图片主题颜色，默认即可
figsize – 图像大小，默认 (15, 8)
anchor – 图例显示的位置，默认 0.94，根据实际显示情况进行调整即可，0.94 附近小范围调整
width – 预期分布与实际分布柱状图之间的间隔，默认 0.35
result – 是否返回 CSI 统计表，默认 False
plot – 是否画 CSI图，默认 True
max_len – 特征显示的最大长度，防止特征名称过长导致图像区域非常小，默认 None 表示不限制
hatch – 是否显示柱状图上的斜线，默认为 True

返回:

当 result 为 True 时，返回 pd.DataFrame

scorecardpipeline.utils.dataframe_plot(df, row_height=0.4, font_size=14, header_color='#2639E9', row_colors=['#dae3f3', 'w'], edge_color='w', bbox=[0, 0, 1, 1], header_columns=0, ax=None, save=None, **kwargs)[源代码]

将 dataframe 转换为图片，推荐行和列都不多的数据集使用该方法

参数:

df – 需要画图的 dataframe 数据
row_height – 行高，默认 0.4
font_size – 字体大小，默认 14
header_color – 标题颜色，默认 #2639E9
row_colors – 行颜色，默认 [‘#dae3f3’, ‘w’]，交替使用两种颜色
edge_color – 表格边框颜色，默认白色
bbox – 边的显示情况，[左，右，上，下]，即仅显示上下两条边框
header_columns – 标题行数，默认仅有一个标题行，即 0
ax – 如果需要在某张画布的子图中显示，那么传入对应的 ax 即可
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
kwargs – plt.table 相关的参数，参考：https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.table.html

返回:

Figure

scorecardpipeline.utils.distribution_plot(data, date='date', target='target', save=None, figsize=(10, 6), colors=['#2639E9', '#F76E6C', '#FE7715'], freq='M', anchor=0.94, result=False, hatch=True)[源代码]

样本时间分布图

参数:

data – 数据集
date – 日期列名称，如果格式非日期，会尝试自动转为日期格式，默认 date，替换为数据中对应的日期列（如申请时间、授信时间、放款时间等）
target – 数据集中标签列的名称，默认 target
save – 图片保存的地址，如果传入路径中有文件夹不存在，会新建相关文件夹，默认 None
figsize – 图像大小，默认 (10, 6)
colors – 图片主题颜色，默认即可
freq – 汇总统计的日期格式，按年、季度、月、周、日等统计，参考：https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects
anchor – 图例显示的位置，默认 0.94，根据实际显示情况进行调整即可，0.94 附近小范围调整
result – 是否返回分布表，默认 False
hatch – 是否显示柱状图上的斜线，默认为 True

返回:

scorecardpipeline.utils.sample_lift_transformer(df, rule, target='target', sample_rate=0.7)[源代码]

采取好坏样本 sample_rate:1 的抽样方式时，计算抽样样本和原始样本上的 lift 指标

参数:

df – 原始数据，需全部为数值型变量
rule – Rule
target – 目标变量名称
sample_rate – 好样本采样比例

返回:

lift_sam: float, 抽样样本上拒绝人群的lift lift_ori: float, 原始样本上拒绝人群的lift

scorecardpipeline.utils.tasks_executor(tasks, n_jobs=-1, pool='thread')[源代码]

多进程或多线程任务执行

参数:

tasks – 任务
n_jobs – 线城池或进程池数量
pool – 类型，默认 thread 线城池,

scorecardpipeline.utils.monotonic_bad_rate_binning(df, feature, target, target_rates, greater_is_better=True)[源代码]

根据目标违约率寻找最佳分箱切点，并确保逾期率单调

参数:

df – 包含特征和目标的DataFrame
feature – 要分箱的特征列名
target – 目标变量列名
target_rates – 目标违约率列表(从高到低或从低到高取决于greater_is_better)
greater_is_better – 评分越高是否越好(违约率越低)

返回:

分箱切点列表(已排序且唯一)

scorecardpipeline.excel_writer

@Time : 2023/05/14 16:23 @Author : itlubber @Site : itlubber.art

class scorecardpipeline.excel_writer.ExcelWriter(style_excel=None, style_sheet_name='初始化', mode='replace', fontsize=10, font='楷体', theme_color='2639E9', opacity=0.85, system=None)[源代码]

基类：object

__init__(style_excel=None, style_sheet_name='初始化', mode='replace', fontsize=10, font='楷体', theme_color='2639E9', opacity=0.85, system=None)[源代码]

excel 写入方法

参数:

style_excel – 样式模版文件，默认安装包路径下的 template.xlsx ，如果路径调整需要进行相应的调整
style_sheet_name – 模版文件内初始样式sheet名称，默认即可
mode – 写入模式，默认 replace，可选 replace、append，当选择 append 模式时会将已存在的excel中的内容复制到新的文件中
fontsize – 插入excel文件中内容的字体大小，默认 10
font – 插入excel文件中内容的字体，默认楷体
theme_color – 主题色，默认 2639E9，注意不包含 #
system – excel报告适配的系统，默认 mac，可选 windows、linux，设置为 windows 时会重新适配 picture 的大小
opacity – 写入dataframe时使用颜色填充主题色的透明度设置，默认 0.85

add_conditional_formatting(worksheet, start_space, end_space)[源代码]

设置条件格式

参数:

worksheet – 当前选择设置条件格式的sheet
start_space – 开始单元格位置
end_space – 结束单元格位置

static set_column_width(worksheet, column, width)[源代码]

调整excel列宽

参数:

worksheet – 当前选择调整列宽的sheet
column – 列，可以直接输入 index 或者字母
width – 设置列的宽度

static set_number_format(worksheet, space, _format)[源代码]

设置数值显示格式

参数:

worksheet – 当前选择调整数值显示格式的sheet
space – 单元格范围
_format – 显示格式，参考 openpyxl

set_freeze_panes(worksheet, space)[源代码]

设置数值显示格式

参数:

worksheet – 当前选择调整数值显示格式的sheet
space – 单元格范围

get_sheet_by_name(name)[源代码]

获取sheet名称为name的工作簿，如果不存在，则从初始模版文件中拷贝一个名称为name的sheet

参数:: name – 需要获取的工作簿名称

move_sheet(worksheet, offset: int = 0, index: int | None = None)[源代码]

移动 sheet 位置

参数:

worksheet – 需要移动的sheet，支持输入字符串和Worksheet
offset – 需要移动的相对位置，默认 0，在传入 index 时参数不生效
index – 需要移动到的位置索引，超出移动到最后

insert_hyperlink2sheet(worksheet, insert_space, hyperlink=None, file=None, sheet=None, target_space=None)[源代码]

向sheet中的某个单元格插入超链接

参数:

worksheet – 需要插入超链接的sheet
insert_space – 超链接插入的单元格位置，可以是 “B2” 或者 (2, 2) 任意一种形式，首个单元格从 (1, 1) 开始
hyperlink – 超链接的地址, 与 target_space 参数互斥，优先 hyperlink，格式参考: [f”file://{文件地址} - #{表名}!{单元格位置}”, f”#{表名}!{单元格位置}”, f”{单元格位置}”], 其中单元格位置为类似 “B2” 的格式
file – 超链接的文件路径，默认 None，即当前excel文件，传入 hyperlink 参数时无效，传入file时确保sheet参数已传，否则默认为当前sheet
sheet – 超链接的sheet名称，默认 None，即当前sheet，传入 hyperlink 参数时无效
target_space – 超链接的单元格位置，默认 None，支持 “B2” 或者 (2, 2) 任意一种形式，传入 hyperlink 参数时无效

insert_value2sheet(worksheet, insert_space, value='', style='content', auto_width=False, end_space=None, align: dict | None = None, max_col_width=50)[源代码]

向sheet中的某个单元格插入某种样式的内容

参数:

worksheet – 需要插入内容的sheet
insert_space – 内容插入的单元格位置，可以是 “B2” 或者 (2, 2) 任意一种形式
value – 需要插入的内容
style – 渲染的样式，参考 init_style 中初始设置的样式
end_space – 如果需要合并单元格，传入需要截止的单元格位置信息，可以是 “B2” 或者 (2, 2) 任意一种形式
auto_width – 是否开启自动调整列宽
align – 文本排列方式, 参考: Alignment
max_col_width – 单元格列最大宽度，默认 50

返回:

返回插入元素最后一列之后、最后一行之后的位置

insert_pic2sheet(worksheet, fig, insert_space, figsize=(600, 250))[源代码]

向excel中插入图片内容

参数:

worksheet – 需要插入内容的sheet
fig – 需要插入的图片路径
insert_space – 插入图片的起始单元格
figsize – 图片大小设置

返回:

返回插入元素最后一列之后、最后一行之后的位置

insert_rows(worksheet, row, row_index, col_index, merge_rows=None, style='', auto_width=False, style_only=False, multi_levels=False)[源代码]

向excel中插入一行数据，insert_df2sheet 依赖本方法

参数:

worksheet – 需要插入内容的sheet
row – 数据内容
row_index – 插入数据的行索引，用来判断使用哪种边框样式
col_index – 插入数据的列索引，用来判断使用哪种边框样式
merge_rows – 需要合并单元的行索引
style – 插入数据的excel风格
auto_width – 是否自动调整列宽，自动调整列宽会导致该列样式模版发生变化，非内容列默认填充的白色失效
style_only – 是否使用填充样式
multi_levels – 是否多层索引或多层级列

insert_df2sheet(worksheet, data, insert_space, merge_column=None, header=True, index=False, auto_width=False, fill=False, merge=False, merge_index=True)[源代码]

向excel文件中插入指定样式的dataframe数据

参数:

worksheet – 需要插入内容的sheet
data – 需要插入的dataframe
insert_space – 插入内容的起始单元格位置
merge_column – 需要分组显示的列，index或者列名，需要提前排好序，从 0.1.33 开始 ExcleWriter 不会自动处理顺序
header – 是否存储dataframe的header，暂不支持多级表头，从 0.1.30 开始支持多层表头和多层索引
index – 是否存储dataframe的index
merge_index – 当存储dataframe的index时，index中同一层级连续相同值是否合并，默认 True，即合并
auto_width – 是否自动调整列宽，自动调整列宽会导致该列样式模版发生变化，非内容列默认填充的白色失效
fill – 是否使用颜色填充而非边框，当 fill 为 True 时，merge_column 失效
merge – 是否合并单元格，配合 merge_column 一起使用，当前版本仅在 merge_column 只有一列时有效

返回:

返回插入元素最后一列之后、最后一行之后的位置

merge_cells(worksheet, start, end)[源代码]

合并同一列单元格并保证样式相应合并

参数:

worksheet – 需要合并单元格的sheet
start – 合并单元格开始的位置
end – 合并单元格结束的位置

返回:

static check_contain_chinese(check_str)[源代码]

检查字符串中是否包含中文

参数:: check_str – 需要检查的字符串
返回:: 返回每个字符是否是中文 list<bool>，英文字符个数，中文字符个数

static astype_insertvalue(value, decimal_point=4)[源代码]

参数:

value – 需要插入 excel 的内容
decimal_point – 如果是浮点型，需要保留的精度，默认小数点后4位数

返回:

格式化后存入excel的内容

static calc_continuous_cnt(list_, index_=0)[源代码]

根据传入的 list ，计算 list 中某个 index 开始，连续出现该元素的个数

参数:

list – 需要检索的 list
index – 元素索引

返回:

元素值，索引值，连续出现的个数

参考样例

>>> calc_continuous_cnt = ExcelWriter.calc_continuous_cnt
>>> list_ = ['A','A','A','A','B','C','C','D','D','D']
>>> calc_continuous_cnt(list_, 0)
('A', 0, 4)
>>> calc_continuous_cnt(list_, 4)
('B', 4, 1)
>>> calc_continuous_cnt(list_, 6)
('C', 6, 1)

static itlubber_border(border, color, white=False)[源代码]

itlubber 的边框样式生成器

参数:

border – 边框样式，如果输入长度为 3，则生成 [左，右，下]，如果长度为4，则生成 [左，右，下，上]
color – 边框颜色
white – 是否显示白色边框

返回:

Border

static get_cell_space(space)[源代码]

根据传入的不同格式的位置，转换为另一种形式的excel单元格定位

参数:: space – 传入的excel单元格定位，支持两种格式，B1 或 (2, 2)
返回:: 返回单元格定位，tuple / str

参考样例

>>> get_cell_space = ExcelWriter.get_cell_space
>>> get_cell_space("B3")
(2, 3)
>>> get_cell_space((2, 2))
'B2'

static calculate_rgba_color(hex_color, opacity, prefix='#')[源代码]

根据某个颜色计算某个透明度对应的颜色

参数:

hex_color – hex格式的颜色值
opacity – 透明度，[0, 1] 之间的数值
prefix – 返回颜色的前缀

返回:

对应某个透明度的颜色

init_style(font, fontsize, theme_color)[源代码]

初始化单元格样式

参数:

font – 字体名称
fontsize – 字体大小
theme_color – 主题颜色

save(filename, close=True)[源代码]

保存excel文件

参数:

filename – 需要保存 excel 文件的路径
close – 是否需要释放 writer

scorecardpipeline.excel_writer.dataframe2excel(data, excel_writer, sheet_name=None, title=None, header=True, theme_color='2639E9', condition_color=None, fill=True, percent_cols=None, condition_cols=None, custom_cols=None, custom_format='#,##0', color_cols=None, percent_rows=None, condition_rows=None, custom_rows=None, color_rows=None, start_col=2, start_row=2, mode='replace', figures=None, figsize=(600, 350), writer_params={}, **kwargs)[源代码]

向excel文件中插入指定样式的dataframe数据

参数:

data – 需要保存的dataframe数据，index默认不保存，如果需要保存先 .reset_index().rename(columns={“index”: “索引名称”}) 再保存，有部分索引 reset_index 之后是 0 而非 index，根据实际情况进行修改
excel_writer – 需要保存到的 excel 文件路径或者 ExcelWriter
sheet_name – 需要插入内容的sheet，如果是 Worksheet，则直接向 Worksheet 插入数据
title – 是否在dataframe之前的位置插入一个标题
figures – 需要数据表与标题之间插入的图片，支持一次性传入多张图片的路径，会根据传入顺序依次插入
figsize – 插入图像的大小，为了统一排版，目前仅支持设置一个图片大小，默认: (600, 350) (长度, 高度)
header – 是否存储dataframe的header，暂不支持多级表头
theme_color – 主题色
condition_color – 条件格式主题颜色，不传默认为 theme_color
fill – 是否使用单元个颜色填充样式还是使用边框样式
percent_cols – 需要显示为百分数的列，仅修改显示格式，不更改数值
condition_cols – 需要显示条件格式的列（无边框渐变数据条）
color_cols – 需要显示为条件格式颜色填充的列（单元格填充渐变色）
custom_cols – 需要显示自定义格式的列，与 custom_format 参数搭配使用
custom_format – 显示的自定义格式，与 custom_cols 参数搭配使用，默认 #,##0 ，即显示为有分隔符的整数
start_col – 在excel中的开始列数，默认 2，即第二列开始
start_row – 在excel中的开始行数，默认 2，即第二行开始，如果 title 有值的话，会从 start_row + 2 行开始插入dataframe数据
mode – excel写入的模式，可选 append 和 replace ，默认 replace ，选择 append 时会在已有的excel文件中增加内容，不覆盖原有内容
writer_params – 透传至 ExcelWriter 内的参数
**kwargs –
其他参数，透传至 insert_df2sheet 方法，例如传入 auto_width=True 会根据内容自动调整列宽

返回:

返回插入元素最后一列之后、最后一行之后的位置

参考样例

>>> writer = ExcelWriter(theme_color='3f1dba')
>>> worksheet = writer.get_sheet_by_name("模型报告")
>>> end_row, end_col = writer.insert_value2sheet(worksheet, "B2", value="模型报告", style="header")
>>> end_row, end_col = writer.insert_value2sheet(worksheet, "B4", value="模型报告", style="header", end_space="D4")
>>> end_row, end_col = writer.insert_value2sheet(worksheet, "B6", value="当前模型主要为评分卡模型", style="header_middle", auto_width=True)
>>> # 单层索引保存样例
>>> sample = pd.DataFrame(np.concatenate([np.random.random_sample((10, 10)) * 40, np.random.randint(0, 3, (10, 2))], axis=1), columns=[f"B{i}" for i in range(10)] + ["target", "type"])
>>> end_row, end_col = writer.insert_df2sheet(worksheet, sample, (end_row + 2, column_index_from_string("B")))
>>> end_row, end_col = writer.insert_df2sheet(worksheet, sample, (end_row + 2, column_index_from_string("B")), fill=True)
>>> end_row, end_col = writer.insert_df2sheet(worksheet, sample, (end_row + 2, column_index_from_string("B")), fill=True, header=False, index=True)
>>> end_row, end_col = writer.insert_df2sheet(worksheet, sample, (end_row + 2, column_index_from_string("B")), merge_column="target")
>>> end_row, end_col = writer.insert_df2sheet(worksheet, sample.set_index("type"), (end_row + 2, column_index_from_string("B")), merge_column="target", index=True, fill=True)
>>> end_row, end_col = writer.insert_df2sheet(worksheet, sample, (end_row + 2, column_index_from_string("B")), merge_column=["target", "type"])
>>> end_row, end_col = writer.insert_df2sheet(worksheet, sample, (end_row + 2, column_index_from_string("B")), merge_column=[10, 11])
>>> end_row, end_col = dataframe2excel(sample, writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, percent_cols=["B2", "B6"], condition_cols=["B3", "B9"], color_cols=["B4"])
>>> end_row, end_col = dataframe2excel(sample, writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, percent_cols=["B2", "B6"], condition_cols=["B3", "B9"], color_cols=["B4"], title="测试样例")
>>> end_row, end_col = dataframe2excel(sample, writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, percent_cols=["B2", "B6"], condition_cols=["B3", "B9"], color_cols=["B4"], title="测试样例", figures=["../examples/model_report/auto_report_corr_plot.png"])
>>> # 多层索引保存样例
>>> multi_sample = pd.DataFrame(np.random.randint(0, 150, size=(8, 12)), columns=pd.MultiIndex.from_product([['模拟考', '正式考'], ['数学', '语文', '英语', '物理', '化学', '生物']]), index=pd.MultiIndex.from_product([['期中', '期末'], ['雷军', '李斌'], ['测试一', '测试二']]))
>>> multi_sample.index.names = ["考试类型", "姓名", "测试"]
>>> end_row, end_col = dataframe2excel(multi_sample, writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, title="测试样例", index=True, header=False)
>>> end_row, end_col = dataframe2excel(multi_sample, writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, title="测试样例", index=True)
>>> end_row, end_col = dataframe2excel(multi_sample, writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, title="测试样例", index=True, fill=False)
>>> end_row, end_col = dataframe2excel(multi_sample.reset_index(names=multi_sample.index.names, col_level=-1), writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, title="测试样例", index=False, fill=False, merge_column=[('', '考试类型'), ('', '姓名')])
>>> end_row, end_col = dataframe2excel(multi_sample.reset_index(names=multi_sample.index.names, col_level=-1), writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, title="测试样例", index=False, fill=False, merge_column=[('', '考试类型')], merge=True)
>>> end_row, end_col = dataframe2excel(multi_sample.reset_index(names=multi_sample.index.names, col_level=-1), writer, theme_color='3f1dba', sheet_name="模型报告", start_row=end_row + 2, title="测试样例", index=True, fill=True, merge_column=[('', '考试类型')], merge=True)
>>> writer.save("./测试样例.xlsx")

scorecardpipeline.rule

@Time : 2024/2/26 12:00 @Author : itlubber @Site : itlubber.art

scorecardpipeline.rule.get_columns_from_query(query_str)[源代码]

获取pandas query语句使用的列

参数:: query_str – pandas query 支持的查询语句
返回:: query 语句使用的列名

scorecardpipeline.rule.json2expr(data, max_index, feature_list)[源代码]

class scorecardpipeline.rule.Rule(expr)[源代码]

基类：object

__init__(expr)[源代码]

规则集

参数:: expr – 类似 DataFrame 的 query 方法传参方式即可，目前仅支持数值型变量规则

参考样例

>>> from scorecardpipeline import *
>>> target = "creditability"
>>> data = germancredit()
>>> data[target] = data[target].map({"good": 0, "bad": 1})
>>> data = data.select_dtypes("number") # 暂不支持字符型规则
>>> rule1 = Rule("duration_in_month < 10")
>>> rule2 = Rule("credit_amount < 500")
>>> rule1.report(data, target=target)
>>> rule2.report(data, target=target)
>>> (rule1 | rule2).report(data, target=target)
>>> (rule1 & rule2).report(data, target=target)

predict(X: DataFrame, part='')[源代码]

report(datasets: DataFrame, target='target', overdue=None, dpd=None, del_grey=False, desc='', filter_cols=None, prior_rules=None) → DataFrame[源代码]

规则效果报告表格输出

参数:

datasets – 数据集，需要包含目标变量或逾期天数，当不包含目标变量时，会通过逾期天数计算目标变量，同时需要传入逾期定义的DPD天数
target – 目标变量名称，默认 target
desc – 规则相关的描述，会出现在返回的表格当中
filter_cols – 指定返回的字段列表，默认不传
prior_rules – 先验规则，可以传入先验规则先筛选数据后再评估规则效果
overdue – 逾期天数字段名称
dpd – 逾期定义方式，逾期天数 > DPD 为 1，其他为 0，仅 overdue 字段起作用时有用
del_grey – 是否删除逾期天数 (0, dpd] 的数据，仅 overdue 字段起作用时有用

返回:

pd.DataFrame，规则效果评估表

result()[源代码]

static save(report, excel_writer, sheet_name=None, merge_column=None, percent_cols=None, condition_cols=None, custom_cols=None, custom_format='#,##0', color_cols=None, start_col=2, start_row=2, **kwargs)[源代码]: 保存规则结果至excel中，参数与 https://scorecardpipeline.itlubber.art/scorecardpipeline.html#scorecardpipeline.dataframe2excel 一致

scorecardpipeline.rule_extraction

@Time : 2024/2/29 13:29 @Author : itlubber @Site : itlubber.art

class scorecardpipeline.rule_extraction.DecisionTreeRuleExtractor(target='target', labels=['positive', 'negative'], feature_map={}, nan=-1.0, max_iter=128, writer=None, seed=None, theme_color='2639E9', decimal=4)[源代码]

基类：object

__init__(target='target', labels=['positive', 'negative'], feature_map={}, nan=-1.0, max_iter=128, writer=None, seed=None, theme_color='2639E9', decimal=4)[源代码]

决策树自动规则挖掘工具包

参数:

target – 数据集中好坏样本标签列名称，默认 target
labels – 好坏样本标签名称，传入一个长度为2的列表，第0个元素为好样本标签，第1个元素为坏样本标签，默认 [“positive”, “negative”]
feature_map – 变量名称及其含义，在后续输出报告和策略信息时增加可读性，默认 {}
nan – 在决策树策略挖掘时，默认空值填充的值，默认 -1
max_iter – 最多支持在数据集上训练多少颗树模型，每次生成一棵树后，会剔除特征重要性最高的特征后，再生成树，默认 128
writer – 在之前程序运行时生成的 ExcelWriter，可以支持传入一个已有的writer，后续所有内容将保存至该workbook中，默认 None
seed – 随机种子，保证结果可复现使用，默认为 None
theme_color – 主题色，默认 2639E9 克莱因蓝，可设置位其他颜色
decimal – 精度，决策树分裂节点阈值的精度范围，默认 4，即保留4位小数

encode_cat_features(X, y)[源代码]

get_dt_rules(tree)[源代码]

select_dt_rules(decision_tree, x, y, lift=0.0, max_samples=1.0, save=None, verbose=False, drop=False)[源代码]

query_dt_rules(x, y, parsed_rules=None)[源代码]

insert_dt_rules(parsed_rules, end_row, start_col, save=None, sheet=None, figsize=(500, 350))[源代码]

fit(x, y=None, max_depth=2, lift=0.0, max_samples=1.0, min_score=None, verbose=False, *args, **kwargs)[源代码]

组合策略挖掘

参数:

x – 包含标签的数据集
max_depth – 决策树最大深度，即最多组合的特征个数，默认 2
lift – 组合策略最小的lift值，默认 0.，即全部组合策略
max_samples – 每条组合策略的最大样本占比，默认 1.0，即全部组合策略
min_score – 决策树拟合时最小的auc，如果不满足则停止后续生成决策树
verbose – 是否调试模式，仅在 jupyter 环境有效
kwargs – DecisionTreeClassifier 参数，参考 https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

transform(x, y=None)[源代码]

report(valid=None, sheet='组合策略汇总', save=None)[源代码]

组合策略插入excel文档

参数:

valid – 验证数据集
sheet – 保存组合策略的表格sheet名称
save – 保存报告的文件路径

返回:

返回每个数据集组合策略命中情况

scorecardpipeline.logger

scorecardpipeline.logger.init_logger(filename=None, stream=True, fmt='[ %(asctime)s ][ %(levelname)s ][ %(filename)s:%(funcName)s:%(lineno)d ] %(message)s', datefmt=None)[源代码]

初始化日志

参数:

filename – 日志文件存储地址，如果不传不记录日志到文件中，默认为 None
stream – 是否显示在终端中，默认 True
fmt – 日志格式，参考：https://docs.python.org/3/library/logging.html#formatter-objects
datefmt – 日期格式

返回:

logging.Logger