Pandas中高效处理宽表结构的教育经历数据：从宽表到长表的规范化重构

霞舞

发布时间：2026-03-14 15:48:13

417人浏览过

来源于php中文网

原创

本文介绍如何使用pandas的melt()将包含多组重复字段（如degree1/2/3、specialisation1/2/3等）的宽表Excel数据，规范化为每行一条教育经历的长表结构，从而简化后续迭代、校验与Selenium自动化流程。

本文介绍如何使用pandas的`melt()`将包含多组重复字段（如degree1/2/3、specialisation1/2/3等）的宽表excel数据，规范化为每行一条教育经历的长表结构，从而简化后续迭代、校验与selenium自动化流程。

在处理招聘或HR类Excel数据时，常遇到“宽表设计”问题：同一候选人（一行记录）的多个教育背景被横向展开为degree1、specialisation1、college1、degree2、specialisation2……直至degree5等列。这种结构虽便于人工阅读，却严重阻碍程序化处理——直接使用df.iterrows()无法自然遍历每位候选人的全部教育项，导致逻辑臃肿、易出错（如原问题中尝试的复杂索引分组函数）。

理想方案是数据重塑（Reshaping）：将宽表转换为标准长表（tidy data），即每行唯一标识一位候选人的一条教育经历。pandas 提供了简洁高效的 pd.melt() 方法实现这一目标。

✅ 正确做法：用 melt() 实现规范化长表

假设已通过 pd.read_excel() 加载数据：

import pandas as pd

df = pd.read_excel("applicants.xlsx")

观察原始列名，识别两类字段：

标识变量（id_vars）：描述候选人本身的列，如 'sr_no', 'name', 'mobile', 'emp_status' 等；
度量变量（value_vars）：需展开的教育相关列，如所有以 degree, specialisation, college 开头的列（共 3 类 × 最多 5 组 = 15 列）。

我们无需手动枚举全部15列——可动态提取：

PixVerse

PixVerse是一款强大的AI视频生成工具，可以轻松地将多种输入转化为令人惊叹的视频。

下载

# 定义不变的标识列
id_cols = ['sr_no', 'old_emp_id', 'name', 'address', 'mobile', 'emp_status']

# 自动筛选所有教育相关列（含 degreeX / specialisationX / collegeX）
edu_cols = [col for col in df.columns 
            if col.startswith(('degree', 'specialisation', 'college'))]

# 执行熔解（melt）：将教育列“折叠”为两列 —— attribute（字段名）和 value（值）
df_long = pd.melt(
    df,
    id_vars=id_cols,
    value_vars=edu_cols,
    var_name='attribute',
    value_name='value'
)

此时 df_long 结构如下（节选）：

sr_no	name	mobile	emp_status	attribute	value
1	Amit	356363474	full-time	degree1	Computer Science
1	Amit	356363474	full-time	specialisation1	Robotics
1	Amit	356363474	full-time	college1	IIT Delhi
1	Amit	356363474	full-time	degree2	MSC ML
1	Amit	356363474	full-time	specialisation2	MIT
...	...	...	...	...	...

✅ 优势立现：每位候选人对应多行，每行含完整身份信息 + 单一教育属性，可直接按 sr_no 分组后批量处理：

# 按候选人分组，聚合其全部教育记录
for sr_no, group in df_long.groupby('sr_no'):
    candidate_name = group['name'].iloc[0]
    print(f"\n✅ 处理候选人 {candidate_name}（ID: {sr_no}）:")

    # 提取该候选人的所有教育条目（每3行一组：degree/specialisation/college）
    edu_records = []
    for _, row in group.iterrows():
        attr, val = row['attribute'], row['value']
        if pd.isna(val): continue  # 跳过空值

        # 解析字段编号（如 degree1 → 1, college3 → 3）
        import re
        match = re.search(r'(\d+)$', attr)
        if not match: continue
        idx = int(match.group(1))

        # 动态构建字典（支持 degreeX / specialisationX / collegeX）
        key = attr.replace(str(idx), '')  # 如 'degree1' → 'degree'
        if len(edu_records) < idx:
            edu_records.append({})
        edu_records[idx-1][key] = val

    # 输出标准化教育列表（每项含 degree, specialisation, college）
    for i, edu in enumerate(edu_records, 1):
        print(f"  ? 教育{i}: {edu.get('degree', 'N/A')} | "
              f"{edu.get('specialisation', 'N/A')} | "
              f"{edu.get('college', 'N/A')}")

⚠️ 注意事项与最佳实践

避免 iterrows() 嵌套循环处理宽表：原问题中自定义函数依赖硬编码索引与除法运算，极易因列顺序变化或缺失字段崩溃；melt() 是声明式、健壮且可读性高的替代方案。
字段命名一致性至关重要：确保所有教育列严格遵循 degree1/specialisation1/college1 等命名模式，否则正则解析会失效。建议在数据导入后添加校验：
```
assert all(re.match(r'(degree|specialisation|college)\d+$', c) for c in edu_cols), "教育字段命名不规范"
```
空值与截断处理：实际数据中常存在 degree3 有值但 specialisation3 为空的情况。melt() 保留原始空值，后续聚合时应显式跳过 pd.isna()，避免填充错误。
性能提示：对于超大数据集（>10万行），可先 df[edu_cols] = df[edu_cols].fillna("") 再 melt()，减少 pd.NA 开销；必要时改用 pd.wide_to_long()（需列名满足严格格式，如 degree_1, degree_2）。

✅ 总结

将宽表教育数据通过 pd.melt() 重构为长表，是面向自动化流程（如 Selenium 表单填写）的关键预处理步骤。它消除了手工索引计算的脆弱性，使代码逻辑回归“一人一行教育经历”的自然语义，大幅提升可维护性与扩展性。掌握 melt()、wide_to_long() 及 pivot() 等重塑工具，是 Pandas 高阶数据清洗的核心能力。