Post

Pandas Data Cleaning

pandas 字符串处理 常用函数

  • 大小写转换 lower() upper()
  • 去除首尾空格 strip()
  • 以特定字符串开始 startwith()
  • 替换特定数据 replace()
  • 将某列拆成多列 split()
  • 个性化处理数据 lambda 函数
    1
    
    import pandas as pd
    
1
2
3
data = pd.read_excel('demo-data.xlsx')
data

Unnamed: 0dayNameStatusCid
00Fridayxiao menghappyCS 001
11Saturdayxiao menghappyCS 002
22SundayXiao MengsadIS 001
33MondayXIAO MenghappyIS 001
44TuesdayXIAO MENGhappyCS 002
55Wednesdayxiao mENgsadIS 002
66ThursdayxIAO MENgsadIS 002
1
data['name_upper'] = data['Name'].str.upper()
1
data
Unnamed: 0dayNameStatusCidname_upper
00Fridayxiao menghappyCS 001XIAO MENG
11Saturdayxiao menghappyCS 002XIAO MENG
22SundayXiao MengsadIS 001XIAO MENG
33MondayXIAO MenghappyIS 001XIAO MENG
44TuesdayXIAO MENGhappyCS 002XIAO MENG
55Wednesdayxiao mENgsadIS 002XIAO MENG
66ThursdayxIAO MENgsadIS 002XIAO MENG
1
data['name_upper'].to_list()
1
2
3
4
5
6
7
['XIAO MENG',
 'XIAO MENG',
 'XIAO MENG',
 'XIAO MENG',
 'XIAO MENG',
 'XIAO MENG',
 'XIAO MENG']
1
data['Cid'].str.startswith('CS')
1
2
3
4
5
6
7
8
0     True
1     True
2    False
3    False
4     True
5    False
6    False
Name: Cid, dtype: bool
1
data['day'].str.replace('day','ABC')
1
2
3
4
5
6
7
8
0        FriABC
1      SaturABC
2        SunABC
3        MonABC
4       TuesABC
5     WednesABC
6      ThursABC
Name: day, dtype: object
  • 与上面的str的区别是 上面替换某一column内部具体的值,使用str。替换整体直接replace
1
2
data.replace({"happy":1,"sad":0})

Unnamed: 0dayNameStatusCidname_upper
00Fridayxiao meng1CS 001XIAO MENG
11Saturdayxiao meng1CS 002XIAO MENG
22SundayXiao Meng0IS 001XIAO MENG
33MondayXIAO Meng1IS 001XIAO MENG
44TuesdayXIAO MENG1CS 002XIAO MENG
55Wednesdayxiao mENg0IS 002XIAO MENG
66ThursdayxIAO MENg0IS 002XIAO MENG
1
data[['cid_major','cid_num']] = data['Cid'].str.split(" ",expand=True)
1
data
Unnamed: 0dayNameStatusCidname_uppercid_majorcid_num
00Fridayxiao menghappyCS 001XIAO MENGCS001
11Saturdayxiao menghappyCS 002XIAO MENGCS002
22SundayXiao MengsadIS 001XIAO MENGIS001
33MondayXIAO MenghappyIS 001XIAO MENGIS001
44TuesdayXIAO MENGhappyCS 002XIAO MENGCS002
55Wednesdayxiao mENgsadIS 002XIAO MENGIS002
66ThursdayxIAO MENgsadIS 002XIAO MENGIS002
1
data['day'].apply(lambda x: x+'ADDED' if x =='Friday' else x)
1
2
3
4
5
6
7
8
0    FridayADDED
1       Saturday
2         Sunday
3         Monday
4        Tuesday
5      Wednesday
6       Thursday
Name: day, dtype: object

lambda 妙用

  • 与map函数结合
    1
    
    map(function, iterable)
    

    function这里可以是lambda 函数, iterable是一个可迭代对象,如列表、元组等。

    • map() 函数将 function 应用于 iterable 中的每个元素,并返回一个迭代器,该迭代器包含将 function 应用于每个元素后的结果。通常,你需要将其转换为列表或其他类型的可迭代对象以查看结果。
  • reduce 函数用于将一个函数累积地应用到可迭代对象的元素上,从而将其归约为单个值。 ```python

    ===========一般写法:===========

    1、两数相加

    def add(x, y):
    return x + y

reduce(add, [1, 3, 5, 7, 9]) # 计算列表元素和:1+3+5+7+9

结果:

25

””” ===========执行步骤解析:=========== 调用 reduce(add, [1, 3, 5, 7, 9])时,reduce函数将做如下计算: 1 先计算头两个元素:f(1, 3),结果为4; 2 再把结果和第3个元素计算:f(4, 5),结果为9; 3 再把结果和第4个元素计算:f(9, 7),结果为16; 4 再把结果和第5个元素计算:f(16, 9),结果为25; 5 由于没有更多的元素了,计算结束,返回结果25。 “””

===========匿名函数写法:===========

2、两数相加,lambda 写法

reduce(lambda x, y: x + y, [1, 2, 3, 4, 5])

结果:

15

当然求和运算可以直接用Python内建函数sum(),没必要动用reduce。

3、但是如果要把序列 [1, 3, 5, 7, 9] 变换成整数 13579,reduce就可以派上用场:

from functools import reduce

def fn(x, y): return x * 10 + y

reduce(fn, [1, 3, 5, 7, 9])

结果:

13579

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
* filter 函数
```python
# ===========一般用法:===========
# 1、过滤出列表中的所有奇数
def is_odd(n):
	return n % 2 == 1
		 
newlist = filter(is_odd, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
print(list(newlist))
# 结果: [1, 3, 5, 7, 9]

# ===========匿名函数用法:===========
# 2、将列表[1, 2, 3]中能够被3整除的元素过滤出来
newlist = filter(lambda x: x % 3 == 0, [1, 2, 3])
print(list(newlist))
# 结果: [3]

pandas 时间 datetime

  • 将某一列转为datetime类型
    • pd.read_csv() 时候设置
    • pd.to_datetime() 时候通过format
  • 从datetime中获取有用信息 date, time, year, month, day, hour, minute, second
  • 基于datetime列修改日期: pd.DateOffset()
  • 获得两列datetime的时间差
  • 时区转化 tz_convert()
    1
    
    import pandas as pd
    
  • parse_dates 只能转化标准的格式 一般情况下都不可以成功,需要使用to_datetime
1
2
data = pd.read_excel('date_time.xlsx',parse_dates=['v1','v2'])
data
v1v2
02023-07-01 00:00:0007/01/23TEST00:00:00
12023-07-01 01:05:1007/01/23TEST01:05:10
22023-07-01 02:10:2007/01/23TEST02:10:20
32023-07-02 03:15:3007/02/23TEST03:15:30
42023-07-02 04:20:4007/02/23TEST04:20:40
1
data.dtypes
1
2
3
v1    datetime64[ns]
v2            object
dtype: object
  • 把v2转为datetime
1
pd.to_datetime(data['v2'],format="%m/%d/%yTEST%H:%M:%S")
1
2
3
4
5
6
0   2023-07-01 00:00:00
1   2023-07-01 01:05:10
2   2023-07-01 02:10:20
3   2023-07-02 03:15:30
4   2023-07-02 04:20:40
Name: v2, dtype: datetime64[ns]
1
2
data['v1_hour'] = data['v1'].dt.hour
data
v1v2v1_hour
02023-07-01 00:00:0007/01/23TEST00:00:000
12023-07-01 01:05:1007/01/23TEST01:05:101
22023-07-01 02:10:2007/01/23TEST02:10:202
32023-07-02 03:15:3007/02/23TEST03:15:303
42023-07-02 04:20:4007/02/23TEST04:20:404
1
2
data['v1_plus_1day'] = data['v1'] + pd.DateOffset(days=1)
data
v1v2v1_hourv1_plus_1day
02023-07-01 00:00:0007/01/23TEST00:00:0002023-07-02 00:00:00
12023-07-01 01:05:1007/01/23TEST01:05:1012023-07-02 01:05:10
22023-07-01 02:10:2007/01/23TEST02:10:2022023-07-02 02:10:20
32023-07-02 03:15:3007/02/23TEST03:15:3032023-07-03 03:15:30
42023-07-02 04:20:4007/02/23TEST04:20:4042023-07-03 04:20:40
1
data['v1_plus_1day'] - data['v1']
1
2
3
4
5
6
0   1 days
1   1 days
2   1 days
3   1 days
4   1 days
dtype: timedelta64[ns]
  • 转为int类型,ps astype会将days看为ASCII并且加进来
1
2
3
# (data['v1_plus_1day'] - data['v1']).astype(int)
(data['v1_plus_1day'] - data['v1']).dt.days

1
2
3
4
5
6
0    1
1    1
2    1
3    1
4    1
dtype: int64
1
data['v1'].dt.tz_localize('US/Central').dt.tz_convert('US/Eastern')
1
2
3
4
5
6
0   2023-07-01 01:00:00-04:00
1   2023-07-01 02:05:10-04:00
2   2023-07-01 03:10:20-04:00
3   2023-07-02 04:15:30-04:00
4   2023-07-02 05:20:40-04:00
Name: v1, dtype: datetime64[ns, US/Eastern]

Pandas 完整性约束 Integrity Constraints

  • 函数依赖(funcitonal dependency, key constraint)
    • data.drop_duplicates() 消除函数依赖
    • data[‘key’].nunique() == len(data) 检查是否符合函数依赖
    • data[‘key’].value_counts() 检查是否符合函数依赖
    • data[data.groupby(“key”)[‘key’].transform(“size”)>1] 检查是否符合函数依赖
  • 包含依赖(inclusion dependency, foreign key constraint)
    • df1[df1[‘key’].isin(df2[‘key’])]
This post is licensed under CC BY 4.0 by the author.