pandas 字符串处理 常用函数
- 大小写转换 lower() upper()
- 去除首尾空格 strip()
- 以特定字符串开始 startwith()
- 替换特定数据 replace()
- 将某列拆成多列 split()
- 个性化处理数据 lambda 函数
1
2
3
| data = pd.read_excel('demo-data.xlsx')
data
|
| Unnamed: 0 | day | Name | Status | Cid |
---|
0 | 0 | Friday | xiao meng | happy | CS 001 |
---|
1 | 1 | Saturday | xiao meng | happy | CS 002 |
---|
2 | 2 | Sunday | Xiao Meng | sad | IS 001 |
---|
3 | 3 | Monday | XIAO Meng | happy | IS 001 |
---|
4 | 4 | Tuesday | XIAO MENG | happy | CS 002 |
---|
5 | 5 | Wednesday | xiao mENg | sad | IS 002 |
---|
6 | 6 | Thursday | xIAO MENg | sad | IS 002 |
---|
1
| data['name_upper'] = data['Name'].str.upper()
|
| Unnamed: 0 | day | Name | Status | Cid | name_upper |
---|
0 | 0 | Friday | xiao meng | happy | CS 001 | XIAO MENG |
---|
1 | 1 | Saturday | xiao meng | happy | CS 002 | XIAO MENG |
---|
2 | 2 | Sunday | Xiao Meng | sad | IS 001 | XIAO MENG |
---|
3 | 3 | Monday | XIAO Meng | happy | IS 001 | XIAO MENG |
---|
4 | 4 | Tuesday | XIAO MENG | happy | CS 002 | XIAO MENG |
---|
5 | 5 | Wednesday | xiao mENg | sad | IS 002 | XIAO MENG |
---|
6 | 6 | Thursday | xIAO MENg | sad | IS 002 | XIAO MENG |
---|
1
| data['name_upper'].to_list()
|
1
2
3
4
5
6
7
| ['XIAO MENG',
'XIAO MENG',
'XIAO MENG',
'XIAO MENG',
'XIAO MENG',
'XIAO MENG',
'XIAO MENG']
|
1
| data['Cid'].str.startswith('CS')
|
1
2
3
4
5
6
7
8
| 0 True
1 True
2 False
3 False
4 True
5 False
6 False
Name: Cid, dtype: bool
|
1
| data['day'].str.replace('day','ABC')
|
1
2
3
4
5
6
7
8
| 0 FriABC
1 SaturABC
2 SunABC
3 MonABC
4 TuesABC
5 WednesABC
6 ThursABC
Name: day, dtype: object
|
- 与上面的str的区别是 上面替换某一column内部具体的值,使用str。替换整体直接replace
1
2
| data.replace({"happy":1,"sad":0})
|
| Unnamed: 0 | day | Name | Status | Cid | name_upper |
---|
0 | 0 | Friday | xiao meng | 1 | CS 001 | XIAO MENG |
---|
1 | 1 | Saturday | xiao meng | 1 | CS 002 | XIAO MENG |
---|
2 | 2 | Sunday | Xiao Meng | 0 | IS 001 | XIAO MENG |
---|
3 | 3 | Monday | XIAO Meng | 1 | IS 001 | XIAO MENG |
---|
4 | 4 | Tuesday | XIAO MENG | 1 | CS 002 | XIAO MENG |
---|
5 | 5 | Wednesday | xiao mENg | 0 | IS 002 | XIAO MENG |
---|
6 | 6 | Thursday | xIAO MENg | 0 | IS 002 | XIAO MENG |
---|
1
| data[['cid_major','cid_num']] = data['Cid'].str.split(" ",expand=True)
|
| Unnamed: 0 | day | Name | Status | Cid | name_upper | cid_major | cid_num |
---|
0 | 0 | Friday | xiao meng | happy | CS 001 | XIAO MENG | CS | 001 |
---|
1 | 1 | Saturday | xiao meng | happy | CS 002 | XIAO MENG | CS | 002 |
---|
2 | 2 | Sunday | Xiao Meng | sad | IS 001 | XIAO MENG | IS | 001 |
---|
3 | 3 | Monday | XIAO Meng | happy | IS 001 | XIAO MENG | IS | 001 |
---|
4 | 4 | Tuesday | XIAO MENG | happy | CS 002 | XIAO MENG | CS | 002 |
---|
5 | 5 | Wednesday | xiao mENg | sad | IS 002 | XIAO MENG | IS | 002 |
---|
6 | 6 | Thursday | xIAO MENg | sad | IS 002 | XIAO MENG | IS | 002 |
---|
1
| data['day'].apply(lambda x: x+'ADDED' if x =='Friday' else x)
|
1
2
3
4
5
6
7
8
| 0 FridayADDED
1 Saturday
2 Sunday
3 Monday
4 Tuesday
5 Wednesday
6 Thursday
Name: day, dtype: object
|
lambda 妙用
reduce(add, [1, 3, 5, 7, 9]) # 计算列表元素和:1+3+5+7+9
结果:
25
””” ===========执行步骤解析:=========== 调用 reduce(add, [1, 3, 5, 7, 9])时,reduce函数将做如下计算: 1 先计算头两个元素:f(1, 3),结果为4; 2 再把结果和第3个元素计算:f(4, 5),结果为9; 3 再把结果和第4个元素计算:f(9, 7),结果为16; 4 再把结果和第5个元素计算:f(16, 9),结果为25; 5 由于没有更多的元素了,计算结束,返回结果25。 “””
===========匿名函数写法:===========
2、两数相加,lambda 写法
reduce(lambda x, y: x + y, [1, 2, 3, 4, 5])
结果:
15
当然求和运算可以直接用Python内建函数sum(),没必要动用reduce。
3、但是如果要把序列 [1, 3, 5, 7, 9] 变换成整数 13579,reduce就可以派上用场:
from functools import reduce
def fn(x, y): return x * 10 + y
reduce(fn, [1, 3, 5, 7, 9])
结果:
13579
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| * filter 函数
```python
# ===========一般用法:===========
# 1、过滤出列表中的所有奇数
def is_odd(n):
return n % 2 == 1
newlist = filter(is_odd, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
print(list(newlist))
# 结果: [1, 3, 5, 7, 9]
# ===========匿名函数用法:===========
# 2、将列表[1, 2, 3]中能够被3整除的元素过滤出来
newlist = filter(lambda x: x % 3 == 0, [1, 2, 3])
print(list(newlist))
# 结果: [3]
|
pandas 时间 datetime
- 将某一列转为datetime类型
- pd.read_csv() 时候设置
- pd.to_datetime() 时候通过format
- 从datetime中获取有用信息 date, time, year, month, day, hour, minute, second
- 基于datetime列修改日期: pd.DateOffset()
- 获得两列datetime的时间差
- 时区转化 tz_convert()
- parse_dates 只能转化标准的格式 一般情况下都不可以成功,需要使用to_datetime
1
2
| data = pd.read_excel('date_time.xlsx',parse_dates=['v1','v2'])
data
|
| v1 | v2 |
---|
0 | 2023-07-01 00:00:00 | 07/01/23TEST00:00:00 |
---|
1 | 2023-07-01 01:05:10 | 07/01/23TEST01:05:10 |
---|
2 | 2023-07-01 02:10:20 | 07/01/23TEST02:10:20 |
---|
3 | 2023-07-02 03:15:30 | 07/02/23TEST03:15:30 |
---|
4 | 2023-07-02 04:20:40 | 07/02/23TEST04:20:40 |
---|
1
2
3
| v1 datetime64[ns]
v2 object
dtype: object
|
1
| pd.to_datetime(data['v2'],format="%m/%d/%yTEST%H:%M:%S")
|
1
2
3
4
5
6
| 0 2023-07-01 00:00:00
1 2023-07-01 01:05:10
2 2023-07-01 02:10:20
3 2023-07-02 03:15:30
4 2023-07-02 04:20:40
Name: v2, dtype: datetime64[ns]
|
1
2
| data['v1_hour'] = data['v1'].dt.hour
data
|
| v1 | v2 | v1_hour |
---|
0 | 2023-07-01 00:00:00 | 07/01/23TEST00:00:00 | 0 |
---|
1 | 2023-07-01 01:05:10 | 07/01/23TEST01:05:10 | 1 |
---|
2 | 2023-07-01 02:10:20 | 07/01/23TEST02:10:20 | 2 |
---|
3 | 2023-07-02 03:15:30 | 07/02/23TEST03:15:30 | 3 |
---|
4 | 2023-07-02 04:20:40 | 07/02/23TEST04:20:40 | 4 |
---|
1
2
| data['v1_plus_1day'] = data['v1'] + pd.DateOffset(days=1)
data
|
| v1 | v2 | v1_hour | v1_plus_1day |
---|
0 | 2023-07-01 00:00:00 | 07/01/23TEST00:00:00 | 0 | 2023-07-02 00:00:00 |
---|
1 | 2023-07-01 01:05:10 | 07/01/23TEST01:05:10 | 1 | 2023-07-02 01:05:10 |
---|
2 | 2023-07-01 02:10:20 | 07/01/23TEST02:10:20 | 2 | 2023-07-02 02:10:20 |
---|
3 | 2023-07-02 03:15:30 | 07/02/23TEST03:15:30 | 3 | 2023-07-03 03:15:30 |
---|
4 | 2023-07-02 04:20:40 | 07/02/23TEST04:20:40 | 4 | 2023-07-03 04:20:40 |
---|
1
| data['v1_plus_1day'] - data['v1']
|
1
2
3
4
5
6
| 0 1 days
1 1 days
2 1 days
3 1 days
4 1 days
dtype: timedelta64[ns]
|
- 转为int类型,ps astype会将days看为ASCII并且加进来
1
2
3
| # (data['v1_plus_1day'] - data['v1']).astype(int)
(data['v1_plus_1day'] - data['v1']).dt.days
|
1
2
3
4
5
6
| 0 1
1 1
2 1
3 1
4 1
dtype: int64
|
1
| data['v1'].dt.tz_localize('US/Central').dt.tz_convert('US/Eastern')
|
1
2
3
4
5
6
| 0 2023-07-01 01:00:00-04:00
1 2023-07-01 02:05:10-04:00
2 2023-07-01 03:10:20-04:00
3 2023-07-02 04:15:30-04:00
4 2023-07-02 05:20:40-04:00
Name: v1, dtype: datetime64[ns, US/Eastern]
|
Pandas 完整性约束 Integrity Constraints
- 函数依赖(funcitonal dependency, key constraint)
- data.drop_duplicates() 消除函数依赖
- data[‘key’].nunique() == len(data) 检查是否符合函数依赖
- data[‘key’].value_counts() 检查是否符合函数依赖
- data[data.groupby(“key”)[‘key’].transform(“size”)>1] 检查是否符合函数依赖
- 包含依赖(inclusion dependency, foreign key constraint)
- df1[df1[‘key’].isin(df2[‘key’])]