DataFrame 是表示型的数据结构,每一行就是一个记录,每一列是一个变量。简单说,DataFrame 是共享同一个 index 的 Series 集合。
创建 DataFrame
纯手工操作1
要创建,必须先创建索引,而且DataFrame 必须有索引。
import pandas as pd
import numpy as np
dates = pd.date_range('2011-1-1', periods=6, freq='D')
dates
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
'2011-01-05', '2011-01-06'],
dtype='datetime64[ns]', freq='D')
用这个创建好的 dates 作为索引,创建 DataFrame:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df
|
A |
B |
C |
D |
| 2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
| 2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
| 2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
| 2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
| 2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
| 2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
纯手工操作2
df1 = pd.DataFrame({'A':[1, 2, 3], 'B':[4, 5, 6], 'C':[7, 8, 9]},
index=pd.date_range('20110101', periods=3))
df1
|
A |
B |
C |
| 2011-01-01 |
1 |
4 |
7 |
| 2011-01-02 |
2 |
5 |
8 |
| 2011-01-03 |
3 |
6 |
9 |
从外部文件或者数据库导入
一般使用 DataFrame不会手工创建的, 都是导入的, 不过导入很简单了, 比如从 csv 导入:
df = pandas.read_csv('csvfile.txt')
查看和操作
查看列名, 索引, 值和统计量
Index(['A', 'B', 'C', 'D'], dtype='object')
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
'2011-01-05', '2011-01-06'],
dtype='datetime64[ns]', freq='D')
array([[-0.66012233, -1.3129909 , 0.84327428, 3.96020219],
[ 0.48393364, 0.39004968, 1.38578819, -0.38039762],
[-0.85551546, -0.57764636, 0.90005938, 0.13927364],
[-0.13578653, 1.03186187, 1.47734054, -1.11916714],
[-0.99127477, -0.56881698, -0.64527863, 0.31228982],
[ 2.43062939, 0.07037805, 0.3527228 , 0.16526828]])
|
A |
B |
C |
D |
| count |
6.000000 |
6.000000 |
6.000000 |
6.000000 |
| mean |
-0.092473 |
0.280178 |
-0.684536 |
0.297170 |
| std |
0.536510 |
0.888420 |
1.062205 |
1.049473 |
| min |
-0.720014 |
-1.442339 |
-2.514868 |
-0.704865 |
| 25% |
-0.569908 |
0.270064 |
-1.048116 |
-0.523770 |
| 50% |
0.002167 |
0.577708 |
-0.480265 |
0.183566 |
| 75% |
0.378883 |
0.718902 |
0.081548 |
0.634094 |
| max |
0.416989 |
1.034505 |
0.336762 |
2.096034 |
行列操作
行
|
A |
B |
C |
D |
| 2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
| 2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
| 2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
列
2011-01-01 0.416989
2011-01-02 0.334617
2011-01-03 0.393638
2011-01-04 -0.720014
2011-01-05 -0.330284
2011-01-06 -0.649782
Freq: D, Name: A, dtype: float64
2011-01-01 0.416989
2011-01-02 0.334617
2011-01-03 0.393638
2011-01-04 -0.720014
2011-01-05 -0.330284
2011-01-06 -0.649782
Freq: D, Name: A, dtype: float64
|
A |
C |
| 2011-01-01 |
0.416989 |
-2.514868 |
| 2011-01-02 |
0.334617 |
-0.536929 |
| 2011-01-03 |
0.393638 |
0.249931 |
| 2011-01-04 |
-0.720014 |
-1.218512 |
| 2011-01-05 |
-0.330284 |
-0.423601 |
| 2011-01-06 |
-0.649782 |
0.336762 |
条件提取
|
A |
B |
C |
D |
| 2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
| 2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
| 2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
df[df['A'] > 0][['A', 'C']]
|
A |
C |
| 2011-01-01 |
0.416989 |
-2.514868 |
| 2011-01-02 |
0.334617 |
-0.536929 |
| 2011-01-03 |
0.393638 |
0.249931 |
不能这样同时提取行列
df[0:1, 'A']
使用 loc() 方法 – 开始变得像 matlab 的操作1
通过行列的标签名提取
df.loc[:, 'A'] # 所有行, A 列
2011-01-01 0.416989
2011-01-02 0.334617
2011-01-03 0.393638
2011-01-04 -0.720014
2011-01-05 -0.330284
2011-01-06 -0.649782
Freq: D, Name: A, dtype: float64
df.loc[dates[0:2], 'A':'C']
|
A |
B |
C |
| 2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
| 2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
这样还是不行:
df.loc[0:2, 'A':'C']
使用 iloc() – 开始变得像 matlab 的操作2
通过行列位置提取(index locate)
A -0.855515
B -0.577646
C 0.900059
D 0.139274
Name: 2011-01-03 00:00:00, dtype: float64
2011-01-01 0.843274
2011-01-02 1.385788
2011-01-03 0.900059
2011-01-04 1.477341
2011-01-05 -0.645279
2011-01-06 0.352723
Freq: D, Name: C, dtype: float64
df.iloc[[1,3], [2,3]] # 提取某几行某几列
|
C |
D |
| 2011-01-02 |
1.385788 |
-0.380398 |
| 2011-01-04 |
1.477341 |
-1.119167 |
|
C |
| 2011-01-02 |
1.385788 |
| 2011-01-03 |
0.900059 |
混合使用标签和位置索引, ix() – 不推荐使用
|
A |
B |
C |
D |
| 2011-01-03 |
-0.855515 |
-0.577646 |
0.900059 |
0.139274 |
| 2011-01-04 |
-0.135787 |
1.031862 |
1.477341 |
-1.119167 |
| 2011-01-05 |
-0.991275 |
-0.568817 |
-0.645279 |
0.312290 |
2011-01-02 1.385788
2011-01-04 1.477341
Freq: 2D, Name: C, dtype: float64
2011-01-02 1.385788
2011-01-03 0.900059
Freq: D, Name: C, dtype: float64
|
B |
C |
| 2011-01-02 |
0.390050 |
1.385788 |
| 2011-01-03 |
-0.577646 |
0.900059 |
DataFrame 操作
转置
|
2011-01-01 00:00:00 |
2011-01-02 00:00:00 |
2011-01-03 00:00:00 |
2011-01-04 00:00:00 |
2011-01-05 00:00:00 |
2011-01-06 00:00:00 |
| A |
-0.660122 |
0.483934 |
-0.855515 |
-0.135787 |
-0.991275 |
2.430629 |
| B |
-1.312991 |
0.390050 |
-0.577646 |
1.031862 |
-0.568817 |
0.070378 |
| C |
0.843274 |
1.385788 |
0.900059 |
1.477341 |
-0.645279 |
0.352723 |
| D |
3.960202 |
-0.380398 |
0.139274 |
-1.119167 |
0.312290 |
0.165268 |
排序
df.sort_index(ascending=False)
|
A |
B |
C |
D |
| 2011-01-06 |
2.430629 |
0.070378 |
0.352723 |
0.165268 |
| 2011-01-05 |
-0.991275 |
-0.568817 |
-0.645279 |
0.312290 |
| 2011-01-04 |
-0.135787 |
1.031862 |
1.477341 |
-1.119167 |
| 2011-01-03 |
-0.855515 |
-0.577646 |
0.900059 |
0.139274 |
| 2011-01-02 |
0.483934 |
0.390050 |
1.385788 |
-0.380398 |
| 2011-01-01 |
-0.660122 |
-1.312991 |
0.843274 |
3.960202 |
df.sort_index(axis=1, ascending=False) # 注意排序方向
|
D |
C |
B |
A |
| 2011-01-01 |
3.960202 |
0.843274 |
-1.312991 |
-0.660122 |
| 2011-01-02 |
-0.380398 |
1.385788 |
0.390050 |
0.483934 |
| 2011-01-03 |
0.139274 |
0.900059 |
-0.577646 |
-0.855515 |
| 2011-01-04 |
-1.119167 |
1.477341 |
1.031862 |
-0.135787 |
| 2011-01-05 |
0.312290 |
-0.645279 |
-0.568817 |
-0.991275 |
| 2011-01-06 |
0.165268 |
0.352723 |
0.070378 |
2.430629 |
脑补 sort_index() 为 sort_values(by=index)
|
A |
B |
C |
D |
| 2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
| 2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
| 2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
| 2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
| 2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
| 2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
排名
rank() 返回相应的行或者列排序的排名
df.rank(axis=0) # 列方向升序排名
|
A |
B |
C |
D |
| 2011-01-01 |
6.0 |
5.0 |
1.0 |
3.0 |
| 2011-01-02 |
4.0 |
4.0 |
3.0 |
6.0 |
| 2011-01-03 |
5.0 |
6.0 |
5.0 |
4.0 |
| 2011-01-04 |
1.0 |
2.0 |
2.0 |
1.0 |
| 2011-01-05 |
3.0 |
1.0 |
4.0 |
2.0 |
| 2011-01-06 |
2.0 |
3.0 |
6.0 |
5.0 |
df.rank(axis=0).astype(int) # 使用 astype 将数据类型变为 int
|
A |
B |
C |
D |
| 2011-01-01 |
6 |
5 |
1 |
3 |
| 2011-01-02 |
4 |
4 |
3 |
6 |
| 2011-01-03 |
5 |
6 |
5 |
4 |
| 2011-01-04 |
1 |
2 |
2 |
1 |
| 2011-01-05 |
3 |
1 |
4 |
2 |
| 2011-01-06 |
2 |
3 |
6 |
5 |
df.rank(axis=1, ascending=False) # 行方向降序排名
|
A |
B |
C |
D |
| 2011-01-01 |
2.0 |
1.0 |
4.0 |
3.0 |
| 2011-01-02 |
3.0 |
2.0 |
4.0 |
1.0 |
| 2011-01-03 |
2.0 |
1.0 |
4.0 |
3.0 |
| 2011-01-04 |
3.0 |
1.0 |
4.0 |
2.0 |
| 2011-01-05 |
1.0 |
4.0 |
2.0 |
3.0 |
| 2011-01-06 |
4.0 |
2.0 |
3.0 |
1.0 |
增加行或者列
s1 = pd.Series([1,2,3,4,5], index=pd.date_range('20110102', periods=5))
s1
2011-01-02 1
2011-01-03 2
2011-01-04 3
2011-01-05 4
2011-01-06 5
Freq: D, dtype: int64
|
A |
B |
C |
D |
E |
| 2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
NaN |
| 2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
1.0 |
| 2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
2.0 |
| 2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
3.0 |
| 2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
4.0 |
| 2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
5.0 |
在 df 新增了一列,而且会根据 index 自动对齐
# df = df.loc[:, 'A':'D']
df = df[list('ABCD')] # 两种方法都可以
df
|
A |
B |
C |
D |
| 2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
| 2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
| 2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
| 2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
| 2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
| 2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
pd.concat([df, df]) # 上下摞起来
|
A |
B |
C |
D |
| 2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
| 2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
| 2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
| 2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
| 2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
| 2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
| 2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
| 2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
| 2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
| 2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
| 2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
| 2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
pd.concat([df, df], axis=1) # 左右连起来
|
A |
B |
C |
D |
A |
B |
C |
D |
| 2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
| 2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
| 2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
| 2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
| 2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
| 2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
df1 = pd.DataFrame({'A':[1, 2, 3], 'B':[4, 5, 6], 'C':[7, 8, 9]},
index=pd.date_range('20110101', periods=3))
df.append(df1) # 摞起来
|
A |
B |
C |
D |
| 2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
| 2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
| 2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
| 2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
| 2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
| 2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
| 2011-01-01 |
1.000000 |
4.000000 |
7.000000 |
NaN |
| 2011-01-02 |
2.000000 |
5.000000 |
8.000000 |
NaN |
| 2011-01-03 |
3.000000 |
6.000000 |
9.000000 |
NaN |
pd.concat([df, df1], join='inner')
|
A |
B |
C |
| 2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
| 2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
| 2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
| 2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
| 2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
| 2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
| 2011-01-01 |
1.000000 |
4.000000 |
7.000000 |
| 2011-01-02 |
2.000000 |
5.000000 |
8.000000 |
| 2011-01-03 |
3.000000 |
6.000000 |
9.000000 |
删除
drop() 返回新的 DataFrame, del 直接在原 DataFrame 上操作
|
A |
B |
C |
D |
| 2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
| 2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
| 2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
| 2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
|
B |
C |
D |
| 2011-01-01 |
0.724751 |
-2.514868 |
0.005752 |
| 2011-01-02 |
0.701355 |
-0.536929 |
2.096034 |
| 2011-01-03 |
1.034505 |
0.249931 |
0.361379 |
| 2011-01-04 |
0.208732 |
-1.218512 |
-0.704865 |
| 2011-01-05 |
-1.442339 |
-0.423601 |
-0.700278 |
| 2011-01-06 |
0.454061 |
0.336762 |
0.724999 |
|
A |
B |
C |
D |
| 2011-01-01 |
-0.660122 |
-1.312991 |
0.843274 |
3.960202 |
| 2011-01-02 |
0.483934 |
0.390050 |
1.385788 |
-0.380398 |
| 2011-01-03 |
-0.855515 |
-0.577646 |
0.900059 |
0.139274 |
| 2011-01-04 |
-0.135787 |
1.031862 |
1.477341 |
-1.119167 |
| 2011-01-05 |
-0.991275 |
-0.568817 |
-0.645279 |
0.312290 |
| 2011-01-06 |
2.430629 |
0.070378 |
0.352723 |
0.165268 |
|
B |
C |
D |
| 2011-01-01 |
-1.312991 |
0.843274 |
3.960202 |
| 2011-01-02 |
0.390050 |
1.385788 |
-0.380398 |
| 2011-01-03 |
-0.577646 |
0.900059 |
0.139274 |
| 2011-01-04 |
1.031862 |
1.477341 |
-1.119167 |
| 2011-01-05 |
-0.568817 |
-0.645279 |
0.312290 |
| 2011-01-06 |
0.070378 |
0.352723 |
0.165268 |
替换
df.loc[dates[2], 'C'] = 0
df
|
B |
C |
D |
| 2011-01-01 |
-1.312991 |
0.843274 |
3.960202 |
| 2011-01-02 |
0.390050 |
1.385788 |
-0.380398 |
| 2011-01-03 |
-0.577646 |
0.000000 |
0.139274 |
| 2011-01-04 |
1.031862 |
1.477341 |
-1.119167 |
| 2011-01-05 |
-0.568817 |
-0.645279 |
0.312290 |
| 2011-01-06 |
0.070378 |
0.352723 |
0.165268 |
重置索引
new_index = pd.date_range('20110101', periods=7)
df.reindex(new_index, columns=list('ABCDE'))
|
A |
B |
C |
D |
E |
| 2011-01-01 |
0.558092 |
-0.044851 |
0.627649 |
1.376806 |
NaN |
| 2011-01-02 |
2.370370 |
0.309665 |
1.534129 |
0.477640 |
NaN |
| 2011-01-03 |
-0.918037 |
1.442996 |
0.718332 |
-0.764125 |
NaN |
| 2011-01-04 |
1.232411 |
1.239217 |
1.380759 |
-1.552525 |
NaN |
| 2011-01-05 |
-1.157906 |
0.486401 |
-0.015537 |
0.016348 |
NaN |
| 2011-01-06 |
2.025508 |
1.072081 |
-2.053920 |
-0.710797 |
NaN |
| 2011-01-07 |
NaN |
NaN |
NaN |
NaN |
NaN |
new_index = pd.date_range('20140101', periods=7)
df.reindex(new_index, columns=list('ABCD')) # 如果不存在, 那么用 NaN
|
A |
B |
C |
D |
| 2014-01-01 |
NaN |
NaN |
NaN |
NaN |
| 2014-01-02 |
NaN |
NaN |
NaN |
NaN |
| 2014-01-03 |
NaN |
NaN |
NaN |
NaN |
| 2014-01-04 |
NaN |
NaN |
NaN |
NaN |
| 2014-01-05 |
NaN |
NaN |
NaN |
NaN |
| 2014-01-06 |
NaN |
NaN |
NaN |
NaN |
| 2014-01-07 |
NaN |
NaN |
NaN |
NaN |
轴
参考排序部分
DataFrame, Series 的运算
Series之间的运算
首先根据 index 匹配,无法匹配的话使用 NaN。
s1 = pd.Series([1, 2, 3], index=list('ABC'))
s2 = pd.Series([4, 5, 6], index=list('BCD'))
A NaN
B 6.0
C 8.0
D NaN
dtype: float64
DataFrame 与 Series
Series 的 index 与 DataFrame 的 column 匹配。无法匹配的话使用 NaN。
df1 = pd.DataFrame(np.arange(1, 13).reshape(3, 4),
index=list('abc'), columns=list('ABCD'))
df1
|
A |
B |
C |
D |
| a |
1 |
2 |
3 |
4 |
| b |
5 |
6 |
7 |
8 |
| c |
9 |
10 |
11 |
12 |
|
A |
B |
C |
D |
| a |
0.0 |
0.0 |
0.0 |
NaN |
| b |
4.0 |
4.0 |
4.0 |
NaN |
| c |
8.0 |
8.0 |
8.0 |
NaN |
DataFrame 之间
根据 index 和 column 匹配。无法匹配的话使用 NaN。
df2 = pd.DataFrame(np.arange(1, 13).reshape(4,3),
index=list('bcde'), columns=list('CDE'))
df2
|
C |
D |
E |
| b |
1 |
2 |
3 |
| c |
4 |
5 |
6 |
| d |
7 |
8 |
9 |
| e |
10 |
11 |
12 |
|
A |
B |
C |
D |
E |
| a |
NaN |
NaN |
NaN |
NaN |
NaN |
| b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
| c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
| d |
NaN |
NaN |
NaN |
NaN |
NaN |
| e |
NaN |
NaN |
NaN |
NaN |
NaN |
函数和映射
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df
|
A |
B |
C |
D |
| 2011-01-01 |
0.959872 |
0.036688 |
-0.495552 |
0.350378 |
| 2011-01-02 |
0.118525 |
0.093901 |
1.964353 |
0.773641 |
| 2011-01-03 |
0.248025 |
-0.947414 |
1.420396 |
0.591633 |
| 2011-01-04 |
0.123047 |
0.702724 |
-2.514727 |
0.297213 |
| 2011-01-05 |
0.265320 |
-0.126869 |
0.628532 |
0.244290 |
| 2011-01-06 |
2.158399 |
1.175738 |
1.477467 |
0.553494 |
A 1.464420
B 0.836578
C 1.301486
D 1.216124
dtype: float64
2011-01-01 0.653242
2011-01-02 0.836578
2011-01-03 1.301486
2011-01-04 1.216124
2011-01-05 1.464420
2011-01-06 0.467605
Freq: D, dtype: float64
或者使用自定义函数:
f = lambda x: x.max() - x.min()
df.apply(f)
A 3.148747
B 2.923576
C 2.243189
D 2.499079
dtype: float64
数据规整化
判断时候为缺失值 NaN
|
A |
B |
C |
D |
E |
| a |
True |
True |
True |
True |
True |
| b |
True |
True |
False |
False |
True |
| c |
True |
True |
False |
False |
True |
| d |
True |
True |
True |
True |
True |
| e |
True |
True |
True |
True |
True |
|
A |
B |
C |
D |
E |
| a |
False |
False |
False |
False |
False |
| b |
False |
False |
True |
True |
False |
| c |
False |
False |
True |
True |
False |
| d |
False |
False |
False |
False |
False |
| e |
False |
False |
False |
False |
False |
选出不是缺失值的数据
b 8.0
c 15.0
Name: C, dtype: float64
缺失值的填充
以指定的数字填充
|
A |
B |
C |
D |
E |
| a |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
| b |
0.0 |
0.0 |
8.0 |
10.0 |
0.0 |
| c |
0.0 |
0.0 |
15.0 |
17.0 |
0.0 |
| d |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
| e |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
以列方向上前一个值填充
df3.fillna(method='ffill')
|
A |
B |
C |
D |
E |
| a |
NaN |
NaN |
NaN |
NaN |
NaN |
| b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
| c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
| d |
NaN |
NaN |
15.0 |
17.0 |
NaN |
| e |
NaN |
NaN |
15.0 |
17.0 |
NaN |
以列方向上后一个值填充
df3.fillna(method='bfill')
|
A |
B |
C |
D |
E |
| a |
NaN |
NaN |
8.0 |
10.0 |
NaN |
| b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
| c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
| d |
NaN |
NaN |
NaN |
NaN |
NaN |
| e |
NaN |
NaN |
NaN |
NaN |
NaN |
行方向
df3.fillna(method='bfill', axis=1)
|
A |
B |
C |
D |
E |
| a |
NaN |
NaN |
NaN |
NaN |
NaN |
| b |
8.0 |
8.0 |
8.0 |
10.0 |
NaN |
| c |
15.0 |
15.0 |
15.0 |
17.0 |
NaN |
| d |
NaN |
NaN |
NaN |
NaN |
NaN |
| e |
NaN |
NaN |
NaN |
NaN |
NaN |
限定个数
df3.fillna(method='pad', limit=1)
|
A |
B |
C |
D |
E |
| a |
NaN |
NaN |
NaN |
NaN |
NaN |
| b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
| c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
| d |
NaN |
NaN |
15.0 |
17.0 |
NaN |
| e |
NaN |
NaN |
NaN |
NaN |
NaN |
删除缺失值
df3.dropna(thresh=2) # 需要多少非 NaN 数据
|
A |
B |
C |
D |
E |
| b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
| c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
|
A |
B |
C |
D |
E |
| b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
| c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
删除重复数据
df4 = df3.fillna(method='pad')
df4
|
A |
B |
C |
D |
E |
| a |
NaN |
NaN |
NaN |
NaN |
NaN |
| b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
| c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
| d |
NaN |
NaN |
15.0 |
17.0 |
NaN |
| e |
NaN |
NaN |
15.0 |
17.0 |
NaN |
a False
b False
c False
d True
e True
dtype: bool
|
A |
B |
C |
D |
E |
| a |
NaN |
NaN |
NaN |
NaN |
NaN |
| b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
| c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
df4.duplicated(['C']) # 仅仅对某列操作
a False
b False
c False
d True
e True
dtype: bool
df4.drop_duplicates(['C'])
|
A |
B |
C |
D |
E |
| a |
NaN |
NaN |
NaN |
NaN |
NaN |
| b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
| c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
使用 pandas 映射范畴数据
MAPPING CATEGORICAL DATA IN PANDAS
Python 不像 R可以使用 factor 代表范畴数据. 我们可以使用 astype('category') 将范畴数据转变为数字
Nominal Categories
无序
import pandas as pd
df = pd.DataFrame({'vertebrates': ['Bird', 'Bird', 'Mammal', 'Fish', 'Amphibian', 'Reptile', 'Mammal']})
print(df)
df1 = df.vertebrates.astype("category").cat.codes # no change for the original df
df1
vertebrates
0 Bird
1 Bird
2 Mammal
3 Fish
4 Amphibian
5 Reptile
6 Mammal
0 1
1 1
2 3
3 2
4 0
5 4
6 3
dtype: int8
指定种类
也可以指定要要转换那些种类, 这样你就很清楚都有什么种类. 任何没有指定的种类都会被标为 1.
vertebrate_types = ['Reptile', 'Bird', 'Amphibian', 'Fish']
df.vertebrates.astype("category", categories=vertebrate_types).cat.codes
0 1
1 1
2 -1
3 3
4 2
5 0
6 -1
dtype: int8
dummes
但是上面这些种类其实和数字是没有关系的, 也就是没有什么大小关系. 那么我们可以将它们转换为 dummes.
However, there is no inherent relationship between these categories so it doesn’t necessary make sense to store these as different numbers on the same scale.
pd.get_dummies(df, columns=['vertebrates'])
|
vertebrates_Amphibian |
vertebrates_Bird |
vertebrates_Fish |
vertebrates_Mammal |
vertebrates_Reptile |
| 0 |
0 |
1 |
0 |
0 |
0 |
| 1 |
0 |
1 |
0 |
0 |
0 |
| 2 |
0 |
0 |
0 |
1 |
0 |
| 3 |
0 |
0 |
1 |
0 |
0 |
| 4 |
1 |
0 |
0 |
0 |
0 |
| 5 |
0 |
0 |
0 |
0 |
1 |
| 6 |
0 |
0 |
0 |
1 |
0 |
为了防止 dummy variable trap, 使用 drop_first=True.
pd.get_dummies(df, columns=['vertebrates'], drop_first=True)
|
vertebrates_Bird |
vertebrates_Fish |
vertebrates_Mammal |
vertebrates_Reptile |
| 0 |
1 |
0 |
0 |
0 |
| 1 |
1 |
0 |
0 |
0 |
| 2 |
0 |
0 |
1 |
0 |
| 3 |
0 |
1 |
0 |
0 |
| 4 |
0 |
0 |
0 |
0 |
| 5 |
0 |
0 |
0 |
1 |
| 6 |
0 |
0 |
1 |
0 |
Ordinal Categories
还有很多范畴数据是有序的, 比如成绩的优良中差, 满意度等. 使用 ordered = True.
ordered_satisfaction = ['Very Unhappy', 'Unhappy', 'Neutral', 'Happy', 'Very Happy']
df = pd.DataFrame({'satisfaction':['Mad', 'Happy', 'Unhappy', 'Neutral']})
可以直接输出文本, 没有的种类为 NaN.
df.satisfaction.astype("category",
ordered=True,
categories=ordered_satisfaction
)
0 NaN
1 Happy
2 Unhappy
3 Neutral
Name: satisfaction, dtype: category
Categories (5, object): [Very Unhappy < Unhappy < Neutral < Happy < Very Happy]
也可以使用数字, 那么没有的就是 -1.
df.satisfaction.astype("category",
ordered=True,
categories=ordered_satisfaction
).cat.codes
0 -1
1 3
2 1
3 2
dtype: int8