DataFrame 是表示型的数据结构,每一行就是一个记录,每一列是一个变量。简单说,DataFrame 是共享同一个 index 的 Series 集合。
创建 DataFrame
纯手工操作1
要创建,必须先创建索引,而且DataFrame 必须有索引。
import pandas as pd
import numpy as np
dates = pd.date_range('2011-1-1', periods=6, freq='D')
dates
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
'2011-01-05', '2011-01-06'],
dtype='datetime64[ns]', freq='D')
用这个创建好的 dates 作为索引,创建 DataFrame:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df
|
A |
B |
C |
D |
2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
纯手工操作2
df1 = pd.DataFrame({'A':[1, 2, 3], 'B':[4, 5, 6], 'C':[7, 8, 9]},
index=pd.date_range('20110101', periods=3))
df1
|
A |
B |
C |
2011-01-01 |
1 |
4 |
7 |
2011-01-02 |
2 |
5 |
8 |
2011-01-03 |
3 |
6 |
9 |
从外部文件或者数据库导入
一般使用 DataFrame
不会手工创建的, 都是导入的, 不过导入很简单了, 比如从 csv 导入:
df = pandas.read_csv('csvfile.txt')
查看和操作
查看列名, 索引, 值和统计量
Index(['A', 'B', 'C', 'D'], dtype='object')
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
'2011-01-05', '2011-01-06'],
dtype='datetime64[ns]', freq='D')
array([[-0.66012233, -1.3129909 , 0.84327428, 3.96020219],
[ 0.48393364, 0.39004968, 1.38578819, -0.38039762],
[-0.85551546, -0.57764636, 0.90005938, 0.13927364],
[-0.13578653, 1.03186187, 1.47734054, -1.11916714],
[-0.99127477, -0.56881698, -0.64527863, 0.31228982],
[ 2.43062939, 0.07037805, 0.3527228 , 0.16526828]])
|
A |
B |
C |
D |
count |
6.000000 |
6.000000 |
6.000000 |
6.000000 |
mean |
-0.092473 |
0.280178 |
-0.684536 |
0.297170 |
std |
0.536510 |
0.888420 |
1.062205 |
1.049473 |
min |
-0.720014 |
-1.442339 |
-2.514868 |
-0.704865 |
25% |
-0.569908 |
0.270064 |
-1.048116 |
-0.523770 |
50% |
0.002167 |
0.577708 |
-0.480265 |
0.183566 |
75% |
0.378883 |
0.718902 |
0.081548 |
0.634094 |
max |
0.416989 |
1.034505 |
0.336762 |
2.096034 |
行列操作
行
|
A |
B |
C |
D |
2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
列
2011-01-01 0.416989
2011-01-02 0.334617
2011-01-03 0.393638
2011-01-04 -0.720014
2011-01-05 -0.330284
2011-01-06 -0.649782
Freq: D, Name: A, dtype: float64
2011-01-01 0.416989
2011-01-02 0.334617
2011-01-03 0.393638
2011-01-04 -0.720014
2011-01-05 -0.330284
2011-01-06 -0.649782
Freq: D, Name: A, dtype: float64
|
A |
C |
2011-01-01 |
0.416989 |
-2.514868 |
2011-01-02 |
0.334617 |
-0.536929 |
2011-01-03 |
0.393638 |
0.249931 |
2011-01-04 |
-0.720014 |
-1.218512 |
2011-01-05 |
-0.330284 |
-0.423601 |
2011-01-06 |
-0.649782 |
0.336762 |
条件提取
|
A |
B |
C |
D |
2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
df[df['A'] > 0][['A', 'C']]
|
A |
C |
2011-01-01 |
0.416989 |
-2.514868 |
2011-01-02 |
0.334617 |
-0.536929 |
2011-01-03 |
0.393638 |
0.249931 |
不能这样同时提取行列
df[0:1, 'A']
使用 loc() 方法 – 开始变得像 matlab 的操作1
通过行列的标签名提取
df.loc[:, 'A'] # 所有行, A 列
2011-01-01 0.416989
2011-01-02 0.334617
2011-01-03 0.393638
2011-01-04 -0.720014
2011-01-05 -0.330284
2011-01-06 -0.649782
Freq: D, Name: A, dtype: float64
df.loc[dates[0:2], 'A':'C']
|
A |
B |
C |
2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
这样还是不行:
df.loc[0:2, 'A':'C']
使用 iloc() – 开始变得像 matlab 的操作2
通过行列位置提取(index locate)
A -0.855515
B -0.577646
C 0.900059
D 0.139274
Name: 2011-01-03 00:00:00, dtype: float64
2011-01-01 0.843274
2011-01-02 1.385788
2011-01-03 0.900059
2011-01-04 1.477341
2011-01-05 -0.645279
2011-01-06 0.352723
Freq: D, Name: C, dtype: float64
df.iloc[[1,3], [2,3]] # 提取某几行某几列
|
C |
D |
2011-01-02 |
1.385788 |
-0.380398 |
2011-01-04 |
1.477341 |
-1.119167 |
|
C |
2011-01-02 |
1.385788 |
2011-01-03 |
0.900059 |
混合使用标签和位置索引, ix() – 不推荐使用
|
A |
B |
C |
D |
2011-01-03 |
-0.855515 |
-0.577646 |
0.900059 |
0.139274 |
2011-01-04 |
-0.135787 |
1.031862 |
1.477341 |
-1.119167 |
2011-01-05 |
-0.991275 |
-0.568817 |
-0.645279 |
0.312290 |
2011-01-02 1.385788
2011-01-04 1.477341
Freq: 2D, Name: C, dtype: float64
2011-01-02 1.385788
2011-01-03 0.900059
Freq: D, Name: C, dtype: float64
|
B |
C |
2011-01-02 |
0.390050 |
1.385788 |
2011-01-03 |
-0.577646 |
0.900059 |
DataFrame 操作
转置
|
2011-01-01 00:00:00 |
2011-01-02 00:00:00 |
2011-01-03 00:00:00 |
2011-01-04 00:00:00 |
2011-01-05 00:00:00 |
2011-01-06 00:00:00 |
A |
-0.660122 |
0.483934 |
-0.855515 |
-0.135787 |
-0.991275 |
2.430629 |
B |
-1.312991 |
0.390050 |
-0.577646 |
1.031862 |
-0.568817 |
0.070378 |
C |
0.843274 |
1.385788 |
0.900059 |
1.477341 |
-0.645279 |
0.352723 |
D |
3.960202 |
-0.380398 |
0.139274 |
-1.119167 |
0.312290 |
0.165268 |
排序
df.sort_index(ascending=False)
|
A |
B |
C |
D |
2011-01-06 |
2.430629 |
0.070378 |
0.352723 |
0.165268 |
2011-01-05 |
-0.991275 |
-0.568817 |
-0.645279 |
0.312290 |
2011-01-04 |
-0.135787 |
1.031862 |
1.477341 |
-1.119167 |
2011-01-03 |
-0.855515 |
-0.577646 |
0.900059 |
0.139274 |
2011-01-02 |
0.483934 |
0.390050 |
1.385788 |
-0.380398 |
2011-01-01 |
-0.660122 |
-1.312991 |
0.843274 |
3.960202 |
df.sort_index(axis=1, ascending=False) # 注意排序方向
|
D |
C |
B |
A |
2011-01-01 |
3.960202 |
0.843274 |
-1.312991 |
-0.660122 |
2011-01-02 |
-0.380398 |
1.385788 |
0.390050 |
0.483934 |
2011-01-03 |
0.139274 |
0.900059 |
-0.577646 |
-0.855515 |
2011-01-04 |
-1.119167 |
1.477341 |
1.031862 |
-0.135787 |
2011-01-05 |
0.312290 |
-0.645279 |
-0.568817 |
-0.991275 |
2011-01-06 |
0.165268 |
0.352723 |
0.070378 |
2.430629 |
脑补 sort_index()
为 sort_values(by=index)
|
A |
B |
C |
D |
2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
排名
rank() 返回相应的行或者列排序的排名
df.rank(axis=0) # 列方向升序排名
|
A |
B |
C |
D |
2011-01-01 |
6.0 |
5.0 |
1.0 |
3.0 |
2011-01-02 |
4.0 |
4.0 |
3.0 |
6.0 |
2011-01-03 |
5.0 |
6.0 |
5.0 |
4.0 |
2011-01-04 |
1.0 |
2.0 |
2.0 |
1.0 |
2011-01-05 |
3.0 |
1.0 |
4.0 |
2.0 |
2011-01-06 |
2.0 |
3.0 |
6.0 |
5.0 |
df.rank(axis=0).astype(int) # 使用 astype 将数据类型变为 int
|
A |
B |
C |
D |
2011-01-01 |
6 |
5 |
1 |
3 |
2011-01-02 |
4 |
4 |
3 |
6 |
2011-01-03 |
5 |
6 |
5 |
4 |
2011-01-04 |
1 |
2 |
2 |
1 |
2011-01-05 |
3 |
1 |
4 |
2 |
2011-01-06 |
2 |
3 |
6 |
5 |
df.rank(axis=1, ascending=False) # 行方向降序排名
|
A |
B |
C |
D |
2011-01-01 |
2.0 |
1.0 |
4.0 |
3.0 |
2011-01-02 |
3.0 |
2.0 |
4.0 |
1.0 |
2011-01-03 |
2.0 |
1.0 |
4.0 |
3.0 |
2011-01-04 |
3.0 |
1.0 |
4.0 |
2.0 |
2011-01-05 |
1.0 |
4.0 |
2.0 |
3.0 |
2011-01-06 |
4.0 |
2.0 |
3.0 |
1.0 |
增加行或者列
s1 = pd.Series([1,2,3,4,5], index=pd.date_range('20110102', periods=5))
s1
2011-01-02 1
2011-01-03 2
2011-01-04 3
2011-01-05 4
2011-01-06 5
Freq: D, dtype: int64
|
A |
B |
C |
D |
E |
2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
NaN |
2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
1.0 |
2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
2.0 |
2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
3.0 |
2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
4.0 |
2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
5.0 |
在 df 新增了一列,而且会根据 index 自动对齐
# df = df.loc[:, 'A':'D']
df = df[list('ABCD')] # 两种方法都可以
df
|
A |
B |
C |
D |
2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
pd.concat([df, df]) # 上下摞起来
|
A |
B |
C |
D |
2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
pd.concat([df, df], axis=1) # 左右连起来
|
A |
B |
C |
D |
A |
B |
C |
D |
2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
df1 = pd.DataFrame({'A':[1, 2, 3], 'B':[4, 5, 6], 'C':[7, 8, 9]},
index=pd.date_range('20110101', periods=3))
df.append(df1) # 摞起来
|
A |
B |
C |
D |
2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2.096034 |
2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
0.361379 |
2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
2011-01-01 |
1.000000 |
4.000000 |
7.000000 |
NaN |
2011-01-02 |
2.000000 |
5.000000 |
8.000000 |
NaN |
2011-01-03 |
3.000000 |
6.000000 |
9.000000 |
NaN |
pd.concat([df, df1], join='inner')
|
A |
B |
C |
2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
2011-01-02 |
0.334617 |
0.701355 |
-0.536929 |
2011-01-03 |
0.393638 |
1.034505 |
0.249931 |
2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
2011-01-01 |
1.000000 |
4.000000 |
7.000000 |
2011-01-02 |
2.000000 |
5.000000 |
8.000000 |
2011-01-03 |
3.000000 |
6.000000 |
9.000000 |
删除
drop() 返回新的 DataFrame, del 直接在原 DataFrame 上操作
|
A |
B |
C |
D |
2011-01-01 |
0.416989 |
0.724751 |
-2.514868 |
0.005752 |
2011-01-04 |
-0.720014 |
0.208732 |
-1.218512 |
-0.704865 |
2011-01-05 |
-0.330284 |
-1.442339 |
-0.423601 |
-0.700278 |
2011-01-06 |
-0.649782 |
0.454061 |
0.336762 |
0.724999 |
|
B |
C |
D |
2011-01-01 |
0.724751 |
-2.514868 |
0.005752 |
2011-01-02 |
0.701355 |
-0.536929 |
2.096034 |
2011-01-03 |
1.034505 |
0.249931 |
0.361379 |
2011-01-04 |
0.208732 |
-1.218512 |
-0.704865 |
2011-01-05 |
-1.442339 |
-0.423601 |
-0.700278 |
2011-01-06 |
0.454061 |
0.336762 |
0.724999 |
|
A |
B |
C |
D |
2011-01-01 |
-0.660122 |
-1.312991 |
0.843274 |
3.960202 |
2011-01-02 |
0.483934 |
0.390050 |
1.385788 |
-0.380398 |
2011-01-03 |
-0.855515 |
-0.577646 |
0.900059 |
0.139274 |
2011-01-04 |
-0.135787 |
1.031862 |
1.477341 |
-1.119167 |
2011-01-05 |
-0.991275 |
-0.568817 |
-0.645279 |
0.312290 |
2011-01-06 |
2.430629 |
0.070378 |
0.352723 |
0.165268 |
|
B |
C |
D |
2011-01-01 |
-1.312991 |
0.843274 |
3.960202 |
2011-01-02 |
0.390050 |
1.385788 |
-0.380398 |
2011-01-03 |
-0.577646 |
0.900059 |
0.139274 |
2011-01-04 |
1.031862 |
1.477341 |
-1.119167 |
2011-01-05 |
-0.568817 |
-0.645279 |
0.312290 |
2011-01-06 |
0.070378 |
0.352723 |
0.165268 |
替换
df.loc[dates[2], 'C'] = 0
df
|
B |
C |
D |
2011-01-01 |
-1.312991 |
0.843274 |
3.960202 |
2011-01-02 |
0.390050 |
1.385788 |
-0.380398 |
2011-01-03 |
-0.577646 |
0.000000 |
0.139274 |
2011-01-04 |
1.031862 |
1.477341 |
-1.119167 |
2011-01-05 |
-0.568817 |
-0.645279 |
0.312290 |
2011-01-06 |
0.070378 |
0.352723 |
0.165268 |
重置索引
new_index = pd.date_range('20110101', periods=7)
df.reindex(new_index, columns=list('ABCDE'))
|
A |
B |
C |
D |
E |
2011-01-01 |
0.558092 |
-0.044851 |
0.627649 |
1.376806 |
NaN |
2011-01-02 |
2.370370 |
0.309665 |
1.534129 |
0.477640 |
NaN |
2011-01-03 |
-0.918037 |
1.442996 |
0.718332 |
-0.764125 |
NaN |
2011-01-04 |
1.232411 |
1.239217 |
1.380759 |
-1.552525 |
NaN |
2011-01-05 |
-1.157906 |
0.486401 |
-0.015537 |
0.016348 |
NaN |
2011-01-06 |
2.025508 |
1.072081 |
-2.053920 |
-0.710797 |
NaN |
2011-01-07 |
NaN |
NaN |
NaN |
NaN |
NaN |
new_index = pd.date_range('20140101', periods=7)
df.reindex(new_index, columns=list('ABCD')) # 如果不存在, 那么用 NaN
|
A |
B |
C |
D |
2014-01-01 |
NaN |
NaN |
NaN |
NaN |
2014-01-02 |
NaN |
NaN |
NaN |
NaN |
2014-01-03 |
NaN |
NaN |
NaN |
NaN |
2014-01-04 |
NaN |
NaN |
NaN |
NaN |
2014-01-05 |
NaN |
NaN |
NaN |
NaN |
2014-01-06 |
NaN |
NaN |
NaN |
NaN |
2014-01-07 |
NaN |
NaN |
NaN |
NaN |
轴
参考排序部分
DataFrame, Series 的运算
Series之间的运算
首先根据 index 匹配,无法匹配的话使用 NaN。
s1 = pd.Series([1, 2, 3], index=list('ABC'))
s2 = pd.Series([4, 5, 6], index=list('BCD'))
A NaN
B 6.0
C 8.0
D NaN
dtype: float64
DataFrame 与 Series
Series 的 index 与 DataFrame 的 column 匹配。无法匹配的话使用 NaN。
df1 = pd.DataFrame(np.arange(1, 13).reshape(3, 4),
index=list('abc'), columns=list('ABCD'))
df1
|
A |
B |
C |
D |
a |
1 |
2 |
3 |
4 |
b |
5 |
6 |
7 |
8 |
c |
9 |
10 |
11 |
12 |
|
A |
B |
C |
D |
a |
0.0 |
0.0 |
0.0 |
NaN |
b |
4.0 |
4.0 |
4.0 |
NaN |
c |
8.0 |
8.0 |
8.0 |
NaN |
DataFrame 之间
根据 index 和 column 匹配。无法匹配的话使用 NaN。
df2 = pd.DataFrame(np.arange(1, 13).reshape(4,3),
index=list('bcde'), columns=list('CDE'))
df2
|
C |
D |
E |
b |
1 |
2 |
3 |
c |
4 |
5 |
6 |
d |
7 |
8 |
9 |
e |
10 |
11 |
12 |
|
A |
B |
C |
D |
E |
a |
NaN |
NaN |
NaN |
NaN |
NaN |
b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
d |
NaN |
NaN |
NaN |
NaN |
NaN |
e |
NaN |
NaN |
NaN |
NaN |
NaN |
函数和映射
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df
|
A |
B |
C |
D |
2011-01-01 |
0.959872 |
0.036688 |
-0.495552 |
0.350378 |
2011-01-02 |
0.118525 |
0.093901 |
1.964353 |
0.773641 |
2011-01-03 |
0.248025 |
-0.947414 |
1.420396 |
0.591633 |
2011-01-04 |
0.123047 |
0.702724 |
-2.514727 |
0.297213 |
2011-01-05 |
0.265320 |
-0.126869 |
0.628532 |
0.244290 |
2011-01-06 |
2.158399 |
1.175738 |
1.477467 |
0.553494 |
A 1.464420
B 0.836578
C 1.301486
D 1.216124
dtype: float64
2011-01-01 0.653242
2011-01-02 0.836578
2011-01-03 1.301486
2011-01-04 1.216124
2011-01-05 1.464420
2011-01-06 0.467605
Freq: D, dtype: float64
或者使用自定义函数:
f = lambda x: x.max() - x.min()
df.apply(f)
A 3.148747
B 2.923576
C 2.243189
D 2.499079
dtype: float64
数据规整化
判断时候为缺失值 NaN
|
A |
B |
C |
D |
E |
a |
True |
True |
True |
True |
True |
b |
True |
True |
False |
False |
True |
c |
True |
True |
False |
False |
True |
d |
True |
True |
True |
True |
True |
e |
True |
True |
True |
True |
True |
|
A |
B |
C |
D |
E |
a |
False |
False |
False |
False |
False |
b |
False |
False |
True |
True |
False |
c |
False |
False |
True |
True |
False |
d |
False |
False |
False |
False |
False |
e |
False |
False |
False |
False |
False |
选出不是缺失值的数据
b 8.0
c 15.0
Name: C, dtype: float64
缺失值的填充
以指定的数字填充
|
A |
B |
C |
D |
E |
a |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
b |
0.0 |
0.0 |
8.0 |
10.0 |
0.0 |
c |
0.0 |
0.0 |
15.0 |
17.0 |
0.0 |
d |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
e |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
以列方向上前一个值填充
df3.fillna(method='ffill')
|
A |
B |
C |
D |
E |
a |
NaN |
NaN |
NaN |
NaN |
NaN |
b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
d |
NaN |
NaN |
15.0 |
17.0 |
NaN |
e |
NaN |
NaN |
15.0 |
17.0 |
NaN |
以列方向上后一个值填充
df3.fillna(method='bfill')
|
A |
B |
C |
D |
E |
a |
NaN |
NaN |
8.0 |
10.0 |
NaN |
b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
d |
NaN |
NaN |
NaN |
NaN |
NaN |
e |
NaN |
NaN |
NaN |
NaN |
NaN |
行方向
df3.fillna(method='bfill', axis=1)
|
A |
B |
C |
D |
E |
a |
NaN |
NaN |
NaN |
NaN |
NaN |
b |
8.0 |
8.0 |
8.0 |
10.0 |
NaN |
c |
15.0 |
15.0 |
15.0 |
17.0 |
NaN |
d |
NaN |
NaN |
NaN |
NaN |
NaN |
e |
NaN |
NaN |
NaN |
NaN |
NaN |
限定个数
df3.fillna(method='pad', limit=1)
|
A |
B |
C |
D |
E |
a |
NaN |
NaN |
NaN |
NaN |
NaN |
b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
d |
NaN |
NaN |
15.0 |
17.0 |
NaN |
e |
NaN |
NaN |
NaN |
NaN |
NaN |
删除缺失值
df3.dropna(thresh=2) # 需要多少非 NaN 数据
|
A |
B |
C |
D |
E |
b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
|
A |
B |
C |
D |
E |
b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
删除重复数据
df4 = df3.fillna(method='pad')
df4
|
A |
B |
C |
D |
E |
a |
NaN |
NaN |
NaN |
NaN |
NaN |
b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
d |
NaN |
NaN |
15.0 |
17.0 |
NaN |
e |
NaN |
NaN |
15.0 |
17.0 |
NaN |
a False
b False
c False
d True
e True
dtype: bool
|
A |
B |
C |
D |
E |
a |
NaN |
NaN |
NaN |
NaN |
NaN |
b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
df4.duplicated(['C']) # 仅仅对某列操作
a False
b False
c False
d True
e True
dtype: bool
df4.drop_duplicates(['C'])
|
A |
B |
C |
D |
E |
a |
NaN |
NaN |
NaN |
NaN |
NaN |
b |
NaN |
NaN |
8.0 |
10.0 |
NaN |
c |
NaN |
NaN |
15.0 |
17.0 |
NaN |
使用 pandas 映射范畴数据
MAPPING CATEGORICAL DATA IN PANDAS
Python
不像 R
可以使用 factor
代表范畴数据. 我们可以使用 astype('category')
将范畴数据转变为数字
Nominal Categories
无序
import pandas as pd
df = pd.DataFrame({'vertebrates': ['Bird', 'Bird', 'Mammal', 'Fish', 'Amphibian', 'Reptile', 'Mammal']})
print(df)
df1 = df.vertebrates.astype("category").cat.codes # no change for the original df
df1
vertebrates
0 Bird
1 Bird
2 Mammal
3 Fish
4 Amphibian
5 Reptile
6 Mammal
0 1
1 1
2 3
3 2
4 0
5 4
6 3
dtype: int8
指定种类
也可以指定要要转换那些种类, 这样你就很清楚都有什么种类. 任何没有指定的种类都会被标为 1.
vertebrate_types = ['Reptile', 'Bird', 'Amphibian', 'Fish']
df.vertebrates.astype("category", categories=vertebrate_types).cat.codes
0 1
1 1
2 -1
3 3
4 2
5 0
6 -1
dtype: int8
dummes
但是上面这些种类其实和数字是没有关系的, 也就是没有什么大小关系. 那么我们可以将它们转换为 dummes
.
However, there is no inherent relationship between these categories so it doesn’t necessary make sense to store these as different numbers on the same scale.
pd.get_dummies(df, columns=['vertebrates'])
|
vertebrates_Amphibian |
vertebrates_Bird |
vertebrates_Fish |
vertebrates_Mammal |
vertebrates_Reptile |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
1 |
0 |
0 |
0 |
2 |
0 |
0 |
0 |
1 |
0 |
3 |
0 |
0 |
1 |
0 |
0 |
4 |
1 |
0 |
0 |
0 |
0 |
5 |
0 |
0 |
0 |
0 |
1 |
6 |
0 |
0 |
0 |
1 |
0 |
为了防止 dummy variable trap, 使用 drop_first=True
.
pd.get_dummies(df, columns=['vertebrates'], drop_first=True)
|
vertebrates_Bird |
vertebrates_Fish |
vertebrates_Mammal |
vertebrates_Reptile |
0 |
1 |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
2 |
0 |
0 |
1 |
0 |
3 |
0 |
1 |
0 |
0 |
4 |
0 |
0 |
0 |
0 |
5 |
0 |
0 |
0 |
1 |
6 |
0 |
0 |
1 |
0 |
Ordinal Categories
还有很多范畴数据是有序的, 比如成绩的优良中差, 满意度等. 使用 ordered = True
.
ordered_satisfaction = ['Very Unhappy', 'Unhappy', 'Neutral', 'Happy', 'Very Happy']
df = pd.DataFrame({'satisfaction':['Mad', 'Happy', 'Unhappy', 'Neutral']})
可以直接输出文本, 没有的种类为 NaN
.
df.satisfaction.astype("category",
ordered=True,
categories=ordered_satisfaction
)
0 NaN
1 Happy
2 Unhappy
3 Neutral
Name: satisfaction, dtype: category
Categories (5, object): [Very Unhappy < Unhappy < Neutral < Happy < Very Happy]
也可以使用数字, 那么没有的就是 -1.
df.satisfaction.astype("category",
ordered=True,
categories=ordered_satisfaction
).cat.codes
0 -1
1 3
2 1
3 2
dtype: int8