学习心得9 - Pandas

创建 DataFrame
查看和操作
DataFrame， Series 的运算
数据规整化
使用 pandas 映射范畴数据
- Nominal Categories
  - 指定种类
  - dummes
- Ordinal Categories

DataFrame 是表示型的数据结构，每一行就是一个记录，每一列是一个变量。简单说，DataFrame 是共享同一个 index 的 Series 集合。

创建 DataFrame

纯手工操作1

要创建，必须先创建索引，而且DataFrame 必须有索引。

import pandas as pd
import numpy as np

dates = pd.date_range('2011-1-1', periods=6, freq='D')
dates

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
               '2011-01-05', '2011-01-06'],
              dtype='datetime64[ns]', freq='D')

用这个创建好的 dates 作为索引，创建 DataFrame：

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

	A	B	C	D
2011-01-01	0.416989	0.724751	-2.514868	0.005752
2011-01-02	0.334617	0.701355	-0.536929	2.096034
2011-01-03	0.393638	1.034505	0.249931	0.361379
2011-01-04	-0.720014	0.208732	-1.218512	-0.704865
2011-01-05	-0.330284	-1.442339	-0.423601	-0.700278
2011-01-06	-0.649782	0.454061	0.336762	0.724999

纯手工操作2

df1 = pd.DataFrame({'A':[1, 2, 3], 'B':[4, 5, 6], 'C':[7, 8, 9]}, 
                  index=pd.date_range('20110101', periods=3))
df1

	A	B	C
2011-01-01	1	4	7
2011-01-02	2	5	8
2011-01-03	3	6	9

从外部文件或者数据库导入

一般使用 DataFrame不会手工创建的, 都是导入的, 不过导入很简单了, 比如从 csv 导入:

df = pandas.read_csv('csvfile.txt')

查看和操作

查看列名, 索引, 值和统计量

df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

df.index

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
               '2011-01-05', '2011-01-06'],
              dtype='datetime64[ns]', freq='D')

df.values

array([[-0.66012233, -1.3129909 ,  0.84327428,  3.96020219],
       [ 0.48393364,  0.39004968,  1.38578819, -0.38039762],
       [-0.85551546, -0.57764636,  0.90005938,  0.13927364],
       [-0.13578653,  1.03186187,  1.47734054, -1.11916714],
       [-0.99127477, -0.56881698, -0.64527863,  0.31228982],
       [ 2.43062939,  0.07037805,  0.3527228 ,  0.16526828]])

df.describe()

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	-0.092473	0.280178	-0.684536	0.297170
std	0.536510	0.888420	1.062205	1.049473
min	-0.720014	-1.442339	-2.514868	-0.704865
25%	-0.569908	0.270064	-1.048116	-0.523770
50%	0.002167	0.577708	-0.480265	0.183566
75%	0.378883	0.718902	0.081548	0.634094
max	0.416989	1.034505	0.336762	2.096034

行列操作

行

df[0:3]

	A	B	C	D
2011-01-01	0.416989	0.724751	-2.514868	0.005752
2011-01-02	0.334617	0.701355	-0.536929	2.096034
2011-01-03	0.393638	1.034505	0.249931	0.361379

列

df['A']

2011-01-01    0.416989
2011-01-02    0.334617
2011-01-03    0.393638
2011-01-04   -0.720014
2011-01-05   -0.330284
2011-01-06   -0.649782
Freq: D, Name: A, dtype: float64

df.A

2011-01-01    0.416989
2011-01-02    0.334617
2011-01-03    0.393638
2011-01-04   -0.720014
2011-01-05   -0.330284
2011-01-06   -0.649782
Freq: D, Name: A, dtype: float64

df[['A', 'C']]

	A	C
2011-01-01	0.416989	-2.514868
2011-01-02	0.334617	-0.536929
2011-01-03	0.393638	0.249931
2011-01-04	-0.720014	-1.218512
2011-01-05	-0.330284	-0.423601
2011-01-06	-0.649782	0.336762

条件提取

df[df['A'] > 0]

	A	B	C	D
2011-01-01	0.416989	0.724751	-2.514868	0.005752
2011-01-02	0.334617	0.701355	-0.536929	2.096034
2011-01-03	0.393638	1.034505	0.249931	0.361379

df[df['A'] > 0][['A', 'C']]

	A	C
2011-01-01	0.416989	-2.514868
2011-01-02	0.334617	-0.536929
2011-01-03	0.393638	0.249931

不能这样同时提取行列

~~df[0:1, 'A']~~

使用 loc() 方法 – 开始变得像 matlab 的操作1

通过行列的标签名提取

df.loc[:, 'A']  # 所有行, A 列

2011-01-01    0.416989
2011-01-02    0.334617
2011-01-03    0.393638
2011-01-04   -0.720014
2011-01-05   -0.330284
2011-01-06   -0.649782
Freq: D, Name: A, dtype: float64

df.loc[dates[0:2], 'A':'C']

	A	B	C
2011-01-01	0.416989	0.724751	-2.514868
2011-01-02	0.334617	0.701355	-0.536929

这样还是不行: ~~df.loc[0:2, 'A':'C']~~

使用 iloc() – 开始变得像 matlab 的操作2

通过行列位置提取(index locate)

df.iloc[2]  # 提取某行

A   -0.855515
B   -0.577646
C    0.900059
D    0.139274
Name: 2011-01-03 00:00:00, dtype: float64

df.iloc[:, 2]  # 提取某列

2011-01-01    0.843274
2011-01-02    1.385788
2011-01-03    0.900059
2011-01-04    1.477341
2011-01-05   -0.645279
2011-01-06    0.352723
Freq: D, Name: C, dtype: float64

df.iloc[[1,3], [2,3]]  # 提取某几行某几列

	C	D
2011-01-02	1.385788	-0.380398
2011-01-04	1.477341	-1.119167

df.iloc[1:3, 2:3]  # 切片

	C
2011-01-02	1.385788
2011-01-03	0.900059

混合使用标签和位置索引, ix() – 不推荐使用

df.ix[2:5]  # 行

	A	B	C	D
2011-01-03	-0.855515	-0.577646	0.900059	0.139274
2011-01-04	-0.135787	1.031862	1.477341	-1.119167
2011-01-05	-0.991275	-0.568817	-0.645279	0.312290

df.ix[[1,3], 2]

2011-01-02    1.385788
2011-01-04    1.477341
Freq: 2D, Name: C, dtype: float64

df.ix[1:3, 2]

2011-01-02    1.385788
2011-01-03    0.900059
Freq: D, Name: C, dtype: float64

df.ix[1:3, df.iloc[3]>0]

	B	C
2011-01-02	0.390050	1.385788
2011-01-03	-0.577646	0.900059

DataFrame 操作

转置

df.T

	2011-01-01 00:00:00	2011-01-02 00:00:00	2011-01-03 00:00:00	2011-01-04 00:00:00	2011-01-05 00:00:00	2011-01-06 00:00:00
A	-0.660122	0.483934	-0.855515	-0.135787	-0.991275	2.430629
B	-1.312991	0.390050	-0.577646	1.031862	-0.568817	0.070378
C	0.843274	1.385788	0.900059	1.477341	-0.645279	0.352723
D	3.960202	-0.380398	0.139274	-1.119167	0.312290	0.165268

排序

df.sort_index(axis=0, ascending=True, inplace=False)

将 DataFrame 按照标签排序

axis: 默认为0, 按照 index 排序. 为1时, 按照列名排序

  +------------+---------+--------+
  |            |  A      |  B     |
  +------------+---------+---------
  |2011-01-06  | 0.626386| 1.52325|----axis=1----->
  +------------+---------+--------+
               |         |
               | axis=0  |
               ↓         ↓

inplace: 如果 False, 不修改原始数据. 如果 True, 直接修改原始数据

df.sort_index(ascending=False)  

	A	B	C	D
2011-01-06	2.430629	0.070378	0.352723	0.165268
2011-01-05	-0.991275	-0.568817	-0.645279	0.312290
2011-01-04	-0.135787	1.031862	1.477341	-1.119167
2011-01-03	-0.855515	-0.577646	0.900059	0.139274
2011-01-02	0.483934	0.390050	1.385788	-0.380398
2011-01-01	-0.660122	-1.312991	0.843274	3.960202

df.sort_index(axis=1, ascending=False)  # 注意排序方向

	D	C	B	A
2011-01-01	3.960202	0.843274	-1.312991	-0.660122
2011-01-02	-0.380398	1.385788	0.390050	0.483934
2011-01-03	0.139274	0.900059	-0.577646	-0.855515
2011-01-04	-1.119167	1.477341	1.031862	-0.135787
2011-01-05	0.312290	-0.645279	-0.568817	-0.991275
2011-01-06	0.165268	0.352723	0.070378	2.430629

sort_values(by) 根据值排序

脑补 sort_index() 为 sort_values(by=index)

df.sort_values(by='C') 

	A	B	C	D
2011-01-01	0.416989	0.724751	-2.514868	0.005752
2011-01-04	-0.720014	0.208732	-1.218512	-0.704865
2011-01-02	0.334617	0.701355	-0.536929	2.096034
2011-01-05	-0.330284	-1.442339	-0.423601	-0.700278
2011-01-03	0.393638	1.034505	0.249931	0.361379
2011-01-06	-0.649782	0.454061	0.336762	0.724999

排名

rank() 返回相应的行或者列排序的排名

df.rank(axis=0)  # 列方向升序排名

	A	B	C	D
2011-01-01	6.0	5.0	1.0	3.0
2011-01-02	4.0	4.0	3.0	6.0
2011-01-03	5.0	6.0	5.0	4.0
2011-01-04	1.0	2.0	2.0	1.0
2011-01-05	3.0	1.0	4.0	2.0
2011-01-06	2.0	3.0	6.0	5.0

df.rank(axis=0).astype(int)  # 使用 astype 将数据类型变为 int

	A	B	C	D
2011-01-01	6	5	1	3
2011-01-02	4	4	3	6
2011-01-03	5	6	5	4
2011-01-04	1	2	2	1
2011-01-05	3	1	4	2
2011-01-06	2	3	6	5

df.rank(axis=1, ascending=False)  # 行方向降序排名

	A	B	C	D
2011-01-01	2.0	1.0	4.0	3.0
2011-01-02	3.0	2.0	4.0	1.0
2011-01-03	2.0	1.0	4.0	3.0
2011-01-04	3.0	1.0	4.0	2.0
2011-01-05	1.0	4.0	2.0	3.0
2011-01-06	4.0	2.0	3.0	1.0

增加行或者列

s1 = pd.Series([1,2,3,4,5], index=pd.date_range('20110102', periods=5))
s1

2011-01-02    1
2011-01-03    2
2011-01-04    3
2011-01-05    4
2011-01-06    5
Freq: D, dtype: int64

df['E'] = s1  # 新增一列
df

	A	B	C	D	E
2011-01-01	0.416989	0.724751	-2.514868	0.005752	NaN
2011-01-02	0.334617	0.701355	-0.536929	2.096034	1.0
2011-01-03	0.393638	1.034505	0.249931	0.361379	2.0
2011-01-04	-0.720014	0.208732	-1.218512	-0.704865	3.0
2011-01-05	-0.330284	-1.442339	-0.423601	-0.700278	4.0
2011-01-06	-0.649782	0.454061	0.336762	0.724999	5.0

在 df 新增了一列,而且会根据 index 自动对齐

# df = df.loc[:, 'A':'D']
df = df[list('ABCD')]  # 两种方法都可以
df

	A	B	C	D
2011-01-01	0.416989	0.724751	-2.514868	0.005752
2011-01-02	0.334617	0.701355	-0.536929	2.096034
2011-01-03	0.393638	1.034505	0.249931	0.361379
2011-01-04	-0.720014	0.208732	-1.218512	-0.704865
2011-01-05	-0.330284	-1.442339	-0.423601	-0.700278
2011-01-06	-0.649782	0.454061	0.336762	0.724999

pd.concat([df, df])  # 上下摞起来

	A	B	C	D
2011-01-01	0.416989	0.724751	-2.514868	0.005752
2011-01-02	0.334617	0.701355	-0.536929	2.096034
2011-01-03	0.393638	1.034505	0.249931	0.361379
2011-01-04	-0.720014	0.208732	-1.218512	-0.704865
2011-01-05	-0.330284	-1.442339	-0.423601	-0.700278
2011-01-06	-0.649782	0.454061	0.336762	0.724999
2011-01-01	0.416989	0.724751	-2.514868	0.005752
2011-01-02	0.334617	0.701355	-0.536929	2.096034
2011-01-03	0.393638	1.034505	0.249931	0.361379
2011-01-04	-0.720014	0.208732	-1.218512	-0.704865
2011-01-05	-0.330284	-1.442339	-0.423601	-0.700278
2011-01-06	-0.649782	0.454061	0.336762	0.724999

pd.concat([df, df], axis=1)  # 左右连起来

	A	B	C	D	A	B	C	D
2011-01-01	0.416989	0.724751	-2.514868	0.005752	0.416989	0.724751	-2.514868	0.005752
2011-01-02	0.334617	0.701355	-0.536929	2.096034	0.334617	0.701355	-0.536929	2.096034
2011-01-03	0.393638	1.034505	0.249931	0.361379	0.393638	1.034505	0.249931	0.361379
2011-01-04	-0.720014	0.208732	-1.218512	-0.704865	-0.720014	0.208732	-1.218512	-0.704865
2011-01-05	-0.330284	-1.442339	-0.423601	-0.700278	-0.330284	-1.442339	-0.423601	-0.700278
2011-01-06	-0.649782	0.454061	0.336762	0.724999	-0.649782	0.454061	0.336762	0.724999

df1 = pd.DataFrame({'A':[1, 2, 3], 'B':[4, 5, 6], 'C':[7, 8, 9]}, 
                  index=pd.date_range('20110101', periods=3))
df.append(df1)  # 摞起来

	A	B	C	D
2011-01-01	0.416989	0.724751	-2.514868	0.005752
2011-01-02	0.334617	0.701355	-0.536929	2.096034
2011-01-03	0.393638	1.034505	0.249931	0.361379
2011-01-04	-0.720014	0.208732	-1.218512	-0.704865
2011-01-05	-0.330284	-1.442339	-0.423601	-0.700278
2011-01-06	-0.649782	0.454061	0.336762	0.724999
2011-01-01	1.000000	4.000000	7.000000	NaN
2011-01-02	2.000000	5.000000	8.000000	NaN
2011-01-03	3.000000	6.000000	9.000000	NaN

pd.concat([df, df1], join='inner')

	A	B	C
2011-01-01	0.416989	0.724751	-2.514868
2011-01-02	0.334617	0.701355	-0.536929
2011-01-03	0.393638	1.034505	0.249931
2011-01-04	-0.720014	0.208732	-1.218512
2011-01-05	-0.330284	-1.442339	-0.423601
2011-01-06	-0.649782	0.454061	0.336762
2011-01-01	1.000000	4.000000	7.000000
2011-01-02	2.000000	5.000000	8.000000
2011-01-03	3.000000	6.000000	9.000000

删除

drop() 返回新的 DataFrame, del 直接在原 DataFrame 上操作

df.drop(dates[1:3])

	A	B	C	D
2011-01-01	0.416989	0.724751	-2.514868	0.005752
2011-01-04	-0.720014	0.208732	-1.218512	-0.704865
2011-01-05	-0.330284	-1.442339	-0.423601	-0.700278
2011-01-06	-0.649782	0.454061	0.336762	0.724999

df.drop('A', axis=1)

	B	C	D
2011-01-01	0.724751	-2.514868	0.005752
2011-01-02	0.701355	-0.536929	2.096034
2011-01-03	1.034505	0.249931	0.361379
2011-01-04	0.208732	-1.218512	-0.704865
2011-01-05	-1.442339	-0.423601	-0.700278
2011-01-06	0.454061	0.336762	0.724999

df  # 没变

	A	B	C	D
2011-01-01	-0.660122	-1.312991	0.843274	3.960202
2011-01-02	0.483934	0.390050	1.385788	-0.380398
2011-01-03	-0.855515	-0.577646	0.900059	0.139274
2011-01-04	-0.135787	1.031862	1.477341	-1.119167
2011-01-05	-0.991275	-0.568817	-0.645279	0.312290
2011-01-06	2.430629	0.070378	0.352723	0.165268

del df['A']
df

	B	C	D
2011-01-01	-1.312991	0.843274	3.960202
2011-01-02	0.390050	1.385788	-0.380398
2011-01-03	-0.577646	0.900059	0.139274
2011-01-04	1.031862	1.477341	-1.119167
2011-01-05	-0.568817	-0.645279	0.312290
2011-01-06	0.070378	0.352723	0.165268

替换

df.loc[dates[2], 'C'] = 0
df

	B	C	D
2011-01-01	-1.312991	0.843274	3.960202
2011-01-02	0.390050	1.385788	-0.380398
2011-01-03	-0.577646	0.000000	0.139274
2011-01-04	1.031862	1.477341	-1.119167
2011-01-05	-0.568817	-0.645279	0.312290
2011-01-06	0.070378	0.352723	0.165268

重置索引

new_index = pd.date_range('20110101', periods=7)
df.reindex(new_index, columns=list('ABCDE'))

	A	B	C	D	E
2011-01-01	0.558092	-0.044851	0.627649	1.376806	NaN
2011-01-02	2.370370	0.309665	1.534129	0.477640	NaN
2011-01-03	-0.918037	1.442996	0.718332	-0.764125	NaN
2011-01-04	1.232411	1.239217	1.380759	-1.552525	NaN
2011-01-05	-1.157906	0.486401	-0.015537	0.016348	NaN
2011-01-06	2.025508	1.072081	-2.053920	-0.710797	NaN
2011-01-07	NaN	NaN	NaN	NaN	NaN

new_index = pd.date_range('20140101', periods=7)
df.reindex(new_index, columns=list('ABCD'))    # 如果不存在, 那么用 NaN

	A	B	C	D
2014-01-01	NaN	NaN	NaN	NaN
2014-01-02	NaN	NaN	NaN	NaN
2014-01-03	NaN	NaN	NaN	NaN
2014-01-04	NaN	NaN	NaN	NaN
2014-01-05	NaN	NaN	NaN	NaN
2014-01-06	NaN	NaN	NaN	NaN
2014-01-07	NaN	NaN	NaN	NaN

轴

参考排序部分

DataFrame， Series 的运算

Series之间的运算

首先根据 index 匹配，无法匹配的话使用 NaN。

s1 = pd.Series([1, 2, 3], index=list('ABC'))
s2 = pd.Series([4, 5, 6], index=list('BCD'))

s1 + s2

A    NaN
B    6.0
C    8.0
D    NaN
dtype: float64

DataFrame 与 Series

Series 的 index 与 DataFrame 的 column 匹配。无法匹配的话使用 NaN。

df1 = pd.DataFrame(np.arange(1, 13).reshape(3, 4),
                  index=list('abc'), columns=list('ABCD'))
df1

	A	B	C	D
a	1	2	3	4
b	5	6	7	8
c	9	10	11	12

df1 - s1  # 按照

	A	B	C	D
a	0.0	0.0	0.0	NaN
b	4.0	4.0	4.0	NaN
c	8.0	8.0	8.0	NaN

DataFrame 之间

根据 index 和 column 匹配。无法匹配的话使用 NaN。

df2 = pd.DataFrame(np.arange(1, 13).reshape(4,3),
                  index=list('bcde'), columns=list('CDE'))
df2

	C	D	E
b	1	2	3
c	4	5	6
d	7	8	9
e	10	11	12

df3 = df1 + df2
df3

	A	B	C	D	E
a	NaN	NaN	NaN	NaN	NaN
b	NaN	NaN	8.0	10.0	NaN
c	NaN	NaN	15.0	17.0	NaN
d	NaN	NaN	NaN	NaN	NaN
e	NaN	NaN	NaN	NaN	NaN

函数和映射

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

	A	B	C	D
2011-01-01	0.959872	0.036688	-0.495552	0.350378
2011-01-02	0.118525	0.093901	1.964353	0.773641
2011-01-03	0.248025	-0.947414	1.420396	0.591633
2011-01-04	0.123047	0.702724	-2.514727	0.297213
2011-01-05	0.265320	-0.126869	0.628532	0.244290
2011-01-06	2.158399	1.175738	1.477467	0.553494

df.apply(max)

A    1.464420
B    0.836578
C    1.301486
D    1.216124
dtype: float64

df.apply(max, axis=1)

2011-01-01    0.653242
2011-01-02    0.836578
2011-01-03    1.301486
2011-01-04    1.216124
2011-01-05    1.464420
2011-01-06    0.467605
Freq: D, dtype: float64

或者使用自定义函数：

f = lambda x: x.max() - x.min()
df.apply(f)

A    3.148747
B    2.923576
C    2.243189
D    2.499079
dtype: float64

数据规整化

判断时候为缺失值 NaN

df3.isnull()

	A	B	C	D	E
a	True	True	True	True	True
b	True	True	False	False	True
c	True	True	False	False	True
d	True	True	True	True	True
e	True	True	True	True	True

df3.notnull()

	A	B	C	D	E
a	False	False	False	False	False
b	False	False	True	True	False
c	False	False	True	True	False
d	False	False	False	False	False
e	False	False	False	False	False

选出不是缺失值的数据

df3.C[df3.C.notnull()]

b     8.0
c    15.0
Name: C, dtype: float64

缺失值的填充

以指定的数字填充

df3.fillna(0)

	C	D
a	0.0	0.0
b	8.0	10.0
c	15.0	17.0
d	0.0	0.0
e	0.0	0.0

以列方向上前一个值填充

df3.fillna(method='ffill')

	A	B	C	D	E
a	NaN	NaN	NaN	NaN	NaN
b	NaN	NaN	8.0	10.0	NaN
c	NaN	NaN	15.0	17.0	NaN
d	NaN	NaN	15.0	17.0	NaN
e	NaN	NaN	15.0	17.0	NaN

以列方向上后一个值填充

df3.fillna(method='bfill')

	A	B	C	D	E
a	NaN	NaN	8.0	10.0	NaN
b	NaN	NaN	8.0	10.0	NaN
c	NaN	NaN	15.0	17.0	NaN
d	NaN	NaN	NaN	NaN	NaN
e	NaN	NaN	NaN	NaN	NaN

行方向

df3.fillna(method='bfill', axis=1)

	A	B	C	D	E
a	NaN	NaN	NaN	NaN	NaN
b	8.0	8.0	8.0	10.0	NaN
c	15.0	15.0	15.0	17.0	NaN
d	NaN	NaN	NaN	NaN	NaN
e	NaN	NaN	NaN	NaN	NaN

限定个数

df3.fillna(method='pad', limit=1)

	A	B	C	D	E
a	NaN	NaN	NaN	NaN	NaN
b	NaN	NaN	8.0	10.0	NaN
c	NaN	NaN	15.0	17.0	NaN
d	NaN	NaN	15.0	17.0	NaN
e	NaN	NaN	NaN	NaN	NaN

删除缺失值

df3.dropna(thresh=2)  # 需要多少非 NaN 数据

	A	B	C	D	E
b	NaN	NaN	8.0	10.0	NaN
c	NaN	NaN	15.0	17.0	NaN

df3.dropna(how='all')

	A	B	C	D	E
b	NaN	NaN	8.0	10.0	NaN
c	NaN	NaN	15.0	17.0	NaN

删除重复数据

df4 = df3.fillna(method='pad')
df4

	A	B	C	D	E
a	NaN	NaN	NaN	NaN	NaN
b	NaN	NaN	8.0	10.0	NaN
c	NaN	NaN	15.0	17.0	NaN
d	NaN	NaN	15.0	17.0	NaN
e	NaN	NaN	15.0	17.0	NaN

df4.duplicated()

a    False
b    False
c    False
d     True
e     True
dtype: bool

df4.drop_duplicates()

	A	B	C	D	E
a	NaN	NaN	NaN	NaN	NaN
b	NaN	NaN	8.0	10.0	NaN
c	NaN	NaN	15.0	17.0	NaN

df4.duplicated(['C'])  # 仅仅对某列操作

a    False
b    False
c    False
d     True
e     True
dtype: bool

df4.drop_duplicates(['C'])

	A	B	C	D	E
a	NaN	NaN	NaN	NaN	NaN
b	NaN	NaN	8.0	10.0	NaN
c	NaN	NaN	15.0	17.0	NaN

使用 pandas 映射范畴数据

MAPPING CATEGORICAL DATA IN PANDAS

Python 不像 R可以使用 factor 代表范畴数据. 我们可以使用 astype('category') 将范畴数据转变为数字

Nominal Categories

无序

import pandas as pd

df = pd.DataFrame({'vertebrates': ['Bird', 'Bird', 'Mammal', 'Fish', 'Amphibian', 'Reptile', 'Mammal']})
print(df)

df1 = df.vertebrates.astype("category").cat.codes  # no change for the original df
df1

  vertebrates
      Bird
      Bird
    Mammal
      Fish
 Amphibian
   Reptile
    Mammal





  1
  1
  3
  2
  0
  4
  3
dtype: int8

指定种类

也可以指定要要转换那些种类, 这样你就很清楚都有什么种类. 任何没有指定的种类都会被标为 1.

vertebrate_types = ['Reptile', 'Bird', 'Amphibian', 'Fish']

df.vertebrates.astype("category", categories=vertebrate_types).cat.codes

  1
  1
 -1
  3
  2
  0
 -1
dtype: int8

dummes

但是上面这些种类其实和数字是没有关系的, 也就是没有什么大小关系. 那么我们可以将它们转换为 dummes.

However, there is no inherent relationship between these categories so it doesn’t necessary make sense to store these as different numbers on the same scale.

pd.get_dummies(df, columns=['vertebrates'])

	vertebrates_Amphibian	vertebrates_Bird	vertebrates_Fish	vertebrates_Mammal	vertebrates_Reptile
0	0	1	0	0	0
1	0	1	0	0	0
2	0	0	0	1	0
3	0	0	1	0	0
4	1	0	0	0	0
5	0	0	0	0	1
6	0	0	0	1	0

为了防止 dummy variable trap, 使用 drop_first=True.

pd.get_dummies(df, columns=['vertebrates'], drop_first=True)

	vertebrates_Bird	vertebrates_Fish	vertebrates_Mammal	vertebrates_Reptile
0	1	0	0	0
1	1	0	0	0
2	0	0	1	0
3	0	1	0	0
4	0	0	0	0
5	0	0	0	1
6	0	0	1	0

Ordinal Categories

还有很多范畴数据是有序的, 比如成绩的优良中差, 满意度等. 使用 ordered = True.

ordered_satisfaction = ['Very Unhappy', 'Unhappy', 'Neutral', 'Happy', 'Very Happy']
df = pd.DataFrame({'satisfaction':['Mad', 'Happy', 'Unhappy', 'Neutral']})

可以直接输出文本, 没有的种类为 NaN.

df.satisfaction.astype("category",
  ordered=True,
  categories=ordered_satisfaction
)

0        NaN
1      Happy
2    Unhappy
3    Neutral
Name: satisfaction, dtype: category
Categories (5, object): [Very Unhappy < Unhappy < Neutral < Happy < Very Happy]

也可以使用数字, 那么没有的就是 -1.

df.satisfaction.astype("category",
  ordered=True,
  categories=ordered_satisfaction
).cat.codes

 -1
  3
  1
  2
dtype: int8

学习心得9 - Pandas - DataFrame