个人学习 学,以致用

学习心得9 - Pandas - DataFrame

2018-02-17
Geng

DataFrame 是表示型的数据结构,每一行就是一个记录,每一列是一个变量。简单说,DataFrame 是共享同一个 index 的 Series 集合。

创建 DataFrame

纯手工操作1

要创建,必须先创建索引,而且DataFrame 必须有索引。

import pandas as pd
import numpy as np
dates = pd.date_range('2011-1-1', periods=6, freq='D')
dates
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
               '2011-01-05', '2011-01-06'],
              dtype='datetime64[ns]', freq='D')

用这个创建好的 dates 作为索引,创建 DataFrame:

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df
A B C D
2011-01-01 0.416989 0.724751 -2.514868 0.005752
2011-01-02 0.334617 0.701355 -0.536929 2.096034
2011-01-03 0.393638 1.034505 0.249931 0.361379
2011-01-04 -0.720014 0.208732 -1.218512 -0.704865
2011-01-05 -0.330284 -1.442339 -0.423601 -0.700278
2011-01-06 -0.649782 0.454061 0.336762 0.724999

纯手工操作2

df1 = pd.DataFrame({'A':[1, 2, 3], 'B':[4, 5, 6], 'C':[7, 8, 9]}, 
                  index=pd.date_range('20110101', periods=3))
df1
A B C
2011-01-01 1 4 7
2011-01-02 2 5 8
2011-01-03 3 6 9

从外部文件或者数据库导入

一般使用 DataFrame不会手工创建的, 都是导入的, 不过导入很简单了, 比如从 csv 导入:

df = pandas.read_csv('csvfile.txt')

查看和操作

查看列名, 索引, 值和统计量

df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
df.index
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
               '2011-01-05', '2011-01-06'],
              dtype='datetime64[ns]', freq='D')
df.values
array([[-0.66012233, -1.3129909 ,  0.84327428,  3.96020219],
       [ 0.48393364,  0.39004968,  1.38578819, -0.38039762],
       [-0.85551546, -0.57764636,  0.90005938,  0.13927364],
       [-0.13578653,  1.03186187,  1.47734054, -1.11916714],
       [-0.99127477, -0.56881698, -0.64527863,  0.31228982],
       [ 2.43062939,  0.07037805,  0.3527228 ,  0.16526828]])
df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.092473 0.280178 -0.684536 0.297170
std 0.536510 0.888420 1.062205 1.049473
min -0.720014 -1.442339 -2.514868 -0.704865
25% -0.569908 0.270064 -1.048116 -0.523770
50% 0.002167 0.577708 -0.480265 0.183566
75% 0.378883 0.718902 0.081548 0.634094
max 0.416989 1.034505 0.336762 2.096034

行列操作

df[0:3]
A B C D
2011-01-01 0.416989 0.724751 -2.514868 0.005752
2011-01-02 0.334617 0.701355 -0.536929 2.096034
2011-01-03 0.393638 1.034505 0.249931 0.361379

df['A']
2011-01-01    0.416989
2011-01-02    0.334617
2011-01-03    0.393638
2011-01-04   -0.720014
2011-01-05   -0.330284
2011-01-06   -0.649782
Freq: D, Name: A, dtype: float64
df.A
2011-01-01    0.416989
2011-01-02    0.334617
2011-01-03    0.393638
2011-01-04   -0.720014
2011-01-05   -0.330284
2011-01-06   -0.649782
Freq: D, Name: A, dtype: float64
df[['A', 'C']]
A C
2011-01-01 0.416989 -2.514868
2011-01-02 0.334617 -0.536929
2011-01-03 0.393638 0.249931
2011-01-04 -0.720014 -1.218512
2011-01-05 -0.330284 -0.423601
2011-01-06 -0.649782 0.336762

条件提取

df[df['A'] > 0]
A B C D
2011-01-01 0.416989 0.724751 -2.514868 0.005752
2011-01-02 0.334617 0.701355 -0.536929 2.096034
2011-01-03 0.393638 1.034505 0.249931 0.361379
df[df['A'] > 0][['A', 'C']]
A C
2011-01-01 0.416989 -2.514868
2011-01-02 0.334617 -0.536929
2011-01-03 0.393638 0.249931

不能这样同时提取行列

df[0:1, 'A']

使用 loc() 方法 – 开始变得像 matlab 的操作1

通过行列的标签名提取

df.loc[:, 'A']  # 所有行, A 列
2011-01-01    0.416989
2011-01-02    0.334617
2011-01-03    0.393638
2011-01-04   -0.720014
2011-01-05   -0.330284
2011-01-06   -0.649782
Freq: D, Name: A, dtype: float64
df.loc[dates[0:2], 'A':'C']
A B C
2011-01-01 0.416989 0.724751 -2.514868
2011-01-02 0.334617 0.701355 -0.536929

这样还是不行: df.loc[0:2, 'A':'C']

使用 iloc() – 开始变得像 matlab 的操作2

通过行列位置提取(index locate)

df.iloc[2]  # 提取某行
A   -0.855515
B   -0.577646
C    0.900059
D    0.139274
Name: 2011-01-03 00:00:00, dtype: float64
df.iloc[:, 2]  # 提取某列
2011-01-01    0.843274
2011-01-02    1.385788
2011-01-03    0.900059
2011-01-04    1.477341
2011-01-05   -0.645279
2011-01-06    0.352723
Freq: D, Name: C, dtype: float64
df.iloc[[1,3], [2,3]]  # 提取某几行某几列
C D
2011-01-02 1.385788 -0.380398
2011-01-04 1.477341 -1.119167
df.iloc[1:3, 2:3]  # 切片
C
2011-01-02 1.385788
2011-01-03 0.900059

混合使用标签和位置索引, ix() – 不推荐使用

df.ix[2:5]  # 行
A B C D
2011-01-03 -0.855515 -0.577646 0.900059 0.139274
2011-01-04 -0.135787 1.031862 1.477341 -1.119167
2011-01-05 -0.991275 -0.568817 -0.645279 0.312290
df.ix[[1,3], 2]
2011-01-02    1.385788
2011-01-04    1.477341
Freq: 2D, Name: C, dtype: float64
df.ix[1:3, 2]
2011-01-02    1.385788
2011-01-03    0.900059
Freq: D, Name: C, dtype: float64
df.ix[1:3, df.iloc[3]>0]
B C
2011-01-02 0.390050 1.385788
2011-01-03 -0.577646 0.900059

DataFrame 操作

转置

df.T
2011-01-01 00:00:00 2011-01-02 00:00:00 2011-01-03 00:00:00 2011-01-04 00:00:00 2011-01-05 00:00:00 2011-01-06 00:00:00
A -0.660122 0.483934 -0.855515 -0.135787 -0.991275 2.430629
B -1.312991 0.390050 -0.577646 1.031862 -0.568817 0.070378
C 0.843274 1.385788 0.900059 1.477341 -0.645279 0.352723
D 3.960202 -0.380398 0.139274 -1.119167 0.312290 0.165268

排序

  • df.sort_index(axis=0, ascending=True, inplace=False)

    DataFrame 按照标签排序

    • axis: 默认为0, 按照 index 排序. 为1时, 按照列名排序
      +------------+---------+--------+
      |            |  A      |  B     |
      +------------+---------+---------
      |2011-01-06  | 0.626386| 1.52325|----axis=1----->
      +------------+---------+--------+
                   |         |
                   | axis=0  |
                            
    
    • inplace: 如果 False, 不修改原始数据. 如果 True, 直接修改原始数据
df.sort_index(ascending=False)  
A B C D
2011-01-06 2.430629 0.070378 0.352723 0.165268
2011-01-05 -0.991275 -0.568817 -0.645279 0.312290
2011-01-04 -0.135787 1.031862 1.477341 -1.119167
2011-01-03 -0.855515 -0.577646 0.900059 0.139274
2011-01-02 0.483934 0.390050 1.385788 -0.380398
2011-01-01 -0.660122 -1.312991 0.843274 3.960202
df.sort_index(axis=1, ascending=False)  # 注意排序方向
D C B A
2011-01-01 3.960202 0.843274 -1.312991 -0.660122
2011-01-02 -0.380398 1.385788 0.390050 0.483934
2011-01-03 0.139274 0.900059 -0.577646 -0.855515
2011-01-04 -1.119167 1.477341 1.031862 -0.135787
2011-01-05 0.312290 -0.645279 -0.568817 -0.991275
2011-01-06 0.165268 0.352723 0.070378 2.430629
  • sort_values(by) 根据值排序

脑补 sort_index()sort_values(by=index)

df.sort_values(by='C') 
A B C D
2011-01-01 0.416989 0.724751 -2.514868 0.005752
2011-01-04 -0.720014 0.208732 -1.218512 -0.704865
2011-01-02 0.334617 0.701355 -0.536929 2.096034
2011-01-05 -0.330284 -1.442339 -0.423601 -0.700278
2011-01-03 0.393638 1.034505 0.249931 0.361379
2011-01-06 -0.649782 0.454061 0.336762 0.724999

排名

rank() 返回相应的行或者列排序的排名

df.rank(axis=0)  # 列方向升序排名
A B C D
2011-01-01 6.0 5.0 1.0 3.0
2011-01-02 4.0 4.0 3.0 6.0
2011-01-03 5.0 6.0 5.0 4.0
2011-01-04 1.0 2.0 2.0 1.0
2011-01-05 3.0 1.0 4.0 2.0
2011-01-06 2.0 3.0 6.0 5.0
df.rank(axis=0).astype(int)  # 使用 astype 将数据类型变为 int
A B C D
2011-01-01 6 5 1 3
2011-01-02 4 4 3 6
2011-01-03 5 6 5 4
2011-01-04 1 2 2 1
2011-01-05 3 1 4 2
2011-01-06 2 3 6 5
df.rank(axis=1, ascending=False)  # 行方向降序排名
A B C D
2011-01-01 2.0 1.0 4.0 3.0
2011-01-02 3.0 2.0 4.0 1.0
2011-01-03 2.0 1.0 4.0 3.0
2011-01-04 3.0 1.0 4.0 2.0
2011-01-05 1.0 4.0 2.0 3.0
2011-01-06 4.0 2.0 3.0 1.0

增加行或者列

s1 = pd.Series([1,2,3,4,5], index=pd.date_range('20110102', periods=5))
s1
2011-01-02    1
2011-01-03    2
2011-01-04    3
2011-01-05    4
2011-01-06    5
Freq: D, dtype: int64
df['E'] = s1  # 新增一列
df
A B C D E
2011-01-01 0.416989 0.724751 -2.514868 0.005752 NaN
2011-01-02 0.334617 0.701355 -0.536929 2.096034 1.0
2011-01-03 0.393638 1.034505 0.249931 0.361379 2.0
2011-01-04 -0.720014 0.208732 -1.218512 -0.704865 3.0
2011-01-05 -0.330284 -1.442339 -0.423601 -0.700278 4.0
2011-01-06 -0.649782 0.454061 0.336762 0.724999 5.0

在 df 新增了一列,而且会根据 index 自动对齐

# df = df.loc[:, 'A':'D']
df = df[list('ABCD')]  # 两种方法都可以
df
A B C D
2011-01-01 0.416989 0.724751 -2.514868 0.005752
2011-01-02 0.334617 0.701355 -0.536929 2.096034
2011-01-03 0.393638 1.034505 0.249931 0.361379
2011-01-04 -0.720014 0.208732 -1.218512 -0.704865
2011-01-05 -0.330284 -1.442339 -0.423601 -0.700278
2011-01-06 -0.649782 0.454061 0.336762 0.724999
pd.concat([df, df])  # 上下摞起来
A B C D
2011-01-01 0.416989 0.724751 -2.514868 0.005752
2011-01-02 0.334617 0.701355 -0.536929 2.096034
2011-01-03 0.393638 1.034505 0.249931 0.361379
2011-01-04 -0.720014 0.208732 -1.218512 -0.704865
2011-01-05 -0.330284 -1.442339 -0.423601 -0.700278
2011-01-06 -0.649782 0.454061 0.336762 0.724999
2011-01-01 0.416989 0.724751 -2.514868 0.005752
2011-01-02 0.334617 0.701355 -0.536929 2.096034
2011-01-03 0.393638 1.034505 0.249931 0.361379
2011-01-04 -0.720014 0.208732 -1.218512 -0.704865
2011-01-05 -0.330284 -1.442339 -0.423601 -0.700278
2011-01-06 -0.649782 0.454061 0.336762 0.724999
pd.concat([df, df], axis=1)  # 左右连起来
A B C D A B C D
2011-01-01 0.416989 0.724751 -2.514868 0.005752 0.416989 0.724751 -2.514868 0.005752
2011-01-02 0.334617 0.701355 -0.536929 2.096034 0.334617 0.701355 -0.536929 2.096034
2011-01-03 0.393638 1.034505 0.249931 0.361379 0.393638 1.034505 0.249931 0.361379
2011-01-04 -0.720014 0.208732 -1.218512 -0.704865 -0.720014 0.208732 -1.218512 -0.704865
2011-01-05 -0.330284 -1.442339 -0.423601 -0.700278 -0.330284 -1.442339 -0.423601 -0.700278
2011-01-06 -0.649782 0.454061 0.336762 0.724999 -0.649782 0.454061 0.336762 0.724999
df1 = pd.DataFrame({'A':[1, 2, 3], 'B':[4, 5, 6], 'C':[7, 8, 9]}, 
                  index=pd.date_range('20110101', periods=3))
df.append(df1)  # 摞起来
A B C D
2011-01-01 0.416989 0.724751 -2.514868 0.005752
2011-01-02 0.334617 0.701355 -0.536929 2.096034
2011-01-03 0.393638 1.034505 0.249931 0.361379
2011-01-04 -0.720014 0.208732 -1.218512 -0.704865
2011-01-05 -0.330284 -1.442339 -0.423601 -0.700278
2011-01-06 -0.649782 0.454061 0.336762 0.724999
2011-01-01 1.000000 4.000000 7.000000 NaN
2011-01-02 2.000000 5.000000 8.000000 NaN
2011-01-03 3.000000 6.000000 9.000000 NaN
pd.concat([df, df1], join='inner')
A B C
2011-01-01 0.416989 0.724751 -2.514868
2011-01-02 0.334617 0.701355 -0.536929
2011-01-03 0.393638 1.034505 0.249931
2011-01-04 -0.720014 0.208732 -1.218512
2011-01-05 -0.330284 -1.442339 -0.423601
2011-01-06 -0.649782 0.454061 0.336762
2011-01-01 1.000000 4.000000 7.000000
2011-01-02 2.000000 5.000000 8.000000
2011-01-03 3.000000 6.000000 9.000000

删除

drop() 返回新的 DataFrame, del 直接在原 DataFrame 上操作

df.drop(dates[1:3])
A B C D
2011-01-01 0.416989 0.724751 -2.514868 0.005752
2011-01-04 -0.720014 0.208732 -1.218512 -0.704865
2011-01-05 -0.330284 -1.442339 -0.423601 -0.700278
2011-01-06 -0.649782 0.454061 0.336762 0.724999
df.drop('A', axis=1)
B C D
2011-01-01 0.724751 -2.514868 0.005752
2011-01-02 0.701355 -0.536929 2.096034
2011-01-03 1.034505 0.249931 0.361379
2011-01-04 0.208732 -1.218512 -0.704865
2011-01-05 -1.442339 -0.423601 -0.700278
2011-01-06 0.454061 0.336762 0.724999
df  # 没变
A B C D
2011-01-01 -0.660122 -1.312991 0.843274 3.960202
2011-01-02 0.483934 0.390050 1.385788 -0.380398
2011-01-03 -0.855515 -0.577646 0.900059 0.139274
2011-01-04 -0.135787 1.031862 1.477341 -1.119167
2011-01-05 -0.991275 -0.568817 -0.645279 0.312290
2011-01-06 2.430629 0.070378 0.352723 0.165268
del df['A']
df
B C D
2011-01-01 -1.312991 0.843274 3.960202
2011-01-02 0.390050 1.385788 -0.380398
2011-01-03 -0.577646 0.900059 0.139274
2011-01-04 1.031862 1.477341 -1.119167
2011-01-05 -0.568817 -0.645279 0.312290
2011-01-06 0.070378 0.352723 0.165268

替换

df.loc[dates[2], 'C'] = 0
df
B C D
2011-01-01 -1.312991 0.843274 3.960202
2011-01-02 0.390050 1.385788 -0.380398
2011-01-03 -0.577646 0.000000 0.139274
2011-01-04 1.031862 1.477341 -1.119167
2011-01-05 -0.568817 -0.645279 0.312290
2011-01-06 0.070378 0.352723 0.165268

重置索引

new_index = pd.date_range('20110101', periods=7)
df.reindex(new_index, columns=list('ABCDE'))
A B C D E
2011-01-01 0.558092 -0.044851 0.627649 1.376806 NaN
2011-01-02 2.370370 0.309665 1.534129 0.477640 NaN
2011-01-03 -0.918037 1.442996 0.718332 -0.764125 NaN
2011-01-04 1.232411 1.239217 1.380759 -1.552525 NaN
2011-01-05 -1.157906 0.486401 -0.015537 0.016348 NaN
2011-01-06 2.025508 1.072081 -2.053920 -0.710797 NaN
2011-01-07 NaN NaN NaN NaN NaN
new_index = pd.date_range('20140101', periods=7)
df.reindex(new_index, columns=list('ABCD'))    # 如果不存在, 那么用 NaN
A B C D
2014-01-01 NaN NaN NaN NaN
2014-01-02 NaN NaN NaN NaN
2014-01-03 NaN NaN NaN NaN
2014-01-04 NaN NaN NaN NaN
2014-01-05 NaN NaN NaN NaN
2014-01-06 NaN NaN NaN NaN
2014-01-07 NaN NaN NaN NaN

参考排序部分

DataFrame, Series 的运算

Series之间的运算

首先根据 index 匹配,无法匹配的话使用 NaN。

s1 = pd.Series([1, 2, 3], index=list('ABC'))
s2 = pd.Series([4, 5, 6], index=list('BCD'))
s1 + s2
A    NaN
B    6.0
C    8.0
D    NaN
dtype: float64

DataFrame 与 Series

Series 的 index 与 DataFrame 的 column 匹配。无法匹配的话使用 NaN。

df1 = pd.DataFrame(np.arange(1, 13).reshape(3, 4),
                  index=list('abc'), columns=list('ABCD'))
df1
A B C D
a 1 2 3 4
b 5 6 7 8
c 9 10 11 12
df1 - s1  # 按照
A B C D
a 0.0 0.0 0.0 NaN
b 4.0 4.0 4.0 NaN
c 8.0 8.0 8.0 NaN

DataFrame 之间

根据 index 和 column 匹配。无法匹配的话使用 NaN。

df2 = pd.DataFrame(np.arange(1, 13).reshape(4,3),
                  index=list('bcde'), columns=list('CDE'))
df2
C D E
b 1 2 3
c 4 5 6
d 7 8 9
e 10 11 12
df3 = df1 + df2
df3
A B C D E
a NaN NaN NaN NaN NaN
b NaN NaN 8.0 10.0 NaN
c NaN NaN 15.0 17.0 NaN
d NaN NaN NaN NaN NaN
e NaN NaN NaN NaN NaN

函数和映射

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df
A B C D
2011-01-01 0.959872 0.036688 -0.495552 0.350378
2011-01-02 0.118525 0.093901 1.964353 0.773641
2011-01-03 0.248025 -0.947414 1.420396 0.591633
2011-01-04 0.123047 0.702724 -2.514727 0.297213
2011-01-05 0.265320 -0.126869 0.628532 0.244290
2011-01-06 2.158399 1.175738 1.477467 0.553494
df.apply(max)
A    1.464420
B    0.836578
C    1.301486
D    1.216124
dtype: float64
df.apply(max, axis=1)
2011-01-01    0.653242
2011-01-02    0.836578
2011-01-03    1.301486
2011-01-04    1.216124
2011-01-05    1.464420
2011-01-06    0.467605
Freq: D, dtype: float64

或者使用自定义函数:

f = lambda x: x.max() - x.min()
df.apply(f)
A    3.148747
B    2.923576
C    2.243189
D    2.499079
dtype: float64

数据规整化

判断时候为缺失值 NaN

df3.isnull()
A B C D E
a True True True True True
b True True False False True
c True True False False True
d True True True True True
e True True True True True
df3.notnull()
A B C D E
a False False False False False
b False False True True False
c False False True True False
d False False False False False
e False False False False False

选出不是缺失值的数据

df3.C[df3.C.notnull()]
b     8.0
c    15.0
Name: C, dtype: float64

缺失值的填充

以指定的数字填充

df3.fillna(0)
A B C D E
a 0.0 0.0 0.0 0.0 0.0
b 0.0 0.0 8.0 10.0 0.0
c 0.0 0.0 15.0 17.0 0.0
d 0.0 0.0 0.0 0.0 0.0
e 0.0 0.0 0.0 0.0 0.0

以列方向上前一个值填充

df3.fillna(method='ffill')
A B C D E
a NaN NaN NaN NaN NaN
b NaN NaN 8.0 10.0 NaN
c NaN NaN 15.0 17.0 NaN
d NaN NaN 15.0 17.0 NaN
e NaN NaN 15.0 17.0 NaN

以列方向上后一个值填充

df3.fillna(method='bfill')
A B C D E
a NaN NaN 8.0 10.0 NaN
b NaN NaN 8.0 10.0 NaN
c NaN NaN 15.0 17.0 NaN
d NaN NaN NaN NaN NaN
e NaN NaN NaN NaN NaN

行方向

df3.fillna(method='bfill', axis=1)
A B C D E
a NaN NaN NaN NaN NaN
b 8.0 8.0 8.0 10.0 NaN
c 15.0 15.0 15.0 17.0 NaN
d NaN NaN NaN NaN NaN
e NaN NaN NaN NaN NaN

限定个数

df3.fillna(method='pad', limit=1)
A B C D E
a NaN NaN NaN NaN NaN
b NaN NaN 8.0 10.0 NaN
c NaN NaN 15.0 17.0 NaN
d NaN NaN 15.0 17.0 NaN
e NaN NaN NaN NaN NaN

删除缺失值

df3.dropna(thresh=2)  # 需要多少非 NaN 数据
A B C D E
b NaN NaN 8.0 10.0 NaN
c NaN NaN 15.0 17.0 NaN
df3.dropna(how='all')
A B C D E
b NaN NaN 8.0 10.0 NaN
c NaN NaN 15.0 17.0 NaN

删除重复数据

df4 = df3.fillna(method='pad')
df4
A B C D E
a NaN NaN NaN NaN NaN
b NaN NaN 8.0 10.0 NaN
c NaN NaN 15.0 17.0 NaN
d NaN NaN 15.0 17.0 NaN
e NaN NaN 15.0 17.0 NaN
df4.duplicated()
a    False
b    False
c    False
d     True
e     True
dtype: bool
df4.drop_duplicates()
A B C D E
a NaN NaN NaN NaN NaN
b NaN NaN 8.0 10.0 NaN
c NaN NaN 15.0 17.0 NaN
df4.duplicated(['C'])  # 仅仅对某列操作
a    False
b    False
c    False
d     True
e     True
dtype: bool
df4.drop_duplicates(['C'])
A B C D E
a NaN NaN NaN NaN NaN
b NaN NaN 8.0 10.0 NaN
c NaN NaN 15.0 17.0 NaN

使用 pandas 映射范畴数据

MAPPING CATEGORICAL DATA IN PANDAS

Python 不像 R可以使用 factor 代表范畴数据. 我们可以使用 astype('category') 将范畴数据转变为数字

Nominal Categories

无序

import pandas as pd

df = pd.DataFrame({'vertebrates': ['Bird', 'Bird', 'Mammal', 'Fish', 'Amphibian', 'Reptile', 'Mammal']})
print(df)

df1 = df.vertebrates.astype("category").cat.codes  # no change for the original df
df1
  vertebrates
0        Bird
1        Bird
2      Mammal
3        Fish
4   Amphibian
5     Reptile
6      Mammal





0    1
1    1
2    3
3    2
4    0
5    4
6    3
dtype: int8

指定种类

也可以指定要要转换那些种类, 这样你就很清楚都有什么种类. 任何没有指定的种类都会被标为 1.

vertebrate_types = ['Reptile', 'Bird', 'Amphibian', 'Fish']

df.vertebrates.astype("category", categories=vertebrate_types).cat.codes
0    1
1    1
2   -1
3    3
4    2
5    0
6   -1
dtype: int8

dummes

但是上面这些种类其实和数字是没有关系的, 也就是没有什么大小关系. 那么我们可以将它们转换为 dummes.

However, there is no inherent relationship between these categories so it doesn’t necessary make sense to store these as different numbers on the same scale.

pd.get_dummies(df, columns=['vertebrates'])
vertebrates_Amphibian vertebrates_Bird vertebrates_Fish vertebrates_Mammal vertebrates_Reptile
0 0 1 0 0 0
1 0 1 0 0 0
2 0 0 0 1 0
3 0 0 1 0 0
4 1 0 0 0 0
5 0 0 0 0 1
6 0 0 0 1 0

为了防止 dummy variable trap, 使用 drop_first=True.

pd.get_dummies(df, columns=['vertebrates'], drop_first=True)
vertebrates_Bird vertebrates_Fish vertebrates_Mammal vertebrates_Reptile
0 1 0 0 0
1 1 0 0 0
2 0 0 1 0
3 0 1 0 0
4 0 0 0 0
5 0 0 0 1
6 0 0 1 0

Ordinal Categories

还有很多范畴数据是有序的, 比如成绩的优良中差, 满意度等. 使用 ordered = True.

ordered_satisfaction = ['Very Unhappy', 'Unhappy', 'Neutral', 'Happy', 'Very Happy']
df = pd.DataFrame({'satisfaction':['Mad', 'Happy', 'Unhappy', 'Neutral']})

可以直接输出文本, 没有的种类为 NaN.

df.satisfaction.astype("category",
  ordered=True,
  categories=ordered_satisfaction
)
0        NaN
1      Happy
2    Unhappy
3    Neutral
Name: satisfaction, dtype: category
Categories (5, object): [Very Unhappy < Unhappy < Neutral < Happy < Very Happy]

也可以使用数字, 那么没有的就是 -1.

df.satisfaction.astype("category",
  ordered=True,
  categories=ordered_satisfaction
).cat.codes
0   -1
1    3
2    1
3    2
dtype: int8

Similar Posts

Comments

你可以请我喝喝茶,聊聊天,鼓励我

Wechat Pay
wechat

Thanks!