Pandas高级教程之:处理缺失数据
Pandas高级教程之:处理缺失数据
简介
在数据处理中,Pandas会将无法解析的数据或者缺失的数据使用NaN来表示。虽然所有的数据都有了相应的表示,但是NaN很明显是无法进行数学运算的。
本文将会讲解Pandas对于NaN数据的处理方法。
NaN的例子
上面讲到了缺失的数据会被表现为NaN,我们来看一个具体的例子:
我们先来构建一个DF:
In [1]: df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], ...: columns=['one', 'two', 'three']) ...: In [2]: df['four'] = 'bar'In [3]: df['five'] = df['one'] > 0In [4]: dfOut[4]: one two three four fivea 0.469112 -0.282863 -1.509059 bar Truec -1.135632 1.212112 -0.173215 bar Falsee 0.119209 -1.044236 -0.861849 bar Truef -2.104569 -0.494929 1.071804 bar Falseh 0.721555 -0.706771 -1.039575 bar True
上面DF只有acefh这几个index,我们重新index一下数据:
In [5]: df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])In [6]: df2Out[6]: one two three four fivea 0.469112 -0.282863 -1.509059 bar Trueb NaN NaN NaN NaN NaNc -1.135632 1.212112 -0.173215 bar Falsed NaN NaN NaN NaN NaNe 0.119209 -1.044236 -0.861849 bar Truef -2.104569 -0.494929 1.071804 bar Falseg NaN NaN NaN NaN NaNh 0.721555 -0.706771 -1.039575 bar True
数据缺失,就会产生很多NaN。
为了检测是否NaN,可以使用isna()或者notna() 方法。
In [7]: df2['one']Out[7]: a 0.469112b NaNc -1.135632d NaNe 0.119209f -2.104569g NaNh 0.721555Name: one, dtype: float64In [8]: pd.isna(df2['one'])Out[8]: a Falseb Truec Falsed Truee Falsef Falseg Trueh FalseName: one, dtype: boolIn [9]: df2['four'].notna()Out[9]: a Trueb Falsec Trued Falsee Truef Trueg Falseh TrueName: four, dtype: bool
注意在Python中None是相等的:
In [11]: None == None # noqa: E711Out[11]: True
但是np.nan是不等的:
In [12]: np.nan == np.nanOut[12]: False
整数类型的缺失值
NaN默认是float类型的,如果是整数类型,我们可以强制进行转换:
In [14]: pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())Out[14]: 0 11 22 <NA>3 4dtype: Int64
Datetimes 类型的缺失值
时间类型的缺失值使用NaT来表示:
In [15]: df2 = df.copy()In [16]: df2['timestamp'] = pd.Timestamp('20120101')In [17]: df2Out[17]: one two three four five timestampa 0.469112 -0.282863 -1.509059 bar True 2012-01-01c -1.135632 1.212112 -0.173215 bar False 2012-01-01e 0.119209 -1.044236 -0.861849 bar True 2012-01-01f -2.104569 -0.494929 1.071804 bar False 2012-01-01h 0.721555 -0.706771 -1.039575 bar True 2012-01-01In [18]: df2.loc[['a', 'c', 'h'], ['one', 'timestamp']] = np.nanIn [19]: df2Out[19]: one two three four five timestampa NaN -0.282863 -1.509059 bar True NaTc NaN 1.212112 -0.173215 bar False NaTe 0.119209 -1.044236 -0.861849 bar True 2012-01-01f -2.104569 -0.494929 1.071804 bar False 2012-01-01h NaN -0.706771 -1.039575 bar True NaTIn [20]: df2.dtypes.value_counts()Out[20]: float64 3datetime64[ns] 1bool 1object 1dtype: int64
None 和 np.nan 的转换
对于数字类型的,如果赋值为None,那么会转换为相应的NaN类型:
In [21]: s = pd.Series([1, 2, 3])In [22]: s.loc[0] = NoneIn [23]: sOut[23]: 0 NaN1 2.02 3.0dtype: float64
如果是对象类型,使用None赋值,会保持原样:
In [24]: s = pd.Series(["a", "b", "c"])In [25]: s.loc[0] = NoneIn [26]: s.loc[1] = np.nanIn [27]: sOut[27]: 0 None1 NaN2 cdtype: object
缺失值的计算
缺失值的数学计算还是缺失值:
In [28]: aOut[28]: one twoa NaN -0.282863c NaN 1.212112e 0.119209 -1.044236f -2.104569 -0.494929h -2.104569 -0.706771In [29]: bOut[29]: one two threea NaN -0.282863 -1.509059c NaN 1.212112 -0.173215e 0.119209 -1.044236 -0.861849f -2.104569 -0.494929 1.071804h NaN -0.706771 -1.039575In [30]: a + bOut[30]: one three twoa NaN NaN -0.565727c NaN NaN 2.424224e 0.238417 NaN -2.088472f -4.209138 NaN -0.989859h NaN NaN -1.413542
但是在统计中会将NaN当成0来对待。
In [31]: dfOut[31]: one two threea NaN -0.282863 -1.509059c NaN 1.212112 -0.173215e 0.119209 -1.044236 -0.861849f -2.104569 -0.494929 1.071804h NaN -0.706771 -1.039575In [32]: df['one'].sum()Out[32]: -1.9853605075978744In [33]: df.mean(1)Out[33]: a -0.895961c 0.519449e -0.595625f -0.509232h -0.873173dtype: float64
如果是在cumsum或者cumprod中,默认是会跳过NaN,如果不想统计NaN,可以加上参数skipna=False
In [34]: df.cumsum()Out[34]: one two threea NaN -0.282863 -1.509059c NaN 0.929249 -1.682273e 0.119209 -0.114987 -2.544122f -1.985361 -0.609917 -1.472318h NaN -1.316688 -2.511893In [35]: df.cumsum(skipna=False)Out[35]: one two threea NaN -0.282863 -1.509059c NaN 0.929249 -1.682273e NaN -0.114987 -2.544122f NaN -0.609917 -1.472318h NaN -1.316688 -2.511893
使用fillna填充NaN数据
数据分析中,如果有NaN数据,那么需要对其进行处理,一种处理方法就是使用fillna来进行填充。
下面填充常量:
In [42]: df2Out[42]: one two three four five timestampa NaN -0.282863 -1.509059 bar True NaTc NaN 1.212112 -0.173215 bar False NaTe 0.119209 -1.044236 -0.861849 bar True 2012-01-01f -2.104569 -0.494929 1.071804 bar False 2012-01-01h NaN -0.706771 -1.039575 bar True NaTIn [43]: df2.fillna(0)Out[43]: one two three four five timestampa 0.000000 -0.282863 -1.509059 bar True 0c 0.000000 1.212112 -0.173215 bar False 0e 0.119209 -1.044236 -0.861849 bar True 2012-01-01 00:00:00f -2.104569 -0.494929 1.071804 bar False 2012-01-01 00:00:00h 0.000000 -0.706771 -1.039575 bar True 0
还可以指定填充方法,比如pad:
In [45]: dfOut[45]: one two threea NaN -0.282863 -1.509059c NaN 1.212112 -0.173215e 0.119209 -1.044236 -0.861849f -2.104569 -0.494929 1.071804h NaN -0.706771 -1.039575In [46]: df.fillna(method='pad')Out[46]: one two threea NaN -0.282863 -1.509059c NaN 1.212112 -0.173215e 0.119209 -1.044236 -0.861849f -2.104569 -0.494929 1.071804h -2.104569 -0.706771 -1.039575
可以指定填充的行数:
In [48]: df.fillna(method='pad', limit=1)
fill方法统计:
方法名 | 描述 |
---|---|
pad / ffill | 向前填充 |
bfill / backfill | 向后填充 |
可以使用PandasObject来填充:
In [53]: dffOut[53]: A B C0 0.271860 -0.424972 0.5670201 0.276232 -1.087401 -0.6736902 0.113648 -1.478427 0.5249883 NaN 0.577046 -1.7150024 NaN NaN -1.1578925 -1.344312 NaN NaN6 -0.109050 1.643563 NaN7 0.357021 -0.674600 NaN8 -0.968914 -1.294524 0.4137389 0.276662 -0.472035 -0.013960In [54]: dff.fillna(dff.mean())Out[54]: A B C0 0.271860 -0.424972 0.5670201 0.276232 -1.087401 -0.6736902 0.113648 -1.478427 0.5249883 -0.140857 0.577046 -1.7150024 -0.140857 -0.401419 -1.1578925 -1.344312 -0.401419 -0.2935436 -0.109050 1.643563 -0.2935437 0.357021 -0.674600 -0.2935438 -0.968914 -1.294524 0.4137389 0.276662 -0.472035 -0.013960In [55]: dff.fillna(dff.mean()['B':'C'])Out[55]: A B C0 0.271860 -0.424972 0.5670201 0.276232 -1.087401 -0.6736902 0.113648 -1.478427 0.5249883 NaN 0.577046 -1.7150024 NaN -0.401419 -1.1578925 -1.344312 -0.401419 -0.2935436 -0.109050 1.643563 -0.2935437 0.357021 -0.674600 -0.2935438 -0.968914 -1.294524 0.4137389 0.276662 -0.472035 -0.013960
上面操作等同于:
In [56]: dff.where(pd.notna(dff), dff.mean(), axis='columns')
使用dropna删除包含NA的数据
除了fillna来填充数据之外,还可以使用dropna删除包含na的数据。
In [57]: dfOut[57]: one two threea NaN -0.282863 -1.509059c NaN 1.212112 -0.173215e NaN 0.000000 0.000000f NaN 0.000000 0.000000h NaN -0.706771 -1.039575In [58]: df.dropna(axis=0)Out[58]: Empty DataFrameColumns: [one, two, three]Index: []In [59]: df.dropna(axis=1)Out[59]: two threea -0.282863 -1.509059c 1.212112 -0.173215e 0.000000 0.000000f 0.000000 0.000000h -0.706771 -1.039575In [60]: df['one'].dropna()Out[60]: Series([], Name: one, dtype: float64)
插值interpolation
数据分析时候,为了数据的平稳,我们需要一些插值运算interpolate() ,使用起来很简单:
In [61]: tsOut[61]: 2000-01-31 0.4691122000-02-29 NaN2000-03-31 NaN2000-04-28 NaN2000-05-31 NaN ... 2007-12-31 -6.9502672008-01-31 -7.9044752008-02-29 -6.4417792008-03-31 -8.1849402008-04-30 -9.011531Freq: BM, Length: 100, dtype: float64
In [64]: ts.interpolate()Out[64]: 2000-01-31 0.4691122000-02-29 0.4344692000-03-31 0.3998262000-04-28 0.3651842000-05-31 0.330541 ... 2007-12-31 -6.9502672008-01-31 -7.9044752008-02-29 -6.4417792008-03-31 -8.1849402008-04-30 -9.011531Freq: BM, Length: 100, dtype: float64
插值函数还可以添加参数,指定插值的方法,比如按时间插值:
In [67]: ts2Out[67]: 2000-01-31 0.4691122000-02-29 NaN2002-07-31 -5.7850372005-01-31 NaN2008-04-30 -9.011531dtype: float64In [68]: ts2.interpolate()Out[68]: 2000-01-31 0.4691122000-02-29 -2.6579622002-07-31 -5.7850372005-01-31 -7.3982842008-04-30 -9.011531dtype: float64In [69]: ts2.interpolate(method='time')Out[69]: 2000-01-31 0.4691122000-02-29 0.2702412002-07-31 -5.7850372005-01-31 -7.1908662008-04-30 -9.011531dtype: float64
按index的float value进行插值:
In [70]: serOut[70]: 0.0 0.01.0 NaN10.0 10.0dtype: float64In [71]: ser.interpolate()Out[71]: 0.0 0.01.0 5.010.0 10.0dtype: float64In [72]: ser.interpolate(method='values')Out[72]: 0.0 0.01.0 1.010.0 10.0dtype: float64
除了插值Series,还可以插值DF:
In [73]: df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8], ....: 'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]}) ....: In [74]: dfOut[74]: A B0 1.0 0.251 2.1 NaN2 NaN NaN3 4.7 4.004 5.6 12.205 6.8 14.40In [75]: df.interpolate()Out[75]: A B0 1.0 0.251 2.1 1.502 3.4 2.753 4.7 4.004 5.6 12.205 6.8 14.40
interpolate还接收limit参数,可以指定插值的个数。
In [95]: ser.interpolate(limit=1)Out[95]: 0 NaN1 NaN2 5.03 7.04 NaN5 NaN6 13.07 13.08 NaNdtype: float64
使用replace替换值
replace可以替换常量,也可以替换list:
In [102]: ser = pd.Series([0., 1., 2., 3., 4.])In [103]: ser.replace(0, 5)Out[103]: 0 5.01 1.02 2.03 3.04 4.0dtype: float64
In [104]: ser.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])Out[104]: 0 4.01 3.02 2.03 1.04 0.0dtype: float64
可以替换DF中特定的数值:
In [106]: df = pd.DataFrame({'a': [0, 1, 2, 3, 4], 'b': [5, 6, 7, 8, 9]})In [107]: df.replace({'a': 0, 'b': 5}, 100)Out[107]: a b0 100 1001 1 62 2 73 3 84 4 9
可以使用插值替换:
In [108]: ser.replace([1, 2, 3], method='pad')Out[108]: 0 0.01 0.02 0.03 0.04 4.0dtype: float64
来源https://www.cnblogs.com/flydean/p/14925464.html