pbootcms网站模板|日韩1区2区|织梦模板||网站源码|日韩1区2区|jquery建站特效-html5模板网

pandas 和 numpy 的意思不同

mean from pandas and numpy differ(pandas 和 numpy 的意思不同)
本文介紹了pandas 和 numpy 的意思不同的處理方法,對大家解決問題具有一定的參考價值,需要的朋友們下面隨著小編來一起學習吧!

問題描述

我有一個 MEMS IMU,我一直在其上收集數據,我正在使用 pandas 從中獲取一些統計數據.每個周期收集 6 個 32 位浮點數.對于給定的收集運行,數據速率是固定的.數據速率在 100Hz 和 1000Hz 之間變化,采集時間長達 72 小時.數據保存在一個平面二進制文件中.我是這樣讀取數據的:

I have a MEMS IMU on which I've been collecting data and I'm using pandas to get some statistical data from it. There are 6 32-bit floats collected each cycle. Data rates are fixed for a given collection run. The data rates vary between 100Hz and 1000Hz and the collection times run as long as 72 hours. The data is saved in a flat binary file. I read the data this way:

import numpy as np
import pandas as pd
dataType=np.dtype([('a','<f4'),('b','<f4'),('c','<f4'),('d','<f4'),('e','<f4'),('e','<f4')])
df=pd.DataFrame(np.fromfile('FILENAME',dataType))
df['c'].mean()
-9.880581855773926
x=df['c'].values
x.mean()
-9.8332081

-9.833 是正確的結果.我可以創建一個類似的結果,有人應該能夠以這種方式重復:

-9.833 is the correct result. I can create a similar result that someone should be able to repeat this way:

import numpy as np
import pandas as pd
x=np.random.normal(-9.8,.05,size=900000)
df=pd.DataFrame(x,dtype='float32',columns=['x'])
df['x'].mean()
-9.859579086303711
x.mean()
-9.8000648778888628

我在 linux 和 windows、AMD 和 Intel 處理器、Python 2.7 和 3.5 上重復了這一點.我難住了.我究竟做錯了什么?得到這個:

I've repeated this on linux and windows, on AMD and Intel processors, in Python 2.7 and 3.5. I'm stumped. What am I doing wrong? And get this:

x=np.random.normal(-9.,.005,size=900000)
df=pd.DataFrame(x,dtype='float32',columns=['x'])
df['x'].mean()
-8.999998092651367
x.mean()
-9.0000075889406528

我可以接受這種差異.它處于 32 位浮點數精度的極限.

I could accept this difference. It's at the limit of the precision of 32 bit floats.

沒關系.我在星期五寫了這篇文章,今天早上我遇到了解決方案.這是由于大量數據而加劇的浮點精度問題.我需要以這種方式在創建數據幀時將數據轉換為 64 位浮點數:

NEVERMIND. I wrote this on Friday and the solution hit me this morning. It is a floating point precision problem exacerbated by the large amount of data. I needed to convert the data into 64 bit float on the creation of the dataframe this way:

df=pd.DataFrame(np.fromfile('FILENAME',dataType),dtype='float64')

如果其他人遇到類似問題,我會離開帖子.

I'll leave the post should anyone else run into a similar issue.

推薦答案

短版:

之所以不同,是因為 pandas 在調用 mean 操作時使用了 bottleneck(如果已安裝),而不是僅僅依賴numpy.bottleneck 可能被使用,因為它似乎比 numpy 更快(至少在我的機器上),但以精度為代價.它們恰好匹配 64 位版本,但在 32 位土地上有所不同(這是有趣的部分).

Short version:

The reason it's different is because pandas uses bottleneck (if it's installed) when calling the mean operation, as opposed to just relying on numpy. bottleneck is presumably used since it appears to be faster than numpy (at least on my machine), but at the cost of precision. They happen to match for the 64 bit version, but differ in 32 bit land (which is the interesting part).

僅通過檢查這些模塊的源代碼很難判斷發生了什么(它們非常復雜,即使對于像 mean 這樣的簡單計算,數值計算也很困難).最好使用調試器來避免大腦編譯和那些類型的錯誤.調試器不會在邏輯上出錯,它會準確地告訴你發生了什么.

It's extremely difficult to tell what's going on just by inspecting the source code of these modules (they're quite complex, even for simple computations like mean, turns out numerical computing is hard). Best to use the debugger to avoid brain-compiling and those types of mistakes. The debugger won't make a mistake in logic, it'll tell you exactly what's going on.

這是我的一些堆棧跟蹤(值略有不同,因為沒有 RNG 種子):

Here's some of my stack trace (values differ slightly since no seed for RNG):

可以重現(Windows):

>>> import numpy as np; import pandas as pd
>>> x=np.random.normal(-9.,.005,size=900000)
>>> df=pd.DataFrame(x,dtype='float32',columns=['x'])
>>> df['x'].mean()
-9.0
>>> x.mean()
-9.0000037501099754
>>> x.astype(np.float32).mean()
-9.0000029

numpy 的版本沒有什么特別之處.pandas 版本有點古怪.

Nothing extraordinary going on with numpy's version. It's the pandas version that's a little wacky.

我們來看看df['x'].mean():

>>> def test_it_2():
...   import pdb; pdb.set_trace()
...   df['x'].mean()
>>> test_it_2()
... # Some stepping/poking around that isn't important
(Pdb) l
2307
2308            if we have an ndarray as a value, then simply perform the operation,
2309            otherwise delegate to the object
2310
2311            """
2312 ->         delegate = self._values
2313            if isinstance(delegate, np.ndarray):
2314                # Validate that 'axis' is consistent with Series's single axis.
2315                self._get_axis_number(axis)
2316                if numeric_only:
2317                    raise NotImplementedError('Series.{0} does not implement '
(Pdb) delegate.dtype
dtype('float32')
(Pdb) l
2315                self._get_axis_number(axis)
2316                if numeric_only:
2317                    raise NotImplementedError('Series.{0} does not implement '
2318                                              'numeric_only.'.format(name))
2319                with np.errstate(all='ignore'):
2320 ->                 return op(delegate, skipna=skipna, **kwds)
2321
2322            return delegate._reduce(op=op, name=name, axis=axis, skipna=skipna,
2323                                    numeric_only=numeric_only,
2324                                    filter_type=filter_type, **kwds)

所以我們找到了問題所在,但現在事情變得有點奇怪:

So we found the trouble spot, but now things get kind of weird:

(Pdb) op
<function nanmean at 0x000002CD8ACD4488>
(Pdb) op(delegate)
-9.0
(Pdb) delegate_64 = delegate.astype(np.float64)
(Pdb) op(delegate_64)
-9.000003749978807
(Pdb) delegate.mean()
-9.0000029
(Pdb) delegate_64.mean()
-9.0000037499788075
(Pdb) np.nanmean(delegate, dtype=np.float64)
-9.0000037499788075
(Pdb) np.nanmean(delegate, dtype=np.float32)
-9.0000029

注意 delegate.mean()np.nanmean 輸出 -9.0000029 類型為 float32not -9.0 不像 pandas nanmean 那樣.稍微摸索一下,您可以在 pandas.core.nanops 中找到 pandas nanmean 的源代碼.有趣的是,它實際上看起來應該首先匹配numpy.我們來看看pandas nanmean:

Note that delegate.mean() and np.nanmean output -9.0000029 with type float32, not -9.0 as pandas nanmean does. With a bit of poking around, you can find the source to pandas nanmean in pandas.core.nanops. Interestingly, it actually appears like it should be matching numpy at first. Let's have a look at pandas nanmean:

(Pdb) import inspect
(Pdb) src = inspect.getsource(op).split("
")
(Pdb) for line in src: print(line)
@disallow('M8')
@bottleneck_switch()
def nanmean(values, axis=None, skipna=True):
    values, mask, dtype, dtype_max = _get_values(values, skipna, 0)

    dtype_sum = dtype_max
    dtype_count = np.float64
    if is_integer_dtype(dtype) or is_timedelta64_dtype(dtype):
        dtype_sum = np.float64
    elif is_float_dtype(dtype):
        dtype_sum = dtype
        dtype_count = dtype
    count = _get_counts(mask, axis, dtype=dtype_count)
    the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))

    if axis is not None and getattr(the_sum, 'ndim', False):
        the_mean = the_sum / count
        ct_mask = count == 0
        if ct_mask.any():
            the_mean[ct_mask] = np.nan
    else:
        the_mean = the_sum / count if count > 0 else np.nan

    return _wrap_results(the_mean, dtype)

這是 bottleneck_switch 裝飾器的(短)版本:

Here's a (short) version of the bottleneck_switch decorator:

import bottleneck as bn
...
class bottleneck_switch(object):

    def __init__(self, **kwargs):
        self.kwargs = kwargs

    def __call__(self, alt):
        bn_name = alt.__name__

        try:
            bn_func = getattr(bn, bn_name)
        except (AttributeError, NameError):  # pragma: no cover
            bn_func = None
    ...

                if (_USE_BOTTLENECK and skipna and
                        _bn_ok_dtype(values.dtype, bn_name)):
                    result = bn_func(values, axis=axis, **kwds)

這是用 alt 作為 pandas nanmean 函數調用的,所以 bn_name'nanmean',這是從 bottleneck 模塊中獲取的屬性:

This is called with alt as the pandas nanmean function, so bn_name is 'nanmean', and this is the attr that's grabbed from the bottleneck module:

(Pdb) l
 93                             result = np.empty(result_shape)
 94                             result.fill(0)
 95                             return result
 96
 97                     if (_USE_BOTTLENECK and skipna and
 98  ->                         _bn_ok_dtype(values.dtype, bn_name)):
 99                         result = bn_func(values, axis=axis, **kwds)
100
101                         # prefer to treat inf/-inf as NA, but must compute the fun
102                         # twice :(
103                         if _has_infs(result):
(Pdb) n
> d:anaconda3libsite-packagespandascore
anops.py(99)f()
-> result = bn_func(values, axis=axis, **kwds)
(Pdb) alt
<function nanmean at 0x000001D2C8C04378>
(Pdb) alt.__name__
'nanmean'
(Pdb) bn_func
<built-in function nanmean>
(Pdb) bn_name
'nanmean'
(Pdb) bn_func(values, axis=axis, **kwds)
-9.0

假設 bottleneck_switch() 裝飾器一秒鐘不存在.我們實際上可以看到,手動單步執行這個函數(沒有 bottleneck)調用會得到與 numpy 相同的結果:

Pretend that bottleneck_switch() decorator doesn't exist for a second. We can actually see that calling that manually stepping through this function (without bottleneck) will get you the same result as numpy:

(Pdb) from pandas.core.nanops import _get_counts
(Pdb) from pandas.core.nanops import _get_values
(Pdb) from pandas.core.nanops import _ensure_numeric
(Pdb) values, mask, dtype, dtype_max = _get_values(delegate, skipna=skipna)
(Pdb) count = _get_counts(mask, axis=None, dtype=dtype)
(Pdb) count
900000.0
(Pdb) values.sum(axis=None, dtype=dtype) / count
-9.0000029

但是,如果您安裝了 bottleneck,則永遠不會調用它.取而代之的是,bottleneck_switch() 裝飾器使用 bottleneck 的版本對 nanmean 函數進行爆破.這就是差異所在(有趣的是,它匹配 float64 案例):

That never gets called, though, if you have bottleneck installed. Instead, the bottleneck_switch() decorator instead blasts over the nanmean function with bottleneck's version. This is where the discrepancy lies (interestingly it matches on the float64 case, though):

(Pdb) import bottleneck as bn
(Pdb) bn.nanmean(delegate)
-9.0
(Pdb) bn.nanmean(delegate.astype(np.float64))
-9.000003749978807

bottleneck 據我所知,僅用于速度.我假設他們正在使用他們的 nanmean 函數采取某種快捷方式,但我沒有深入研究它(有關此主題的詳細信息,請參閱@ead 的答案).你可以看到它通常比 numpy 通過他們的基準測試快一點:https://github.com/kwgoodman/bottleneck.顯然,要為這種速度付出的代價就是精度.

bottleneck is used solely for speed, as far as I can tell. I'm assuming they're taking some type of shortcut with their nanmean function, but I didn't look into it much (see @ead's answer for details on this topic). You can see that it's typically a bit faster than numpy by their benchmarks: https://github.com/kwgoodman/bottleneck. Clearly the price to pay for this speed is precision.

瓶頸真的更快嗎?

確實看起來像(至少在我的機器上).

Sure looks like it (at least on my machine).

In [1]: import numpy as np; import pandas as pd

In [2]: x=np.random.normal(-9.8,.05,size=900000)

In [3]: y_32 = x.astype(np.float32)

In [13]: %timeit np.nanmean(y_32)
100 loops, best of 3: 5.72 ms per loop

In [14]: %timeit bn.nanmean(y_32)
1000 loops, best of 3: 854 μs per loop

pandas 在這里引入一個標志可能會很好(一個用于速度,另一個用于更好的精度,默認是速度,因為這是當前的實現).有些用戶更關心計算的準確性,而不是計算速度.

It might be nice for pandas to introduce a flag here (one for speed, the other for better precision, default is for speed since that's the current impl). Some users care much more about the accuracy of the computation than the speed at which it happens.

HTH.

這篇關于pandas 和 numpy 的意思不同的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!

【網站聲明】本站部分內容來源于互聯網,旨在幫助大家更快的解決問題,如果有圖片或者內容侵犯了您的權益,請聯系我們刪除處理,感謝您的支持!

相關文檔推薦

Python 3 Float Decimal Points/Precision(Python 3 浮點小數點/精度)
Converting Float to Dollars and Cents(將浮點數轉換為美元和美分)
What are some possible calculations with numpy or scipy that can return a NaN?(numpy 或 scipy 有哪些可能的計算可以返回 NaN?)
Python float to ratio(Python浮動比率)
How to manage division of huge numbers in Python?(如何在 Python 中管理大量數字的除法?)
Function to determine if two numbers are nearly equal when rounded to n significant decimal digits(用于確定兩個數字在舍入到 n 個有效十進制數字時是否幾乎相等的函數)
主站蜘蛛池模板: 涡街流量计_LUGB智能管道式高温防爆蒸汽温压补偿计量表-江苏凯铭仪表有限公司 | 电磁流量计厂家_涡街流量计厂家_热式气体流量计-青天伟业仪器仪表有限公司 | 背压阀|减压器|不锈钢减压器|减压阀|卫生级背压阀|单向阀|背压阀厂家-上海沃原自控阀门有限公司 本安接线盒-本安电路用接线盒-本安分线盒-矿用电话接线盒-JHH生产厂家-宁波龙亿电子科技有限公司 | 众品地板网-地板品牌招商_地板装修设计_地板门户的首选网络媒体。 | 首页_中夏易经起名网| 托盘租赁_塑料托盘租赁_托盘出租_栈板出租_青岛托盘租赁-优胜必达 | 尊享蟹太太美味,大闸蟹礼卡|礼券|礼盒在线预订-蟹太太官网 | 广州冷却塔维修厂家_冷却塔修理_凉水塔风机电机填料抢修-广东康明节能空调有限公司 | 辐射色度计-字符亮度测试-反射式膜厚仪-苏州瑞格谱光电科技有限公司 | 好物生环保网、环保论坛 - 环保人的学习交流平台 | 石英砂矿石色选机_履带辣椒色选机_X光异物检测机-合肥幼狮光电科技 | 商标转让-商标注册-商标查询-软著专利服务平台 - 赣江万网 | 水平筛厂家-三轴椭圆水平振动筛-泥沙震动筛设备_山东奥凯诺矿机 包装设计公司,产品包装设计|包装制作,包装盒定制厂家-汇包装【官方网站】 | 电销卡_稳定企业大语音卡-归属地可选-世纪通信 | 物和码官网,物和码,免费一物一码数字化营销SaaS平台 | 合肥钣金加工-安徽激光切割加工-机箱机柜加工厂家-合肥通快 | 订做不锈钢_不锈钢定做加工厂_不锈钢非标定制-重庆侨峰金属加工厂 | 电子海图系统-电梯检验系统-智慧供热系统开发-商品房预售资金监管系统 | 执业药师报名时间,报考条件,考试时间-首页入口 | 实验室隔膜泵-无油防腐蚀隔膜泵-耐腐蚀隔膜真空泵-杭州景程仪器 电杆荷载挠度测试仪-电杆荷载位移-管桩测试仪-北京绿野创能机电设备有限公司 | 活性氧化铝|无烟煤滤料|活性氧化铝厂家|锰砂滤料厂家-河南新泰净水材料有限公司 | 吹塑加工_大型吹塑加工_滚塑代加工-莱力奇吹塑加工有限公司 | 食药成分检测_调料配方还原_洗涤剂化学成分分析_饲料_百检信息科技有限公司 | 技德应用| 葡萄酒灌装机-食用油灌装机-液体肥灌装设备厂家_青州惠联灌装机械 | 南方珠江-南方一线电缆-南方珠江科技电缆-南方珠江科技有限公司 南汇8424西瓜_南汇玉菇甜瓜-南汇水蜜桃价格 | 水篦子|雨篦子|镀锌格栅雨水篦子|不锈钢排水篦子|地下车库水箅子—安平县云航丝网制品厂 | 钢衬四氟管道_钢衬四氟直管_聚四氟乙烯衬里管件_聚四氟乙烯衬里管道-沧州汇霖管道科技有限公司 | 北京森语科技有限公司-模型制作专家-展览展示-沙盘模型设计制作-多媒体模型软硬件开发-三维地理信息交互沙盘 | 南京PVC快速门厂家南京快速卷帘门_南京pvc快速门_世界500强企业国内供应商_南京美高门业 | 武汉高低温试验机-现货恒温恒湿试验箱-高低温湿热交变箱价格-湖北高天试验设备 | 必胜高考网_全国高考备考和志愿填报信息平台 | 房车价格_依维柯/大通/东风御风/福特全顺/江铃图片_云梯搬家车厂家-程力专用汽车股份有限公司 | 风信子发稿-专注为企业提供全球新闻稿发布服务 | 威客电竞(vk·game)·电子竞技赛事官网 | 众品地板网-地板品牌招商_地板装修设计_地板门户的首选网络媒体。 | MOOG伺服阀维修,ATOS比例流量阀维修,伺服阀维修-上海纽顿液压设备有限公司 | 塑料瓶罐_食品塑料瓶_保健品塑料瓶_调味品塑料瓶–东莞市富慷塑料制品有限公司 | 实验室pH计|电导率仪|溶解氧测定仪|离子浓度计|多参数水质分析仪|pH电极-上海般特仪器有限公司 | 全自动面膜机_面膜折叠机价格_面膜灌装机定制_高速折棉机厂家-深圳市益豪科技有限公司 | 台式低速离心机-脱泡离心机-菌种摇床-常州市万丰仪器制造有限公司 |