pbootcms网站模板|日韩1区2区|织梦模板||网站源码|日韩1区2区|jquery建站特效-html5模板网

在python中讀取巨大的sas數據集

Reading huge sas dataset in python(在python中讀取巨大的sas數據集)
本文介紹了在python中讀取巨大的sas數據集的處理方法,對大家解決問題具有一定的參考價值,需要的朋友們下面隨著小編來一起學習吧!

問題描述

我有一個 50 GB 的 SAS 數據集.我想在熊貓數據框中閱讀它.快速讀取 sas 數據集的最佳方法是什么.

I have a 50 gb SAS dataset. I want to read it in pandas dataframe. What is the best way to fast read the sas dataset.

我使用下面的代碼太慢了:

I used the below code which is way too slow:

import pandas as pd
df = pd.read_sas("xxxx.sas7bdat", chunksize = 10000000)
dfs = []
for chunk in df:
    dfs.append(chunk)
df_final = pd.concat(dfs)

有什么方法可以更快地在 python 中讀取大型數據集?可以并行運行這個過程嗎?

Is there any way faster way to read large dataset in python? Can run this process parallely?

推薦答案

我知道這是一個很晚的回應,但我認為我的回答將對未來的讀者有用.幾個月前,當我必須讀取和處理 SAS 數據時,無論是 SAS7BDAT 還是 xpt 格式的 SAS 數據,我都是尋找可用于讀取這些數據集的不同庫和包,其中,我將庫入圍如下:

I know it's a very late response but I think my answer is going to be useful for future readers. Few months back when I had to read and process SAS data either SAS7BDAT or xpt format SAS data, I was looking for different libraries and packages available to read these datasets, among them, I shortlisted the libraries as follows:

  1. pandas(由于社區原因,它在高優先級列表中支持和性能)
  2. SAS7BDAT(能夠讀取SAS7BDAT 僅限文件,最后一次發布于 2019 年 7 月)
  3. pyreadstat(有希望的性能根據文檔以及讀取元數據的能力)
  1. pandas (It was on high priority list due to community support and performance)
  2. SAS7BDAT (Is able to read SAS7BDAT files only, and last release July 2019)
  3. pyreadstat (Promising performance as per the documentation plus ability to read meta data)

在拿起任何包之前,我做了一些性能基準測試,雖然在發布此答案時我沒有基準測試結果,但我發現 pyreadstatpandas,(似乎它在讀取文檔中提到的數據時使用了多處理,但我不確定),并且在使用 pyreadstat 時內存消耗和占用空間要小得多對比pandas,加上它可以讀取元數據,甚至只允許讀取元數據,所以我最終選擇了pyreadstat.

Before picking up any package, I did some performance benchmarking, although I don't have benchmark result at the time of posting this answer, I found pyreadstat to be faster than pandas, (seems like it's using multiprocessing while reading the data as mentioned in the documentation but I'm not exactly sure), and also the memory consumption and the footprint was much lesser while using pyreadstat in comparison to pandas, plus it is able to read the metadata, and even allows to read the metadeta only, so I finally ended up picking pyreadstat.

pyreadstat讀取的數據也是dataframe的形式,不需要手動轉換成pandas dataframe.

The data read using pyreadstat is also in the form of dataframe, so it doesn't need some manual conversion to pandas dataframe.

說到讀取大的SAS數據,pyreadstatrow_limitoffset參數可以用來讀取在塊中,因此內存不會成為瓶頸,此外,在讀取塊中的 SAS 數據時,您可以將每個塊轉換為分類并將其附加到結果數據中,然后再讀取另一個塊;它將壓縮數據大小,因此內存消耗極低(取決于數據,數據幀中唯一值的數量越少,內存使用量就越少).以下代碼片段可能對愿意閱讀大型 SAS 數據的人有用:

Talking about reading large SAS data, pyreadstat has row_limit and offset parameters which can be used to read in chunk, so the Memory is not going to be a bottleneck, furthermore, while reading the SAS data in chunk, you can convert each chunk to categorical and append it to the resulting data before reading another chunk; it will compress the data size so the Memory consumption is extremely low (depends on the data, the lesser the number of unique values in the dataframe, is lesser the memory usage). The following code snippet might be useful for someone who is willing to read large SAS data:

import pandas as pd
import pyreadstat
filename = 'foo.SAS7BDAT'
CHUNKSIZE = 50000
offset = 0
allChunk,_ = getChunk(row['filePath'], row_limit=CHUNKSIZE, row_offset=offset)
allChunk = allChunk.astype('category')

while True:
    offset += CHUNKSIZE
    # for xpt data, use pyreadstat.read_xpt()
    chunk, _ = pyreadstat.read_sas7bdat(filename, row_limit=CHUNKSIZE, row_offset=offset)
    if chunk.empty: break  # if chunk is empty, it means the entire data has been read, so break

    for eachCol in chunk:  #converting each column to categorical 
        colUnion = pd.api.types.union_categoricals([allChunk[eachCol], chunk[eachCol]])
        allChunk[eachCol] = pd.Categorical(allChunk[eachCol], categories=colUnion.categories)
        chunk[eachCol] = pd.Categorical(chunk[eachCol], categories=colUnion.categories)

    allChunk = pd.concat([allChunk, chunk])  #Append each chunk to the resulting dataframe

PS:請注意,生成的數據幀 allChunk 將所有列作為 Categorical 數據

PS: Please be noted that the resulting dataframe allChunk is going to have all column as Categorical data

這是針對 CDISC 的真實數據(原始和標準化)執行的一些基準測試(將文件讀取到數據幀的時間),文件大小范圍從幾 KB 到幾 MB,包括 xpt 和 sas7bdat 文件格式:

Here is some benchmark (Time to read the file to a dataframe) performed on real data (Raw and Standardized) for CDISC, the file size ranges from some KB to some MB, and includes both xpt and sas7bdat file formats:

Reading ADAE.xpt 49.06 KB for 100 loops:
    Pandas Average time : 0.02232 seconds
    Pyreadstat Average time : 0.04819 seconds
----------------------------------------------------------------------------
Reading ADIE.xpt 27.73 KB for 100 loops:
    Pandas Average time : 0.01610 seconds
    Pyreadstat Average time : 0.03981 seconds
----------------------------------------------------------------------------
Reading ADVS.xpt 386.95 KB for 100 loops:
    Pandas Average time : 0.03248 seconds
    Pyreadstat Average time : 0.07580 seconds
----------------------------------------------------------------------------
Reading beck.sas7bdat 14.72 MB for 50 loops:
    Pandas Average time : 5.30275 seconds
    Pyreadstat Average time : 0.60373 seconds
----------------------------------------------------------------------------
Reading p0_qs.sas7bdat 42.61 MB for 50 loops:
    Pandas Average time : 15.53942 seconds
    Pyreadstat Average time : 1.69885 seconds
----------------------------------------------------------------------------
Reading ta.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.04017 seconds
    Pyreadstat Average time : 0.00152 seconds
----------------------------------------------------------------------------
Reading te.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.01052 seconds
    Pyreadstat Average time : 0.00109 seconds
----------------------------------------------------------------------------
Reading ti.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.04446 seconds
    Pyreadstat Average time : 0.00179 seconds
----------------------------------------------------------------------------
Reading ts.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.01273 seconds
    Pyreadstat Average time : 0.00129 seconds
----------------------------------------------------------------------------
Reading t_frcow.sas7bdat 14.59 MB for 50 loops:
    Pandas Average time : 7.93266 seconds
    Pyreadstat Average time : 0.92295 seconds

如您所見,對于 xpt 文件,讀取文件的時間并不好,但對于 sas7bdat 文件,pyreadstat 的性能優于 pandas.

As you can see, for xpt files, the time to read the files isn't better, but for sas7bdat files, pyreadstat just outperforms pandas.

上述基準測試是在 pyreadstat 1.0.9、pandas 1.2.4 和 Python 3.7.5 上執行的.

這篇關于在python中讀取巨大的sas數據集的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!

【網站聲明】本站部分內容來源于互聯網,旨在幫助大家更快的解決問題,如果有圖片或者內容侵犯了您的權益,請聯系我們刪除處理,感謝您的支持!

相關文檔推薦

Python mock Patch os.environ and return value(Python mock Patch os.environ 和返回值)
Extract SAS Stored Process web Service in Python and store it in a Data Frame(在 Python 中提取 SAS 存儲過程 Web 服務并將其存儲在數據框中)
Running SAS EG project with Python(使用 Python 運行 SAS EG 項目)
Python equivalent of SAS raking macro(SAS 耙宏的 Python 等價物)
How to read JMP *.jmp file with Python Pandas into Pandas dataframe(如何使用 Python Pandas 將 JMP *.jmp 文件讀入 Pandas 數據幀)
How to calculate Distance between two ZIPs?(如何計算兩個 ZIP 之間的距離?)
主站蜘蛛池模板: 除湿机|工业除湿机|抽湿器|大型地下室车间仓库吊顶防爆除湿机|抽湿烘干房|新风除湿机|调温/降温除湿机|恒温恒湿机|加湿机-杭州川田电器有限公司 | 玻璃钢型材-玻璃钢风管-玻璃钢管道,生产厂家-[江苏欧升玻璃钢制造有限公司] | 加气混凝土砌块设备,轻质砖设备,蒸养砖设备,新型墙体设备-河南省杜甫机械制造有限公司 | EDLC超级法拉电容器_LIC锂离子超级电容_超级电容模组_软包单体电容电池_轴向薄膜电力电容器_深圳佳名兴电容有限公司_JMX专注中高端品牌电容生产厂家 | 硬度计,金相磨抛机_厂家-莱州华煜众信试验仪器有限公司 | 长江船运_国内海运_内贸船运_大件海运|运输_船舶运输价格_钢材船运_内河运输_风电甲板船_游艇运输_航运货代电话_上海交航船运 | 带式压滤机_污泥压滤机_污泥脱水机_带式过滤机_带式压滤机厂家-河南恒磊环保设备有限公司 | 粘度计NDJ-5S,粘度计NDJ-8S,越平水分测定仪-上海右一仪器有限公司 | 精密钢管,冷拔精密无缝钢管,精密钢管厂,精密钢管制造厂家,精密钢管生产厂家,山东精密钢管厂家 | 鹤壁创新仪器公司-全自动量热仪,定硫仪,煤炭测硫仪,灰熔点测定仪,快速自动测氢仪,工业分析仪,煤质化验仪器 | 电缆故障测试仪_电缆故障定位仪_探测仪_检测仪器_陕西意联电气厂家 | Eiafans.com_环评爱好者 环评网|环评论坛|环评报告公示网|竣工环保验收公示网|环保验收报告公示网|环保自主验收公示|环评公示网|环保公示网|注册环评工程师|环境影响评价|环评师|规划环评|环评报告|环评考试网|环评论坛 - Powered by Discuz! | 亿诺千企网-企业核心产品贸易| 无锡不干胶标签,卷筒标签,无锡瑞彩包装材料有限公司 | 排烟防火阀-消防排烟风机-正压送风口-厂家-价格-哪家好-德州鑫港旺通风设备有限公司 | 河南砖机首页-全自动液压免烧砖机,小型砌块水泥砖机厂家[十年老厂] | 球磨机,节能球磨机价格,水泥球磨机厂家,粉煤灰球磨机-吉宏机械制造有限公司 | 电销卡_北京电销卡_包月电话卡-豪付网络| 废气处理_废气处理设备_工业废气处理_江苏龙泰环保设备制造有限公司 | ZHZ8耐压测试仪-上海胜绪电气有限公司 | 昆山新莱洁净应用材料股份有限公司-卫生级蝶阀,无菌取样阀,不锈钢隔膜阀,换向阀,离心泵 | 全自动包衣机-无菌分装隔离器-浙江迦南科技股份有限公司 | 微量水分测定仪_厂家_卡尔费休微量水分测定仪-淄博库仑 | 合肥活动房_安徽活动板房_集成打包箱房厂家-安徽玉强钢结构集成房屋有限公司 | 煤棒机_增碳剂颗粒机_活性炭颗粒机_木炭粉成型机-巩义市老城振华机械厂 | 镀锌角钢_槽钢_扁钢_圆钢_方矩管厂家_镀锌花纹板-海邦钢铁(天津)有限公司 | 微动开关厂家-东莞市德沃电子科技有限公司| 烟气换热器_GGH烟气换热器_空气预热器_高温气气换热器-青岛康景辉 | 北京西风东韵品牌与包装设计公司,创造视觉销售力! | 苏州工作服定做-工作服定制-工作服厂家网站-尺品服饰科技(苏州)有限公司 | 黑田精工电磁阀-CAMMOZI气缸-ROSS电磁-上海茂硕机械设备有限公司 | 水篦子|雨篦子|镀锌格栅雨水篦子|不锈钢排水篦子|地下车库水箅子—安平县云航丝网制品厂 | 贝朗斯动力商城(BRCPOWER.COM) - 买叉车蓄电池上贝朗斯商城,价格更超值,品质有保障! | 撕碎机,撕破机,双轴破碎机-大件垃圾破碎机厂家 | 北京康百特科技有限公司-分子蒸馏-短程分子蒸馏设备-实验室分子蒸馏设备 | YT保温材料_YT无机保温砂浆_外墙保温材料_南阳银通节能建材高新技术开发有限公司 | 质检报告_CE认证_FCC认证_SRRC认证_PSE认证_第三方检测机构-深圳市环测威检测技术有限公司 | 工业淬火油烟净化器,北京油烟净化器厂家,热处理油烟净化器-北京众鑫百科 | ALC墙板_ALC轻质隔墙板_隔音防火墙板_轻质隔墙材料-湖北博悦佳 | 江西自考网 | 致胜管家软件服务【在线免费体验】 |