問題描述
我們每年生成 20.000.000 個文本文件,每個平均大小約為 250 Kb(35 Kb 壓縮).
We have 20.000.000 generated textfiles every year, average size is approx 250 Kb each (35 Kb zipped).
我們必須將這些文件放入某種存檔中 10 年.不需要在文本文件中搜索,但我們必須能夠通過搜索 5-10 個元數據字段(例如productname"、creationdate"等)來找到一個 texfile.
We must put these files in some kind of archive for 10 years. No need to search inside textfiles, but we must be able to find one texfile by searching on 5-10 metadata fields such as "productname", "creationdate", etc.
我正在考慮壓縮每個文件并將它們存儲在 SQL Server 數據庫中,該數據庫具有 5-10 個可搜索(索引)列和一個用于壓縮文件數據的 varbinary(MAX) 列.
I'm considering zipping each file and storing them in a SQL Server database with 5-10 searchable (indexed) columns and a varbinary(MAX) column for the zipped file data.
數據庫會隨著時間的推移變得龐大;5-10 TB.所以我認為我們需要對數據進行分區,例如每年保留一個數據庫.
The database will be grow huge over the years; 5-10 Tb. So I think we need to partition data for example by keeping one database per year.
我一直在研究在 SQL Server 中對包含數據的 varbinary 列使用 FILESTREAM,但似乎這更適合大于 1 Mb 的 blob?
I've been looking into using FILESTREAM in SQL Server for the varbinary column that holds the data, but it seems this is more suitable for blobs > 1 Mb?
有關如何管理此類數據量的任何其他建議?
Any other suggestions on how to manage such data volumes?
推薦答案
Filestream 絕對更適合更大的 blob (750kB-1MB),因為打開外部文件所需的開銷開始影響讀寫性能 vs. vb(max) 小文件的 blob 存儲.如果這不是什么大問題(即,在初始寫入后讀取 blob 數據的頻率很低,并且 blob 實際上是不可變的),那么這絕對是一個選擇.
Filestream is definitely more suited to larger blobs (750kB-1MB) as the overhead required to open the external file begins to impact read and write performance vs. vb(max) blob storage for small files. If this is not so much of an issue (ie. reads of blob data after the initial write are infrequent, and the blobs are effectively immutable) then it's definitely an option.
我可能會建議將文件直接保存在 vb(max) 列中,如果您可以保證它們的大小不會變大,但是使用 TEXTIMAGE_ON 選項將此表存儲在單獨的文件組中,這將允許您如有必要,將其從元數據的其余部分移至不同的存儲.此外,請確保設計您的架構,以便可以使用分區或通過某些多表方案將 blob 的實際存儲拆分到多個文件組,以便您可以在將來必要時擴展到不同的磁盤.
I would probably suggest keeping the files directly in a vb(max) column if you can guarantee they won't get much larger in size, but have this table stored in a seperate filegroup using the TEXTIMAGE_ON option which would allow you to move it to different storage from the rest of the metadata if necessary. Also, make sure to design your schema so the actual storage of blobs can be split over multiple filegroups either using partitions or via some multiple table scheme so you can scale to different disks if necessary in the future.
通過 Filestream 或直接 vb(max) 存儲使 blob 與 SQL 元數據直接相關比處理文件系統/SQL 不一致具有許多優勢,不僅限于易于備份和其他管理操作.
Keeping the blobs directly associated with the SQL metadata either via Filestream or direct vb(max) storage has many advantages over dealing with filesystem / SQL inconsistencies not limited to ease of backup and other management operations.
這篇關于龐大的 SQL Server 數據庫中的 Blob 數據的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!