Disk Sizing and Clusters

Anonymous user · ‎2021-06-09

In the Geo SCADA Expert help on recommended server configurations it is mentioned that calculate the disk space you require based on the size of the records, and then multiply by a factor of two. This pages provides further details on why that advice is made.

Summary

NTFS clusters and the inefficiencies of storing data smaller than the cluster size mean that more disk space will be used than calculated off the raw numbers alone, and so applying a factor provides a simple way to allow for the inefficiencies.

Long Answer

By default the cluster size of NTFS is 4KB, it is how the file system allocates storage. Only one file per cluster is allowed and that cluster is wholly allocated to that file. This means that if you write a single 32 byte record (the size of a single historic record) it actually uses 4KB of disk space.

You can test this on your machine by creating a txt file, making the content just "A" and saving it. Go to the file properties in Explorer and it will tell you both the size and size on disk. Size on disk will be your cluster size, and as you increase the file to many MB, the size on disk will always be a multiple of your cluster size. Save "AAAAAA" to the file and now the file size increases but the actual size on disk is unaffected and so the actual disk space used is unchanged, as it is still going to use the allocated cluster until the file size is greater than the allocated cluster and then a new cluster is allocated.

This leads into the second question, why double it (with all of the historians taken into account)? Calculating actual disk space verses disk usage based on the 32 bytes per record and whatever cluster size you use isn't so easy. Given you rarely have a single record per week and likely a lot more causing multiple disk clusters to be used per file it does seem to work itself out (actually it seemed to be about 1.5 times the calculated size, but does come down to how much data per minute you store, and the guide of 4 values per minute per point isn't due to this behavior, that limit is due to something else). It's a guide by the product team, and each customer will be unique.

So with the default cluster size of 4KB and using historic values (32 bytes per record).

1 record per week = disk used is 4KB for 32 bytes of data, so you have a poor efficiency of 128 (i.e. 128 times the disk usage you calculate rather than "double it")

2 records per week = disk used is 4KB, data storage needed is 64 bytes, efficiency is 64

32 records per week = disk used is 4KB, data storage needed is 1KB, efficiency is 4

64 records per week = disk used is 4KB, data storage needed is 2KB, efficiency is 2 (i.e. the magic double it!)

128 records per week = disk used is 4KB, data storage needed is 4KB, efficiency is 1

129 records per week = disk used is 8KB, data storage needed is 4,128 bytes, efficiency is 1.98

256 records per week = disk used is 8KB, data storage needed is 8KB, efficiency is 1

257 records per week = disk used is 12KB, data storage needed is 8,224 bytes, efficiency is 1.49

So you can see that with more data, as the number of clusters that get allocated to the file increases, the efficiency (disk usage based on raw numbers versus actual disk usage by NTFS) actually gets better on each roll-over. It is not a case of reducing the cluster size, that causes performance issues and has to be balanced.

Regarding file fragmentation, the use of clusters combined with the way data is streamed into a Geo SCADA system could compound that. However it makes little difference on SSDs as their random I/O read is on par with their sequential performance. But in a HDD as that NTFS cluster is reserved for that file the NTFS cluster will slowly be filled with data for that point for that week. Once the NTFS cluster is filled a new NTFS cluster will be allocated on disk. Chances are extremely likely that that new cluster won't be next to the original cluster on disk as it takes time to fill the NTFS cluster on any system (especially with the 32 bytes per record of historic values). So a defragmentation basically takes those clusters and puts them together on disk significantly increasing sequential read performance but making no difference to the cluster storage efficiency.

Go: Home Back

Disk Sizing and Clusters