Showing results for 
Search instead for 
Did you mean: 

the advantage of random segmentation

New Contributor III

Hey Everyone,

Keen for thoughts on this.

I've been setting up so my HDBs are segmented. I have a few different projects and I have 1 parent drive which hosts the home folders for HDBs then 4 extra drives which have  segmented tables across them. (No RAID and daily backup).  I do this because I've got more data than I can store in memory so any gains from parallel read/writes are worth it. 

Thing is, whenever I look at any examples, it always looks like data is assigned to segments based on grouping  data in some non arbitrary way, based on a feature of that data.

I think in reality this would almost always be the wrong approach (boom).

My thinking is that in order to get a gain from segmentation, you would need to ensure that your data is being accessed in parallel. So for any given dataset, you would prefer it to be randomly distributed across segments. Segmenting based on an attribute in the data would increase 'clumpyness' so the chances of one or more segments being accessed more than others is increased so the number of parallel I/Os would, I recon, decrease. In practice, I pull my segment list from par.txt, then randomly assign a segment label to each row in my data then filter on that and write to each segment. Because of something like neat freakishness, I add the segment label as a column in the data but frankly, I don't think I'll ever use it. (maybe some edge case on testing access speeds or something).

Anyway, am I missing something?



Contributor III
Contributor III

It depends on the data/access-patterns/hardware. May make sense in your use case but not in others.


If you had a DB with a very large number of unique values in the `p# field then it can make sense to group them together to make the attribute more efficient and have the disk doing more sequential reads which are faster.


There are some notes on estimating attribute overheads on:


In some IoT schemas where there can be millions of unique identifiers even these approaches would not be enough and introducing a hash column/lookup can be needed to not have the DB be slowed by larger attributes and small random read reads.