Showing results for 
Search instead for 
Did you mean: 

Question about structure of HDB segments

New Contributor
Hi all

Currently I have a HDB which is segments by year, something like

    /2009.01.01        /trade   
    /2009.01.02        /trade   
/2009.01.03 /trade
    …=============/2010    /2010.01.01        /trade  

I've a colleagues told me that it would be better to segments data by exchange, it would be faster

    /2009.01.01        /trade   
    /2009.01.02        /trade   
/2009.01.03 /trade
    …=============/HK    /2009.01.01        /trade  

I can understand it would be faster if there is query to get all data per difference exchange, but other than that use cases, it should be similar to segment by year, is that correct? ( Given that most query we do will not multi-threaded )

Or, somehow data is better to be segment by exchange?


New Contributor III
My discussion below carries two big assumptions:
1) You store the different segments on different physical disks, thus there is a real benefit by virtue of concurrent disk reads from different segments.
2) Your typical use case involves querying data across the whole of a date partition (or a range of date partitions).

With those assumptions, segmenting by a attribute other than date-related ones could be beneficial -- as your typical queries can be processed with the benefit of concurrent physical disk reads, whereas date-related segmentation would mean that such queries can only use a small subset of your physical disks.

It will all depend on query patterns but I think in general the answer would be no. 

If you have the data 
- segmented by exchange
- partitioned by date
- (parted by sym … this is an assumption)

Then the exchange segmentation isn’t really doing anything.  For example, if you just want to run

select from trade where date=X, sym=Y, exch=Z

(note you will still have to have an exch column within the table schema)

then kdb+ will still have to access every segment (and given you don’t use slaves it will be sequential) … so if you have 20+ exchanges for a decent sized database, then that is a lot of extra unnecessary work.  In you current layout it will only access 1 partition (disk access is expensive).  Any cross-exchange aggregations e.g.  "select max price by sym from trade where date=X, sym in Y”, would require you to hit all the segments as well, as opposed to your original format which would only require you to access 1. 

If you did lay out the data by exchange and you have a lot of queries for a single exchange then you could create a function to derive the file path and do queries like

// assuming getpartition[date;exch;table] returns `:/exch/date/table
select from getpartition[X; Z;`trade] where sym=Y

For me there would have to be lots of queries like this to warrant deviating from either having a separate exchange column, or from making the sym a concatenation of instrument+exchange 

Also with exchange segmentation the scaling becomes more painful - if you want to get data from a new exchange you have to create a new segment for it.