Showing results for 
Search instead for 
Did you mean: 

Loading data for combinations of instruments and date ranges (rolled series)

New Contributor II

I am trying to come up with an efficient way to load data from disk, for multiple instruments over different date ranges. The actual problem we're trying to solve is loading rolled series for futures, forwards, and other derivatives.

Say we want to load data for the following instruments and date ranges (spec)

inst startDate endDate
A 2022.01.01 2022.03.31
B 2022.04.01 2022.06.30
C 2022.07.01 2022.08.31

One way of doing this, following the usual date-then-sym query pattern would be:

minDate: min exec startDate from spec;
maxDate: max exec endDate from spec;
insts: exec inst from spec;

result: select from trades where date within (minDate; maxDate), sym in insts;

 And then perhaps join result and spec so that we can filter each instrument to its specific date range.

The problem with this is that we end up loading much more data than we care about, for example we are loading data for B from well before 2022.04.01. I imagine KDB is also wasting time scanning for instruments in historical dates that likely have no data at all for that instrument.

For our datasets, we've found the following to be much faster:

loadContract: {[x] select from trades where date within (x[`startDate]; x[`endDate]), sym=x[`inst]};
result: raze loadContract each spec

We now only load data for a given inst for its relevant dates.

Can we do better than this?

I was reading up on segmented databases, and that queries can be parallelised over how partitions are divided over the segments. I am concerned that by looping our queries rather than providing all the information upfront, we may lose out on this and other under-the-hood efficiencies/optimisations.

I had considered a table lookup, e.g.

where ([] date; sym) in ...

but have found in the past these don't leverage the partitioned/parted attributes.

Many thanks.


Contributor III
Contributor III

If you wanted to touch the files as little as possible you could work out ranges and overlaps

//Create a spec ensuring to include some overlaps and gaps to test code better
q)spec:flip `inst`startDate`endDate!(`A`B`C;2022.01.01 2022.02.01 2022.06.01;2022.03.31 2022.04.30 2022.07.31)
inst startDate  endDate
A    2022.01.01 2022.03.31
B    2022.02.01 2022.04.30
C    2022.06.01 2022.07.31
//Explode the ranges into their individual dates
q)ranges:ungroup select inst,date:startDate+til each 1+endDate-startDate from spec 
//Now regroup to gather instruments by date
q)ranges:0!select inst by date from ranges
//Find cases where there are gaps or the instruments change
q)ranges:update dDate:deltas[date],dInst:differ inst from ranges
//Grab the indexes from your ranges table for the beginning and ending of our needed queries
q)rInds:{-1_x,'-1+next x}(exec i from ranges where (dDate>1) or dInst),count ranges
0   30
31  89
90  119
120 180
//Pulling out the indexes you can see each pair of rows forms a query
q)select from ranges where i in raze rInds
date       inst dDate dInst
2022.01.01 ,`A  8036  1
2022.01.31 ,`A  1     0
2022.02.01 `A`B 1     1
2022.03.31 `A`B 1     0
2022.04.01 ,`B  1     1
2022.04.30 ,`B  1     0
2022.06.01 ,`C  32    1
2022.07.31 ,`C  1     0
//Sample table of data
q)trades:([] date:2022.01.01+til 365;sym:365?`A`B`C)
//Build and execute functional selects to query from disk
q)result:raze {?[`trades;((within;`date;x`date);(in;`sym;enlist x[`inst]0));0b;()]} each ranges each rInds
//Visually inspect result to see months and instruments pulled
q)select count i by sym,date.month from result
sym month  | x
-----------| --
A   2022.01| 8
A   2022.02| 11
A   2022.03| 12
B   2022.02| 8
B   2022.03| 9
B   2022.04| 10
C   2022.06| 15
C   2022.07| 11


Thank you for the prompt and thorough response. This is undoubtedly useful for loading various instruments with different ranges.

In our specific example of rolled series there would only be one day of overlap between contracts. I am wondering if there is a way of constructing this as a single-query, rather than looping by date ranges. This would allow KDB to decompose the query over different segments/threads as appropriate.

Perhaps this could be done by exploding the contract/date combinations as you suggest, and looking up into this table. However I'm not sure how to do this without losing the efficiency gained from filtering by date first. Could a join help?

kdb+ is already automatically using multithreaded map-reduce for each of your subqueries without any changes as long as you started your process with -s 

//I'll create a test HDB to test against - you can use real data instead

.z.zd:17 2 6 //Turning compression on
//Write some partitions
{.Q.dd[`:HDB;x,`tab`] set .Q.en[`:HDB] ([] sym:`p#asc n?`3;p:n?100.0;t:n?1000)} each 2022.01.01 + til 365

//Load the HDB
\l HDB

//q was started with -s 4 for 4 secondary threads
//Run our sample query 100 times for each of 0-4 secondary threads available
q)results:([] secondaryThreads:til 1+system"s"; timings:{system"s ",string x;value"\\t:100 select from tab where date within 2022.01.01 2022.03.31,sym=`aaa"}each til 1+system"s")
//Results show that the threads are being used and our query is speeding up with more threads available
secondaryThreads timings
0                12153
1                12225
2                7919
3                6222
4                5767

You can see the benefit of threads by running through some sample queries



If you wanted you could experiment with using the threads per subquery rather within each subquery.

This would only require you change each to peach on your existing code to have threads execute each subquery

result: raze loadContract peach spec

For your query though I expect the existing code is probably the better balance of compute and memory.

Thank you - the syntax for inspecting the thread timings is very useful.

If we had our data segmented over 2+ disks and were not using peach, surely we wouldn't utilise the available I/O throughput?

Whereas if we were to use peach, with the thread number being equal to, or a multiple of, the number of disks which our data is segmented across, then we would hopefully max out our I/O capacity. However this requires us manually ensuring that the date order in which we invoke the sub-function with peach matches the order in which data is distributed across segments.

I believe KDB does this natively* if you have a segmented, partitioned HDB but it seems like we aren't able to utilise that efficiently with the query we're trying to run.