Showing results for 
Search instead for 
Did you mean: 

Finding median from partitions using map reduce

New Contributor

I have a date partitioned HDB that stores a big number of trades. e.g.:
trade:([] date; sym; time; side; size; price)

While the machine only have enough memory to host data for 1 or 2 days, we want to find the med trade size across past 30 days.
i.e. select med size by sym from trade where date > .z.D - 30

Is there a way to compute the median by mapping raw trade data to daily summary?
Then reduce the summary to 30 days median?


New Contributor
Perhaps create a frequency distribution of each size (count) and update accordingly. 


New Contributor
Hi Kei,

This is a nice question. I used to use streaming medians for interviews.There are many options depending upon how accurate you need to be, with different memory, accuracy tradeoffs.

At the simplest level, you could just store the median each day, and then calculate the median of medians. This is what KDB used to do to calculate medians over partitions prior to v3.0. This is very simple, but it sacrifices as bit of accuracy.

To be completely accurate we don't need to store every trade, at the very least we need to store the number of trades with each different trade size. i.e to bin the data. Trade sizes are discrete, so this should cut down on things significantly. The median can then be reconstructed by calculating the 30 day rolling sum of the medians, and then finding where the CDF crosses 0.5 cross sectionally. 

Trade sizes start at zero, so the size of our bins table could be bulked up by large but rare trades which are unlikely to have an effect on the median. To combat this you could clip the trade sizes at say 90% each day (but still include these trades at the clipped size), which would save some space. It probably wouldn't effect the accuracy, but could introduce some bias, especially as you transition between trading environments e.g. from low to high volatility.

To combat this bias, an alternative would be to use to use batch reservoir sampling, to decide which trades to include each day. This is more advanced, but is likely to give a very accurate answer without any bias.

If space is an absolute premium, and you would only like to store a few values per sym per day, then you could look at a technique called frugal streaming, but I don't think it can be vectorized.



New Contributor
Hi Kei,

I don't think med can be map-reduced in a nice way without having the whole data set. If you wanted something like avg over the past 30 days, then you could store sums and counts for each day and use that but there's no similar method for med.   

One suggestion, although not ideal, is to store the median size by sym for each day and calculate the median of these values. This will only give an estimate of the true median and could potentially be skewed if you have days where lots of the size data is close to the median for that day.   

If your data has a small number of distinct values, then you could try a 
count each group x 
on each partition. By having a count of all the distinct size values for each day, you could build a median this way. This won't be practical if you have a large number of distinct size values, or you have float columns.  

Hope this helps,