Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- KX Community
- :
- Discussion Forums
- :
- kdb+ and q
- :
- Finding median from partitions using map reduce

Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Finding median from partitions using map reduce

Options

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

2020.09.02 11:54 PM

Hi,

I have a date partitioned HDB that stores a big number of trades. e.g.:

trade:([] date; sym; time; side; size; price)

While the machine only have enough memory to host data for 1 or 2 days, we want to find the med trade size across past 30 days.

i.e. select med size by sym from trade where date > .z.D - 30

Is there a way to compute the median by mapping raw trade data to daily summary?

Then reduce the summary to 30 days median?

Then reduce the summary to 30 days median?

Thx,

Kei

3 REPLIES 3

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

2020.09.03 12:59 AM

Hi,

Perhaps create a frequency distribution of each size (count) and update accordingly.

Gary

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

2020.09.03 02:01 AM

This is a nice question. I used to use streaming medians for interviews.There are many options depending upon how accurate you need to be, with different memory, accuracy tradeoffs.

At the simplest level, you could just store the median each day, and then calculate the median of medians. This is what KDB used to do to calculate medians over partitions prior to v3.0. This is very simple, but it sacrifices as bit of accuracy.

To be completely accurate we don't need to store every trade, at the very least we need to store the number of trades with each different trade size. i.e to bin the data. Trade sizes are discrete, so this should cut down on things significantly. The median can then be reconstructed by calculating the 30 day rolling sum of the medians, and then finding where the CDF crosses 0.5 cross sectionally.

Trade sizes start at zero, so the size of our bins table could be bulked up by large but rare trades which are unlikely to have an effect on the median. To combat this you could clip the trade sizes at say 90% each day (but still include these trades at the clipped size), which would save some space. It probably wouldn't effect the accuracy, but could introduce some bias, especially as you transition between trading environments e.g. from low to high volatility.

To combat this bias, an alternative would be to use to use batch reservoir sampling, to decide which trades to include each day. This is more advanced, but is likely to give a very accurate answer without any bias.

If space is an absolute premium, and you would only like to store a few values per sym per day, then you could look at a technique called frugal streaming, but I don't think it can be vectorized.

Best,

George

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

2020.09.03 06:02 AM

Hi Kei,

I don't think med can be map-reduced in a nice way without having the whole data set. If you wanted something like avg over the past 30 days, then you could store sums and counts for each day and use that but there's no similar method for med.

One suggestion, although not ideal, is to store the median size by sym for each day and calculate the median of these values. This will only give an estimate of the true median and could potentially be skewed if you have days where lots of the size data is close to the median for that day.

If your data has a small number of distinct values, then you could try a

count each group x

on each partition. By having a count of all the distinct size values for each day, you could build a median this way. This won't be practical if you have a large number of distinct size values, or you have float columns.

Hope this helps,

Liam

Main Office Contacts

**EMEA**

Tel: +44 (0)28 3025 2242

**AMERICAS**

Tel: +1 (212) 447 6700

**APAC**

Tel: +61 (0)2 9236 5700

Useful Information

Popular Links

Follow Us

KX. All Rights Reserved.

KX and kdb+ are registered trademarks of KX Systems, Inc., a subsidiary of FD Technologies plc.