cancel
Showing results for 
Search instead for 
Did you mean: 

Group or fby

andiecairns4
New Contributor
Hi Q Experts, 

I'm unsure of the best option in this instance so hopefully some of you can help. 
Say I've a table with columns date, sym and size (small sample);

    date       sym  size
    ---------------------
    2018.02.01 `A   200
    2018.02.01 `B   400
    2018.02.02 `A   600
    2018.02.02 `B   200
    2018.02.05 `A   100
    2018.02.05 `B   101
    2018.02.06 `A   205
    2018.02.06 `B   400

How would I reduce this table to only show me, say 5 records per date (in this case, the 5 highest sizes for each date, obviously I can sort those columns first if needs be).
The big question - would I be better using group or would fby be more efficient here and also why more importantly? - I was reading into fby on the following page (but was finding it confusing);

https://code.kx.com/wiki/Reference/fby

As always thanks!
Regards
Andie C
2 REPLIES 2

markk1995
New Contributor
Hi Andie C

fby allows you to apply an aggregation function of different sub-groups of your data and return a result for each group. It takes three arguments, an aggregation function, a data vector to apply this function on and a group vector of the same size which will be used for the sub-groups. fby works out the sub-groups from the distinct values of the group vector and works out the indices of each element in each subgroup. It then applies the aggregation function on each of the subgroups in turn and then takes the result vector for each group and merges them back together in the order that the elements appeared in the original group vector.

Your query could be achieved using fby by the following command:

select from t where 5>(rank;neg size)fby date

rank returns the position where the element would place in a list sorted in ascending order (https://code.kx.com/wiki/Reference/rank). To return the index that each element would place in the list if sorted in descending order, neg size has been used.

q)t:([]date:.z.d + 15?3;sym:15?`a`b;size:100 + 15?100)
date       sym size
-------------------
2018.05.31 b   193
2018.05.29 b   105
2018.05.30 b   158
2018.05.29 b   131
2018.05.30 a   107
2018.05.30 a   153
2018.05.31 a   138
2018.05.31 a   123
2018.05.30 b   127
2018.05.31 a   132
2018.05.31 a   182
2018.05.31 a   181
2018.05.29 b   156
2018.05.31 a   181
2018.05.29 b   100


Incides of sub-groups as split by the distinct values of date:
2018.05.29 => 1 3 12 14
2018.05.30 => 2 4 5 8
2018.05.31 => 0 6 7 9 10 11 13

Applying the aggregation function rank to the corresponding values in the data vector:

2018.05.29     rank -105 -131 -156 -100 => 2 1 0 3                                        5>2 1 0 3 =>1111b corresponding to elements 1 3 12 14 of the group vector
2018.05.30     rank -158 -107 -153 -127 => 0 3 1 2                                        5>0 3 1 2 => 1111b corresponding to elements 2 4 5 8
2018.05.31     rank -193 -138 -123 -132 -182 -181 -181 => 0 4 6 5 1 2 3      5>0 4 6 5 1 2 3 =>1100111b corresponding to elements 0 6 7 9 10 11 13

Putting the results back in order gives 111111101011111b and so the where clause appended onto the select query will select all rows apart from the rows with indices 7 and 9.

Using 0N!x will allow you to see the output for each group
q) select from a where 5>({[x]0N!x;rank x};neg size)fby date
-193 -138 -123 -132 -182 -181 -181
-105 -131 -156 -100
-158 -107 -153 -127
`long$()
date       sym size
-------------------
2018.05.31 b   193
2018.05.29 b   105
2018.05.30 b   158
2018.05.29 b   131
2018.05.30 a   107
2018.05.30 a   153
2018.05.31 a   138
2018.05.30 b   127
2018.05.31 a   182
2018.05.31 a   181
2018.05.29 b   156
2018.05.31 a   181
2018.05.29 b   100

You could achieve the same result using xgroup using ungroup[t1] where raze exec 5>rank each neg[size] from t1:`date xgroup t where the ungroup[t1] has to be applied to achieve similar functionality to the merging of fby since the original group lost the information about the original indices of the rows.

You could also achieve it using group but it requires extra steps to mimic the functionality of fby with respect to uncurling the result. An example of this is as follows.
First get the indices of the records for each group.
q) indices:exec group date from t
2018.05.30| 0 2 8
2018.05.31| 1 3 6 12 13 14
2018.05.29| 4 5 7 9 10 11

Then apply the aggregation function to each group returning whether each element in the group is in the top 5 by descending size
q)g:exec 5>rank neg size by date from t
2018.05.29| 111110b
2018.05.30| 111b
2018.05.31| 011111b

Use where to return the position of the 1s and index into the dictionary of indices from the original table
q)indices @' where each g
2018.05.30| 0 2 8
2018.05.31| 3 6 12 13 14
2018.05.29| 4 5 7 9 10

Uncurl the indices using raze and index into the original table to return all the records
t raze i @' where each g

q)\t:10000 select from t where 5>(rank;neg size)fby date
136
\t:10000 t raze (exec group date from t) @' where each exec 5>rank neg size by date from t
188
q)\t:10000 ungroup[t1] where raze exec 5>rank each neg[size] from t1:`date xgroup t
290

fby is the most efficient in this example and I am told it will be for larger datasets to given that it is built to handle these types of queries efficiently.

Fantastic answer - Thanks Mark!