cancel
Showing results for
Did you mean:

## Group or fby

New Contributor
Hi Q Experts,

I'm unsure of the best option in this instance so hopefully some of you can help.
Say I've a table with columns date, sym and size (small sample);

date       sym  size
---------------------
2018.02.01 `A   200
2018.02.01 `B   400
2018.02.02 `A   600
2018.02.02 `B   200
2018.02.05 `A   100
2018.02.05 `B   101
2018.02.06 `A   205
2018.02.06 `B   400

How would I reduce this table to only show me, say 5 records per date (in this case, the 5 highest sizes for each date, obviously I can sort those columns first if needs be).
The big question - would I be better using group or would fby be more efficient here and also why more importantly? - I was reading into fby on the following page (but was finding it confusing);

https://code.kx.com/wiki/Reference/fby

As always thanks!
Regards
Andie C
2 REPLIES 2
New Contributor
Hi Andie C

fby allows you to apply an aggregation function of different sub-groups of your data and return a result for each group. It takes three arguments, an aggregation function, a data vector to apply this function on and a group vector of the same size which will be used for the sub-groups. fby works out the sub-groups from the distinct values of the group vector and works out the indices of each element in each subgroup. It then applies the aggregation function on each of the subgroups in turn and then takes the result vector for each group and merges them back together in the order that the elements appeared in the original group vector.

Your query could be achieved using fby by the following command:

`select from t where 5>(rank;neg size)fby date`

rank returns the position where the element would place in a list sorted in ascending order (https://code.kx.com/wiki/Reference/rank). To return the index that each element would place in the list if sorted in descending order, neg size has been used.

`q)t:([]date:.z.d + 15?3;sym:15?`a`b;size:100 + 15?100)date       sym size-------------------2018.05.31 b   1932018.05.29 b   1052018.05.30 b   1582018.05.29 b   1312018.05.30 a   1072018.05.30 a   1532018.05.31 a   1382018.05.31 a   1232018.05.30 b   1272018.05.31 a   1322018.05.31 a   1822018.05.31 a   1812018.05.29 b   1562018.05.31 a   1812018.05.29 b   100`

Incides of sub-groups as split by the distinct values of date:
2018.05.29 => 1 3 12 14
2018.05.30 => 2 4 5 8
2018.05.31 => 0 6 7 9 10 11 13

Applying the aggregation function rank to the corresponding values in the data vector:

2018.05.29     rank -105 -131 -156 -100 => 2 1 0 3                                        5>2 1 0 3 =>1111b corresponding to elements 1 3 12 14 of the group vector
2018.05.30     rank -158 -107 -153 -127 => 0 3 1 2                                        5>0 3 1 2 => 1111b corresponding to elements 2 4 5 8
2018.05.31     rank -193 -138 -123 -132 -182 -181 -181 => 0 4 6 5 1 2 3      5>0 4 6 5 1 2 3 =>1100111b corresponding to elements 0 6 7 9 10 11 13

Putting the results back in order gives 111111101011111b and so the where clause appended onto the select query will select all rows apart from the rows with indices 7 and 9.

Using 0N!x will allow you to see the output for each group
`q) select from a where 5>({[x]0N!x;rank x};neg size)fby date-193 -138 -123 -132 -182 -181 -181-105 -131 -156 -100-158 -107 -153 -127`long\$()date       sym size-------------------2018.05.31 b   1932018.05.29 b   1052018.05.30 b   1582018.05.29 b   1312018.05.30 a   1072018.05.30 a   1532018.05.31 a   1382018.05.30 b   1272018.05.31 a   1822018.05.31 a   1812018.05.29 b   1562018.05.31 a   1812018.05.29 b   100`

You could achieve the same result using xgroup using ungroup[t1] where raze exec 5>rank each neg[size] from t1:`date xgroup t where the ungroup[t1] has to be applied to achieve similar functionality to the merging of fby since the original group lost the information about the original indices of the rows.

You could also achieve it using group but it requires extra steps to mimic the functionality of fby with respect to uncurling the result. An example of this is as follows.
First get the indices of the records for each group.
`q) indices:exec group date from t2018.05.30| 0 2 82018.05.31| 1 3 6 12 13 142018.05.29| 4 5 7 9 10 11`

Then apply the aggregation function to each group returning whether each element in the group is in the top 5 by descending size
`q)g:exec 5>rank neg size by date from t2018.05.29| 111110b2018.05.30| 111b2018.05.31| 011111b`

Use where to return the position of the 1s and index into the dictionary of indices from the original table
`q)indices @' where each g2018.05.30| 0 2 82018.05.31| 3 6 12 13 142018.05.29| 4 5 7 9 10`

Uncurl the indices using raze and index into the original table to return all the records
`t raze i @' where each g`

`q)\t:10000 select from t where 5>(rank;neg size)fby date136\t:10000 t raze (exec group date from t) @' where each exec 5>rank neg size by date from t188q)\t:10000 ungroup[t1] where raze exec 5>rank each neg[size] from t1:`date xgroup t290`

fby is the most efficient in this example and I am told it will be for larger datasets to given that it is built to handle these types of queries efficiently.
New Contributor