2022.05.24
01:07 AM
- last edited on
2023.04.19
03:10 AM
by
davidcrossey
I would like to understand what is the underlying optimization or performance tweak that kdb+/q is able to extract when we apply key on one/more columns in the table. Key serves the purpose of helping join 2 or more tables via row to row match using the keys. Internally the keyed table is stored as a map [ key value pair ]. The keyed columns are the key and the rest of unkeyed columns make the payload/data tuple. since it is a map querying the map for a particular key must be faster than an unkeyed table. The keys are basically nodes on a BST ( binary search tree ) which should look up in O(log(n)) time given n is size of the table. An unkeyed table on the other hand is an aggregate of rows [ an array basically ]. Hence searching it or querying should be O(n). But when we run a simple test with some data there is no difference in speed between querying unkeyed vs keyed table at all. Why is this ? In Fact the keyed table takes slightly more time to lookup a particular record. I do understand that the table may not be sorted in memory based on the keys but still with a dictionary type keyed table should a faster search be not possible ?
2022.05.24 03:04 AM
You can see the speed improvement if you index rather than query using qsql:
q)select from t1 where Name=`Anand
Name Age Salary
----------------
Anand 10 1000
q)select from t2 where Name=`Anand
Name | Age Salary
-----| ----------
Anand| 10 1000
q)t2`Anand
Age | 10
Salary| 1000
q)\ts do[100000;select from t1 where Name=`Anand]
164 1808
q)\ts do[100000;select from t2 where Name=`Anand]
174 1808
q)\ts do[100000;t2`Anand]
71 960
2022.05.24 04:37 AM
With qsql, the full column is searched before the result is presented. With key lookup, the search stops when it finds the first match. This is where the performance gain from key lookup comes from.
> The keys are basically nodes on a BST ( binary search tree ) which should look up in O(log(n)) time given n is size of the table
This isn't true. KDB's keyed table, or more generally a plain dictionary, does not use hashing techniques like Java's HashMap. Lookup is done by searching through the table/list linearly. If one wishes less time for the lookup, grouped (conceptually the same as index) or parted (conceptually a space-optimized index that assumes data are sorted) attributes help.
2022.05.24 03:04 AM
You can see the speed improvement if you index rather than query using qsql:
q)select from t1 where Name=`Anand
Name Age Salary
----------------
Anand 10 1000
q)select from t2 where Name=`Anand
Name | Age Salary
-----| ----------
Anand| 10 1000
q)t2`Anand
Age | 10
Salary| 1000
q)\ts do[100000;select from t1 where Name=`Anand]
164 1808
q)\ts do[100000;select from t2 where Name=`Anand]
174 1808
q)\ts do[100000;t2`Anand]
71 960
2022.05.24 03:52 AM
It's amazing that lookup via key works faster but same lookup isn't used as a hint when querying the table via qsql ? Sounds surprising behaviour actually and doesn't look sensible.
2022.05.24 04:24 AM
For optimisations in qsql queries you should look to attributes.
Here applying the unique attribute to the Name column:
q)t3:update `u#Name from t1
q)\ts do[100000;select from t3 where Name=`Anand]
64 1808
2022.05.24 04:37 AM
With qsql, the full column is searched before the result is presented. With key lookup, the search stops when it finds the first match. This is where the performance gain from key lookup comes from.
> The keys are basically nodes on a BST ( binary search tree ) which should look up in O(log(n)) time given n is size of the table
This isn't true. KDB's keyed table, or more generally a plain dictionary, does not use hashing techniques like Java's HashMap. Lookup is done by searching through the table/list linearly. If one wishes less time for the lookup, grouped (conceptually the same as index) or parted (conceptually a space-optimized index that assumes data are sorted) attributes help.
2022.05.24 04:44 AM
Regarding why q-sql doesn't do key lookup internally if the search is on the key, short answer is that they are not equivalent. Consider a keyed table where there are duplicate keys (yes this is allowed in KDB, there is no such thing as "primary key constraint" in this world), q-sql may return a table of multiple rows, while the key lookup form returns a dictionary representing the first entry (you know why - the search stops at first matched entry)
2022.05.24 08:36 AM
Just wanted to codify some of the above:
q)tab:([]sym:-50000?`6;px:50000?10)
q)ktab:`sym xkey tab
// wanted to show the last record to convince that if kdb
// has to search through entire list, no speed gain is evident
q)-1#tab
sym px
---------
obafmn 6
q)-1#ktab
sym | px
------| --
obafmn| 6
q)\ts do[100000;select from tab where sym=`obafmn]
2034 66240
q)\ts do[100000;select from ktab where sym=`obafmn]
2051 66240
// below no speed gain, but memory usage is reduced
q)\ts do[100000;ktab`obafmn]
2079 960
// however this does mean any other *valid* request would be faster
q)rand key ktab
sym| kfpfmd
q)\ts do[100000;ktab`kfpfmd]
1169 960
// showing point above with grouped attribute applied to keyed column
q)gktab:`sym xkey update `g#sym from tab
q)\ts do[100000;select from gktab where sym=`obafmn]
89 1808
// final point I wanted to make showing duplicate keys
q)`a`a!(1;1)
a| 1
a| 1
Thanks @darrenwsun and @rocuinneagain for contributing here.
2022.05.24 08:41 AM
Many thanks! I typed the responses during commute, would love to accompany the explanation with examples but doing that on mobile isn't too easy 🙂
EMEA
Tel: +44 (0)28 3025 2242
AMERICAS
Tel: +1 (212) 447 6700
APAC
Tel: +61 (0)2 9236 5700
KX. All Rights Reserved.
KX and kdb+ are registered trademarks of KX Systems, Inc., a subsidiary of FD Technologies plc.