cancel
Showing results for 
Search instead for 
Did you mean: 

Large Scale WindowJoins Questions

planefan
Contributor

Hello Fellow Qbies,

I'm attempting to run a wj across a table with a dimension of 1million rows and 30 columns.

It works, but it takes too long to run.

What are some tips and tricks to dealing with large datasets and windowjoins?

9 REPLIES 9

rocuinneagain
Contributor III
Contributor III

The documentation does have a note on attributes to apply:

https://code.kx.com/q/ref/wj/

t and q are simple tables to be joined
(q should be sorted `sym`time with `p# on sym)

Have you applied this?

planefan
Contributor

What if I'm not using a sym column? I'm going datetime to datetime for the windowjoin.

Would I get speed improvements by using a sym column?

So you are only joining on time windows? no other column?

You could apply the sorted attribute.

https://code.kx.com/q/ref/set-attribute/#sorted

The 'datetime' datatype is deprecated. 'timestamp' should be used instead.

https://code.kx.com/q/basics/datatypes/#temporal

If you can share a minimal reproducible example in q that will help us answer your questions.

https://stackoverflow.com/help/minimal-reproducible-example

I have 800,000 rows of data in a table with the following meta:

c              | t f a
---------------| -----
col1           | n    
col2           | s    
col3           | d    
mmm3           | f    
col4           | f    
col5           | s    
col6           | i    
col7           | i    
col8           | j    
col9           | f    
col10          | j    
dt             | p   s
col11          | f    
col12          | f    
col13          | j    
col14          | j    
col15          | f    
col16          | f    
col17          | f    
col18          | f    
col19          | f    
col20          | f    
col21          | f    
col22          | f    
col23          | j    
col24          | j  

Im attempting to get a rolling 5minute min/max using wj. (i recognize the window isn't such for 5minutes, im just writing up an example)

f:`dt;
w:(-1000000; 0)+\:data[`dt];
data: wj[w;f;data;(data;(max;`mmm3);(min;`mmm3))];

The wj works just fine, but takes insanely long to complete.

I've tested it down to only 1 column, the timestamp, being wj to.

Is there a way to open threading or something? It's like 12,000 entries per timestamp being read. How long should that take?

On my laptop this took 765 milliseconds - what runtime are you seeing?

data:([] dt:`s#.z.p+0D00:00:01*til 800000;mmm3:800000?1000.0)
f:`dt;
w:(-1000000; 0)+\:data[`dt];
\ts result: wj[w;f;data;(data;(max;`mmm3);(min;`mmm3))]

It took me significantly longer. But, I'm also dealing with 30 columns -- would that matter, even though I'm just using mmm3 for the wj?

In testing, when I make the data table that im searching (data;(max:`mmm3)) smaller, things speed up. For example, I ran a 1 minute xbar on that table and the windowjoin now takes a couple seconds.

I dont understand what I could be doing wrong. At full scale, 800,000x30cols, it took like 30-40 minutes to complete a 5 minute lookback.

I added extra columns and it did not slow down the operation for me.
Is your data in memory or on disk?

data:([] dt:`s#.z.p+0D00:00:01*til 800000;mmm3:800000?1000.0)
data:data,'flip (`$"col",/:string til 30)!30#enlist til 800000
f:`dt;
w:(-1000000; 0)+\:data[`dt];
result: wj[w;f;data;(data;(max;`mmm3);(min;`mmm3))]

Im working in memory.

I load the file in its 800,000 x 7. Then run a bunch of updates to make the remaining 23 columns (30 or so total).

And then run the wj. And pass back the resulting table.

Im trying to simulate what it would be doing in a tp. Like as more data comes in the wj is going to continuously run slower until it hits max file for the day, 800k rows. So I figured loading the whole file in and doing it all at once would be a decent enough way to test what it'd be doing.