Hello Fellow Qbies,
I'm attempting to run a wj across a table with a dimension of 1million rows and 30 columns.
It works, but it takes too long to run.
What are some tips and tricks to dealing with large datasets and windowjoins?
So you are only joining on time windows? no other column?
You could apply the sorted attribute.
The 'datetime' datatype is deprecated. 'timestamp' should be used instead.
If you can share a minimal reproducible example in q that will help us answer your questions.
I have 800,000 rows of data in a table with the following meta:
c | t f a ---------------| ----- col1 | n col2 | s col3 | d mmm3 | f col4 | f col5 | s col6 | i col7 | i col8 | j col9 | f col10 | j dt | p s col11 | f col12 | f col13 | j col14 | j col15 | f col16 | f col17 | f col18 | f col19 | f col20 | f col21 | f col22 | f col23 | j col24 | j
Im attempting to get a rolling 5minute min/max using wj. (i recognize the window isn't such for 5minutes, im just writing up an example)
f:`dt; w:(-1000000; 0)+\:data[`dt]; data: wj[w;f;data;(data;(max;`mmm3);(min;`mmm3))];
The wj works just fine, but takes insanely long to complete.
It took me significantly longer. But, I'm also dealing with 30 columns -- would that matter, even though I'm just using mmm3 for the wj?
In testing, when I make the data table that im searching (data;(max:`mmm3)) smaller, things speed up. For example, I ran a 1 minute xbar on that table and the windowjoin now takes a couple seconds.
I dont understand what I could be doing wrong. At full scale, 800,000x30cols, it took like 30-40 minutes to complete a 5 minute lookback.
I added extra columns and it did not slow down the operation for me.
Is your data in memory or on disk?
data:( dt:`s#.z.p+0D00:00:01*til 800000;mmm3:800000?1000.0) data:data,'flip (`$"col",/:string til 30)!30#enlist til 800000 f:`dt; w:(-1000000; 0)+\:data[`dt]; result: wj[w;f;data;(data;(max;`mmm3);(min;`mmm3))]
Im working in memory.
I load the file in its 800,000 x 7. Then run a bunch of updates to make the remaining 23 columns (30 or so total).
And then run the wj. And pass back the resulting table.
Im trying to simulate what it would be doing in a tp. Like as more data comes in the wj is going to continuously run slower until it hits max file for the day, 800k rows. So I figured loading the whole file in and doing it all at once would be a decent enough way to test what it'd be doing.
Tel: +44 (0)28 3025 2242
Tel: +1 (212) 447 6700
Tel: +61 (0)2 9236 5700