Showing results for 
Search instead for 
Did you mean: 

Parallelising .Q.dpft with default compression enabled

New Contributor II
New Contributor II

.Q.dpft and why to parallelise it

.Q.dpft can be split into 4 steps
  1. Find the index ordering of the table with iasc
  2. Enumerate table against sym file
  3. Save down table 1 column at a time, reordering and applying attributes as it goes
  4. Set .d file down


Since writing data down to disk is an IO-bound operation, attempting to parallelise any of the above steps would normally not yield any speed increase.


But if you are saving data down with a default compression algorithm, .Q.dpft will spend an average of x amount of time per column compressing the data before it is written to disk.
Parallelising this compression will allow the IO channels to be saturated more frequently than just using a single thread.

Parallelising .Q.dpft

Very similar to .Q.dpft, except I replace the each-both with a peach
	i:iasc t f;
	tab:.Q.en[d;`. t];
	.[{[d;t;i;c;a]@[d;c;:;a t[c]i]}[d:.Q.par[d;p;t];tab;i;;]]peach flip(c;)(::;`p#)f=c:cols t;
	@[d;`.d;:;f,c where not f=c];


Before each test I would
  • Start a new q session
  • Clear out the HDB directory
  • Define default compression with .z.zd
  • Create table to be saved down
By running the following code:
// Set default compression and delete HDB
.z.zd:17 2 6;							
system"rm -r /home/alivingston/HDB/*";

// define parallelised .Q.dpft
	i:iasc t f;
	tab:.Q.en[d;`. t];
	.[{[d;t;i;c;a]@[d;c;:;a t[c]i]}[d:.Q.par[d;p;t];tab;i;;]]peach flip(c;)(::;`p#)f=c:cols t;
	@[d;`.d;:;f,c where not f=c];

// Create table
trade:([]timestamp:.z.p+til n;sym:n?`2;a:n?0;b:n?1f;c:string n?`3;d:n?0b;e:n?0;f:n?1f;g:string n?`3;h:n?0b);


I then test the original .Q.dpft and the new function when slaves at set to 0, 2, 4 and 8, while logging RAM usage with top.
\ts func[dir;.z.d;`sym;`trade]
\ts .Q.dpft[dir;.z.d;`sym;`trade]



For the following tables, time and space have been normalised against a reference run from .Q.dpft.
threads| time  space
-------| -----------
0      | 0.992 1    
2      | 1.52  1.17 
4      | 1.8   1.32 
8      | 2.61  1.66 ​
In an attempt to limit memory usage I repeated this testing with automatic garbage collection enabled with -g 1
threads| time  space
-------| -----------
0      | 0.981 1    
2      | 1.56  1.08 
4      | 1.84  1.2  
8      | 2.63  1.49 


The parallelised .Q.dpft func with 2 threads ran 56% faster using 8% more RAM, while 8 threads was 163% faster using 50% more RAM.



Due to the extra memory required, this would likely not be sensible to run on an RDB at EOD.
I think the best use case for this would be when attempting to ingest years of data into kdb as fast as possible where RAM isn't an issue.
Comments or critiques are more than welcome. I'd be interested to know if this should be avoided for the above use case or if there are any other issues with this that I have missed.

Moderator Moderator

Thanks for posting! Looking forward to seeing if the community has any feedback on your approach.

New Contributor III
New Contributor III

Tacking on here some further improvements Alex and myself discussed:

	i:iasc t f;
    c:cols t;
    is:(ceiling count[i]%count c) cut i;
	tab:.Q.en[d;`. t];
	{[d;tab;c;t;f;i].[{[d;t;i;c;a]@[d;c;,;a t[c]i]}[d;tab;i;;]]peach flip(c;)(::;`p#)f=c:cols t}[d:.Q.par[d;p;t];tab;c;t;f;]each is;
	@[d;`.d;:;f,c where not f=c];

This makes the memory drawback less - theoretically this will be more memory efficient than the standard .Q.dpft. What the above is doing is slicing up the parted column into chunks, such that the maximum size of a chunk in memory of the table contains the same number of entries as a single column of the table (which is the maximum amount of data .Q.dpft holds in memory due to writing column-by-column). 

The result of this will lead to the benefits of parallelisation as above without the memory drawback we have seen by simply adding peach.

My above statement I made of "more memory efficient than standard .Q.dpft", I've claimed because the chunks are based on matching the number of elements of a column. .Q.dpft writing column-by-column means the maximum memory used would be for the biggest (in bytes) datatype column. The biggest for this new method would only contain part of that large datatype column at any one time, as well as other smaller datatypes, which will lead to at maximum the same memory usage of .Q.dpft in the case when the columns are of the same sized datatype. 

Preliminary tests showed the maintained improvement in speed, with no memory drawback. However these tests were not standardised or conducted in an official unit testing framework. Would love to know the official results of this at some point - be that generated by myself or someone else who is curious.