cancel
Showing results for 
Search instead for 
Did you mean: 

Parallelising .Q.dpft with default compression enabled

alivingston
New Contributor II
New Contributor II

.Q.dpft and why to parallelise it

.Q.dpft can be split into 4 steps
  1. Find the index ordering of the table with iasc
  2. Enumerate table against sym file
  3. Save down table 1 column at a time, reordering and applying attributes as it goes
  4. Set .d file down

 

Since writing data down to disk is an IO-bound operation, attempting to parallelise any of the above steps would normally not yield any speed increase.

 

But if you are saving data down with a default compression algorithm, .Q.dpft will spend an average of x amount of time per column compressing the data before it is written to disk.
Parallelising this compression will allow the IO channels to be saturated more frequently than just using a single thread.
 

Parallelising .Q.dpft

Very similar to .Q.dpft, except I replace the each-both with a peach
{[d;p;f;t]
	i:iasc t f;
	tab:.Q.en[d;`. t];
	.[{[d;t;i;c;a]@[d;c;:;a t[c]i]}[d:.Q.par[d;p;t];tab;i;;]]peach flip(c;)(::;`p#)f=c:cols t;
	@[d;`.d;:;f,c where not f=c];
	t
 };
 

Testing

Before each test I would
  • Start a new q session
  • Clear out the HDB directory
  • Define default compression with .z.zd
  • Create table to be saved down
By running the following code:
// Set default compression and delete HDB
.z.zd:17 2 6;							
system"rm -r /home/alivingston/HDB/*";
dir:`:/home/alivingston/HDB;

// define parallelised .Q.dpft
func:{[d;p;f;t]
	i:iasc t f;
	tab:.Q.en[d;`. t];
	.[{[d;t;i;c;a]@[d;c;:;a t[c]i]}[d:.Q.par[d;p;t];tab;i;;]]peach flip(c;)(::;`p#)f=c:cols t;
	@[d;`.d;:;f,c where not f=c];
	t
 };

// Create table
n:10000000;
trade:([]timestamp:.z.p+til n;sym:n?`2;a:n?0;b:n?1f;c:string n?`3;d:n?0b;e:n?0;f:n?1f;g:string n?`3;h:n?0b);

 

I then test the original .Q.dpft and the new function when slaves at set to 0, 2, 4 and 8, while logging RAM usage with top.
\ts func[dir;.z.d;`sym;`trade]
\ts .Q.dpft[dir;.z.d;`sym;`trade]

 

Results

For the following tables, time and space have been normalised against a reference run from .Q.dpft.
threads| time  space
-------| -----------
0      | 0.992 1    
2      | 1.52  1.17 
4      | 1.8   1.32 
8      | 2.61  1.66 ​
 
In an attempt to limit memory usage I repeated this testing with automatic garbage collection enabled with -g 1
threads| time  space
-------| -----------
0      | 0.981 1    
2      | 1.56  1.08 
4      | 1.84  1.2  
8      | 2.63  1.49 

 

The parallelised .Q.dpft func with 2 threads ran 56% faster using 8% more RAM, while 8 threads was 163% faster using 50% more RAM.

 

Conclusion

Due to the extra memory required, this would likely not be sensible to run on an RDB at EOD.
I think the best use case for this would be when attempting to ingest years of data into kdb as fast as possible where RAM isn't an issue.
 
Comments or critiques are more than welcome. I'd be interested to know if this should be avoided for the above use case or if there are any other issues with this that I have missed.
3 REPLIES 3

megan_mcp
Community Manager Community Manager
Community Manager

Thanks for posting! Looking forward to seeing if the community has any feedback on your approach.

sbruce01
New Contributor III
New Contributor III

Tacking on here some further improvements Alex and myself discussed:

funcMem:{[d;p;f;t]
	i:iasc t f;
    c:cols t;
    is:(ceiling count[i]%count c) cut i;
	tab:.Q.en[d;`. t];
	{[d;tab;c;t;f;i].[{[d;t;i;c;a]@[d;c;,;a t[c]i]}[d;tab;i;;]]peach flip(c;)(::;`p#)f=c:cols t}[d:.Q.par[d;p;t];tab;c;t;f;]each is;
	@[d;`.d;:;f,c where not f=c];
	t
 };

This makes the memory drawback less - theoretically this will be more memory efficient than the standard .Q.dpft. What the above is doing is slicing up the parted column into chunks, such that the maximum size of a chunk in memory of the table contains the same number of entries as a single column of the table (which is the maximum amount of data .Q.dpft holds in memory due to writing column-by-column). 

The result of this will lead to the benefits of parallelisation as above without the memory drawback we have seen by simply adding peach.

My above statement I made of "more memory efficient than standard .Q.dpft", I've claimed because the chunks are based on matching the number of elements of a column. .Q.dpft writing column-by-column means the maximum memory used would be for the biggest (in bytes) datatype column. The biggest for this new method would only contain part of that large datatype column at any one time, as well as other smaller datatypes, which will lead to at maximum the same memory usage of .Q.dpft in the case when the columns are of the same sized datatype. 

Preliminary tests showed the maintained improvement in speed, with no memory drawback. However these tests were not standardised or conducted in an official unit testing framework. Would love to know the official results of this at some point - be that generated by myself or someone else who is curious.

sujoy
New Contributor III

This is cool. I also have used and created similar override to .Q.dpft. Use case is valid till 3.6.

If you use KX 4.0 with slaves, it internally is optimised to save data faster on disk (Probably some c code ). Something I did not find in the KX release notes.

In my tests, it will beat/match the peach override.

Many Thanks

Sujoy