cancel
Showing results for 
Search instead for 
Did you mean: 

Most memory efficient way of merging tables

KdbNoob
New Contributor

I have a bunch of tables on the file system. I need to read all of them into memory and merge them together. 

It seems like using raze or (uj), I will inevitable read all of them, then merge them together. That will at least double the memory usage. 

How do I merge them efficiently? 

What if these tables are organized as played table, would that change anything? 
2 REPLIES 2

Flying
New Contributor III
It depends on what you want to do with the "merged" table.

The following are talking about loading from CSV, the ideas behind merging large data sets, however, should be applicable to your case as well:

http://code.kx.com/wiki/Cookbook/LoadingFromLargeFiles
http://code.kx.com/wiki/Cookbook/LoadingFromLargeFilesAndSplaying

Matthew_McAuley
New Contributor

Hello,

Would it be possible for you to provide some more information? Are all the tables on disk? What size are they? What memory constraints do you have?

You suggested that splaying the tables might help. When you load a splayed table, the table is mapped to memory, rather than being loaded into RAM. However upon joining, they will be loaded into RAM. I understand that your concern is loading all the tables for the join and the memory cost associated with that.

It may be suitable to join one table at a time, then upsert into your final splayed table on disk. I've outlined how you might approach this below.

Turn on immediate garbage collection:
\g 1

For instance, create an empty splayed table, with the same schema as the end goal table, t:
t::([]a:`$();b:`int$();c:`int$();d:`int$();e:`int$())
`:hdb/ujtab/ set .Q.en[`:hdb] t


Create a few sample tables to be joined:
ta:([]a:`q`w`e;b:10 11 12i;c:1 2 3i)
tb:([]a:`r`t`y;d:18 16 15i;e:112 221 332i)


Join each table (uj) and upsert into your splayed table on disk (enumerating with .Q.en, assuming you have a sym column):
{`:hdb/ujtab/ upsert .Q.en[`:hdb] t uj value x} each `ta`tb

With immediate gc turned on, kdb will free up memory as each table is joined and should keep memory usage to a minimum.

Feel free to follow up with more details and I might be able to offer a more optimised solution.

Thanks,
Matthew
AquaQ Analytics