cancel
Showing results for 
Search instead for 
Did you mean: 

Transferring Large Data From Server to Client

kadir_kilicoglu
New Contributor II
I am trying to implement a function to load large chunks of data, let's say 1 billion rows, and load it into my local KDB+ instance. So far here is what I have:

\d .importer

fetch_data:{[table_names;batch_size;data_date]
    / (0) Open remote connection
    h::hopen `$":redacted:redacted:redacted";
    
    / (1) Set arguments
    tabs:enlist table_names;
    bsize::batch_size;
    dday::data_date;
    
    / (2) Create tables
    { .[x;();:;()] } each tabs;
    
    / (3) For each remote table;
    /     (3.1) Get the table count
    /     (3.2) Split the table to an optimal value
    /     (3.4) Iterate over the table by using the splits
    {[tab]
      tcount:h({[t;d] count select from t where date=d};tab;dday);
      splits:$[tcount>bsize; [batch:tcount div bsize;((0;bsize-1)+/:bsize*til batch),enlist (batch*bsize;tcount-1)];enlist(0;tcount-1)];
      {[t;dday;split] t upsert h({[t;y;dday] ?[t;((=;`date;dday);(within;`i;y));0b;()]};t;split;dday) }[tab;dday] each splits;
    }each tabs;
  }

fetch_today:fetch_data[;;.z.d];
fetch_quotes_today:fetch_data[`trades;;.z.d];
fetch_quotes:fetch_data[`trades;;];


Then I simply load the script and call, let's say,

.importer.fetch_quotes_today[1000000]

..and it works fine if there is 40-50 million data tops. But if the rows count exceeds 100M then the execution takes forever. Now that I have about a billion data rows, is there any more effective way to transfer such data directly from KDB to KDB?

Regards!

4 REPLIES 4

wp
New Contributor
Sounds like you are running out of memory. 
Glancing at the script, it looks like you upsert in memory. You could upsert on disk. 

kadir_kilicoglu
New Contributor II
Well, I didn't face any OOM or Swap Memory issues nor alarms in monitoring, thank you for your suggestion. I'll try to persist directly on disk and see!

Would it not make more sense to scp or rsync in this type of scenario?

kadir_kilicoglu
New Contributor II
As @wp suggested, I switched to upsert asynchronously into a file directly and it seems working fine now! Below you can find the updated snippet:

...
...
/ (1) Set arguments
    tabs:enlist table_names;
    bsize::batch_size;
    dday::data_date;
    
    / (2) Create tables
    { .[x;();:;()] } each tabs;
    
    / (3) For each remote table;
    /     (3.1) Get the table count
    /     (3.2) Split the table to an optimal value
    /     (3.3) Fetch the meta to create a proper CSV
    /     (3.4) Open file handle
    /     (3.5) Iterate over the table by using the splits and for each batch
    /           (3.5.1) Asynchronously save data to the CSV file
    {[tab]
      tcount:h({[t;d] count select from t where date=d};tab;dday);
      splits:$[tcount>bsize; [batch:tcount div bsize;((0;bsize-1)+/:bsize*til batch),enlist (batch*bsize;tcount-1)];enlist(0;tcount-1)];
      tmetakey h({[t] meta t};tab);         
      (hsym `$ raze string tab,".csv"0enlist "," sv raze string flip tmeta cols tmeta;
      fhandle:hopen (hsym `$ raze string tab,".csv");      
      {[t;dday;tcount;fhandle;split] 
        neg[fhandle] peach "," 0:h({[t;y;dday] ?[t;((=;`date;dday);(within;`i;y));0b;()]};t;split;dday); 
        show raze string split[0],"-",split[1]," of ", tcount; }[tab;dday;tcount;fhandle] each splits;
      hclose fhandle; 
    }each tabs;
    hclose h;
...
...

Thanks for the advice!

Regards @wp and @Morten.