cancel
Showing results for 
Search instead for 
Did you mean: 

Optimizating csv import

LamHinYan
New Contributor
The following program is parsing csv into in-memory table at about 22 MB/sec. I have repeated the benchmark a few times. The data is in the OS disk cache. The program is 1 core CPU-bound. My CPU is i5-3230M @ 2.60 GHz on a laptop. How can I make it faster? What have I missed in the tutorials? Thx.

http://code.kx.com/wiki/Cookbook/LoadingFromLargeFiles
http://code.kx.com/wiki/Cookbook/LoadingFromLargeFilesAndSplaying

/ csvParsingBenchmark.q
fieldNames: `tsStr`sym`tradePrice`cumTradeVol`bidFlag`bidPrice1`bidVol1`bidPrice2`bidVol2`bidPrice3`bidVol3`bidPrice4`bidVol4`bidPrice5`bidVol5`askFlag`askPrice1`askVol1`askPrice2`askVol2`askPrice3`askVol3`askPrice4`askVol4`askPrice5`askVol5;
formatString: "*SFJSFJFJFJFJFJSFJFJFJFJFJ";
csvFile: `$":/tmp/marketdata/20140130.csv";

a:0;
.Q.fsn[{flip fieldNames!(formatString;",") 0: x; show (string .z.P), " ", (string a+:1)}; csvFile; 100*1024*1024]

$ tail -5 20140130.csv
20140130_120000_030000,28964,0.032,4030000,B,0.029,100000,0.027,480000,0.026,480000,0.025,1780000,0.024,480000,A,0.036,480000,0.038,480000,0.039,480000,0.045,999999,999999.000,999999
20140130_120000_030000,28964,0.032,4030000,B,0.029,100000,0.027,480000,0.026,480000,0.025,1780000,0.024,480000,A,0.036,480000,0.037,480000,0.038,480000,0.039,480000,0.045,999999
20140130_120000_030000,28964,0.032,4030000,B,0.029,100000,0.027,480000,0.026,480000,0.025,1780000,0.024,480000,A,0.036,480000,0.037,480000,0.039,480000,0.045,999999,999999.000,999999
20140130_120000_030000,28964,0.032,4030000,B,0.029,100000,0.027,480000,0.026,480000,0.025,1780000,0.024,480000,A,0.036,480000,0.037,480000,0.045,999999,999999.000,999999,999999.000,999999
20140130_120000_030000,67387,999999.000,0,B,0.244,1000000,0.243,1000000,0.242,1000000,0.241,1000000,0.240,1000000,A,0.246,1000000,0.247,1000000,0.248,1000000,0.249,1000000,0.250,1000000
12 REPLIES 12

jlucid
New Contributor
Have you plotted the Total Read Time Vs Chunk size? there should be a minimum in this curve corresponding to the optimum chunk size for your machine.
Just as a double check if you haven't already done so...

I have just plotted total read time vs block size.

It takes about 4.5 sec to read a 128 MB csv file under various block sizes. Avg 28 MB/sec. The CPU is i5-2400 3.1 GHz on a desktop. Speed stopped increasing above 64k block sizes. How can I make it faster? Thx.

"2015.03.14D08:21:08.150399000 - 0D00:00:15.505887000 1 k block"
"2015.03.14D08:21:17.790950000 - 0D00:00:09.640551000 2 k block"
"2015.03.14D08:21:24.702345000 - 0D00:00:06.911395000 4 k block"
"2015.03.14D08:21:30.300666000 - 0D00:00:05.598321000 8 k block"
"2015.03.14D08:21:35.237948000 - 0D00:00:04.936282000 16 k block"
"2015.03.14D08:21:39.991220000 - 0D00:00:04.753272000 32 k block"
"2015.03.14D08:21:44.511479000 - 0D00:00:04.520259000 64 k block"
"2015.03.14D08:21:48.921731000 - 0D00:00:04.410252000 128 k block"
"2015.03.14D08:18:15.374517000 - 0D00:00:04.427254000 256 k block"
"2015.03.14D08:18:19.814771000 - 0D00:00:04.439254000 512 k block"
"2015.03.14D08:18:24.336029000 - 0D00:00:04.521258000 1024 k block"
"2015.03.14D08:18:28.877289000 - 0D00:00:04.541260000 2048 k block"
"2015.03.14D08:18:33.387547000 - 0D00:00:04.510258000 4096 k block"
"2015.03.14D08:18:37.888804000 - 0D00:00:04.500257000 8192 k block"
"2015.03.14D08:18:42.397062000 - 0D00:00:04.508258000 16384 k block"
"2015.03.14D08:18:46.939322000 - 0D00:00:04.541260000 32768 k block"
"2015.03.14D08:18:51.530585000 - 0D00:00:04.590263000 65536 k block"
"2015.03.14D08:18:56.151849000 - 0D00:00:04.621264000 131072 k block"

/ csvParsingBenchmark.q
fieldNames: `tsStr`sym`tradePrice`cumTradeVol`bidFlag`bidPrice1`bidVol1`bidPrice2`bidVol2`bidPrice3`bidVol3`bidPrice4`bidVol4`bidPrice5`bidVol5`askFlag`askPrice1`askVol1`askPrice2`askVol2`askPrice3`askVol3`askPrice4`askVol4`askPrice5`askVol5;
formatString: "*SFJSFJFJFJFJFJSFJFJFJFJFJ";
/csvFile: `$":/tmp/marketdata/a.csv";

parseCsvPart:{
    flip fieldNames!(formatString;",") 0: x};

createFilename:{
    `$":/tmp/marketdata/128M.csv"};

blockSize: 128*1024;
a:0;
while[a<8;
    blockSize: `long$1*1024*xexp[2;a];
    t0: .z.P;
    .Q.fsn[parseCsvPart; createFilename[a]; blockSize];
    t1: .z.P;
    dt: t1-t0;
    show (string t1), " - ", (string dt), " ", (string blockSize%1024), " k block";
    a+:1]

The default chunk size of 131000 maybe there for a good reason. I have got no improvements above 64k ^_^

http://code.kx.com/wiki/DotQ/DotQDotfsn

Yes,would be nice to get more detail explanation on that choice.

In the document the only optimization they suggest (that I saw) is the chunk size :  "particularly useful when balancing load speed and ram usage" 

I was hoping that by tinkering with the chunk size, and paying the ram usage penalty you might get a load speed improvement, but your tests show no significant speedup beyond the limit  

Hope someone else can help you to optimize the in-memory operation..

Well, I can read 4M rows, parse 1M rows on each of my 4 cores, and then concat the 4 slices together. Any other idea? ^_^

t: raze {flip (fieldNames)!1_/:(formatString;csv) 0: (hsym `$"/tmp/marketdata/file", string 1+x) } peach til 4

In this case the big file is already pre split , each a separate file with 1M.
Is this what you have tried? , not using .Q.fsn.
I think we are cheating a little now using extra cores 🙂

/ 2.2 sec to parse 128 MB on 4 cores
/ The file is bigger than 1 GB. .Q.fs is used.

fieldNames: `tsStr`sym`tradePrice`cumTradeVol`bidFlag`bidPrice1`bidVol1`bidPrice2`bidVol2`bidPrice3`bidVol3`bidPrice4`bidVol4`bidPrice5`bidVol5`askFlag`askPrice1`askVol1`askPrice2`askVol2`askPrice3`askVol3`askPrice4`askVol4`askPrice5`askVol5;
formatString: "*SFJSFJFJFJFJFJSFJFJFJFJFJ";
/csvFile: `$":/tmp/marketdata/a.csv";

parseCsvSingleThread:{
    flip fieldNames!(formatString;",") 0: x};

parseCsvParallel:{
    .Q.fc[{parseCsvSingleThread x}] x};

parseCsv: parseCsvParallel;

csvFile: `$":/tmp/marketdata/128M.csv";
.Q.fs[{quote,: parseCsv x}; csvFile];

Ok Yan, I see what you mean. So you got a x2 speedup for a x4 core addition, which is 50% scaling efficiency. This is around the same as the tests done here using the parallel read on split files.. almost x3 speedup for 6 threads (doesnt say the number of cores)

http://www.firstderivatives.com/downloads/q_for_Gods_August_2013.pdf


Can anyone help speed this up more?



/ 2.2 sec to parse 128 MB on 4 cores
/ The file is bigger than 1 GB. .Q.fs is used.

fieldNames: `tsStr`sym`tradePrice`cumTradeVol`bidFlag`bidPrice1`bidVol1`bidPrice2`bidVol2`bidPrice3`bidVol3`bidPrice4`bidVol4`bidPrice5`bidVol5`askFlag`askPrice1`askVol1`askPrice2`askVol2`askPrice3`askVol3`askPrice4`askVol4`askPrice5`askVol5;
formatString: "*SFJSFJFJFJFJFJSFJFJFJFJFJ";
/csvFile: `$":/tmp/marketdata/a.csv";

parseCsvSingleThread:{
    flip fieldNames!(formatString;",") 0: x};

parseCsvParallel:{
    .Q.fc[{parseCsvSingleThread x}] x};

parseCsv: parseCsvParallel;

csvFile: `$":/tmp/marketdata/128M.csv";
.Q.fs[{quote,: parseCsv x}; csvFile];

sohagan
New Contributor
Are your disks busys?
Are your cpus busy? Maybe taskset set one to get fair results?

Remember you are timing has to:
1) read in chunk
2) split on ","
3) convert each column to correct type

And the number of times this has to occur is greater as the size of the chunks you read are smaller. 

Can you try without .Q.fs and see what speeds you are getting maybe?

Perhaps you should try out Simon's script. It is very well written, well tested, avoids a while loop, and has many other features you can avail off. Plus it has dynamic chunk sizes based on MB size. http://kx.com/q/e/csvguess.q. It outputs time taken, records per second, and speed so could be a lot of help to you. 

q csvguess.q FILE -savescript
q csvguess.custom.q FILE -bl -chunksize 40

HTH and let us know,
Sean



I have tried csvguess.q. It loads at 36 MB/sec for chunksizes from 4 to 128 MB. My disks are not busy. I have done the benchmark on a ramdisk. It is single-core cpu-bound. The max cpu% I have seen is 25% on a 4-core machine.

$ seq 2 7 | perl -ne 'chomp; print 2**$_; print qq{\n}' | xargs -n1 --verbose q 20140130.load.q f:/512M.csv -bl -chunks
ize
q 20140130.load.q f:/512M.csv -bl -chunksize 4
KDB+ 3.2 2015.02.10 Copyright (C) 1993-2015 Kx Systems
w32/ 4()core 4095MB yan yani5 192.168.1.5 NONEXPIRE

07:19:13 loading <f:/512M.csv> to variable DATA
07:19:28 done (2880282 records; 191814 records/sec; 36 MB/sec; CHUNKSIZE 4)
q 20140130.load.q f:/512M.csv -bl -chunksize 8
KDB+ 3.2 2015.02.10 Copyright (C) 1993-2015 Kx Systems
w32/ 4()core 4095MB yan yani5 192.168.1.5 NONEXPIRE

07:19:28 loading <f:/512M.csv> to variable DATA
07:19:43 done (2880282 records; 192390 records/sec; 36 MB/sec; CHUNKSIZE 😎
q 20140130.load.q f:/512M.csv -bl -chunksize 16
KDB+ 3.2 2015.02.10 Copyright (C) 1993-2015 Kx Systems
w32/ 4()core 4095MB yan yani5 192.168.1.5 NONEXPIRE

07:19:43 loading <f:/512M.csv> to variable DATA
07:19:58 done (2880282 records; 192493 records/sec; 36 MB/sec; CHUNKSIZE 16)
q 20140130.load.q f:/512M.csv -bl -chunksize 32
KDB+ 3.2 2015.02.10 Copyright (C) 1993-2015 Kx Systems
w32/ 4()core 4095MB yan yani5 192.168.1.5 NONEXPIRE

07:19:58 loading <f:/512M.csv> to variable DATA
07:20:13 done (2880282 records; 191890 records/sec; 36 MB/sec; CHUNKSIZE 32)
q 20140130.load.q f:/512M.csv -bl -chunksize 64
KDB+ 3.2 2015.02.10 Copyright (C) 1993-2015 Kx Systems
w32/ 4()core 4095MB yan yani5 192.168.1.5 NONEXPIRE

07:20:13 loading <f:/512M.csv> to variable DATA
07:20:28 done (2880282 records; 191942 records/sec; 36 MB/sec; CHUNKSIZE 64)
q 20140130.load.q f:/512M.csv -bl -chunksize 128
KDB+ 3.2 2015.02.10 Copyright (C) 1993-2015 Kx Systems
w32/ 4()core 4095MB yan yani5 192.168.1.5 NONEXPIRE

07:20:28 loading <f:/512M.csv> to variable DATA
07:20:43 done (2880282 records; 191724 records/sec; 36 MB/sec; CHUNKSIZE 128)

LamHinYan
New Contributor
The following code parses csv at 80 MB/sec on a 4 core 4 thread 3GHz+ cpu. My next problem is parsing from stdin piped from "zcat bigFile.csv.gz". Please post your suggestions. Thx.

fieldNames: `tsStr`sym`tradePrice`cumTradeVol`bidFlag`bidPrice1`bidVol1`bidPrice2`bidVol2`bidPrice3`bidVol3`bidPrice4`bidVol4`bidPrice5`bidVol5`askFlag`askPrice1`askVol1`askPrice2`askVol2`askPrice3`askVol3`askPrice4`askVol4`askPrice5`askVol5;
formatString: "*SFJSFJFJFJFJFJSFJFJFJFJFJ";
/csvFile: `$":/tmp/marketdata/a.csv";

parseCsv:{
    flip fieldNames!(formatString;",") 0: x};
show;
show .z.P;
csvFile: `$"/tmp/marketdata/10lines.csv";
csvFile: `$"/tmp/marketdata/10MB.csv";
csvFile: `$"/tmp/marketdata/128MB.csv";
/.Q.fsn[{insert[`quote] .Q.fc[parseCsv] x}; csvFile; 4*1024*1024];

chunkSize:10*1024*1024;

adjustBound:{[csvFile; p]
    if[p=0; :0];
    if[p=count csvFile; :p];
    1+p+(raze read1(csvFile; p; 1024)) ? 0xa};

jobs:([] begin: chunkSize*til (hcount csvFile) div chunkSize);
update adjustBound[csvFile] each begin from `jobs;
jobs[`end]:(-1_next jobs[`begin]), (hcount csvFile);
update length: end-begin from `jobs;
\ts:10 raze { parseCsv read0(csvFile; x[`begin]; x[`length])} peach jobs

LamHinYan
New Contributor
chunkSize:120*1024;

I have got 98 MB/sec after REDUCING the chunk size from 10 MB to 120 KB. The default chunk size of 131000 is probably chosen to be smaller than 50% of 256 KB, the cpu level 2 cache size per core. code + input data + output data < 256 KB

http://code.kx.com/wiki/DotQ/DotQDotfsn
http://en.wikipedia.org/wiki/List_of_Intel_Core_i5_microprocessors