cancel
Showing results for 
Search instead for 
Did you mean: 

Format of splayed files

skuvvv
New Contributor
Does information about structure of splayed files is public?
Eg if I want update splayed table outside kdb, because kdb limited by memory.
I can get count of elements in file and calculate header size backwardly, but I am curious to know about structure of headers.
16 REPLIES 16

LamHinYan
New Contributor
Counting backwards works for noattr and sorted columns. Look at the abcd's.

tab:([]
  noattr:10+til 4;
  sorted:`s#10+til 4;
  parted:`p#10+til 4;
  grouped:`g#10+til 4
  );

`:/tmp/hdb/2015.09.21/tab/ set tab


yan@yani5 /tmp/hdb/2015.09.21/tab
$ ls | xargs -n1 --verbose od -tx8
od -tx8 grouped
0000000 00000000040720fe 0000000000000004
0000020 000000000000000a 000000000000000b
0000040 000000000000000c 000000000000000d
0000060 0000000102070503 0000000000000004
0000100 000000000000000a 000000000000000b
0000120 000000000000000c 000000000000000d
0000140 0000000000000000 ffffffffffffffff
0000160 ffffffffffffffff ffffffffffffffff
0000200 ffffffffffffffff 0000000000000003
0000220 0000000000000002 0000000000000001
0000240 0000000000070302 0000000000000005
0000260 0000000000000000 0000000000000001
0000300 0000000000000002 0000000000000003
0000320 0000000000000004 0000000000070302
0000340 0000000000000004 0000000000000000
0000360 0000000000000001 0000000000000002
0000400 0000000000000003
0000410
od -tx8 noattr
0000000 00000000000720fe 0000000000000004
0000020 000000000000000a 000000000000000b
0000040 000000000000000c 000000000000000d
0000060
od -tx8 parted
0000000 00000000030720fe 0000000000000004
0000020 000000000000000a 000000000000000b
0000040 000000000000000c 000000000000000d
0000060 0000000002070303 0000000000000004
0000100 000000000000000a 000000000000000b
0000120 000000000000000c 000000000000000d
0000140 0000000000000000 ffffffffffffffff
0000160 ffffffffffffffff ffffffffffffffff
0000200 ffffffffffffffff 0000000000000003
0000220 0000000000000002 0000000000000001
0000240 0000000000070202 0000000000000005
0000260 0000000000000000 0000000000000001
0000300 0000000000000002 0000000000000003
0000320 0000000000000004
0000330
od -tx8 sorted
0000000 00000000010720fe 0000000000000004
0000020 000000000000000a 000000000000000b
0000040 000000000000000c 000000000000000d
0000060

Thanks for information!

Suppose KX or you delivered a spec, what would you see in a column vector file?

flags
length of the list
index (list or lists of list, offsets)
satellite data

The flags take constant time to read or write thru Q programs.

The length takes constant time to update if you append to a noattr list. You may need to update the sorted flag. Why don't you just upsert in Q?

The index will probably take linear time to update in the general case, i.e. rewriting the whole thing. Why don't you just rewrite in Q?

The satellite data will probably take linear time to update, too. Why don't you just rewrite in Q?

I have read your last few posts. Random access in-place overwrite to a particular element is the only usage that you can do in C but not Q. Are there anything else that you want to bypass Q and do in C?

First thing that I can not do with kdb is sorting, so I looking for alternative way making sorting outside kdb.

Sample data, please. What are you trying to sort?

Nothing special, 100mil+ row in the table, 2-3 rows sorting using xasc on the disc.

Do you mean sorting 2-3 consecutive rows within a splayed table with 100M+ rows?

I mean 2-3 cols, eg by symbol, by time, etc

Why can't you use kdb sort?

skuvvv
New Contributor
Example scripts:
create table:
chunkSize: 10000000
chunks: 25
dir: `:f:/temp
do[chunks;
table: ([]symbol:chunkSize?`3;time:chunkSize?.z.t;price:chunkSize?100f;size:chunkSize?1000i);
(` sv dir, `table`) upsert .Q.en[dir] table;
]
table: null
.Q.gc[]



and next sort:
dir: `:f:/temp
`symbol`time xasc (` sv dir, `table`)


Your table size is 4.65 GB. Your price column is 1.9 GB. You have exceeded the kdb32bit memory limits. The size, symbol, and time columns are 900MB+ each, which are close to the 1 GB per vector limit.

Then reduce chunks to 10 and you will get same result.
It is why I want do this outside kdb...

I got OOM when chunks=10. It worked when chunks=5. kdb needs temporary space to manipulate data.

Try making a permutation index list by iasc. Then apply the permutation for each column. You should just need enough memory to hold one column in memory. Take a look at disksort[] here.

q for Gods Whitepaper Series (Edition 16)Intraday Writedown SolutionsMarch 2014
http://www.firstderivatives.com/downloads/q_for_Gods_March_2014.pdf

They use iasc, which too greedy for memory usage, so I can not sort even one column for table with 100mil rows.
I already tested all that you suggested...

Do a csv export, then use the 64 bit linux sort command, then import again. Memory management in kdb32bit is a lot of fun.