cancel
Showing results for 
Search instead for 
Did you mean: 

hdb creation

andyturk
New Contributor
Received: by 10.151.141.2 with SMTP id t2mr1671363ybn.1.1249301617653; Mon, 03 Aug 2009 05:13:37 -0700 (PDT)Date: Mon, 3 Aug 2009 05:13:37 -0700 (PDT)X-IP: 119.82.248.67User-Agent: G2/1.0X-Google-Token: aECvqwwAAACFuH89ujfFNUMcQWFjHbJXX-HTTP-Via: 1.1 proxy1.ezecom.com.kh:3128 (squid/2.7.STABLE6)X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1,gzip(gfe),gzip(gfe)Message-ID: <647ccd92-6b52-49f9-aaa3-f7d76e8f72f3@i4g2000prm.googlegroups.com>Subject: hdb creationFrom: andyturk To: "Kdb+ Personal Developers" X-Google-Approved: simon.garland@gmail.com via web at 2009-08-03 12:20:58I've written some code to import a series of csv files and store themin the hdb splayed/partitioned format for q. Things are mostlyworking, but I have a few questions for the more experienced q-wizardshere:1) When I put two tables into the same partition directory, qsometimes doesn't recognize one of them. This seems to happen when oneof the tables doesn't occupy all the partions of the other. Forexample, suppose I've imported two tables, A and B, into a directorycalled "dir". A has data in two partitions, 2009.08.01 and 2009.08.02,while B has data only in 2009.08.02. When I fire up q with: "q dir",table A exists, but B does not. This happens even though the directory"dir/2009.08.02/B" contains the expected files. How do I get q torecognize both tables?2) My import code is written with .Q.fs, which appears to read in afairly small amount of information each time. I'm getting about 3700lines each time from a fairly narrow table with only five columns. Isthere a way to have Q.fs work in larger chunks?3) I've noticed that q supports partitioning by date and by month.However, partitioning by symbol (e.g., by ticker symbol) doesn't work.That is to say, when I partition by a ticker symbol and try to to loadthe data in q, it's not recognized. Are the allowable partition typesdocumented anywhere?Thanks.
14 REPLIES 14

jake_mccrary
New Contributor
1) As far as I can tell if a table is not in the first date then Q does not know it exists.� An easy way to fix this is to run the .Q.chk[dbdirpath] function. This adds empty tables to directories where they do not exist.

2) You can modify .Q.fs.� If you look at the code you will see a number in there.� Bump that up to chunk in larger sizes.

3) I can't point you to a document, but I believe allowable partition types are date, month, and year.

X-Mailer: Apple Mail (2.935.3)On 03.08.2009, at 15:27, Jake wrote:> 3) I can't point you to a document, but I believe allowable > partition types are date, month, and year.http://kx.com/q/d/a/q.htm#Parallel%20Databasedate, month, year or int(eger)

User-Agent: G2/1.0X-Google-Token: 8M9_YwwAAAABAo1FHXZcSJ3l0F_ChKfsX-HTTP-Via: 1.1 proxy1.ezecom.com.kh:3128 (squid/2.7.STABLE6)X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2,gzip(gfe),gzip(gfe)Message-ID: Subject: Re: hdb creationFrom: andyturk To: "Kdb+ Personal Developers" X-Google-Approved: charlie@kx.com via web at 2009-08-04 12:22:53On Aug 3, 9:42�pm, Simon Garland wrote:> On 03.08.2009, at 15:27, Jake wrote ...Thanks guys. .Q.chk did the trick, q now sees both my tables. Also,hacking .Q.fs to use a larger chunk size made a difference. A 10xincrease in the chunk size resulted in the 3x speedup for the overallimport. It loads at about 210K cvs records per second now, which islivable.Playing with the data resulted in some more head scratching though.The query performance seems decent, but there appears to be asignificant memory leak. When I have q total trade volumes by symbolacross the whole table, the q process grabs nearly a gigabyte of VMeach execution and dies on the 4th try. Could this be related to myhomegrown import code?Here's some console output showing the problem:DB+ 2.5 2009.02.13 Copyright (C) 1993-2009 Kx Systemsm32/ 2()core 4096MB andy andy-turks-macbook-pro.local 10.211.55.2 PLAY2009.05.14q)countfuture212775701jq)\t r:select sum volume by ticker fromfuture4100q)\t r:select sum volume by ticker fromfuture4036q)\t r:select sum volume by ticker fromfuture4099q)\t r:select sum volume by ticker fromfuturek){0!(?).@[x;0;p1[;y;P0::z]]}'ticker.?(+`ticker`date`time`price`volume!`:./2007.08/future;();(,`ticker)!,`ticker;(,..q))\q)rticker| volume------| --------DXH07 | 3DXH08 | 228248DXH09 | 311908DXM06 | 29DXM07 | 104DXM08 | 381560DXM09 | 111060DXU06 | 22DXU07 | 109316DXU08 | 400209DXU09 | 2487DXZ06 | 149DXZ07 | 161025DXZ08 | 379855DXZ09 | 9ERH07 | 10323294ERH08 | 16067359ERH09 | 16594435ERM06 | 7201540ERM07 | 11734516..q)

charlie
New Contributor II
New Contributor II
we've fixed a whole bunch of things in the last few months.
You'd do well to try the latest release.

Other thing to note is that symbols are internalized - i.e. once allocated they cannot be freed. If you have strings that are not repetitive maybe you are better of using char vectors.

> we've fixed a whole bunch of things in the last few months.You'd do well to> try the latest release.OK, got it. Same problem though. On the 4th summation of volume, thesystem falls over.> Other thing to note is that symbols are internalized - i.e. once allocated> they cannot be freed. If you have strings that are not repetitive maybe you> are better of using char vectors.There are 47 unique symbols in the entire dataset, so I hope that'snot it. :-)My hdb, after being loaded by q, looks like this:q)futuremonth ticker date time price volume---------------------------------------------------2006.04 DXM06 2006.04.02 23:00:19.000 89.39 02006.04 DXM06 2006.04.02 23:00:27.000 89.39 02006.04 DXM06 2006.04.02 23:06:51.000 89.39 02006.04 DXM06 2006.04.02 23:09:04.000 89.46 0..The first column, month, doesn't exist in the actual data, butis added by the partitioning (`month$date).It looks like the vm blowout is the cause of the query failure.The result maybe a few hundred bytes, so why would calculatingit a second time gobble up an additional ~1GB of vm?Can I find out where this memory is going?

Hmm, it looks like the vm is going to memory mapped files. Assumingthat q calculated the first result correctly, it shouldn't have tomap "new" files to calculate subsequent results.Is there a way to get a list of which files are currently beingmapped?My guess is that some are being mapped multiple times.Here's another trace showing workspace info:$ q hdbKDB+ 2.5 2009.08.04 Copyright (C) 1993-2009 Kx Systemsm32/ 2()core 4096MB andy andy-turks-macbook-pro.local 10.211.55.2 PLAY2009.11.02q)\w102816 67108864 0 0jq)countfuture212775701jq)\w107328 67108864 0 851103380jq)r:select sum volume by ticker fromfutureq)\w109088 67108864 0 1702206760jq)r:select sum volume by ticker fromfutureq)\w109088 67108864 0 2553310140jq)r:select sum volume by ticker fromfutureq)\w109088 67108864 0 3404413520jq)r:select sum volume by ticker fromfuturek){0!(?).@[x;0;p1[;y;P0::z]]}'ticker.?(+`ticker`date`time`price`volume!`:./2007.08/future;();(,`ticker)!,`ticker;(,..q))\w116560 67108864 0 3594911260jq))\q)

Since I'm on OS X, I tried looking at the Activity Monitor app tosee which files had been opened. This showed that after my query, allthe'volume' files remained open, one from each partition directory.However,the list of open files didn't change when I ran the query multipletimes.To get at that info programmatically, one can use the vmmap program,whichalso displays mapped files (along with lots of other information).Interestingly,vmmap *did* show files being mapped multiple times. A ha!So it looks like q is re-mapping tables that it already has in memory,and onthe 32-bit version, it runs out of vm fairly quickly.Seems like a bug, no?-----------------------$ q hdbKDB+ 2.5 2009.08.04 Copyright (C) 1993-2009 Kx Systemsm32/ 2()core 4096MB andy andy-turks-macbook-pro.local 10.211.55.2 PLAY2009.11.02q)\w102816 67108864 0 0jq)count each group `$'system("vmmap ",(string .z.i),"|awk '/^mapped/{print $8}'")q)/ *** no mapped files yet ***q)r:select sum volume by ticker from futureq)\w105024 201326592 0 851103380jq)count each group `$'system("vmmap ",(string .z.i),"|awk '/^mapped/{print $8}'")/Users/andy/p4/tickwrite/hdb/2006.04/future/volume| 1/Users/andy/p4/tickwrite/hdb/2006.05/future/volume| 1/Users/andy/p4/tickwrite/hdb/2006.06/future/volume| 1..q)r:select sum volume by ticker from futureq)count each group `$'system("vmmap ",(string .z.i),"|awk '/^mapped/{print $8}'")/Users/andy/p4/tickwrite/hdb/2006.04/future/volume| 2/Users/andy/p4/tickwrite/hdb/2006.05/future/volume| 2/Users/andy/p4/tickwrite/hdb/2006.06/future/volume| 2..q)\w108832 67108864 0 1702206760jq)

charlie
New Contributor II
New Contributor II
Very likely the header on the volume files has a bad field.
How were those files generated?

e.g. run�

hexdump volume | head

and look at byte offsets 4,5,6,7 - they should be zero.

$ hexdump badfile | head
0000000 ff 20 00 00�20 3e�00 00 06 00 00 00 80 96 98 00
0000010 39 00 00 00 eb 02 00 00 ba 01 00 00 84 01 00 00
$ hexdump goodfile | head
0000000 ff 20 00 00 00 00 00 00 06 00 00 00 80 96 98 00
0000010 39 00 00 00 eb 02 00 00 ba 01 00 00 84 01 00 00

> Very likely the header on the volume files has a bad field.My files seem to match your "badfile":$ hexdump hdb/2006.04/future/volume | head0000000 ff 20 00 00 20 3e 00 00 06 00 00 00 6c 21 00 000000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00> How were those files generated?My code uses "`:path upsert t" (where t is an actual table, nota reference) to write the partitioned tables to disk.There's also a final pass with Q.chk, but I don't think thattouches existing files.Aside from two bytes of magic, what's the difference between agood file and a bad one?

charlie
New Contributor II
New Contributor II
Andy,

my emails to you privately do not appear to be getting through. Maybe they are going straight to your spam folder?

We have identified the cause of this and it manifests itself in the 32bit version only; this would not show up in the 64bit version used for production.

We'll release�a�fix�for�the�32bit�version�later�today.

thanks,
Charlie

Charlie,The new build works like a champ! The first test was to load up thenew build with the old data files. This worked fine with no vmleakage. In fact, there aren't any memory mapped files left lyingaround at all. Way cool.The second test was to regenerate my data using the new build. I didthis, and it works fine too. Tight as a drum--no leaks. I did notice,however, that the hexdump of the newly generated files still have thesignature of your "badfile" (e.g., 20 3e in bytes 4-5). In fact, thenew files are identical to the old. Note that being lazy, I onlychecked the first one. ;-)Many thanks to you and Kx for supporting an "unsupported" developerrelease and for doing it so quickly.Andy

charlie
New Contributor II
New Contributor II
The problem was manifesting itself in the m32 build only, generating a bad file header when writing e.g.

`:test til 5

All other builds, 32/64bit are fine.

thanks for reporting it. We'll upload a fix soon.

Charlie

charlie
New Contributor II
New Contributor II
sorry that should be

`:test set til 5

> The problem was manifesting itself in the m32 build onlyThe mac never gets any respect. *sigh*:-)