cancel
Showing results for 
Search instead for 
Did you mean: 

Create a splayed table in parallel?

KdbNoob
New Contributor
It is easy to create csv files in parallel -- they are not related to each other. you can launch as many processes as you want and write csv files to the file system. 

splayed tables, however, are tricky since multiple sub-directories share the same "sym" file. It;s unsafe to have multiple processes writing to the same "sym" file. 

How do we get around this?


8 REPLIES 8

LamHinYan
New Contributor
http://code.kx.com/wiki/Cookbook/LoadingFromLargeFiles#Parallel_Loading

q will attempt to acquire a lock on the sym file when it's enumerating against it, so theoretically you should be able to have multiple processes enumerating against the same sym file in parallel.

q).z.i
27864
q)`:sym?`a
`sym$`a

strace shows:

$ strace -p 27864
Process 27864 attached - interrupt to quit
read(0, "`:sym?`a\n", 4080)             = 9
stat("sym", 0x7fff6e0fc5b0)             = -1 ENOENT (No such file or directory)
unlink("sym$")                          = -1 ENOENT (No such file or directory)
open("sym$", O_RDWR|O_CREAT, 0666)      = 3
write(3, "\377\1\v\0\0\0\0\0", 😎       = 8
close(3)                                = 0
rename("sym$", "sym")                   = 0
stat("sym#", 0x7fff6e0fc560)            = -1 ENOENT (No such file or directory)
open("sym", O_RDWR|O_CREAT, 0666)       = 3
read(3, "\377\1\v\0\0\0\0\0", 😎        = 8
lseek(3, 0, SEEK_SET)                   = 0
fcntl(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_CUR, start=0, len=0}) = 0
lseek(3, 2, SEEK_SET)                   = 2
fstat(3, {st_mode=S_IFREG|0644, st_size=8, ...}) = 0


However, if your sym file is on NFS this might not be safe: http://0pointer.de/blog/projects/locking.html

Thanks Martin. Unfortunately I am on an NFS. is there a good way around it? 

As shown in the diagram, the enumService should perform locking properly when NFS locking might not be safe. Is this overkill? Are there any simpler method? Thx.




Sorry Yan I might not understand you. Do you mean loading in parallel on NFS is still safe? 

The following "multiple parser + single writer" design should be simpler than the previous enumService design. The writer kdb processes incoming IPC messages in FIFO order. This is parallel parsing and sequential writing.




Tom Martin: If I have 2 kdb instances on the same host enumerating against /nfs/hdb/sym at the same time, will the locking work properly?

I think it's implementation specific and depends on your OS.

http://man7.org/linux/man-pages/man2/fcntl.2.html

Record locking and NFS       Before Linux 3.12, if an NFSv4 client loses contact with the server       for a period of time (defined as more than 90 seconds with no       communication), it might lose and regain a lock without ever being       aware of the fact.  (The period of time after which contact is       assumed lost is known as the NFSv4 leasetime.  On a Linux NFS server,       this can be determined by looking at /proc/fs/nfsd/nfsv4leasetime,       which expresses the period in seconds.  The default value for this       file is 90.)  This scenario potentially risks data corruption, since       another process might acquire a lock in the intervening period and       perform file I/O.       Since Linux 3.12, if an NFSv4 client loses contact with the server,       any I/O to the file by a process which "thinks" it holds a lock will       fail until that process closes and reopens the file.  A kernel       parameter, nfs.recover_lost_locks, can be set to 1 to obtain the       pre-3.12 behavior, whereby the client will attempt to recover lost       locks when contact is reestablished with the server.  Because of the       attendant risk of data corruption, this parameter defaults to 0       (disabled).

On Sunday, August 23, 2015 at 11:58:14 PM UTC+1, Yan Yan wrote:
Tom Martin: If I have 2 kdb instances on the same host enumerating against /nfs/hdb/sym at the same time, will the locking work properly?