cancel
Showing results for 
Search instead for 
Did you mean: 

Loading new-line separated JSON docs

suls
New Contributor
Hi,

I am trying to load orc files [1] into kdb.
My initial strategy is as follows:

item:flip columnnames!"SIIISHSSIIISSHHHHHHHHHHHHHHHPPDHHPPSSSHISSISISISHIHSHHHHPPHHHPSPPS"$\:();

loadfile:{[file]
 cmd:" " sv ("./ORC-1.5.3-Linux/bin/orc-contents";1_string file); / returns 1 JSON document per line
 tmp:.j.k each system cmd;
 tmp:columnnames xcol tmp;
 item:item,tmp;
}


loadfile[x] each filelist;

I wanted to ask 2 things:

1/ Is there a canonical way to preserve the schema for each json document? My simplistic , doesn't seem to work that way.

2/ I understand using named pipes [2] instead could improve performance. I fail however to use .Q.fps in combination with .j.k. .. all the samples I saw so far are about using CSV and they are using the (""!",") syntax. I fail to understand how I could use .j.k instead ..

Cheers, Mathias

[1] https://orc.apache.org
[2] https://code.kx.com/q/cookbook/named-pipes/
6 REPLIES 6

TerryLynch
New Contributor II
I'm not sure what you mean by preserving the schema, if each document has a different schema how do you plan to store them if they don't conform to a common schema?

On your point about .Q.fps, the easiest way to figure out what to do with the stream of data is to extract out a chunk and inspect it, like so:

.Q.fps[{$[not `global in key `.;`global set x;()]};`:/path/to/myFIFO]

Then you can inspect the global to see what a chunk of the stream looks like. 

Terry



Thanks Terry! Your function set me on the right path!
This is where I am at the moment:

item:flip columnnames!"SIIISHSSIIISSHHHHHHHHHHHHHHHPPDHHPPSSSHISSISISISHIHSHHHHPPHHHPSPPS"$\:();

.Q.fps[{[x] y:.j.k peach x; z:columnnames xcol y; `item upsert z};`:fifo]

It turns out though that .j.k only returns C or f types ..

And thus I am not able to upsert into my `item definition.
What is the best way to align the types of z and my target table?


Cheers, Mathias



On Thursday, December 20, 2018 at 11:04:09 PM UTC+9, TerryLynch wrote:
I'm not sure what you mean by preserving the schema, if each document has a different schema how do you plan to store them if they don't conform to a common schema?

On your point about .Q.fps, the easiest way to figure out what to do with the stream of data is to extract out a chunk and inspect it, like so:

.Q.fps[{$[not `global in key `.;`global set x;()]};`:/path/to/myFIFO]

Then you can inspect the global to see what a chunk of the stream looks like. 

Terry



TerryLynch
New Contributor II

Something like this might work for you,though it may not be the most efficient

 

q)tmp:.j.k .j.j orig:([]col1:`aa`bb`cc;col2:11 22 33i;col3:.z.D-til 3;col4:1 2 3h)

q)clms:`col1`col2`col3`col4;

q)types:"SIDH";

q)t:flip clms!types$\:();

q)`t upsert {@[x;y;{$[0h=type y;x$y;("h"$.Q.t?lower x)$y]}z]}/[tmp;cols tmp;types];

q)t~orig

1b

 

 

/or in your example (assuming you've defined a types variable):

`item upsert {@[x;y;{$[0h=type y;x$y;("h"$.Q.t?lower x)$y]}z]}/[z;cols z;types]



This was super helpful!

I ended up using only a subset of the fields present in the original JSON using:

.Q.fps[{[x] y:.j.k peach x; z:columnnames xcol y; `item upsert (select `int$shop_id, `long$price from z )};`:fifo]
// shop_id and price are both parsed as f

Which was working fine until I wanted to cast a string field in the JSON doc into a long:

.Q.fps[{[x] y:.j.k peach x; z:columnnames xcol y; `item upsert (select `int$shop_id, "J"$string_in_json_but_actually_a_long, `long$price from z )};`:fifo]

It seems that I am still not understanding some fundamentals of q ..

TerryLynch
New Contributor II
In theory the "J"$ should work. Are you 100% certain that every value in the "string_in_json_but_actually_a_long" column is a string of type 10h? Using meta doesn't tell the whole story, it only reports the types of the first row. You would need to run "distinct type each z`string_in_json_but_actually_a_long" to be certain. 

Values that are "null" in json will actually get parsed as null floats (0n) even if the column is supposed to be a string column. I suspect your column might have a mixture of strings and null floats and the null floats are erroring your attempt to cast using "J"$

Terry


Try a more careful cast like this instead:

select {$[10h=type x;"J"$x;0Nj]}each string_in_json_but_actually_a_long from z

Terry