Showing results for 
Search instead for 
Did you mean: 

how to incorporate new data into an on-disk kdb database without

New Contributor
I have a kdb application in which one kdb process receives data from
an external system and issues updates and inserts to an on-disk
kdb database. Concurrently, another kdb process is running select
queries against the on-disk database.

We cannot shut down the select query server while new data is being loaded
nor can we force it to perform a lengthy refresh operation to incorporate
new data.

What is the best way to ensure that this does not result in
corrupted data due to race conditions resulting from partially
complete updates ? .

I believe the standard kdb thing is to have a "today" database (typically in memory)
and a "historical" database (typically on disk), with all queries merging data
from pairs of corresponding tables in the today and historical database.

But, don't you have the same problem applying the import to the "today"
database every few seconds ? And what about when you need to incorporate
the "today" data into the "historical" database at the end of the day,
how do you do that without downtime ?

One approach to the latter problem would be for the code that applies
new imports to the "today" database to also apply them to a copy of the
historical database, which will replace the live copy at midnight of each day.

But this does not address the problem of getting a consistent update to
the "today" database. It also may require that each query server perform
some kind of refresh to ensure that it sees the new on-disk data after

Another idea is to segregate the new data from the old using timestamps.

This seems like it should work in the case where the kdb database receives
only inserts from the external system. In that case the select queries
could simply exclude rows timestamped later than the end of the last
complete import cycle.

To try to get this to support updates, the schema could be designed that
each column that gets updated is associated with two additional columns:
* A timestamp column that says when the value was updated
* A previous value column has the value prior to the update.

Queries would use the value from the previous value column when the timestamp
is later than the end of the last complete import cycle, but otherwise use the
value from the current value column.

In our system most of the updated columns ARE timestamps, and the previous value would always be null.
So no additional columns are needed, we just have to treat values exceeding the time of the last
complete update as null.

Does this approach sound reasonable to you ?
What is the standard way to deal this such issues ?


New Contributor
You approach looks reasonable. You also can use native semaphores (requires so lib unfortunately) to acquire a  lock before any update.

I posted the above question before I realized that you can't perform a update on a splayed table, basically because you can't spay a keyed table and you need a key to perform a update.

My database contains several objects that are represented by a row in a table and which need to change state over time.
This means my importer (ie tickerplant) needs to look up preexisting rows in the table and change the value in specific columns.

Is there some way to accomplish that without reading the entire table into memory and writing it back out ?
I would also like changes applied in this way to be immediately visible to another q process accessing the same table on disk.

If that is not possible, I might be able to redesign my schema so that the importer will only need to insert rather than update rows.
That would be kind of awkward for the main application because the state of one object would now involve more than one row in more than one table. Would that be a usual method of dealing with the issue ?

New Contributor III
Andrew's approach is one possibility.

Another way is to create two copies of on-disk hDB intraday. You can then switch between the two using the so-called red-blue approach.

You can then front the user queries with a gateway that divert queries between the two instances -- read from one copy while updating the other one, then swap over every once while. Your queries will not be completely up-to-date during the swap period, though. So it all depends on your exact requirements on real-time-ness of the data in hDB.

New Contributor
So, with this method you have a "today" table which lives in memory and can therefore be freely updated, and a "historical" table spayed on disk.
Select queries incorporate data from both tables, taking the copy from the "today" table when the same record exists in both.
When a record needs to be updated in the historical table, it is copied into the today table.

At some point the historical table is upserted by the today table, producing a result that replaces the historical database tomorrow.
This seems like it could take some time, and the whole table might have to fit in memory. Is there some way around that ?

In light of the comment above about not being able to update splayed tables on disk, it seems that even having 2 separate instances
would not work if you can't apply updates without reading the whole table into memory. 

Have a look at - there is a way to update files on disk since 3.4, maybe it is what you need.

WBR, Andrey.
In light of the comment above about not being able to update splayed tables on disk, it seems that even having 2 separate instances
would not work if you can't apply updates without reading the whole table into memory. 

So, I don't like the 2 database approach because I am afraid that manual corrections applied to one copy won't make it into the other (human error).
But my boss thinks a file locking approach may be error prone (locks don't get respected) and perform poorly (locks get held too long) and just take
too long to design and code. Worst of all problems might be intermittent and not appear until we have a heavy system load (ie. are in production).  

Could you elaborate a bit on how the file locking solution might work ?

* Which *.so library would we be using ?
* Would this be advisory or mandatory locking ?
* Would we need the library to acquire/release a lock, or could we create/remove a lock file ?
* Will kdb+ use the library to acquire/release locks automatically on the read only side ?
* Could a directory structure similar to a partitioned table be used to only lock a subset of the table rows for each update ?
* Would this be easier to implement if we only insert and never update ?