I have a kdb application in which one kdb process receives data from
an external system and issues updates and inserts to an on-disk
kdb database. Concurrently, another kdb process is running select
queries against the on-disk database.
We cannot shut down the select query server while new data is being loaded
nor can we force it to perform a lengthy refresh operation to incorporate
new data.
What is the best way to ensure that this does not result in
corrupted data due to race conditions resulting from partially
complete updates ? .
I believe the standard kdb thing is to have a "today" database (typically in memory)
and a "historical" database (typically on disk), with all queries merging data
from pairs of corresponding tables in the today and historical database.
But, don't you have the same problem applying the import to the "today"
database every few seconds ? And what about when you need to incorporate
the "today" data into the "historical" database at the end of the day,
how do you do that without downtime ?
One approach to the latter problem would be for the code that applies
new imports to the "today" database to also apply them to a copy of the
historical database, which will replace the live copy at midnight of each day.
But this does not address the problem of getting a consistent update to
the "today" database. It also may require that each query server perform
some kind of refresh to ensure that it sees the new on-disk data after
midnight.
Another idea is to segregate the new data from the old using timestamps.
This seems like it should work in the case where the kdb database receives
only inserts from the external system. In that case the select queries
could simply exclude rows timestamped later than the end of the last
complete import cycle.
To try to get this to support updates, the schema could be designed that
each column that gets updated is associated with two additional columns:
* A timestamp column that says when the value was updated
* A previous value column has the value prior to the update.
Queries would use the value from the previous value column when the timestamp
is later than the end of the last complete import cycle, but otherwise use the
value from the current value column.
In our system most of the updated columns ARE timestamps, and the previous value would always be null.
So no additional columns are needed, we just have to treat values exceeding the time of the last
complete update as null.
Does this approach sound reasonable to you ?
What is the standard way to deal this such issues ?