Hi Saad
Just to be clear here, you have the data date partitioned (separate directory) within each country?
With kdb+ you can only partition by one field. In your case each country is more like a separate database. So if you want to query across all countries, you either need to load each database individually and query it, or you need to run multiple processes, each accessing each database, and query across them and join the results.
Depending on the dataset, it might be better to restructure the data. You can partition either by country OR date (if you partition by country then the country name needs to be enumerated to an int and mapped in each query). The choice probably depends on:
- how the majority of your queries are run- do you more often access the data for a subset of countries, or for a subset of dates. If most of the queries are of the form “for country X query across all dates” then country would be the better choice.
- how many of each do you have in the dataset. For example if you only have a handful of countries and lots of dates then partitioning by date might be the better choice as date filters will cut out more of the data more quickly.
- how the data arrives and whether you would need to sort it to maintain attributes.
You can also apply a p attribute on whichever one you don’t chose as the partitioning type to give faster lookups. Depending on what your data is you might also be able to just store the data in the usual format of partitioned by date, parted by stock symbol, and if there is a 1:1 mapping between symbol and country you could either add an additional country column also with a p attribute (i.e. a small bit of denormalisation) or manage it in the queries.
Thanks
Jonny