cancel
Showing results for 
Search instead for 
Did you mean: 

Distribution of time-series data into train/dev/test sets for ML

marrowgari
New Contributor

I currently have a kdb+ database with ~1mil rows of financial tick data. What is the best way to break up this time-series financial data into train/dev/test sets for ML?


This paper suggests the use of k-fold cross-validation, which partitions the data into complimentary subsets. But it's from Spring-2014 and after reading it I'm still unclear on how to implement it in practice. Is this the best solution or is something like hold-out validation more appropriate for financial data? I found this paper as well on building a Neural Network in Kdb+ but I didn't see any practical real world examples for dividing the dataset into appropriate categories.


Thank you.

6 REPLIES 6

effbiae
New Contributor
kx has developed embedpy.  this allows q to call python, including ML libraries like Tensorflow as in example here:

if having python in q opens up some options, look here.

quintanar401
New Contributor
Hi,

1 mil is a big enough number (though this depends on what exactly you want to do), most benchmark datasets are usually smaller.

Otherwise you can use data augmentation, data mixing (construct examples like alpha*ex1+(1-alpha)*ex2), use a pretrained model and etc.

WBR, Andrey Kozyrev.


четверг, 22 февраля 2018 г., 2:01:33 UTC+3 пользователь marrowgari написал:

I currently have a kdb+ database with ~1mil rows of financial tick data. What is the best way to break up this time-series financial data into train/dev/test sets for ML?


This paper suggests the use of k-fold cross-validation, which partitions the data into complimentary subsets. But it's from Spring-2014 and after reading it I'm still unclear on how to implement it in practice. Is this the best solution or is something like hold-out validation more appropriate for financial data? I found this paper as well on building a Neural Network in Kdb+ but I didn't see any practical real world examples for dividing the dataset into appropriate categories.


Thank you.

Thanks for the reply, Andrey.

I know data augmentation and mixing to be great approaches if you have a small data set or for classification problems, e.g. using linear or softmax regression on cat pics to decide if it's a cat or not. But I'm not sure how this approach would work with time-series data? It seems like the order of rows (days/prices) plays an important factor in training the model's output, which if that's the case, mixing the data would result in an exploding loss function and destroy your model.

Augmenting time-series data is not something I'm familiar. Do you have other examples how to do this?

> cat or not
a picture of a cat is a rectangle of triples
tick data is a sequence of triples, quadruples, quintuples or wider
but simpler nonetheless

need to transform your time series(s) to stationary processes.

Then there are a number of way to perform cross valuation specific for time series data, one typical uses:

https://www.sciencedirect.com/science/article/pii/S0304407600000300

Regards
Xi

heydi
New Contributor
Also Data augmentation for time series is usually done using MCMC technique, but again it depends on your use case.

Xi