Quantitative finance relies heavily on data to make informed decisions. Datasets are used as inputs to train machine learning models that quants need to construct and validate their strategies. These datasets encompass a wide range of financial market data, economic indicators and many so called alternative datasets. aisot’s Head of R&D, Dr.Nino Antulov-Fantulin and AI & Quant Advisory Lead Dr. Petter Kolm have released a dataset consisting of millisecond and minute frequency snapshots of trades and limit order books for BTC/USD (i.e. the Bitcoin / US dollars currency pair) from May 31, 2018 through September 30, 2018 from the Bitstamp exchange (https://www.bitstamp.net). Trade data is on a millisecond frequency. Limit order book snapshots are on minute frequency, with aggregated amounts for each price level with depth up to 5000 for the bid and ask sides.
Importance of publicly available datasets
Publicly available datasets and benchmarks are crucial for advancing Machine Learning. They provide a standardized basis for evaluating and comparing the performance of different algorithms and models, allowing researchers to objectively measure progress. Moreover, they democratize access to essential data, reducing barriers to entry for new researchers and fostering collaboration within the community. Another important role of publicly available datasets is to stimulate innovation by inspiring researchers to tackle challenging problems and develop novel techniques to achieve state-of-the-art results, ultimately driving the field forward.
In sub-fields of machine learning like computer vision, public datasets have proven to be pivotal for the advancement of the field. They facilitate model comparison, encourage innovation through healthy competition, and enable the development of more accurate and specialized algorithms, ultimately driving advancements in areas like image classification, object detection, 3D vision, medical imaging, and more. These resources serve as a foundation for standardized evaluation, transfer learning, and collaborative research efforts in the field.
In comparison to computer vision, the use of publicly available datasets is not yet common practice in quant finance. Rather than using public datasets, banks and asset managers purchase large amounts of data for in-house research and production. Without publicly available datasets, it is challenging to evaluate the effectiveness of models. Public datasets bring some obvious benefits to quant finance:
Similar to other ML fields such as computer vision, it is important to make more datasets publicly available for quantitative finance research. While research benchmarking is the most obvious benefit of public datasets, they also empower researchers, analysts, and investors to make data-driven decisions, develop innovative approaches, and ensure transparency and accountability in the financial industry. At a broader scale, their availability not only benefits professionals in the industry but can also contribute to the overall stability and integrity of financial markets worldwide.