CatBoost is a popular machine learning library that uses gradient boosted decision trees models. It allows to train models on tabular data with different kinds of features: numeric, categorical, and textual, as well as embeddings, while providing good quality even with default parameters.
It is developed primarily by researchers and engineers of Yandex, the largest IT company of Russia, and is used for search, recommendation systems, personal assistant, self-driving cars, weather prediction and many other tasks at Yandex and in other companies.
In this presentation, we introduce CatBoost distributed training on Spark.
We will discuss the key features, the overall architecture and also present some benchmarks.