This is an open-source implementation of the Alternating Direction Method of Multipliers (ADMM)[1] optimization algorithm that relies on CVXPY[2], a Python-based toolbox for convex optimization, and implemented with the COMPSs programming model[3]. The solution has been developed in the context of the I-BiDaaS project[4], and it allows to train several Machine Learning models, including Least Absolute Shrinkage and Selection Operator (LASSO)-based and elastic net regression, logistic loss-based classification, and clustering, as well as several other models with minimal additional coding effort. A number of related implementations are available at the I-BiDaaS knowledge repository[5], and the ADMM implementation of LASSO has been included in the dislib library[6], an open source COMPSs based library oriented to Machine Learning. This innovation has been also recognized by the EU Innovation Radar[7], and it has been applied to several industrial real data sets in the context of the I-BiDaaS project.
[1] Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
[2] CVXPY: A Python-Embedded Modeling Language for Convex Optimization, S.
Diamond and S. Boyd, Journal of Machine Learning Research, 17(83):1-5, 2016.
[3] ServiceSs: an interoperable programming framework for the Cloud, Journal of Grid Computing, March 2014, Volume 12, Issue 1, pp 67–91, Lordan, F., E. Tejedor, J. Ejarque, R. Rafanell, J. Álvarez, F. Marozzo, D. Lezzi, R. Sirvent, D. Talia, and R. M. Badia, DOI: 10.1007/s10723-013-9272-5
[5] https://github.com/ibidaas/knowledge_repository/tree/master/tools_techn…
[6] J. Álvarez Cid-Fuentes, S. Solà, P. Álvarez, A. Castro-Ginard, and R. M. Badia, “dislib: Large Scale High Performance Machine Learning in Python”, in Proceedings of the 15th International Conference on eScience, 2019, pp. 96-105.
[7] https://ec.europa.eu/jrc/sites/jrcsh/files/booklet-a4_innovation_radar…
The implementation can be used to train various Machine Learning models, including regression, classification, clustering over a computer cluster over which COMPSs has been installed. Users can run a pre-defined Machine Learning model, or they can encode a new Machine Learning model by setting a different objective function for training. In this sense, the available implementation can be seen as a code template, where a new Machine Learning model can be obtained with a minor programming effort. Current implementation assumes that the input data is structured, organized into numerical-valued matrices, and split in multiple files, one per machine in the cluster. Parallel, scalable execution over the cluster is achieved by the inherent parallelization of ADMM and the underlying COMPSs runtime system.
Banking; Telecommunication; Manufacturing; potentially any other domain where a machine learning model needs to be trained over a structured dataset.
The solution offers a degree of flexibility with respect to some current commercial products for analyzing (Big) data based on Machine Learning algorithms, which typically have a pre-defined set of Machine Learning models with a limited ability for model parameter tuning.
The ADMM code template allows users – that have an understanding of training Machine Learning algorithms based on minimization of loss functions – to customize an alternative Machine Learning model that can be specified with a minimal additional coding effort. Parallelization is inherently taken care for the user’s advantage, by the nature of ADMM (which defines an embarrassingly parallel workflow) and COMPSs (able to generate the workflow from the source code and run it in parallel with the available resources). The solution may be combined with other companion tools like, e.g., the Hecuba data management system[1].
[1] Hecuba, https://github.com/bsc-dd/hecuba
This solution is novel essentially due to the reusability of the developed code, since with easy and small code changes new models can be supported. It is the first ADMM method implementation in COMPSs and corresponds to a novel addition to dislib. The technological novelty also lies in the integration of COMPSs and CVXPY through the ADMM framework to develop efficient methods for Machine Learning training. In this way, we exploit for the first time the benefits of parallel execution due to COMPSs, and the efficient convex optimization problems solutions due to CVXPY.