Building and using a machine learning microservice in the KEKO ecosystem

Tutorial version 1.0, 15.11.2020

Overview

This tutorial presents an example of using container-based machine learning tools in the KEKO platform. Containers enable the developer to harness the power of modern machine learning tools in data pipelines, without the issues arising with different installations and library versions. Possible application areas include anything from real-time error detection based on sensor data to optimizing services in a building based on people flow predictions.

The examples for this tutorial are built using Python + Flask + xgboost stack, but with a container-based solution, any suitable server software and machine learning libraries could be used instead. In the following, we will first go through some design choices for a suitable container, and then outline a concrete example.

Structuring and building a machine learning container for KEKO

From the KEKO platform point of view, a machine learning container works simply as a function on data. The actual implementation of the container has very few requirements and restrictions. One of the main requirements is that the machine learning container implements a REST interface for communication between the KEKO system and the container.

Another restriction that the KEKO system requires is that all HTTP calls to the container are blocking, i.e., do not return until they are finished with processing their data. This also means that the HTTP server in the container should be able to handle all such calls without timing out, as model training with large datasets can be quite time-consuming.

The container should also manage the machine learning models and model names. The KEKO platform does not restrict the names or models used. Suitable naming conventions depend on the application area, and hence this is a design decision to be made when building the container.

Current machine learning drivers and data formats

The KEKO platform has currently two drivers intended for use in the machine learning pipelines. ML training takes care of model creation and training. JSON API on the other hand lets the developer define a REST access point for making predictions using a container and a suitable model. If the functionalities of the current drivers do not suit a specific application, it is always possible to develop new drivers for the ecosystem.

The ML training driver has an option to add a single JSON string, which is passed on to the container as a JSON-file at model creation. The server should therefore be able to handle such a file with a model creation request. This file can be used to specify the model options, for example.

Similarly, with model training and predicting, the data is passed into the container as JSON-files in a specific format, which the container should be able to process correctly. At the moment, the system uses the Pandas orientation “records”. That is, the data is formatted as an array of data objects, one object per row of data (see Pandas read_json documentation).

Note that the container does not have direct access to the KEKO system’s data assets. Instead, the system is designed so that the models receive the results of a single query as the input data. Large data sets can be imported into the container in multiple slices using the input request multiple times. It is then a design choice of how to handle such data in the container. For instance, the data can be stored into an internal database solution, or processed and discarded immediately after each input request (for models that support incremental training).

The sample machine learning microservice API

The Sample ML Microservice container implements the following REST interface for a standard machine learning workflow. This sample container does not yet implement for example model evaluation or parameter tuning by cross-validation that could be needed when developing a solution for production use.
The model creation, data input and training paths are required to be in the following form by the current ML training driver. The prediction path can be chosen freely by the developer. The exposed ports of the container should be mapped into some “free” ports that can be assumed to be not in use (so not 80 or 8080, for instance).

RequestRemarks
PUT <host>:<port>/models/<model_name>Create a model with the identifier,
accepts an attached JSON-file
PUT <host>:<port>/models/<model_name>/data/trainInput training data into the model, training data as a JSON-file
PUT <host>:<port>/models/<model_name>/trainTrain the model

Building and loading the container

The container for this tutorial is based on a recent stable Ubuntu image. This produces a bigger container than solutions based on Alpine Linux, for example. However, the Ubuntu base image has wide support for all the different libraries required by Python and different machine learning models. For production use, a solution based on Alpine might produce a smaller container image, but will require more fine-tuning.

For local testing of the container, it is possible to download some test data from the KEKO platform, see “Designing queries and data pipelines” for more details. It is recommended to first test the container locally thoroughly with real data containing expected null values and value ranges, since in the KEKO platform the error message lengths are limited, and errors can be difficult to pinpoint once the container is uploaded into the system and in use. The machine learning server should also recover from any errors so that only the model that is currently handled is affected and previously created and trained models are unaffected.

Once the container is ready, built and tested, it is time to import it into the KEKO platform. This requires six files in the correct locations. In the root folder of the container data, there should be the following four files that define the service provided by the container in the KEKO platform:

FileRemarks
DockerfileThe docker build instructions for the container
service.idAn identifier (uid) for the service in the KEKO platform system
service.infoA JSON-file describing the service in the KEKO platform system. Enables to give startup parameters for the container. Especially, defines the external and internal ports of the container for docker run.
service.versionVersion number for the service, 1 by default

In addition to these, the following two files have to be placed in the parent folder of the root of the container. Running these scripts require a running instance of Docker, as well as Amazon AWS tools with correct login information. You can obtain this information from KEKO platform admin.

FileRemarks
login.shLogin script for KEKO platform deployment API. Run once during a session before building and uploading the container.
build.shBuilds the container (see Dockerfile in the previous table), loads the container and service-files into the KEKO platform. The working directory for this script should be the root folder of the container.

Once the container image is successfully uploaded into the KEKO platform system, the service can be installed from Ecosystems -> KEKO Ecosystem -> Services.

After successful installation, the service is running in the KEKO platform until it is terminated from Settings -> Services.

Data aggregation and machine learning functions in the KEKO platform

Data ingestion

The KEKO platform uses Hive / Spark SQL as the database language for querying and manipulating data. Data aggregation begins by defining data sources in the “Integrate data” section. The system supports various drivers for data ingestion, such as file-based and database sources.

Designing queries and data pipelines

After the data sources are defined, they are available as data assets that can be referenced when building new pipelines. In the “Code designer” section you can formulate and test SQL queries to manipulate the data into a suitable format. The data assets for the query are selected in the selection box on the lower right corner of the window. Once selected, you can also review the schema of the data asset by clicking the arrow in front of the asset’s name.

For local testing and other purposes, it is sometimes necessary to export a portion of the data. Once you have designed a suitable query, you can export the data from the “Export” tab in the code designer. Data can be exported into a file in the most commonly used formats such as CSV and JSON.

A streamlined way to handle the data for machine learning in the KEKO platform is to format it so that a single row of data contains both the input and possible output variables of a single entry. The machine learning container should then be able to separate the “X” and the “y” internally from these data rows. In the example container, the input and output features are specified in the model options.

Building machine learning pipelines

For a typical machine learning application, at least two pipelines are needed. One for model creation and training, and one for predictions. A pipeline in the KEKO system consists of three types of data tasks: input, transform and output. In the following example, we will build a model that aims to predict people flow in a building with weather data.

Training

The training pipeline consists of the data assets, data formatting tasks, and the output driver for model creation and training.

The training pipeline starts by choosing one or more data sources for data input. If the data is already in the correct format, it can then be directly connected to the training task. If necessary, data can be first formatted and joined with other data before training in a data task. In this example, the input data consists of some daily weather features and hourly sums of people entering and exiting the building.

The training pipeline ends with an output task with the driver ML Training (v1). The output task defines the options for the machine learning model, the port of the container used for this task and the Model Id, which is used to reference the trained model in the prediction phase.

In this example, both the internal server port and the port visible from the outside are 5000, the default port for Flask. With multiple containers, this should be something else to avoid collisions. XGBoost is chosen as the model for this application, with the default parameters. For an actual application, more finetuning of the model would be needed.

Once the training pipeline is defined, it can be executed from the main “Pipeline Builder” window. This phase takes care of model creation and training. For complex models and large data sets, this will take quite a lot of time. Once the training pipeline has finished, predictive pipelines can be built using the model.

Predicting

The first prediction pipeline in this example defines an access point via the REST API. Only the data from the REST interface is used as the input for predictions, so a special empty data source is used as an input to the data task.

The predictive pipeline is based on a function defined with SQL using the machine learning container. This definition is made in a data task.

The definition of the prediction function consists of defining the port of the container, the return type of the function and the structure of the function itself.

The variables and the types of the prediction call are defined with the JSON API driver. After the prediction pipeline is defined, the predictions can be accessed by calling

https://kone.whereos.com/api/flow_forecast/demo_predict/31/0/5.1/1.8″>https://<server>/api/<application>/<pipeline>/<path_#1>/<path_#2>/…
(for example: https://kone.whereos.com/api/flow_forecast/demo_predict/31/0/5.1/1.8″>https://kone.whereos.com/api/flow_forecast/demo_predict/31/0/5.1/1.8)

[
    {
        "xgb": [
            5.9604644775390625E-8,
            0.22008103132247925,
            5.9604644775390625E-8,
            5.9604644775390625E-8,
            5.9604644775390625E-8,
            5.9604644775390625E-8,
            5.9604644775390625E-8,
            5.9604644775390625E-8,
            5.9604644775390625E-8,
            -6.746053695678711E-4,
            1.5481176376342773,
            0.9573951959609985,
            2.716968059539795,
            0.6374685764312744,
            28.40283966064453,
            3.001434803009033,
            47.48316955566406,
            1.205366849899292,
            17.89867401123047,
            9.303804397583008,
            11.548957824707031,
            7.589839458465576,
            12.240169525146484,
            4.23101806640625,
            16.329538345336914,
            8.032950401306152,
            6.049868106842041,
            8.026823043823242,
            12.423727035522461,
            12.14470386505127,
            6.426841735839844,
            32.54367446899414,
            3.541936159133911,
            21.986160278320312,
            4.155551910400391,
            28.460851669311523,
            23.993276596069336,
            8.305785179138184,
            0.0011903941631317139,
            5.92604923248291,
            4.9953742027282715,
            2.712473154067993,
            1.0000437498092651,
            11.98184585571289,
            0.7505946159362793,
            2.424765110015869,
            5.9604644775390625E-8,
            0.029594719409942627
        ]
    }
]

Prediction output is returned as a JSON-file. The format of the data can be adjusted in the prediction pipeline’s data task.

Predicting to a file

In this second example, the prediction pipeline takes an internal data asset as the source for prediction input and writes the prediction outputs into an internal data asset. These predictions are formatted into an hourly instead of a daily data structure.

An internal data asset is first chosen as the input for the pipeline, and it is connected to a prediction task. Predictions are then connected to an output task that writes the data into a file in the system.

The predicting function is defined similarly as in the first pipeline. The actual prediction is part of a select-statement that formats the input data and output data suitably.

The predicted data is output into a file. Once these definitions are made, the pipeline can be executed from the main “Pipeline Builder” window.

Predictions can be formatted and used in further data processing.