How to build a decision tree model in IBM Db2

Date:


After developing a machine learning model, you need a place to run your model and serve predictions. If your company is in the early stage of its AI journey or has budget constraints, you may struggle to find a deployment system for your model. Building ML infrastructure and integrating ML models with the larger business are major bottlenecks to AI adoption [1,2,3]. IBM Db2 can help solve these problems with its built-in ML infrastructure. Someone with the knowledge of SQL and access to a Db2 instance, where the in-database ML feature is enabled, can easily learn to build and use a machine learning model in the database.

In this post, I will show how to develop, deploy, and use a decision tree model in a Db2 database.

These are my major steps in this tutorial:

  1. Set up Db2 tables
  2. Explore ML dataset
  3. Preprocess the dataset
  4. Train a decision tree model
  5. Generate predictions using the model
  6. Evaluate the model

I implemented these steps in a Db2 Warehouse on-prem database. Db2 Warehouse on cloud also supports these ML features.

The machine learning use case

I will use a dataset of historical flights in the US. For each flight, the dataset has information such as the flight’s origin airport, departure time, flying time, and arrival time. Also, a column in the dataset indicates if each flight had arrived on time or late. Using examples from the dataset, we’ll build a classification model with decision tree algorithm. Once trained, the model can receive as input unseen flight data and predict if the flight will arrive on time or late at its destination.

1. Set up Db2 tables

The dataset I use in this tutorial is available here as a csv file.

Creating a Db2 table

I use the following SQL for creating a table for storing the dataset.

db2start
connect to <database_name>

db2 "CREATE TABLE FLIGHTS.FLIGHTS_DATA_V3  (
ID INTEGER NOT NULL GENERATED BY DEFAULT AS IDENTITY,
YEAR INTEGER ,
QUARTER INTEGER ,
MONTH INTEGER ,
DAYOFMONTH INTEGER ,                       
DAYOFWEEK INTEGER ,                       
UNIQUECARRIER VARCHAR(50 OCTETS) ,               
ORIGIN VARCHAR(50 OCTETS) ,                      
DEST VARCHAR(50 OCTETS) ,                       
CRSDEPTIME INTEGER ,                       
DEPTIME INTEGER ,                       
DEPDELAY REAL ,                       
DEPDEL15 REAL ,                       
TAXIOUT INTEGER ,                       
WHEELSOFF INTEGER ,                       
CRSARRTIME INTEGER ,                       
CRSELAPSEDTIME INTEGER ,                       
AIRTIME INTEGER ,                       
DISTANCEGROUP INTEGER ,                       
FLIGHTSTATUS VARCHAR(1) )
ORGANIZE BY ROW";

After creating the table, I use the following SQL to load the data, from the csv file, into the table:

db2 "IMPORT FROM 'FLIGHTS_DATA_V3.csv' OF DEL COMMITCOUNT 50000 INSERT INTO FLIGHTS.FLIGHTS_DATA_V3"

I now have the ML dataset loaded into the FLIGHTS.FLIGHTS_DATA_V3 table in Db2. I’ll copy a subset of the records from this table to a separate table for the ML model development and evaluation, leaving the original copy of the data intact. 

SELECT count(*) FROM FLIGHTS.FLIGHTS_DATA_V3

 — — — 
1000000

Creating a separate table with sample records

Create a table with 10% sample rows from the above table. Use the RAND function of Db2 for random sampling.

CREATE TABLE FLIGHT.FLIGHTS_DATA AS (SELECT * FROM FLIGHTS.FLIGHTS_DATA_V3 WHERE RAND() < 0.1) WITH DATA

Count the number of rows in the sample table.

SELECT count(*) FROM FLIGHT.FLIGHTS_DATA

— — — 
99879

Look into the scheme definition of the table.

SELECT NAME, COLTYPE, LENGTH
FROM SYSIBM.SYSCOLUMNS
WHERE TBCREATOR = 'FLIGHT' AND TBNAME = 'FLIGHTS_DATA'
ORDER BY COLNO

FLIGHTSTATUS is the response or the target column. Others are feature columns.

Find the DISTINCT values in the target column.

From these values, I can see that it’s a binary classification task where each flight arrived either on time or late. 

Find the frequencies of distinct values in the FLIGHTSTATUS column.

SELECT FLIGHTSTATUS, count(*) AS FREQUENCY, count(*) / (SELECT count(*) FROM FLIGHT.FLIGHTS_DATA) AS FRACTION
FROM FLIGHT.FLIGHTS_DATA fdf
GROUP BY FLIGHTSTATUS

From the above, I see the classes are imbalanced. Now I’ll not gain any further insights from the entire dataset, as this can leak information to the modeling phase. 

Creating train/test partitions of the dataset

Before collecting deeper insights into the data, I’ll divide this dataset into train and test partitions using Db2’s RANDOM_SAMPLING SP. I apply stratified sampling to preserve the ratio between two classes in the generated training data set.

Create a TRAIN partition.

call IDAX.RANDOM_SAMPLE('intable=FLIGHT.FLIGHTS_DATA, fraction=0.8, outtable=FLIGHT.FLIGHTS_TRAIN, by=FLIGHTSTATUS')

Copy the remaining records to a test PARTITION.

CREATE TABLE FLIGHT.FLIGHTS_TEST AS (SELECT * FROM FLIGHT.FLIGHTS_DATA FDF WHERE FDF.ID NOT IN(SELECT FT.ID FROM FLIGHT.FLIGHTS_TRAIN FT)) WITH DATA

2. Explore data

In this step, I’ll look at both sample records and the summary statistics of the training dataset to gain insights into the dataset.

Look into some sample records.

SELECT * FROM FLIGHT.FLIGHTS_TRAIN FETCH FIRST 10 ROWS ONLY

Some columns have encoded the time as numbers:

 — CRSDEPTIME: Computer Reservation System (scheduled) Departure Time (hhmm)

 — DepTime: Departure Time (hhmm)

 — CRSArrTime: Computer Reservation System (scheduled) Arrival Time

Now, I collect summary statistics from the FLIGHTS_TRAIN using SUMMARY1000 SP to get a global view of the characteristics of the dataset.

CALL IDAX.SUMMARY1000('intable=FLIGHT.FLIGHTS_TRAIN, outtable=FLIGHT.FLIGHTS_TRAIN_SUM1000')

Here the intable has the name of the input table from which I want SUMMARY1000 SP to collect statistics. outtable is the name of the table where SUMMARY1000 will store gathered statistics for the entire dataset. Besides the outtable, SUMMARY1000 SP creates a few additional output tables — one table with statistics for each column type. Our dataset has two types of columns — numeric and nominal. So, SUMMARY1000 will generate two additional tables. These additional tables follow this naming convention: the name of the outtable + column type. In our case, the column types are NUM, representing numeric, and CHAR, representing nominal. So, the names of these two additional tables will be as follows:

FLIGHTS_TRAIN_SUM1000_NUM

FLIGHTS_TRAIN_SUM1000_CHAR

Having the statistics available in separate tables for specific datatypes makes it easier to view the statistics that apply to specific datatype and reduce the number of columns whose statistics are viewed together. This simplifies the analysis process. 

Check the summary statistics of the numeric column.

SELECT * FROM FLIGHT.FLIGHTS_TRAIN_SUM1000_NUM

For the numeric columns, SUMMARY1000 gather the following statistics:

  • Missing value count
  • Non-missing value count
  • Average
  • Variance
  • Standard deviation
  • Skewness
  • Excess kurtosis
  • Minimum
  • Maximum

Each of these statistics can help uncover insights into the dataset. For instance, I can see that DEPDEL15 and DEPDELAY columns have 49 missing values. There are large values in these columns: AIRTIME, CRSARRTIME, CRSDEPTIME, CRSELAPSEDTIME, DEPDELAY, DEPTIME, TAXIOUT, WHEELSOFF, and YEAR. Since I will create a decision tree model, I don’t need to deal with the large value and the missing values. Db2 will deal with both issues natively. 

Next, I investigate the summary statistics of the nominal columns.

select * from FLIGHT.FLIGHTS_TRAIN_SUM1000_CHAR

For nominal columns, SUMMARY1000 gathered the following statistics:

  • Number of missing values
  • Number of non-missing values
  • Number of distinct values
  • Frequency of the most frequent value

3. Preprocess data

From the above data exploration, I can see that the dataset has no missing values. These four TIME columns have large values: AIRTIME, CRSARRTIME, DEPTIME, WHEELSOFF. I’ll leave the nominal values in all columns as-is, as the decision tree implementation in Db2 can deal with them natively. 

Extract the hour part from the TIME columns — CRSARRTIME, DEPTIME, WHEELSOFF.

From looking up the description of the dataset, I see the values in the CRSARRTIME, DEPTIME, and WHEELSOFF columns are encoding of hhmm of the time values. I extract the hour part of these values to create, hopefully, better features for the learning algorithm. 

Scale CRSARRTIME COLUMN: divide the value with 100 gives the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TRAIN SET CRSARRTIME = CRSARRTIME / 100

Scale DEPTIME COLUMN: divide the value by 100 gives the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TRAIN SET DEPTIME = DEPTIME / 100

Scale WHEELSOFF COLUMN: divide the value by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TRAIN SET WHEELSOFF = WHEELSOFF / 100

4. Train a decision tree model

Now the training dataset is ready for the decision tree algorithm. 

I train a decision tree model using GROW_DECTREE SP. 

CALL IDAX.GROW_DECTREE('model=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TRAIN, id=ID, target=FLIGHTSTATUS')

I called this SP using the following parameters:

  • model: the name I want to give to the decision tree model — FLIGHT_DECTREE
  • intable: the name of the table where the training dataset is stored
  • id: the name of the ID column
  • target: the name of the target column

After completing the model training, the GROW_DECTREE SP generated several tables with metadata from the model and the training dataset. Here are some of the key tables:

  • FLIGHT_DECTREE_MODEL: this table contains metadata about the model. Examples of metadata include depth of the tree, strategy for handling missing values, and the number of leaf nodes in the tree. 
  • FLIGHT_DECTREE_NODES: this table provides information about each node in the decision tree. 
  • FLIGHT_DECTREE_COLUMNS: this table provides information on each input column and their role in the trained model. The information includes the importance of a column in generating a prediction from the model. 

This link has the complete list of model tables and their details. 

5. Generate predictions from the model

Since the FLIGHT_DECTREE model is trained and deployed in the database, I can use it for generating predictions on the test records from the FLIGHTS_TEST table.

First, I preprocess the test dataset using the same preprocessing logic that I applied to the TRAINING dataset. 

Scale CRSARRTIME COLUMN: divide the value by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TEST SET CRSARRTIME = CRSARRTIME / 100

Scale DEPTIME COLUMN: divide the value by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TEST SET DEPTIME = DEPTIME / 100

Scale WHEELSOFF COLUMN: divide the value by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TEST SET WHEELSOFF = WHEELSOFF / 100

Generating predictions

I use PREDICT_DECTREE SP to generate predictions from the FLIGHT_DECTREE model:

CALL IDAX.PREDICT_DECTREE('model=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TEST, outtable=FLIGHT.FLIGHTS_TEST_PRED, prob=true, outtableprob=FLIGHT.FLIGHTS_TEST_PRED_DIST')

Here is the list of parameters I passed when calling this SP:

  • model: the name of the decision tree model, FLIGHT_DECTREE
  • intable: name of the input table to generate predictions from
  • outtable: the name of the table that the SP will create and store predictions to
  • prob: a boolean flag indicating if we want to include in the output the probability of each prediction
  • outputtableprob: the name of the output table where the probability of each prediction will be stored 

6. Evaluate the model

Using generated predictions for the test dataset, I compute a few metrics to evaluate the quality of the model’s predictions.

Creating a confusion matrix

I use CONFUSION_MATRIX SP to create a confusion matrix based on the model’s prediction on the TEST dataset. 

CALL IDAX.CONFUSION_MATRIX('intable=FLIGHT.FLIGHTS_TEST, resulttable=FLIGHT.FLIGHTS_TEST_PRED, id=ID, target=FLIGHTSTATUS, matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX')

In calling this SP, here are some of the key parameters that I passed:

  • intable: the name of the table that contains the dataset and the actual value of the target column
  • resulttable: the name of the table that contains the column with predicted values from the model
  • target: the name of the target column
  • matrixTable: The output table where the SP will store the confusion matrix

After the SP completes its run, we have the following output table with statistics for the confusion matrix. 

FLIGHTS_TEST_CMATRIX:

This table has three columns. The REAL column has the actual flight status. PREDICTION column has the predicted flight status. Since flight status takes two values – 0 (on time) or 1 (delayed), we have four possible combinations between values in the REAL and the PREDICTION columns: 

  1. TRUE NEGATIVE: REAL: 0, PREDICTION: 0 — The model has accurately predicted the status of those flights that arrived on schedule. From that CNT column, we see that 11795 rows from the TEST table belong to this combination.
  2. FALSE POSITIVE: REAL: 0, PREDICTION: 1 — these are the flights that actually arrived on time but the model predicted them to be delayed. 671 is the count of such flights. 
  3. FALSE NEGATIVE: REAL: 1, PREDICTION: 0 — these flights have arrived late, but the model predicted them to be on time. From the CNT table, we find their count to be 2528.
  4. TRUE POSITIVE: REAL: 1, PREDICTION: 1 — the model has accurately identified these flights that were late. The count is 4981. 

I use these counts to compute a few evaluation metrics for the model. For doing so, I use CMATRIX_STATS SP as follows:

CALL IDAX.CMATRIX_STATS('matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX')

The only parameter this SP needs is the name of the table that contains the statistics generated by the CONFUSION_MATRIX SP in the previous step. CMATRIX_STATS SP generates two sets of output. The first one shows overall quality metrics of the model. The second one includes the model’s predictive performance for each class. 

First output — overall model metrics include correction predictions, incorrect prediction, overall accuracy, weighted accuracy. From this output, I see that the model has an overall accuracy of 83.98% and a weighted accuracy of 80.46%. 

With classification tasks, it’s usually useful to view the model’s quality factors for each individual class. The second output from the CMATRIX_STATS SP includes these class level quality metrics. 

For each class, this output includes the True Positive Rate (TPR), False Positive Rate (FPR), Positive Predictive Value (PPV) or Precision, and F-measure (F1 score). 

Conclusions and key takeaways

If you want to build and deploy an ML model in a Db2 database using Db2’s built-in stored procedures, I hope you’ll find this tutorial useful. Here are the main takeaways of this tutorial:

  • Demonstrated a complete workflow of creating and using a decision tree model in a Db2 database using in-database ML Stored procedures. 
  • For each step in the workflow, I provided concrete and functional SQL statements and stored procedures. For each code example, when applicable, I explained intuitively what it does, and its inputs and outputs. 
  • Included references to IBM Db2’s documentation for the ML stored procedures I used in this tutorial. 

O’Reilly’s 2022 AI Adoption survey[3] underscored challenges in building technical infrastructure and skills gap as two top bottlenecks to AI adoption in the enterprise. Db2 solves the first one by supplying an end-to-end ML infrastructure in the database. It also lessens the latter, the skills gap, by providing simple SQL API for developing and using ML models in the database. In the enterprise, SQL is a more common skill than ML.

Check out the following resources to learn more about the ML features in IBM Db2 and see additional examples of ML use cases implemented with these features.

Explore Db2 ML Product Documentation

Explore Db2 ML samples in GitHub


References

  1. Paleyes, A., Urma, R.G. and Lawrence, N.D., 2022. Challenges in deploying machine learning: a survey of case studies. ACM Computing Surveys, 55(6), pp.1–29.
  2. Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B. and Zimmermann, T., 2019, May. Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (pp. 291–300). IEEE.
  3. Loukides, Mike, AI Adoption in the Enterprise 2022. https://www.oreilly.com/radar/ai-adoption-in-the-enterprise-2022/

Source link

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Share post:

Subscribe

Popular

More like this
Related