Lung Cancer Detection Algorithm in SQL Server

This document describes how to execute a transfer learning algorithm using deep learning and SQL Server 2017 in the context of lung cancer detection. With this tutorial we would like to showcase one of the most exciting new features of SQL Server 2017: in-database store procedures with Python.

Technically, an in-database store procedure is a Python routine executed inside the database, so the data movement is minimized. We want to prove with this solution a new paradigm of computation, where the intelligence of the application is brought to the data, instead of bringing the data to the application.

The data we used are CT scans from the 2017 Data Science Bowl. The scans are horizontal slices of the thorax and the images are black and white and of size 512x512. The scans are grouped by patient, there are 1595 patients and each of them have a variable number of scans that goes from 100 to 500 images. The dataset is labelled per patient, not per image, this means that for each patient we have a label of having cancer or not.

In the next figure there is an animation showing all the scans for one patient:

animation of lung cancer scans

We use transfer learning with a pre-trained Convolutional Neural Network (CNN) on ImageNet dataset as a featurizer to generate features from the Data Science Bowl dataset, this process is computed with CNTK on a GPU. Once the features are computed, a boosted tree using LightGBM is applied to classify the image. The process is explained in detail in this blog.

In the next figure we present an overview of the complete algorithm:

transfer learning overview

To create the featurizer, we remove the last layer of the pretrained CNN (in this example we used the ResNet architecture) and use the output of the penultimate layer as features. Each patient has an arbitrary number of scan images. The images are cropped to 224×244 and packed in groups of 3, to match the format of ImageNet. They are fed to the pre-trained network in k batches and then convoluted in each internal layer, until the penultimate one. This process is performed using CNTK. The output of the network are the features we feed to the boosted tree, programmed with LightGBM.

Once the boosted tree model is trained, it can be operationalized to classify cancerous scans for other patients using a web app.

In the next sections we will explain how to execute this system inside SQL. All the data, models and resulting features are stored and queried in different tables of a SQL database. There are 3 main processes: featurization, training and scoring. The are explained next together with an initial setup.

Installation

The installation process can be found here.

Preprocessing

We have to download the data from kaggle dataset. The images are in DICOM format and consist of a group of slices of the thorax of each patient as it is shown in the following figure:

lung cancer scans lung cancer scans

We are going to upload the images to SQL. The reason for doing this, instead of reading the images directly from disk, is because we want to simulate an scenario where all the data is already in SQL. For demo purposes we are going to use a small subset of the images, they can be found in stage1_labels_partial.csv. This subset consists of 200 patients out of 1595. The complete patient info is stage1_labels.csv.

The first step is to create in SQL Server a database called lung_cancer_database.

The next step is to create a table for the images and upload them. First you need to put the correct paths in the file config_preprocessing.py. In case you want to upload the full dataset, just uncomment STAGE1_LABELS = os.path.join(DATA_PATH, 'stage1_labels.csv'). To import the images to the SQL database you have to execute the script insert_scan_images_in_sql_database.py. This will take a while.

In the mean time, execute the script insert_other_items_in_sql_database.py. This script creates and fill tables for the labels, the CNN model and a gif representation of the images.

Process 1: Featurization of Lung Scans with CNN in a GPU

The initial process generates features from the scans using a pretrained ResNet. In the SQL stored procedure sp_00_cnn_feature_generation_create.sql, the code can be found. To create the store procedure you just need to execute the SQL file in SQL Server Management Studio. This will create a new stored procedure under lung_cancer_database/Programmability/Stored Procedures called dbo.GenerateFeatures.

The main routine is super simple and consists of 9 lines of code. All the functions associated with the script can be found in lung_cancer_utils.py.

try_set_default_device(gpu(0))
patients = get_patients_id(TABLE_SCAN_IMAGES, cur)
model = load_model(Model)
model = select_model_layer(model, "z.x")

for i, p in enumerate(patients):
    scans = get_patient_images(TABLE_SCAN_IMAGES, cur, p)
    scans = manipulate_images(scans)
    feats = compute_features_with_gpu(model, scans, BATCH_SIZE)
    insert_features(TABLE_FEATURES, cur, conn, p, feats)

Let's explain each line one by one:

To execute this stored procedure you have to execute the file sp_00_cnn_feature_generation_execute.sql. This process takes around 40min in a Windows GPU DSVM.

To test that the GPU is actually executing the process, you can type in a terminal nvidia-smi.

Process 2: Training of Scan Features with Boosted Tree

Once the features are computed and inserted in the SQL table, we use them to train a boosted tree using LightGBM library. The code that computes this process is sp_01_boosted_tree_training_create.sql and generates a stored procedure called dbo.TrainLungCancerModel.

In this case the main code occupies 4 lines of code:

patients_train = get_patients_id(TABLE_LABELS, cur)
trn_x, val_x, trn_y, val_y = generate_set(TABLE_FEATURES, TABLE_LABELS, patients_train, cur)
classifier = train_lightgbm(trn_x, val_x, trn_y, val_y)
insert_model(TABLE_MODEL, cur, conn, classifier, LIGHTGBM_MODEL_NAME)

Let's explain again each line:

This process takes around 1 min in a DSVM.

Process 3: Scoring with the Trained Classifier

The final process is the operationalization routine. The boosted tree can be used to compute the probability of a new patient of having cancer. The script is sp_02_boosted_tree_scoring_create.sql and generates a stored procedure called PredictLungCancer. This can be connected to a web app via an API.

The inputs of the SQL stored procedure are @PatientIndex and @ModelName. The output is the prediction result @PredictionResult given a patient index and a model name. Inside the stored procedure, we get the boosted tree model and the features of the patient, which are both serialized and stored as binary variables:

DECLARE @Model VARBINARY(MAX) = (SELECT TOP(1) model from dbo.model where name = @ModelName ORDER BY date DESC);
DECLARE @Features VARBINARY(MAX) = (SELECT TOP(1) array FROM dbo.features AS t1 
                                    INNER JOIN dbo.patients AS t2 ON t1.patient_id = t2.patient_id 
                                    WHERE t2.idx = @PatientIndex);

These two variables are sent to python as serialized objects. The main python code has 4 lines:

feats = pickle.loads(Features)
model = pickle.loads(Model)
probability_cancer = prediction(model, feats)
PredictionResult = float(probability_cancer)*100

In this case there is an input and output for the python routine from SQL. The input is PatientIndex which is the index of the patient we want to analyze. The output is PredictionResult, which is the probability of this patient of having cancer. Here the explanation of the code:

To execute this stored procedure you have to makes this query, which takes 1s:

DECLARE @PredictionResultSP FLOAT;
EXECUTE lung_cancer_database.dbo.PredictLungCancer @PatientIndex = 0, @PredictionResult = @PredictionResultSP;

The variable @PredictionResultSP is the output of the stored procedure and @PatientIndex = 0 is the input. If we use the small dataset, the maximum input is 200, in case we use the full dataset, the maximum input is 1594.

Lung Cancer Detection Web Service

We created a demo web app to show the lung cancer detection in SQL python. To run it, you just need to execute api_service.py

The web page can be accessed at http://localhost:5000.

In case you want to access it from outside you have to open the port 5000 in the Azure portal (Network Interfaces/Network security group/Inbound security rules). You need to do the same in the Firewall inside the virtual machine (Windows Firewall with Advanced Security/Inbound rules). To access the web service from outside just replace localhost with the DNS or IP of the VM.

The final web looks like this:

web page

You can try to search a patient called Anthony or another call Ana. You can also search for patients by ID entering a number between 0 and 200 (1594 if you use the full dataset).

Disclaimer

The idea of the lung cancer demo is to showcase that a deep learning algorithm can be computed with GPU inside SQL in python.

The accuracy of the actual algorithm is low. It has a very simple pipeline. This algorithm was created as a baseline for the lung cancer kaggle competition, to allow users to quickly set up a DSVM and execute a CNTK algorithm.

An example of an algorithm with higher accuracy can be found here, the pipeline has a 3D CNN for nodule segmentation, one CNN for false positive reduction, another CNN for identifying if the nodule is malignant or not, then transfer learning and finally ensembling.

It is important to understand that the focus of the demo is not the algorithm itself but the pipeline which allows to execute deep learning in a SQL database.

Contributing

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.