INTRODUCTION

Optical Coherence Tomography (OCT) is a non-invasive imaging technique that obtains cross-sectional images of biological tissues, using light waves, with micrometer resolution. OCT is commonly used to obtain images of the retina, and allows ophthalmologists to diagnose several diseases such as glaucoma, age-related macular degeneration and diabetic retinopathy. In this post I classify OCT images into four categories: choroidal neovascularization, diabetic macular edema, drusen and normal, with the help of a Deep Learning architecture. Given that my sample size is too small to train a whole Deep Learning architecture, I decided to apply a transfer learning technique and understand what are the limits of the sample size to obtain classification results with high accuracy. Specifically, a VGG16 architecture pre-trained with an Image Net dataset is used to extract features from OCT images, and the last layer is replaced with a new Softmax layer with four outputs. I tested different amounts of training data and determine that fairly small datasets (400 images - 100 per category) produce accuracies of over 85%.

BACKGROUND

Optical Coherence Tomography (OCT) is a non-invasive and non-contact imaging technique. OCT detects the interference formed by the signal from a broadband laser beam reflected from a reference mirror and a biological sample. OCT is capable of generating in vivo cross-sectional volumetric images of the anatomical structures of biological tissues with microscopic resolution (1-10μm) in real-time. OCT has been used to understand different disease pathogenesis, and is commonly used in the field of ophthalmology.

Convolutional Neural Network (CNN) is a Deep Learning technique that has gained popularity in the last few years. It has been used successfully in image classification tasks. There are several types of architectures that have been popularized, and one of the simple ones is the VGG16 model. In this model, large amounts of data is required to train the CNN architecture.

Transfer learning is a method that consists on using a Deep Learning model that was originally trained with large amounts of data to solve a specific problem, and applying it to solve a challenge on a different data set that contains small amounts of data.

In this study I use the VGG16 Convolutional Neural Network architecture that was originally trained with the Image Net dataset, and apply transfer learning to classify OCT images of the retina into four groups. The purpose of the study is to determine the minimum amount of images required to obtain high accuracy.

DATA SET

For this project I decided to use OCT images obtained from the retina of human subjects. The data can be found in Kaggle, and was originally used for the following publication. The data set contains images from four types of patients: normal, diabetic macular edema (DME), choroidal neovascularization (CNV), and drusen. An example of each type of OCT image can be observed in Figure 1.

Fig. 1: From left to right: Choroidal Neovascularization (CNV) with neovascular membrane (white arrowheads) and associated subretinal fluid (arrows). Diabetic Macular Edema (DME) with retinal-thickening-associated intraretinal fluid (arrows). Multip… — Fig. 1: From left to right: Choroidal Neovascularization (CNV) with neovascular membrane (white arrowheads) and associated subretinal fluid (arrows). Diabetic Macular Edema (DME) with retinal-thickening-associated intraretinal fluid (arrows). Multiple drusen (arrowheads) present in early AMD. Normal retina with preserved foveal contour and absence of any retinal fluid/edema. Image obtained from the following publication.

To train the model I used a maximum of 20,000 images (5,000 for each class) so that the data would be balanced across all classes. Additionally, I had 1,000 images (250 for each class) that were separated and used as a testing set to determine the accuracy of the model.

MODEL

For this project I used a VGG16 architecture, as shown below in Figure 2. This architecture presents several convolutional layers, whose dimensions get reduced by applying max pooling. After the convolutional layers, two fully connected neural network layers are applied, which terminate in a Softmax layer which classifies the images into one of 1000 categories. In this project I uses the weights in the architecture that have been pre-trained using the Image Net dataset. The model used was built on Keras using a TensorFlow backend in Python.

Fig. 2: VGG16 Convolutional Neural Network architecture displaying the convolutional, fully connected and softmax layers.  After each convolutional block there was a max pooling layer. — Fig. 2: VGG16 Convolutional Neural Network architecture displaying the convolutional, fully connected and softmax layers. After each convolutional block there was a max pooling layer.

Given that the objective is to classify the images into 4 groups, instead of 1000, the top layers of the architecture were removed, and replaced with a Softmax layer with 4 classes using a categorical crossentropy loss function, an Adam optimizer and a dropout of 0.5 to avoid overfitting. The models were trained using 20 epochs.

Each image was grayscale, where the values for the Red, Green and Blue channels are identical. Images were resized to 224 x 224 x 3 pixels to fit in the VGG16 model.

A) Determining the Optimal Feature Layer

The first part of the study consisted in determining the layer within the architecture that produced the best features to be used for the classification problem. There are 7 locations that were tested and are indicated in Figure 2 as Block 1, Block 2, Block 3, Block 4, Block 5, FC1 and FC2. I tested the algorithm at each layer location by modifying the architecture at each point. All the parameters in the layers before the location tested were frozen (we used the parameters originally trained with the ImageNet dataset). Then I added a Softmax layer with 4 classes and only trained the parameters of the last layer. An example of the modified architecture at the Block 5 location is presented in Figure 3. This location has 100,356 trainable parameters. Similar architecture modifications were created for the other 6 layer locations (images not shown).

At each of the seven modified architectures, I trained the parameter of the Softmax layer using all the 20,000 training samples. Then I tested the model on 1,000 testing samples that the model had not seen before. The accuracy of the test data at each location is presented in Figure 4. The best result was obtained at the Block 5 location with an accuracy of 94.21%.

Table 1 presents the probabilities obtained of the confusion matrix. Ideally we would obtain 25% on all four values of the main diagonal.

B) Determining the Minimum Number of Samples

Using the modified architecture at the Block 5 location, which had previously provided the best results with the full dataset of 20,000 images, I tested training the model with different sample sizes from 4 to 20,000 (with an equal distribution of samples per class). The results are observed in Figure 5. If the model was randomly guessing, it would have an accuracy of 25%. However, with as few as 40 training samples, the accuracy was above 50%, and by 400 samples it had reached more than 85%.

CONCLUSION

In this study I explored the use of transfer learning for a classification problem using medical images of the retina obtained with OCT. I determined that using transfer learning on a VGG16 architecture pre-trained with the ImageNet dataset on Block 5 produced the highest accuracy. Finally, I demonstrated that with a small sample size (400 images) I was able to obtain an accuracy higher than 85%. This approach is a viable method to classify images where the sample size is small, such as in medical applications.