MRNet - explatory data analysis (EDA)¶

This is one of a series of notebook I've created as part of participating in the MRNet Competition.

In this notebook I performed an EDA on the MRNet Dataset.

To jump to the relevant section:

Inspect dataset content
Inspect MRI scan images
Inspect csv files
Investigate co-occurrence of conditions
Resources

For other notebooks and code, please visit the github repo.

Background¶

In the paper Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet, the Stanford ML Group developed an algorithm to predict abnormalities in knee MRI exams, and measured the clinical utility of providing the algorithm’s predictions to radiologists and surgeons during interpretation.

They developed a deep learning model for detecting:

general abnormalities
anterior cruciate ligament (ACL)
meniscal tears

MRNet Dataset description¶

The dataset (~5.7G) was released along with the publication of the paper. You can download it by agreeing to the Research Use Agreement and submitting your details on the MRNet Competition page.

It consists of 1,370 knee MRI exams, containing:

1,104 (80.6%) abnormal exams
319 (23.3%) ACL tears
508 (37.1%) meniscal tears

The dataset is split into:

training set (1,130 exams, 1,088 patients)
validation set (120 exams, 111 patients) - called tuning set in the paper
hidden test set (120 exams, 113 patients) - called validation set in the paper

The hidden test set is not publicly available and is used for scoring models submitted for the competition.

N.B.

Stratified random sampling was used to ensure at least 50 positive examples of abnormal, ACL tear and meniscal tear were preset in each set.
All exams from each patient were put in the same split.
In the paper, an external validation was performed on a publicly available data.

Inspect dataset content¶

First thing first - let's inspect the content and familiarise ourselves with the dataset.

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from glob import glob

data_dir = '../data/MRNet-v1.0'

!ls -lh $data_dir

total 36K
drwxr-xr-x 6 misao 725066026  192 Jun 17 17:40 train
-rw-r--r-- 1 misao 725066026 7.8K Nov 25  2018 train-abnormal.csv
-rw-r--r-- 1 misao 725066026 7.8K Nov 25  2018 train-acl.csv
-rw-r--r-- 1 misao 725066026 7.8K Nov 25  2018 train-meniscus.csv
drwxr-xr-x 6 misao 725066026  192 Jun 17 17:39 valid
-rw-r--r-- 1 misao 725066026  840 Nov 25  2018 valid-abnormal.csv
-rw-r--r-- 1 misao 725066026  840 Nov 25  2018 valid-acl.csv
-rw-r--r-- 1 misao 725066026  840 Nov 25  2018 valid-meniscus.csv

The dataset is made up of train and valid directories, as well as csv files for the diagnoses of the three conditions we're interested in.

Inspect MRI scan images¶

For the time being, I'm going to focus on the train and valid directories, which contain MRI scan images.

train_dir = f'{data_dir}/train'
valid_dir = f'{data_dir}/valid'

!ls -lh $train_dir $valid_dir

../data/MRNet-v1.0/train:
total 0
drwxr-xr-x 1133 misao 725066026 36K Jun 17 17:39 axial
drwxr-xr-x 1133 misao 725066026 36K Jun 17 17:40 coronal
drwxr-xr-x 1133 misao 725066026 36K Jun 17 17:40 sagittal

../data/MRNet-v1.0/valid:
total 0
drwxr-xr-x 122 misao 725066026 3.9K Feb  5 04:44 axial
drwxr-xr-x 122 misao 725066026 3.9K Feb  5 04:45 coronal
drwxr-xr-x 122 misao 725066026 3.9K Feb  5 04:46 sagittal

print('Check disk usage\n')
!du -h $train_dir $valid_dir

Check disk usage

2.4G	../data/MRNet-v1.0/train/axial
2.2G	../data/MRNet-v1.0/train/sagittal
2.1G	../data/MRNet-v1.0/train/coronal
6.6G	../data/MRNet-v1.0/train
258M	../data/MRNet-v1.0/valid/axial
230M	../data/MRNet-v1.0/valid/sagittal
222M	../data/MRNet-v1.0/valid/coronal
709M	../data/MRNet-v1.0/valid

print(f'First 5 items in {train_dir}/axial/\n')
!ls $train_dir/axial | head -5

First 5 items in ../data/MRNet-v1.0/train/axial/

0000.npy
0001.npy
0002.npy
0003.npy
0004.npy

planes = ['axial', 'coronal', 'sagittal']
datasets = {'train': train_dir, 'valid': valid_dir}

for dataset, path in datasets.items():
    print(f'\nNumber of exams in {dataset} set:')
    for plane in planes:
        print(plane, len(glob(f'{path}/{plane}/*.npy')))

Number of exams in train set:
axial 1130
coronal 1130
sagittal 1130

Number of exams in valid set:
axial 120
coronal 120
sagittal 120

So far, we've found that the train and valid sets are further divided into three planes - axial, sagittal and coronal - and each exam series is provided as a npy file.

We have also validated that the train and valid sets contain 1,130 and 120 exams, respectively.

Let's open a couple of them.

train_axial_exams = glob(f'{train_dir}/axial/*.npy')[:5]

for exam in train_axial_exams:
    series = np.load(exam)
    print(series.shape)

(48, 256, 256)
(24, 256, 256)
(26, 256, 256)
(38, 256, 256)
(30, 256, 256)

def plot_first_ten(path):
    series = np.load(path)[:10]

    fig = plt.figure(figsize=(20, 6))
    
    for i, image in enumerate(series):
        ax = fig.add_subplot(1, 10, i+1, xticks=[], yticks=[])
        plt.imshow(image, cmap='gray')
        plt.axis('off');

plot_first_ten(train_axial_exams[0])

Visualise distrubutions of number of images in a series¶

According to the paper, the number of series in each exams varies from 17 to 61 (mean 31.48, SD 7.07).

Let's have a look at the distributions for train and valid datasets.

def get_num_images_per_exam(exams):
    num_images_per_exam = []
    
    for exam in exams:
        series = np.load(exam)
        num_images_per_exam.append(series.shape[0])
        
    return np.asarray(num_images_per_exam)

def plot_num_images_per_exam(dataset_path):
    f, axes = plt.subplots(1, 3, figsize=(20, 6), sharex=True)
    
    for i, plane in enumerate(planes):
        paths = glob(f'{dataset_path}/{plane}/*.npy')
        
        num_images = get_num_images_per_exam(paths)
        print(f'{plane} - min: {num_images.min()}, max: {num_images.max()}')

        sns.distplot(num_images, ax=axes[i])
        axes[i].set_title(f'{plane} plane')

First, the `train` set:¶

plot_num_images_per_exam(train_dir)

axial - min: 19, max: 61
coronal - min: 17, max: 58
sagittal - min: 17, max: 51

And for the `valid` set:¶

plot_num_images_per_exam(valid_dir)

axial - min: 20, max: 52
coronal - min: 17, max: 48
sagittal - min: 21, max: 45

For each plane the distributions are somewhat similar between the training and validation set.

The majority of the cases, the number of images fall in between 20 and 40.

It's probably a good place to stop for now, as we've gained a good understanding of the MRI scan images.

Inspect csv files¶

Let's move on and check out the csv files next.

!find $data_dir -name "*.csv"

../data/MRNet-v1.0/valid-abnormal.csv
../data/MRNet-v1.0/train-acl.csv
../data/MRNet-v1.0/train-meniscus.csv
../data/MRNet-v1.0/valid-acl.csv
../data/MRNet-v1.0/train-abnormal.csv
../data/MRNet-v1.0/valid-meniscus.csv

Train set: abnormal samples¶

train_abnormal_df = pd.read_csv(f'{data_dir}/train-abnormal.csv',
                                header=None,
                                names=['case', 'abnormal'],
                                dtype={'case': str, 'abnormal': np.int64})
print(train_abnormal_df.groupby('abnormal').count())
print('\nMissing values: ', train_abnormal_df.isnull().values.any())
train_abnormal_df.head()

          case
abnormal      
0          217
1          913

Missing values:  False

Train set: ACL tear¶

train_acl_df = pd.read_csv(f'{data_dir}/train-acl.csv',
                           header=None,
                           names=['case', 'acl'],
                           dtype={'case': str, 'acl': np.int64})
print(train_acl_df.groupby('acl').count())
print('\nMissing values: ', train_acl_df.isnull().values.any())
train_acl_df.head()

     case
acl      
0     922
1     208

Missing values:  False

Train set: meniscal tear¶

train_meniscus_df = pd.read_csv(f'{data_dir}/train-meniscus.csv',
                                header=None,
                                names=['case', 'meniscus'],
                                dtype={'case': str, 'meniscus': np.int64})
print(train_meniscus_df.groupby('meniscus').count())
print('\nMissing values: ', train_meniscus_df.isnull().values.any())
train_meniscus_df.head()

          case
meniscus      
0          733
1          397

Missing values:  False

Train set: merge labels¶

It would be handy to have labels for each condition per case in one place, so we will merge the three dataframes.

train_df = pd.merge(train_abnormal_df, train_acl_df, on='case').merge(train_meniscus_df, on='case')
train_df.head()

Valid set: repeat the process¶

We will do the same for the valid set.

valid_abnormal_df = pd.read_csv(f'{data_dir}/valid-abnormal.csv',
                                header=None,
                                names=['case', 'abnormal'],
                                dtype={'case': str, 'abnormal': np.int64})

valid_acl_df = pd.read_csv(f'{data_dir}/valid-acl.csv',
                           header=None,
                           names=['case', 'acl'],
                           dtype={'case': str, 'acl': np.int64})

valid_meniscus_df = pd.read_csv(f'{data_dir}/valid-meniscus.csv',
                                header=None,
                                names=['case', 'meniscus'],
                                dtype={'case': str, 'meniscus': np.int64})

valid_df = pd.merge(valid_abnormal_df, valid_acl_df, on='case').merge(valid_meniscus_df, on='case')
valid_df.head()

Investigate co-occurrence of conditions¶

We can now inspect co-occurrence of abnormalities, ACL tear and meniscal tear.

pd.options.display.float_format = '{:,.1f}'.format

train_co_occurency_df = train_df.groupby(['abnormal', 'acl', 'meniscus']).count()
train_co_occurency_df['%_freq'] = (train_co_occurency_df.case / train_co_occurency_df.case.sum()) * 100
train_co_occurency_df.head()

valid_co_occurency_df = valid_df.groupby(['abnormal', 'acl', 'meniscus']).count()
valid_co_occurency_df['%_freq'] = (valid_co_occurency_df.case / valid_co_occurency_df.case.sum()) * 100
valid_co_occurency_df.head()

It's worth noting that there are discrepancies between the train and valid datasets in terms of the frequency of co-occurrence of these conditions.

Negative of all three
- train: 19.2%
- valid: 20.8%

Abnormal, but no ACL tear and no meniscal tear
- train: 38.3%
- valid: 16.7%

Abnormal with ACL tear, but no meniscal tear
- train: 7.3%
- valid: 19.2%

Abnormal with meniscal tear, but no ACL tear
- train: 24.1%
- valid: 17.5%

Positive of all three
- train: 11.1%
- valid: 25.8%

Resources¶

A Gentle Introduction to Exploratory Data Analysis

Why you should start using .npy file more often…

			case	%_freq
abnormal	acl	meniscus
0	0	0	217	19.2
1	0	0	433	38.3
	0	1	272	24.1
	1	0	83	7.3
	1	1	125	11.1

			case	%_freq
abnormal	acl	meniscus
0	0	0	25	20.8
1	0	0	20	16.7
	0	1	21	17.5
	1	0	23	19.2
	1	1	31	25.8

	case	abnormal
0	0000	1
1	0001	1
2	0002	1
3	0003	1
4	0004	1

	case	acl
0	0000	0
1	0001	1
2	0002	0
3	0003	0
4	0004	0

	case	meniscus
0	0000	0
1	0001	1
2	0002	0
3	0003	1
4	0004	0

	case	abnormal	acl	meniscus
0	0000	1	0	0
1	0001	1	1	1
2	0002	1	0	0
3	0003	1	0	1
4	0004	1	0	0