This is one of a series of notebook I've created as part of participating in the MRNet Competition.
In this notebook I performed an EDA on the MRNet Dataset.
To jump to the relevant section:
For other notebooks and code, please visit the github repo.
In the paper Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet, the Stanford ML Group developed an algorithm to predict abnormalities in knee MRI exams, and measured the clinical utility of providing the algorithm’s predictions to radiologists and surgeons during interpretation.
They developed a deep learning model for detecting:
general abnormalities
anterior cruciate ligament (ACL)
meniscal tears
The dataset (~5.7G) was released along with the publication of the paper. You can download it by agreeing to the Research Use Agreement and submitting your details on the MRNet Competition page.
It consists of 1,370 knee MRI exams, containing:
1,104 (80.6%) abnormal exams
319 (23.3%) ACL tears
508 (37.1%) meniscal tears
The dataset is split into:
training set (1,130 exams, 1,088 patients)
validation set (120 exams, 111 patients) - called tuning set in the paper
hidden test set (120 exams, 113 patients) - called validation set in the paper
The hidden test set is not publicly available and is used for scoring models submitted for the competition.
N.B.
Stratified random sampling was used to ensure at least 50 positive examples of abnormal, ACL tear and meniscal tear were preset in each set.
All exams from each patient were put in the same split.
In the paper, an external validation was performed on a publicly available data.
First thing first - let's inspect the content and familiarise ourselves with the dataset.
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from glob import glob
data_dir = '../data/MRNet-v1.0'
!ls -lh $data_dir
The dataset is made up of train
and valid
directories, as well as csv
files for the diagnoses of the three conditions we're interested in.
For the time being, I'm going to focus on the train
and valid
directories, which contain MRI scan images.
train_dir = f'{data_dir}/train'
valid_dir = f'{data_dir}/valid'
!ls -lh $train_dir $valid_dir
print('Check disk usage\n')
!du -h $train_dir $valid_dir
print(f'First 5 items in {train_dir}/axial/\n')
!ls $train_dir/axial | head -5
planes = ['axial', 'coronal', 'sagittal']
datasets = {'train': train_dir, 'valid': valid_dir}
for dataset, path in datasets.items():
print(f'\nNumber of exams in {dataset} set:')
for plane in planes:
print(plane, len(glob(f'{path}/{plane}/*.npy')))
So far, we've found that the train
and valid
sets are further divided into three planes - axial
, sagittal
and coronal
- and each exam series is provided as a npy
file.
We have also validated that the train
and valid
sets contain 1,130 and 120 exams, respectively.
Let's open a couple of them.
train_axial_exams = glob(f'{train_dir}/axial/*.npy')[:5]
for exam in train_axial_exams:
series = np.load(exam)
print(series.shape)
def plot_first_ten(path):
series = np.load(path)[:10]
fig = plt.figure(figsize=(20, 6))
for i, image in enumerate(series):
ax = fig.add_subplot(1, 10, i+1, xticks=[], yticks=[])
plt.imshow(image, cmap='gray')
plt.axis('off');
plot_first_ten(train_axial_exams[0])
According to the paper, the number of series in each exams varies from 17 to 61 (mean 31.48, SD 7.07).
Let's have a look at the distributions for train
and valid
datasets.
def get_num_images_per_exam(exams):
num_images_per_exam = []
for exam in exams:
series = np.load(exam)
num_images_per_exam.append(series.shape[0])
return np.asarray(num_images_per_exam)
def plot_num_images_per_exam(dataset_path):
f, axes = plt.subplots(1, 3, figsize=(20, 6), sharex=True)
for i, plane in enumerate(planes):
paths = glob(f'{dataset_path}/{plane}/*.npy')
num_images = get_num_images_per_exam(paths)
print(f'{plane} - min: {num_images.min()}, max: {num_images.max()}')
sns.distplot(num_images, ax=axes[i])
axes[i].set_title(f'{plane} plane')
train
set:¶plot_num_images_per_exam(train_dir)
valid
set:¶plot_num_images_per_exam(valid_dir)
For each plane the distributions are somewhat similar between the training and validation set.
The majority of the cases, the number of images fall in between 20 and 40.
It's probably a good place to stop for now, as we've gained a good understanding of the MRI scan images.
!find $data_dir -name "*.csv"
train_abnormal_df = pd.read_csv(f'{data_dir}/train-abnormal.csv',
header=None,
names=['case', 'abnormal'],
dtype={'case': str, 'abnormal': np.int64})
print(train_abnormal_df.groupby('abnormal').count())
print('\nMissing values: ', train_abnormal_df.isnull().values.any())
train_abnormal_df.head()
train_acl_df = pd.read_csv(f'{data_dir}/train-acl.csv',
header=None,
names=['case', 'acl'],
dtype={'case': str, 'acl': np.int64})
print(train_acl_df.groupby('acl').count())
print('\nMissing values: ', train_acl_df.isnull().values.any())
train_acl_df.head()
train_meniscus_df = pd.read_csv(f'{data_dir}/train-meniscus.csv',
header=None,
names=['case', 'meniscus'],
dtype={'case': str, 'meniscus': np.int64})
print(train_meniscus_df.groupby('meniscus').count())
print('\nMissing values: ', train_meniscus_df.isnull().values.any())
train_meniscus_df.head()
It would be handy to have labels for each condition per case in one place, so we will merge the three dataframes.
train_df = pd.merge(train_abnormal_df, train_acl_df, on='case').merge(train_meniscus_df, on='case')
train_df.head()
We will do the same for the valid
set.
valid_abnormal_df = pd.read_csv(f'{data_dir}/valid-abnormal.csv',
header=None,
names=['case', 'abnormal'],
dtype={'case': str, 'abnormal': np.int64})
valid_acl_df = pd.read_csv(f'{data_dir}/valid-acl.csv',
header=None,
names=['case', 'acl'],
dtype={'case': str, 'acl': np.int64})
valid_meniscus_df = pd.read_csv(f'{data_dir}/valid-meniscus.csv',
header=None,
names=['case', 'meniscus'],
dtype={'case': str, 'meniscus': np.int64})
valid_df = pd.merge(valid_abnormal_df, valid_acl_df, on='case').merge(valid_meniscus_df, on='case')
valid_df.head()
We can now inspect co-occurrence of abnormalities, ACL tear and meniscal tear.
pd.options.display.float_format = '{:,.1f}'.format
train_co_occurency_df = train_df.groupby(['abnormal', 'acl', 'meniscus']).count()
train_co_occurency_df['%_freq'] = (train_co_occurency_df.case / train_co_occurency_df.case.sum()) * 100
train_co_occurency_df.head()
valid_co_occurency_df = valid_df.groupby(['abnormal', 'acl', 'meniscus']).count()
valid_co_occurency_df['%_freq'] = (valid_co_occurency_df.case / valid_co_occurency_df.case.sum()) * 100
valid_co_occurency_df.head()
It's worth noting that there are discrepancies between the train
and valid
datasets in terms of the frequency of co-occurrence of these conditions.