Breast cancer is the most common form of cancer in women, and invasive ductal carcinoma (IDC) is the most common form of breast cancer. Accurately identifying and categorizing breast cancer subtypes is an important clinical task, and automated methods can be used to save time and reduce errors.
The goal of this article is to identify IDC when it is present in otherwise unlabeled histopathology images.
TheĀ datasetĀ consists of 277,524 50x50 pixel RGB digital image patches that were derived from 162 H&E-stained breast histopathology samples. These images are small patches that were extracted from digital images of breast tissue samples.
The breast tissue contains many cells but only some of them are cancerous. Patches that are labeled ā1ā contain cells that are characteristic of invasive ductal carcinoma. For more information about the data, see https://www.ncbi.nlm.nih.gov/pubmed/27563488Ā andĀ http://spie.org/Publications/Proceedings/Paper/10.1117/12.2043872.
Dataset Download Link:Ā https://www.kaggle.com/paultimothymooney/breast-histopathology-images
Letās start working on the dataset.
Step 1: Import Libraries
Step 2: Explore Data
TheĀ globĀ module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order.
imagePatches = glob('/kaggle/input/IDC_regular_ps50_idx5/**/*.png', recursive=True)
for filename in imagePatches[0:10]:
print(filename)
Now make a function that can plot an image.
image_name = "/kaggle/input/IDC_regular_ps50_idx5/9135/1/9135_idx5_x1701_y1851_class1.png" #Image to be used as query
def plotImage(image_location):
image = cv2.imread(image_name)
image = cv2.resize(image, (50,50))
plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB)); plt.axis('off')
return
plotImage(image_name)
I created a function plotImage that takes the image path and plot the image, here is the result of the function.
Letās plot sample images of the dataset.
# Plot Multiple Images
bunchOfImages = imagePatches
i_ = 0
plt.rcParams['figure.figsize'] = (10.0, 10.0)
plt.subplots_adjust(wspace=0, hspace=0)
for l in bunchOfImages[:25]:
im = cv2.imread(l)
im = cv2.resize(im, (50, 50))
plt.subplot(5, 5, i_+1) #.set_title(l)
plt.imshow(cv2.cvtColor(im, cv2.COLOR_BGR2RGB)); plt.axis('off')
i_ += 1
def randomImages(a):
r = random.sample(a, 4)
plt.figure(figsize=(16,16))
plt.subplot(131)
plt.imshow(cv2.imread(r[0]))
plt.subplot(132)
plt.imshow(cv2.imread(r[1]))
plt.subplot(133)
plt.imshow(cv2.imread(r[2]));
randomImages(imagePatches)
Step 3: Preprocess Data
For training and testing, we need to pre-process ourĀ data, because of the dataset is not in the desired format.
Right now we have a bunch of images and we donāt have any idea of which image is IDC(+) and IDC(-).
So we can not set the value of Y in our training.
To find which image IDC(+) and IDC(-) we need to process the name of the image, if the end of the image name hasĀ class0.pngĀ then the image belongs to IDC(-) and it hasĀ class1.pngĀ then it belongs to IDC(+).
patternZero = '*class0.png'
patternOne = '*class1.png'
classZero = fnmatch.filter(imagePatches, patternZero)
classOne = fnmatch.filter(imagePatches, patternOne)
print("IDC(-)\n\n",classZero[0:5],'\n')
print("IDC(+)\n\n",classOne[0:5])
Now letās make a function called proc_images that process every image and returnĀ xĀ andĀ y.
whereĀ xĀ is our image andĀ yĀ labels if images are inĀ class0Ā then label will be 0 otherwiseĀ 1.
X,Y = proc_images(0,90000)
df = pd.DataFrame()
df["images"]=X
df["labels"]=Y
X2=df["images"]
Y2=df["labels"]
X2=np.array(X2)
imgs0=[]
imgs1=[]
imgs0 = X2[Y2==0] # (0 = no IDC, 1 = IDC)
imgs1 = X2[Y2==1]
Description of the dataset:
def describeData(a,b):
print('Total number of images: {}'.format(len(a)))
print('Number of IDC(-) Images: {}'.format(np.sum(b==0)))
print('Number of IDC(+) Images: {}'.format(np.sum(b==1)))
print('Percentage of positive images: {:.2f}%'.format(100*np.mean(b)))
print('Image shape (Width, Height, Channels): {}'.format(a[0].shape))
describeData(X2,Y2)
Results of the function:
Total number of images: 90000
Number of IDC(-) Images: 66025
Number of IDC(+) Images: 23975
Percentage of positive images: 26.64%
Image shape (Width, Height, Channels): (50, 50, 3)
dict_characters = {0: āIDC(-)ā, 1: āIDC(+)ā}
print(df.head(10))
print(āā)
print(dict_characters)
Plotting sample images as IDC(-) and IDC(+)
def plotOne(a,b):
"""
Plot one numpy array
"""
plt.subplot(1,2,1)
plt.title('IDC (-)')
plt.imshow(a[0])
plt.subplot(1,2,2)
plt.title('IDC (+)')
plt.imshow(b[0])
plotOne(imgs0, imgs1)
def plotTwo(a,b):
"""
Plot a bunch of numpy arrays sorted by label
"""
for row in range(3):
plt.figure(figsize=(20, 10))
for col in range(3):
plt.subplot(1,8,col+1)
plt.title('IDC (-)')
plt.imshow(a[0+row+col])
plt.axis('off')
plt.subplot(1,8,col+4)
plt.title('IDC (+)')
plt.imshow(b[0+row+col])
plt.axis('off')
plotTwo(imgs0, imgs1)
Plotting histogram of the dataset:
def plotHistogram(a):
"""
Plot histogram of RGB Pixel Intensities
"""
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.imshow(a)
plt.axis('off')
plt.title('IDC(+)' if Y[1] else 'IDC(-)')
histo = plt.subplot(1,2,2)
histo.set_ylabel('Count')
histo.set_xlabel('Pixel Intensity')
n_bins = 30
plt.hist(a[:,:,0].flatten(), bins= n_bins, lw = 0, color='r', alpha=0.5);
plt.hist(a[:,:,1].flatten(), bins= n_bins, lw = 0, color='g', alpha=0.5);
plt.hist(a[:,:,2].flatten(), bins= n_bins, lw = 0, color='b', alpha=0.5);
plotHistogram(X2[100])
Dataset split in Training and Testing:
TheĀ dataĀ is scaled from 0 to 256 but we want it to be scaled from 0 to 1. This will make the data compatible with a wide variety of different classification algorithms. We also want to set aside 20% of the data for testing. This will make the trained model less prone to overfitting. And finally, we will use an oversampling strategy to deal with the imbalanced class sizes.
X=np.array(X
X=X/255.0
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20
print("Training Data Shape:", X_train.shape)
print("Testing Data Shape:", X_test.shape)
Training Data Shape: (72000, 50, 50, 3)
Testing Data Shape: (18000, 50, 50, 3)
Converting Y in categorical
Y_trainHot = to_categorical(Y_train, num_classes = 2)
Y_testHot = to_categorical(Y_test, num_classes = 2)
Plotting Labels
lab = df['labels']
dist = lab.value_counts()
sns.countplot(lab)
print(dict_characters)
{0: 'IDC(-)', 1: 'IDC(+)'}
We have a class imbalanced dataset, letās deal with it.
Dealing with class imbalanced
{0: 'IDC(-)', 1: 'IDC(+)'}
Step 4: Define Helper Functions for the Classification Task
from sklearn.utils import class_weight
class_weight = class_weight.compute_class_weight('balanced', np.unique(Y_train), Y_train)
print("Old Class Weights: ",class_weight)
from sklearn.utils import class_weight
class_weight2 = class_weight.compute_class_weight('balanced', np.unique(Y_trainRos), Y_trainRos)
print("New Class Weights: ",class_weight2)
Old Class Weights: [ 0.68126337 1.87920864]
New Class Weights: [ 1. 1.]
Step 5: Traning of Model
For training our model
Batch size = 128
Epochs = 15
After 8 Epochs val_acc did not improve so model training stopped.
Step: 6 Evaluate Classification Model
score = model.evaluate(X_testRosReshaped,Y_testRosHot, verbose=1)
print('\nKeras CNN #1C - accuracy:', score[1],'\n')
y_pred = model.predict(X_testRosReshaped)
map_characters = {0: 'IDC(-)', 1: 'IDC(+)'}
print('\n', sklearn.metrics.classification_report(np.where(Y_testRosHot > 0)[1], np.argmax(y_pred, axis=1),
target_names=list(map_characters.values())), sep='')
Y_pred_classes = np.argmax(y_pred,axis=1)
Y_true = np.argmax(Y_testRosHot,axis=1)
plot_learning_curve(history)
plt.show()
confusion_mtx = confusion_matrix(Y_true, Y_pred_classes)
plot_confusion_matrix(confusion_mtx, classes = list(dict_characters.values()))
plt.show()
It does not look too be to overfit or too biased based on the learning curve and confusion matrix. In the future, I will improve the score by optimizing the data augmentation step as well as the network architecture.
Related Articles:
- Sentiment analysis of Amazon product reviews
- License Plate Detection (ANPR) Part2
- All You Need To Know About ANPR Part1
If you found this article useful please clap and follow me, it will encourage me to write articles on tech.
Thatās all about this article I hope you liked it.
Also published behind a paywall on Medium's subdomain.