A Distributed Deep Learning Approach for the Mitosis Detection from Big Medical

The strongest indicator of a cancer patient’s prognosis is the number of mitotic bodies that a pathologist manually counts from the high-resolution whole-slide histopathology images. Obviously, it is not efficient to manually count the mitosis number. But it is still challenging to automate the process of mitosis detection due to the limited training datasets and the intensive computing involved in the model training and inference. This presentation introduces a large-scale deep learning approach to train a two-stage CNN-based model with high accuracy to detect the mitosis locations directly from the high-resolution whole-slide images. In details, we first train a nuclei detection model to remove the background information from the raw whole-slide histopathology images. Second, a customized ResNet-50 model is trained on the cleaned dataset in the first step. The first step saves the training time while improving the model performance in the second step. A false-positive oversampling approach is used to further improve the model performance. With these models, the inference process is conducted to detect the mitosis locations from the large volume of histopathology images in parallel. Meanwhile, the whole pipeline, including data preprocessing, model training, hyperparameter tuning, and inference, is parallelized by utilizing the distributed TensorFlow, Apache Spark, and HDFS. The experiences and techniques in this project can be applied to other large scale deep learning problems as well.

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.A Distributed Deep Learning Approach for the Mitosis Detection from Big Medical Images Fei Hu, Center for Open-source Data and AI Technologies, IBM #UnifiedAnalytics #SparkAISummit

3. Center for Open-Source Data & AI Technologies (CODAIT) Mission: Make AI solutions dramatically easier to create, deploy, and manage in the enterprise. Jupyter Python Data Science Relaunch of the Spark Technology Center (STC) Pandas Scikit-Learn Stack to reflect the expanded mission. Machine Learning Location: Gather Analyze Deploy Maintain Data Data Model Model Deep – Physical: 505 Howard St., San Francisco CA Learning – Web: http://codait.org Twitter: @ibmcodait Apache Model Fabric for Mleap + Spark Keras + Tensorflow Asset Deep Learning PFA eXchange (FfDL) 30+ open source developers 3

4.Agenda • Motivation • Related Work • Methodologies – Workflow – Training • Mask R-CNN based mitosis-proposed model • ResNet50-based mitosis classification model – Inference • Data pipeline • Distributed inference with Spark • Results • Model consumption with MAX #UnifiedAnalytics #SparkAISummit 4

5.Motivation • The number of mitotic bodies is one of the strongest indicator of a cancer patient’s prognosis. • Challenges – Education: years of training for the expertise and experience to do well – Time consuming: one pathologist spent 30 hours on 130 slides1 – Subjectivity: agreement in diagnosis https://newsnetwork.mayoclinic.org/discussion/frozen-section-analysis-for- breast-cancer-patients-could-save-more-than-90-million-plus-time-anxiety/ for some forms of breast cancer can be as low as 48% 1. https://ai.googleblog.com/2017/03/assisting-pathologists-in-detecting.html #UnifiedAnalytics #SparkAISummit 5

6. Motivation • Where is the mitosis? – Which area is the background? – Which spots are nuclei? – Which nuclei are in the phases of mitosis • Goal Develop an algorithm to automatically detect mitoses from the stained tissue image • Challenges – Large background area – Very small number of mitoses – Limited training dataset #UnifiedAnalytics #SparkAISummit 6

7.Related work • Handcrafted features based • Features: size, shape, textures • ML methods: SVM, random forest • CNN features based • Sliding-window based classification • Object detection • Selected reference • Cireşan, D.C., Giusti, A., Gambardella, L.M. and Schmidhuber, J., 2013, September. Mitosis detection in breast cancer histology images with deep neural networks. In International Conference on Medical Image Computing and Computer- assisted Intervention (pp. 411-418). Springer, Berlin, Heidelberg. • Paeng, K., Hwang, S., Park, S. and Kim, M., 2017. A unified framework for tumor proliferation score prediction in breast histopathology. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support (pp. 231- 239). Springer, Cham. • Li, C., Wang, X., Liu, W. and Latecki, L.J., 2018. DeepMitosis: Mitosis detection via deep detection, verification and segmentation networks. Medical image analysis, 45, pp.121-133. #UnifiedAnalytics #SparkAISummit 7

8.Methodologies – Workflow – Training • Mask R-CNN based mitosis-proposed model • ResNet50-based mitosis classification model – Inference • Data pipeline • Distributed inference with Spark 8

9. Workflow 1st-stage: Mask R-CNN based mitosis-proposed model Normalize Augment Augmented Tiles Tile q*64*64*3 Region of Interest (ROI) Proposals p*64*64*3 m*512*512*3 Whole Slide Image(WSI) n*50,000*50,000*3 Classification Marginalize(Optional) mitosis detection ROI (cluster/smooth) ROI Customized ResNet50 probability Classified Tiles threshold search for F1 2nd-stage: Customized ResNet50-based mitosis classification model Mitosis Coordinates [(x 1, y 1), Tumor (x 2, y 2), SVM proliferation … (x n, y n)] score WSI Features #UnifiedAnalytics #SparkAISummit 9 WSI

10.Model training: Mask R-CNN Mitosis-proposing Model Data: Data Science Bowl 2018 • segmented nuclei images: 30,800 training labels • varied in cell type, magnification, and imaging modality -> Good generality … https://www.kaggle.com/c/data-science-bowl-2018 Model configuration - Backbone: ResNet50 - Stride size: [4, 8, 16, 32, 64] - Anchor scales: [8, 16, 32, 64, 128] - Ratios of anchor width/height: [0.5, 1, 2] Mask R-CNN GitHub repo: (https://medium.com/@jonathan_hui/image-segmentation-with-mask-r-cnn-ebe6d793272) https://github.com/matterport/Mask_RCNN #UnifiedAnalytics #SparkAISummit 10

11.Evaluate the proposed tiles • The proposed tiles cover 99.46% of the mitoses (1,550) in the TUPAC16 training dataset. • Cluster the overlapped proposed tiles (distance < 32 pixels) #UnifiedAnalytics #SparkAISummit 11

12.1st-stage: Mask R-CNN based tile proposals Mask R-CNN Region of Interest(ROI) Proposals m*512*512*3 m*512*512*3 Proposed Tile p*64*64*3 Tile#UnifiedAnalytics #SparkAISummit p*64*64*3 12

13.Approach comparison • Remove background area • No need considering the tile overlap Sliding-window based classification approach On the validation HPF data 3,203,181 tiles (classification approach) 344,795 tiles (object detection approach) CNN based object detection based approach #UnifiedAnalytics #SparkAISummit 13

14.Model training: Customized ResNet50 Classification Model Data Images in TUPAC 2016 • TUPAC 2016: http://tupac.tue- image.nl/node/3 • 656 images of breast tumor tissue (~600 GB) • Different sizes: • ICPR 2014: https://mitos- atypia- 14.grand- challenge.org • 1 HPF (2000 * 2000 pixels) • ICPR 2012: http://ludo17.free.fr/mitos_2012 • 8 HPFs (5657 * 5657 pixels) • 40x magnification (0.25 𝜇𝑚 / pixel) • TIFF format Labels • (x, y) coordinates of the centers of the mitoses • CVS format • Annotated by a consensus of two pathologists #UnifiedAnalytics #SparkAISummit 14

15.Training: Data Augmentation Normalize Random rotation, translation, Augment mirroring, color, contrast …… Labeled Patches Augmented Patches • Add noise to the input data qx64x64x3 px64x64x3 • Increase the training data size • Improve the model generalization Prediction Update s ResNet50 #UnifiedAnalytics #SparkAISummit 15

16.Training: Model Pre-trained VGG16 base ResNet50 base Custom ResNet #UnifiedAnalytics #SparkAISummit 16

17.Training: Model Loss • Binary classification problem • Logistic loss (“sigmoid cross-entropy”) Optimizers • Train the new classifier: Adam • Fine-tune a portion of the base model: SGD w/ Nesterov Momentum Metrics • Loss • F1 score • Precision • Recall #UnifiedAnalytics #SparkAISummit 17

18.Model-bootstrapped false-positive oversampling Normalize Augment Labeled Patches px64x64x3 Augmented Patches Model- qx64x64x3 bootstrapped FP oversampling Prediction Update s ResNet50 #UnifiedAnalytics #SparkAISummit 18

19.Post processing Cluster/Smooth ROI ROI raw predictions clustered/smoothed predictions #UnifiedAnalytics #SparkAISummit 19

20.Data Parallelized Prediction Node-0: Excutor_0 + GPU_0 Node-0 GPU-0 Partition_0 Inference GPU-1 detection X …… …… ………… Mitosis locations Node_n: Excutor_m + GPU_m Node-n X GPU-0 Inference Partition_j GPU-1 detection GPU resource manager for Spark 20

21. Data Parallelized Prediction Node0: Excutor_0 + GPU_0 Inference Inference detection Parallelized operations: ROIs Tiles Augmentation Stack Clustering • Data transformation • Image augmentation ……… Mitosis • Model training & inference locations • Data smooth Node0: Excutor_m + GPU_m … Inference Inference detection ROIs Tiles Augmentation Stack Clustering Issues: • Small images in HDFS • Data transferring from Spark to TensorFlow 21

22.Inference Result Classification approach Object detection approach F1 0.604 0.6142 Precision 0.613 0.6311 Sensitivity 0.595 0.5983 Time 9 hours 21 mins 1hour 11 mins - No background data - No need for considering the overlap between the sliding windows - No need of the marginalization 22

23. Model Asset Exchange Model Asset eXchange (MAX) • Free, open-source models. • Wide variety of domains. • Multiple deep learning frameworks. • Vetted and tested code and IP. • Build and deploy a model web service in 30 seconds. • Start training on Fabric for Deep Learning (FfDL) Watson Machine Learning in minutes. https://developer.ibm.com/exchanges/models/ 23

24. Model Asset Exchange Demo: MAX Breast Cancer Mitosis Detector Deploy from Docker hub: $ docker run -it -p 5000:5000 codait/max-breast-cancer-mitosis-detector Run locally: $ git clone https://github.com/IBM/MAX-Breast-Cancer-Mitosis-Detector.git $ cd MAX-Breast-Cancer-Mitosis-Detector $ docker build -t max-breast-cancer-mitosis-detector . $ docker run -it -p 5000:5000 max-breast-cancer-mitosis-detector Github repo: https://github.com/IBM/MAX-Breast-Cancer-Mitosis-Detector 24

25.Thank you! • We are Join our project! hiring! - https://github.com/CODAIT/deep-histopath Check out other CODAIT & IBM projects: - https://github.com/CODAIT - https://developer.ibm.com/code/ Get in touch! fei.hu1@ibm.com Try on IBM Cloud! https://ibm.biz/Bd23NU 25


由Apache Spark PMC & Committers发起。致力于发布与传播Apache Spark + AI技术,生态,最佳实践,前沿信息。