1003 DEEP BAYESIAN APPROACH TO EYE FUN DUS IM AGE SEGM EN TATION Azat Garifullin DEEP BAYESIAN APPROACH TO EYE FUNDUS IMAGE SEGMENTATION Azat Garifullin ACTA UNIVERSITATIS LAPPEENRANTAENSIS 1003 Azat Garifullin DEEP BAYESIAN APPROACH TO EYE FUNDUS IMAGE SEGMENTATION Acta Universitatis Lappeenrantaensis 1003 Dissertation for the degree of Doctor of Science (Technology) to be presented with due permission for public examination and criticism in the Auditorium 1314 at Lappeenranta-Lahti University of Technology LUT, Lappeenranta, Finland on the 9th of December, 2021, at noon. Supervisor Professor Lasse Lensu LUT School of Engineering Science Lappeenranta-Lahti University of Technology LUT Finland Reviewers Assistant Professor Juho Kannala Department of Computer Science Aalto University Finland Professor Emanuele Trucco School of Science and Engineering University of Dundee Scotland, UK Opponents Professor Jussi Tohka Faculty of Health Sciences University of Eastern Finland Finland Professor Emanuele Trucco School of Science and Engineering University of Dundee Scotland, UK ISBN 978-952-335-761-7 ISBN 978-952-335-762-4 (PDF) ISSN-L 1456-4491 ISSN 1456-4491 Lappeenranta-Lahti University of Technology LUT LUT University Press 2021 Abstract Azat Garifullin Deep Bayesian Approach to Eye Fundus Image Segmentation Lappeenranta 2021 64 pages Acta Universitatis Lappeenrantaensis 1003 Diss. Lappeenranta-Lahti University of Technology LUT ISBN 978-952-335-761-7 ISBN 978-952-335-762-4 (PDF) ISSN-L 1456-4491 ISSN 1456-4491 Eye diseases cause different retinal abnormalities that can be detected and diagnosed by examining eye fundus images. Due to the rapidly growing amount of data, there is a need for methods that are able to produce meaningful image representations and analysis results helping medical doctors to make correct diagnoses. Recent advances in deep learning have enabled very promising approaches for solving a variety of tasks related to automatic fundus image analysis. However, there is growing concern about the reliability of these methods and possible issues exist regarding their utilization in risk-sensitive scenarios. This study extends the current research by studying fundus image segmentation from a deep Bayesian perspective that permits model parameters and their outputs to be treated as random variables. The treatment makes it possible to estimate how uncertain the model is about its predictions. The study focuses on subproblems including the segmentation of the retinal vasculature, optic disc, macula and diabetic retinopathy lesions. Considering the probabilistic nature of the chosen methods, validation procedures need to be augmented in order to evaluate not only the segmentation results but also the estimated uncertainties. The experimental results show that the proposed Bayesian baselines for fundus image segmentation yield a performance that is comparable to the existing state-of-the-art ap- proaches. The produced uncertainty estimates provide meaningful information about possible problems during the inference. However, the uncertainty validation results sug- gest that predicting misclassifications using uncertainty in a straightforward manner is limited. The results of additional experiments using weight averaging techniques and spectral image data are provided. This work also discusses the problems encountered when applying Bayesian methods to fundus image segmentation. Keywords: Bayesian deep learning, fundus imaging, image segmentation, diabetic retinopathy, lesion segmentation, vasculature, optic disc, macula Acknowledgements This work was carried out at Computer Vision and Pattern Recognition laboratory at Lappeenranta–Lahti University of Technology LUT, Finland, between 2018 and 2021. I am grateful to LUT Doctoral School for funding the research. I would like to express my deepest gratitude to my supervisor Prof. Lasse Lensu for the guidance during this research. I would like to thank Prof. Hannu Uusitalo for valuable advices and discussion especially during the journal publication process. My sincere thanks are due to Dr. Pauli Fält, Prof. Markku Hauta-Kasari, Prof. Hannu Uusitalo et al. for the development of spectral retinal imaging and the collection of spectral retinal images for the DiaRetDB2 dataset utilized in this work. I thank my honored pre-examiners and opponents Prof. Juho Kannala, Prof. Emanuele Trucco and Prof. Jussi Tohka for their valuable participation in the dissertation process. To the friends and family. Azat Garifullin December 2021 Lappeenranta, Finland Symbols and abbreviations a a scalar a a vector A a matrix D a dataset N the number of samples in a dataset x an input image p a ground truth segmentation map θ parameters of a model pˆ estimated segmentation map sigmoid sigmoid activation function y a logit exp(a) exponential function applied to a scalar argmaxx g(x) argument x at which function g takes its maximum p(a) probability density function of a p(a | b) conditional probability density function of a given b �n i=m product over i from m to n � g(x)dx integral of function g with respect to x log x natural logarithm of x �n i=m sum over i from m to n argminx g(x) argument x at which function g takes its minimum L a loss function R a regularization term N (µ,σ2) normal distribution with mean µ and variance σ2 D the number of parameters of a model yˆ an estimated vector of logits σ estimated standard deviations of logits A⊙B element-wise (Hadamard) product of A and B a ∼ pa a random variable a distributed according to pa NA the number of aleatoric samples I identity matrix qθ approximated posterior distribution of θ ω variational parameters MD dropout mask DKL Kullback-Leibler divergence LVI variational minimization objective RVI variational regularization term NE the number of epistemic samples θSWA parameters of a model estimated using stochastic weight averaging ΣSWAG a covariance matrix of parameters of a model estimated using stochastic weight averaging Gaussian Vp a variance under distribution p Ep an expectation under distribution p UA aleatoric uncertainty UE epistemic uncertainty UT total uncertainty SE sensitivity TP true positives FN false negatives PPV positive predictive value FP false positives SP specificity TN true negatives IoU intersection over union A ∪B union of sets A and B A ∩B intersection of sets A and B F1 F1 score ECE expected calibration error AUC area under the curve AV artery-vein AVR arteriole-to-venule ratio BCE binary cross-entropy BN batch normalization CAM class activation map CDR cup-to-disc ratio DCB dense convolutional block Dense-FCN dense fully-convolutional network DiaRetDB1 DiaRetDB1 diabetic retinopathy database DiaRetDB2 DiaRetDB2 diabetic retinopathy database DRIONS-DB digital retinal images for optic nerve segmentation database DRIVE digital retinal images for vessel extraction dataset ECE expected calibration error FCN fully-convolutional network FOV field-of-view HRF high resolution fundus image database IDRiD Indian diabetic retinopathy image dataset IoU intersection over union MC-Dropout Monte-Carlo dropout MCMC Monte-Carlo Markov chain MESSIDOR methods to evaluate segmentation and indexing techniques in the field of retinal ophthalmology OCT optical coherence tomography PPV positive predictive value PR precision-recall characteristic ReLU rectified linear units RGB red, green, blue RIM-ONE open retinal image database for optic nerve evaluation RITE retinal images vessel tree extraction RNFL retinal nerve fiber layer ROC receiver operating characteristic SE sensitivity SGD stochastic gradient descent SP specificity STARE structured analysis of the retina SWA stochastic weight averaging SWAG stochastic weight averaging Gaussian Contents Abstract Acknowledgments Contents List of publications 1 Introduction 15 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2 The eye, fundus imaging and computer-aided diagnosis 19 2.1 Structure and diseases of the eye . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Color and spectral imaging of the eye fundus . . . . . . . . . . . . . . . . 20 2.3 Fundus image datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 DRIVE and RITE . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.2 IDRiD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.3 DiaRetDB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Computer-aided diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3 Fundus image segmentation 33 3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.1 Deep learning for semantic segmentation . . . . . . . . . . . . . . . 33 3.1.2 Bayesian deep learning . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.3 Neural network architectures . . . . . . . . . . . . . . . . . . . . . 36 3.1.4 Segmentation and uncertainty validation . . . . . . . . . . . . . . . 37 3.2 Retinal artery-vein segmentation . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.2 Research findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 Diabetic retinopathy lesion segmentation . . . . . . . . . . . . . . . . . . . 42 3.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.2 Research findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 Hyperspectral image segmentation . . . . . . . . . . . . . . . . . . . . . . 48 3.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.2 Research findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4 Discussion 53 4.1 Current results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5 Conclusion 57 References 58 List of publications This dissertation is based on the following peer-reviewed articles. The rights have been granted by publishers to include the papers in dissertation. I. Garifullin A., Kööbi O., Ylitepsa P., Ådjers P., Hauta-Kasari M., Uusitalo. H, Lensu L. (2018). Hyperspectral image segmentation of retinal vasculature, optic disc and macula. Conference article, Digital Image Computing: Techniques and Applications (DICTA), pp. 1-5. II. Garifullin A., Lensu L., Uusitalo H. (2020). On the uncertainty of retinal artery-vein classification with dense fully-Convolutional neural networks. Conference article, Advanced Concepts for Intelligent Vision Systems (ACIVS), pp. 87-98. III. Lindén, M., Garifullin, A., Lensu L. (2020). Weight averaging impact on the un- certainty of retinal artery-venous segmentation. Conference article, Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and Graphs in Biomedical Image Analysis (UNSURE), pp. 52-60. IV. Garifullin A., Lensu L., Uusitalo H. (2021). Deep Bayesian baseline for segmenting diabetic retinopathy lesions: Advances and challenges. Computers in Biology and Medicine, pp. 104725, Volume 136. Author’s contribution I prepared literature review, implemented the source code, conducted experiments and prepared the first article draft in papers I, II and IV. In paper III, MSc. Markus Lindén was the first author and I provided the initial source code and supervised further research and development. 14 Chapter I Introduction 1.1 Background Fundus photography is a useful tool offering possibilities for early screening of eye diseases and abnormal medical conditions. However, the screening requires trained personnel to perform the examination which can be time consuming and expensive especially due to the amount of data growing. In this situation computer aided screening tools can help to reduce the workload of the medical staff and to increase the efficiency of health care. In the last two decades there has been a significant progress in automatic fundus image analysis supported by the development of benchmarks and state-of-the-art machine vision techniques [33, 36]. The majority of the modern approaches are based on deep neural networks and it indicates that deep neural networks are more effective than the classical methods [36]. Certain eye diseases can be diagnosed by solving a classification task where an input image is mapped to a disease label or probability of a presence of the disease [17,53,58]. Most of the works published on the problem are based on traditional deep learning approaches where uncertainties of the models and the outputs are not considered [36]. Taking uncertainties into account might be crucial for high-risk applications [32]. Leibig et al. [35] evaluated Bayesian deep learning uncertainty measures and showed improved decision making for the diagnostic performance of diabetic retinopathy. The uncertainty measures were used to decide whether a patient needs a further examination. Filos et al. [10] formalized the previous research as a benchmark for robustness of Bayesian deep learning and compared different Bayesian deep learning approaches. They showed that the new benchmark is more realistic compared to the previously used datasets as modern Bayesian deep models fail to provide reliable uncertainty estimates. The alternative approach to screening is to assign each pixel of the input image a label describing a type of object to which the pixel belongs. The problem is called semantic segmentation and it is an area of active research. The typical problems include the segmentation of landmarks and different lesions [36]. The segmentation based approaches 15 16 1. Introduction can be more interpretable as they explicitly highlight the types of objects detected. Another advantage is that the segmentation based methods can better handle small objects, since the fundus images typically have higher resolution and are downscaled which causes an information loss. However, the scientific community has not sufficiently addressed the problem of reliability of the fundus segmentation methods. Figure 1.1 shows an abstract scheme of such a computer-aided diagnosis system. The patient’s eye is imaged using a fundus camera. Next, the resulting fundus images are processed by a Bayesian deep neural network which produces a probability distribution over the segmentation maps given the fundus images. The inferred distribution can be analyzed by a post-processing algorithm which produces an additional description of the patient’s condition. The description can include the status or grades of certain diseases or biomarkers which can be used as indicators of different diseases. The fundus images, segmentation maps and patient’s condition are provided to a clinician who can decide diagnosis and whether the patient’s treatment plan needs to be revised. Actor C NN PP Patient F Actor Fundus Image Segmentation map Patient's condition Clinician Figure 1.1: A scheme of a computer-aided diagnosis system: FC stands for a fun- dus camera producing fundus images; NN is a segmentation neural network pro- ducing a probability distribution of segmentation maps; PP is a post-processing algorithm characterizing the segmentation maps yielding a patient’s condition. 1.2 Objectives The goal of this work is to develop an uncertainty-aware Bayesian approach to the fun- dus image segmentation including both landmarks and diabetic retinopathy lesions. The considered landmarks are retinal arteries, veins, optic disc, and macula. The diabetic retinopathy lesions considered are hard exudates, soft exudates, haemorrhages, and mi- croaneurysms. One of the major issues with Bayesian deep neural networks is validating the produced uncertainty estimates which is aimed to be solved in this work. Another unexplored area is the calibration of deep neural networks in the application to the fundus image segmentation. Thus, the objectives are development of: 1.3 Outline 17 1. a Bayesian baseline for retinal artery-vein segmentation using different uncertainty quantification methods. 2. a Bayesian baseline for diabetic retinopathy retinal lesion segmentation and a val- idation procedure for predicted uncertainties. 3. Bayesian methods for hyperspectral retinal landmarks segmentation and compara- tive analysis with color fundus image segmentation. To the best of the author’s knowledge, this work is the first study of the topics specified in the list of the objectives above. 1.3 Outline The rest of the thesis is structured as follows: Chapter 2 introduces structure of the eye and fundus imaging. An overview of the used datasets is provided. A brief literature review of machine vision methods for disease screening is given. Chapter 3 contains the theoretical background of Bayesian deep learning and semantic segmentation as well as the segmentation validation metrics. The discussion of related works as well as proposed methods is provided. Chapter 4 concludes the thesis with the discussion of the results, major issues and limi- tations together with the possible directions of future research. 18 1. Introduction Chapter II The eye, fundus imaging and computer-aided diagnosis 2.1 Structure and diseases of the eye The human eye is an organ of sight which typically has a spherical shape and located in an orbital cavity. The human eye has a complicated structure. The main object of interest in this work is the eye fundus which is an interior surface of the eye opposite to the lens [29]. The fundus examination can give many insights on the patient’s health. The examination is often performed using fundus photography. The fundus photography provides fundus images which contain different objects of interest such as anatomical landmarks or lesions. The normal landmarks of the fundus are as follows [29]: 1. Retinal vasculature consists of arteries and veins. The arteries transport an oxy- genated blood from the heart all over the body and the veins transports it back. 2. Optic disc is a circular disc which is formed by the nerve fibre layer. Since there are no light-sensitive cells in the disc, it is also known as the blind spot. The optic nerve is a nerve that extends from the optic disc and transfers the visual information from the retina to the brain. The white circular area in the center of the optic disc is called the optic cup. 3. Macula is in the posterior part of the retina which is a pigmented area that con- sists of densely-packed photoreceptors (cones) enabling high visual acuity and color vision. The darker region in the center of the macula is called fovea. Various diseases can affect the fundus in different ways [29] by either affecting the land- marks or causing different lesions. Hypertensive retinopathy is a vascular disease caused by high blood pressure (hyperten- sion). The risk factors for hypertension include obesity, alcohol abuse, tobacco use and stress. A patient might experience headaches, pain in the eyes or blurred vision. The 19 20 2. The eye, fundus imaging and computer-aided diagnosis disease rarely causes visual loss but can be a sign of other vascular problems. Depending on the severity of the condition, different changes to the vasculature can occur, such as narrowing of the retinal arteries and changes in arteriovenous crossings. The arteriole-to- venule ratio (AVR) is also an important biomarker characterizing the retinal vasculature. The lower values of the ratio can indicate hypertension. The fundus photography can be used to identity these changes or to infer the biomarker [29]. Glaucoma is a chronic optic neuropathy causing damage to optic disc and loss of vision. Typically, glaucoma is caused by high intraocular pressure and the basis for the disease is mostly genetic. The symptoms include photophobia, worsening vision, nausea, ocular pain and eye redness. One way to detect glaucoma is to estimate the biomarker called cup-to-disc ratio (CDR) which is a ratio of the size of the optic cup to the size of the disc. The higher cup-to-disc ratio can provide evidence of the presence of the glaucoma. The biomarker can be inferred from fundus images [29]. Age-related macular degeneration is a condition which leads to the worsening of central vision and distorted and blurred vision. Apart from the advanced age, other risk factors are smoking, obesity and hypertension. The disease is caused by degeneration of arteries causing a lack of oxygen and other nutrients. Depending on the type of the disease, different types of lesions can appear near the macular region [29]. Diabetic retinopathy (DR) is a complication of diabetes damaging the retina and it is one of the leading causes of blindness. The disease affects the retinal vasculature by narrowing the arteries or fusiform venous dilatation. It can also be recognized by the appearance of different DR lesions depending on the grade of the disease. Depending on the proximity of exudative lesions to the macula region, diabetic maculopathy may be present. During the late stages of the disease retinal detachment appears leading to further increasing risks of loss of vision [29]. Apart from the landmarks, the objects of interest in this work include DR lesions [29]: • Microaneurysms are one of the earliest signs of DR and resemble red small dots. Microaneurysms are caused by the damage to the retinal capillary walls. • Haemorrhages are red lesions that appear after ruptured microaneurysms. Haem- orrhages are bigger than microaneurysms and have unclear edges. • Soft exudates which are also called cotton wool spots are exudates with blurred edges and contrast. They are the result of obstructed arterioles. • Hard exudates are yellow lesions with high contrast and clear edges. They are accumulations of lipids under the retinal layer. These lipids leak from damaged blood vessels. Figure 2.1 illustrates a fundus image with annotations for the landmarks and lesions. 2.2 Color and spectral imaging of the eye fundus Eye fundus photography is a common imaging technique allowing noninvasive examina- tions of the fundus. The images of the fundus are acquired using fundus cameras which 2.2 Color and spectral imaging of the eye fundus 21 Figure 2.1: The structure of the fundus [46]. Reprinted from Medical Image Analysis, Porwal, P., Pachade, S., Kokare, M., et al., IDRiD: Diabetic retinopa- thy–segmentation and grading challenge, 101561, ©(2020) with permission from Elsevier. are based on low power microscopes. In the process of fundus photography, the light from a light source is guided by the optical system to the eye of the patient. The reflected light is then registered by an imaging detector. A complementary metal–oxide–semiconductor or charge-coupled device (CCD) can be used as a detector. Based on these general principles, different devices producing different images can be constructed: • Color photography provides images with red, green, and blue (RGB) channels. • Spectral photography provides images where each channel corresponds to a certain wavelength or a limited band of the electromagnetic spectrum. Color eye fundus photography is widely used for studying diabetic retinopathy, age- related macular degeneration and cardiovascular diseases [2]. The image in Figure 2.1 is an example of the RGB fundus image. Whereas color fundus cameras provide RGB or grayscale images, spectral fundus imaging systems result in hyperspectral images. In these images each channel corresponds to particular spectral band, i.e., each pixel in the image contains information about the reflectance spectrum of the sample. Different chemical substances in a sample have different reflectance or absorbance spectra, thus, additional features are available for a more refined quantitative analysis [61]. A hyperspectral imaging setup can be a modified fundus camera with a light source with a broadband illumination, and a spectral device for selecting a spectral band. Fält et. al. [12] adapted a Canon CR5-45NM fundus camera to the spectral fundus camera by replacing the standard light source with a fibre optic illuminator consisting of a halogen lamp with illumination spectrum from 380 to 780 nm and 30 interference filters with 10 22 2. The eye, fundus imaging and computer-aided diagnosis nm step are used for the wavelengths selection. As a detector, grayscale CCD camera with array size 2048 × 2048 pixels and 2 × 2 binning was used. The imaging setup is presented in Figure 2.2a. (a) (b) Figure 2.2: (a) The spectral fundus imaging setup; (b) An example of a spectral retinal image. The image was normalized for the visualization purpose [33]. 2.3 Fundus image datasets Fundus image datasets allow benchmarking and analysis of different machine vision meth- ods for fundus image analysis. Typically, these datasets contain pairs of input images and corresponding ground truth data. Particular types of the input images and ground truth data depend on the type of problems the study aims to solve. The problems include the landmark segmentation, lesion segmentation, or disease grading. The datasets used to benchmark the optic disc segmentation algorithms are: 1. Digital retinal images for optic nerve segmentation database (DRIONS-DB) [7] contains 110 color fundus images with spatial resolution 600 × 400 pixels. The ground truth is presented in the form of contours of the optic disc. The annotations were produced by two medical experts. 2. Open retinal image database for optic nerve evaluation (RIM-ONE) [11] is com- posed of 169 fundus images where 118 images are gathered from non-glaucomatous patients and the remaining patients have signs of glaucoma of different stages. The spatial resolution of the images is 2144 × 1424 pixels. The corresponding ground truth data is presented in a form of binary segmentation masks. The datasets used to benchmark the retinal vasculature segmentation algorithms are: 1. High resolution fundus image database (HRF) [6] contains 45 fundus images with the corresponding binary segmentation masks for retinal blood vessels. The spatial resolution of the images is 3504× 2336 pixels. 2.3 Fundus image datasets 23 2. Structured analysis of the retina (STARE) dataset [20] contains 40 images with the corresponding binary segmentation masks for retinal blood vessels. The spatial resolution of the images is 700× 605 pixels. 3. Retinal images vessel tree extraction (RITE) dataset [22] contains 40 images la- belled for retinal arteries, veins and vessels segmentation. The spatial resolution of the images is 768× 584 pixels. RITE dataset is an extension of Digital Retinal Images for Vessel Extraction (DRIVE) dataset. The datasets used to study methods for detecting signs of diabetic retinopathy are: 1. Methods to evaluate segmentation and indexing techniques in the field of retinal ophthalmology (MESSIDOR) dataset [8] contains 1200 images with different spatial resolutions. The images are annotated with DR grades and risks of macular edema. The grades are given based on the number and presence of different DR lesions. 2. DiaRetDB1 [31] is a dataset containing 89 images with the spatial resolution of 1500 × 1152 pixels. The ground truth segmentation masks are available for DR lesions. 3. Indian diabetic retinopathy image dataset (IDRiD) [45] contains 81 image with the spatial resolution of 4288× 2848 pixels. The ground truth information is presented by the segmentation masks for DR lesions, DR grade and binary masks for the optic disc segmentation. The datasets used in this work were chosen based on the kind of ground truth data presented and availability for open access. RITE dataset is an open access dataset containing the ground truth data for both arteriovenous and vessels segmentation. IDRiD dataset contains the pixel-accurate segmentation masks for DR lesions. These datasets contain RGB fundus images. The exception is DiaRetDB2 dataset which contains 55 spectral images with the segmentation masks for retinal vasculature, optic disc, and macula. The access to DiaRetDB2 dataset was provided by University of Eastern Finland and University of Tampere. 2.3.1 DRIVE and RITE Digital retinal images for vessel extraction (DRIVE) dataset [21] is a standard benchmark for the retinal vasculature segmentation. The dataset consists of 20 test and 20 train images with the corresponding ground truth segmentation masks for the blood vessels. The ground truth was collected by two experts. The spatial resolution is 768 pixels. Retinal images vessel tree extraction (RITE) dataset [22] is based on the DRIVE dataset and augments it with the ground truth data for arteries and veins. Figure 2.3 illustrates an example RITE image with the corresponding ground truth. Ground truth labels for the arteries and veins contain labels for the arteries (red), veins (blue), branches (green), and uncertain (white). 24 2. The eye, fundus imaging and computer-aided diagnosis (a) (b) (c) Figure 2.3: The RITE data set: (a) An example test image; (b) corresponding artery-vein reference standard; (c) is the ground truth mask for the blood vessels [22]. 2.3.2 IDRiD Indian diabetic retinopathy image dataset (IDRiD) [45] is a database of fundus images developed for diabetic retinopathy screening research. The dataset contains ground truth data for the optic disc and fovea centers, diabetic retinopathy grade and pixel level seg- mentation masks for hard exudates, soft exudates, haemorrhages, and microaneurysms. There are 54 images for the train set and 27 images for the test set. The resolution of the input images is 4288× 2848. An example image from the dataset is shown in Figure 2.4. Due to different sizes of the lesions the dataset is very unbalanced. Figure 2.5 shows bar graphs with the number of positive pixels for each lesions and healthy tissue (back- ground). 2.3.3 DiaRetDB2 DiaRetDB2 is a dataset of images with the spatial resolution of 1024 × 1024 and with the 30 channels where each channel corresponds to the specific wavelength. The dataset contains manual ground truth segmentation masks for the vasculature, optic disc and macula as well as field-of-view (FOV) masks which indicate informative image regions. The segmentation masks for the optic disc and macula were collected by medical experts. The blood vessel annotations were produced by the author during his Master studies. A montage of the spectral bands is shown in Figure 2.2. An example of the sample with the corresponding masks is shown in Figure 2.6. 2.4 Computer-aided diagnosis The fundus imaging setups can be complemented with computer-aided diagnosis (CAD) systems similar to the one illustrated in Figure 1.1. Such systems utilize computer vision 2.4 Computer-aided diagnosis 25 (a) (b) (c) (d) (e) Figure 2.4: (a) An example of IDRiD image with ground truth masks for (b) hard exudates, (c) soft exudates, (d) haemorrhages, and (e) microaneurysms [46]. techniques to produce descriptions of images to help clinicians to make correct diagnoses. Depending on the requirements of the system, the possible implementations are [36]: 1. End-to-end methods that map retinal images to a disease grade [17,35,48,58]. These methods can be implemented as supervised machine learning algorithms trained on input images and the corresponding disease grades. It is also possible to visualize the image features that are relevant to the predicted grade [48,51]. 2. Biomarker-based methods that are algorithms that map retinal images to biomark- 26 2. The eye, fundus imaging and computer-aided diagnosis � �� �� �� �� �� �� �� �� � ��������� � ����� ������ ������ ������ � � � � � � � � � � � � �� (a) � �� �� �� �� �� �� �� �� � ��������� � ����� ����� ����� ����� � � � � � � � � � � � � �� (b) � �� �� �� �� �� �� �� �� � ��������� � ������ ������ ������ ������ � � � � � � � � � � � � �� (c) � �� �� �� �� �� �� �� �� � ��������� � ���� ���� ���� ���� ����� � � � � � � � � � � � � �� (d) ������� ���������� ��� ��� ��� ��� ��� ��� � � � � � � � � � � � � ��� (e) �� �� �� �� ��� ��� ��� ��� ��� ��� � � � � � � � � � � � � ��� (f) Figure 2.5: Lesion statistics in IDRiD dataset. The number of positive pixels per image for (a) hard exudates (EX), (b) soft exudates (SE), (c) haemorrhages (HE), and (d) microaneurysms (MA). (e) The number of pixels for the lesions and the background. (f) The number of positive pixels for each lesion for the whole dataset. ers such as the AVR [3, 44] or CDR [24]. The diagnostic decision can be inferred from the predicted biomarkers. 3. Segmentation-based approaches that transform retinal images to segmentation maps where each pixel represents semantic information about the image content. The 2.4 Computer-aided diagnosis 27 (a) (b) (c) (d) (e) Figure 2.6: (a) An RGB image from the spectral retinal image dataset. (b) FOV mask and the corresponding segmentation masks for the (c) vessels, (d) optic disc and (e) macula. diagnostic decision can be inferred from the segmentation maps depending on the presence of specific lesions [62] or the state of the retinal landmarks [3]. It is worth to note that the described approaches are not mutually exclusive. It is possible to build systems that utilize combinations of these approaches [3, 55,62]. Manikis et al. [40] proposed an image processing framework for detecting early signs of hypertension. The framework includes retinal blood vessel segmentation, optic disc detection and AVR estimation. The authors achieved an accuracy of 0.937 for blood vessel segmentation on the DRIVE dataset. Agurto et al. [3] also relied on methods for retinal vasculature segmentation together with AVR estimation and additional texture feature extraction for the hypertension classification problem. The study was conducted using a private dataset and the authors achieved an accuracy of 0.8 for hypertension prediction. Triwijoyo et al. [58] trained a convolutional neural network in an end-to-end manner to classify images as hypertensive and non-hypertensive. The method achieved 28 2. The eye, fundus imaging and computer-aided diagnosis an accuracy of 0.98 on the DRIVE dataset. Medeiros et al. [41] proposed a deep neural network for retinal nerve fiber layer (RNFL) thickness prediction. The study was conducted using a private dataset that consisted of retinal color images, optical coherence tomography (OCT) scans and medical history for each patient. The RNFL thicknesses were inferred from the OCT scans. The neural net- work was trained end-to-end to solve the regression problem. Next the RNFL thickness was used to differentiate between glaucomatous and healthy eyes. The authors achieved an accuracy of 0.837 for the glaucoma classification problem. The authors also presented visualizations of the image areas relevant for the network to make the prediction. For this purpose, class activation maps (CAM) [51] were used and example visualizations are presented in Figure 2.7. From the visualizations it is clear that the network focuses more on the optic disc and cup, but can also capture certain areas outside of the optic disc. (a) (b) Figure 2.7: (a) An RGB image and the corresponding CAM for a healthy eye; (b) An RGB image and the corresponding CAM for a glaucomatous eye [41]. The red color denotes more relevant parts, whereas blue represents less relevant parts. Reprinted from Ophthalmology, Vol. 126, Medeiros, F. A., Jammal, A. A., and Thompson, A. C., From machine to machine: An OCT-Trained deep learning algorithm for objective quantification of glaucomatous damage in fundus photographs, 513-521, ©(2019) with permission from Elsevier. Sreng et al. [55] proposed an algorithm for optic disc segmentation and glaucoma classifi- cation. A schematic illustration of the proposed system is given in Figure 2.8. The system uses a neural network for the optic disc segmentation, and another neural network for 2.4 Computer-aided diagnosis 29 the glaucoma classification. The segmentation network is trained on a database of retinal images with optic disc annotations. Further, the segmented images cropped around the optic disc are reused to train the classification network. The authors achieved an accu- racy of 0.997 for the optic disc segmentation and 0.973 for the glaucoma classification. The presented results were achieved on RIM-ONE dataset. Figure 2.8: A schematic illustration of the method proposed by Sreng et al. [55]. Numerous end-to-end methods for DR grading have been proposed [17, 48]. Typically, they are trained to solve a classification task where the input is a color retinal image and the output is a vector of probabilities for each DR grade. Pratt et al. [48] studied feature visualization of convolutional neural networks for DR grading. Figure 2.9 presents CAMs for different DR grades. The CAMs can highlight areas with DR lesions. However, they are very coarse and it is possible that they highlight irrelevant parts of the images. The evaluation metric used for DR grading is quadratic weighted Kappa on the test data for the multi-class problem (larger values mean better performance). The authors achieved a Kappa value of 0.81. Wei et al. [62] proposed a method aiming to solve both lesion segmentation and DR grad- ing problem. The proposed network is a DR classification network with a side-stream for DR lesion segmentation and classification. Figure 2.10 presents a schematic representa- tion of the proposed network. The authors achieved a state-of-the-art Kappa of 0.803 for DR grading and 0.801 for DR lesion classification. Figure 2.11 shows visualizations of the results of DR lesion segmentation and classification. From the figure it can be seen that the produced segmentation are more accurate than the CAM visualizations in Figure 2.9. Based on the above examples, fundus image segmentation plays an important role in fundus image analysis. Fundus image segmentation can be used to help to assist in diagnosing hypertension, diabetic retinopathy, and glaucoma. 30 2. The eye, fundus imaging and computer-aided diagnosis Figure 2.9: (Left) Fundus images from the Liverpool Diabetic Eye Screening Program (LDESP). (Middle) Class activation maps (CAMs) from the trained DenseNet multi-class DR model overlayed on the original image. (Right) Saliency map from the trained DenseNet multi-class diabetic retinopathy (DR) model over- layed on the original fundus image [48]. 2.4 Computer-aided diagnosis 31 Figure 2.10: A schematic illustration of the method proposed by Wei et al. [62] (©2020 IEEE). Figure 2.11: Qualitative results of lesion segmentation and classification. The red font indicates false alarms. The results are given for the method proposed by Wei et al. [62] (©2020 IEEE). 32 2. The eye, fundus imaging and computer-aided diagnosis Chapter III Fundus image segmentation 3.1 Methodology 3.1.1 Deep learning for semantic segmentation Fundus image segmentation is a subproblem of semantic segmentation where pixels are grouped based on their semantic similarity. This problem can be efficiently solved us- ing supervised deep learning methods [36]. Let D = { (x,p)i} N−1 i=0 be a dataset of N input-output pairs where x is an input image and p is a corresponding ground truth segmentation map. Then the training can be formulated as an inference problem of pa- rameters θ of a model f that maps the input image to an estimate of the segmentation map pˆ [13]: pˆi = sigmoid (f (xi,θ)) , (3.1) where sigmoid(y) = (1+ exp(−y))−1 is the sigmoid activation function mapping logits y to label probabilities. The most common way of estimating the parameters is finding a maximum a posterior (MAP) estimate [13] θˆ = argmax θ p (θ | D) , (3.2) where θˆ is a MAP estimate of the parameters and p (θ | D) is the posterior probability distribution of the parameters defined as [13] p (θ | D) = p (D | θ) p (θ) p (D) , (3.3) with the likelihood [13] p (D | θ) = N−1 � i=0 p (pi | xi,θ) , (3.4) the prior over the parameters p (θ), and the evidence p (D) = � p (D | θ) p (θ) dθ. 33 34 3. Fundus image segmentation The problem (3.2) is typically reformulated as a minimization problem [13]: θˆ = argmin θ − log p (θ | D) = argmin θ − � N−1 � i=0 log p (pi | xi,θ) + log p (θ)− log p (D) � = argmin θ N−1 � i=0 L (pi | xi,θ) +R (θ) , (3.5) where L is a negative log-likelihood which is responsible for the data fit also known as the loss function. R is the negative log-prior of the parameters which acts as a regularization term and the log-evidence p(D) is cancelled being a constant not depending on the parameters. In the case of image segmentation, it is natural to formulate the loss function as the binary cross-entropy (BCE) L (pi | xi,θ) = −pi log pˆi − (1− pi) log (1− pˆi) . (3.6) In this work the prior distribution is modelled as a fully-factorized Gaussian distribution p (θ) = D−1 � i=0 N (0,σ2θ), (3.7) where D is the number of parameters and σθ controls the regularization strength. The optimization problem (3.5) is typically solved using gradient descent based methods. One of the basic examples of such techniques is the stochastic gradient descent algorithm (SGD). The method differs from the standard gradient descent method in estimating the gradient by using mini-batches of the data examples. This modification helps to save computational resources while solving the problem. Nowadays, different modifications that improve the convergence of SGD are used [50]. 3.1.2 Bayesian deep learning The approach described above produces only point estimates of the segmentation labels and the model’s parameters. In order to better capture imperfect ground truth labelling and imaging conditions, it is possible to define the model’s outputs and parameters as random variables and infer distributions over them. The first approach takes into account the aleatoric heteroscedastic uncertainty, while the latter models the epistemic uncertainty [32]. Aleatoric uncertainty The aleatoric uncertainty is a data induced uncertainty that can be caused by the imper- fect imaging conditions. It can be included into the model (3.1) by predicting standard deviations of the outputs together with the outputs themselves [32]: [yˆi,σi] = f (xi,θ) , (3.8) pˆi = sigmoid (yˆi + σi ⊙ ǫ) , (3.9) 3.1 Methodology 35 where yˆ is a vector of logits, ⊙ stands for the Hadamard product, and ǫ ∼ N (0, I) is a normally distributed noise with the identity covariance matrix I. Taking into account the modified model (3.8) it is possible to modify the loss function to work with multiple aleatoric samples pˆij LA (pi | xi,θ) = NA−1 � j=0 −pi log pˆij − (1− pi) log (1− pˆij) , (3.10) where pˆij is a j-th sample inferred from the input image xi, and NA is the number of aleatoric samples. Epistemic uncertainty The epistemic uncertainty captures the model’s ignorance about the underlying problem. From (3.2) one can see that θ is a random variable which can be marginalized during the inference [13] p (p∗ | x∗,D) = � p (p∗ | x∗,θ) p (θ | D) dθ. (3.11) Calculating the posterior predictive (3.11) is a difficult task, since it involves taking the intractable integral. Different approximating techniques are used instead [1]. Gal et al. [14] reinterpreted dropout [56] as a stochastic variational inference technique, where the complex posterior distribution (3.3) was replaced by a simpler variational approximant qθ (ω) with parameters ω . This approach is called Monte-Carlo dropout (MC-Dropout) The relationship between the true and approximate posteriors is given by [14,56] ω = θ ⊙MD, (3.12) where MD is a random binary dropout mask. In this case, the training algorithm aims to minimize the difference between the true posterior and approximant [14]: LVI (ω) = � qθ (ω) log p (D |ω) dω −DKL(qθ (ω) k p (ω)), (3.13) where DKL is the Kullback-Leibler divergence DKL(qθ (ω) k p (ω)) = � qθ (ω) log qθ (ω) p (ω) dω. (3.14) The formula (3.13) is similar to (3.5) in a sense that the second term penalizes the model to be close to the prior and the first term is responsible for the data fit and it is typically approximated using Monte-Carlo methods LVI (ω) ≈ N−1 � i=0 NE−1 � j=0 1 NE L (pi | xi,ωj) +RVI (ω) , (3.15) where NE is the number of epistemic samples, and the variational regularization term is RVI (ω) = DKL(qθ (ω) k p (ω)). 36 3. Fundus image segmentation Maddox et al. [39] proposed to model the posterior distribution of the parameters as a fully-factorized Gaussian distribution p (θ | D) = N (θSWA,ΣSWAG) , (3.16) the parameters of which are estimated during training. This technique is called stochastic weight averaging Gaussian (SWAG) and it is based on the stochastic weight averaging (SWA) proposed in [25]. In more traditional approaches, Monte-Carlo Markov chain (MCMC) methods are typ- ically used. For deep models, however, it is difficult to scale them properly due to the high-dimensionality of the problem and costly likelihood evaluations. Ma et al. [38] formalized stochastic gradient extensions of the classical MCMC algorithms, which can work with subsets of the datasets to utilize stochastic gradient information to explore the distributions, and can be used to quantify epistemic uncertainty [26]. The theory above describes general principles of the Bayesian deep learning approach which can be applied to a variety of different architectures that formalize the model f in (3.1). 3.1.3 Neural network architectures The most of modern architectures for deep semantic segmentation are encoder-decoder models. The encoder compresses the input images to a hidden representation. Then, this representation is reconstructed by decoders into a feature map which is further transformed to the segmentation map using a pixelwise classifier. One of the basic examples of such architectures is SegNet [5]. The encoder is composed of blocks of convolutional layers, batch normalization (BN) and rectified linear units (ReLU) which are followed by max-pooling [16]. The decoder is a symmetric reflection of the encoder with the pooling layers replaced by the upsampling layers using pooling indices to recover feature maps. Figure 3.1 is a schematic illustration of the architecture. Figure 3.1: SegNet architecture [5] (©2020 IEEE). Ronneberger et al. [49] proposed the U-Net architecture which follows similar principles but also allows the data leakage from the encoder to the decoder so that the high res- olution feature maps are cropped, copied and concatenated with the decoded feature maps. This mechanism allows to preserve more information about border pixels and 3.1 Methodology 37 fine details. The architecture was developed specifically for medical image segmentation problems and it is one of the most widely used architectures across a variety of different domains [67]. Figure 3.2 shows the U-Net architecture. Figure 3.2: UNet architecture [49]. Reprinted by permission from Springer Na- ture Customer Service Centre GmbH, Springer Nature, Medical Image Computing and Computer- Assisted Intervention – MICCAI 2015. U-Net: Convolutional net- works for biomedical image segmentation, Ronneberger, O., Fischer, P., and Brox, T., ©(2015). In addition to reusing features from the encoder for the decoding purposes, it is also possible to adapt a similar approach to the internal components of the encoders and decoders. Gao et al. [23] proposed dense convolutional networks which connect outputs of previous convolutional layers to subsequent convolutional layers which form dense convolutional blocks (DCB). This approach helps to overcome the vanishing gradients problem and to make networks deeper without significantly increasing the number of parameters. Jégou et al. [27] adapted this approach to build dense fully-convolutional neural networks (Dense-FCN) for the semantic segmentation purposes. The architecture reuses features from different resolutions as well as features from different blocks of the encoder and decoder. Figure 3.3 illustrates an example of Dense-FCN architecture. 3.1.4 Segmentation and uncertainty validation A trained model needs to be tested on a validation set with the goal of estimating its performance. Firstly, point estimates of the segmentation masks are obtained as average 38 3. Fundus image segmentation Dense Down C Conv (48x3x3) Dense Down C Dense Dense Up C Conv (2x1x1) Dense Up C C Input image x logits y H H H H C C C C H BN-ReLU-Conv (gx3x3) Dropout (p=0.5) Down BN-ReLU-Conv(Fx1x1) Dropout (p=0.5) Max-Pool(2x2) Up TransposedConv k:Fx3x3 s:2x2 Conv-Abs (2x1x1) logits std Figure 3.3: The Dense-FCN architecture: Dense stands for DCB; C is a tensor concatenation; H is a block consisting of the batch normalization, rectified linear unit and convolutional layer with growth rate g; Down is a transition-down block with F output feature maps; Up is a transition up with F output feature maps and 2× 2 stride; logits std denotes standard deviations of logits. probabilities using Monte-Carlo sampling: p¯i = 1 S S−1 � s=0 f (xi,ωs) , ωs ∼ qθ , (3.17) where S = NE × NA is the total amount of samples produced by the model, and the variational parameters ωs are sampled from the approximate probability distribution qθ . The aleatoric UA and epistemic UE uncertainties can be estimated as UA = Eq � Vp(p|x,ω) [p] � , (3.18) UE = Vq � Ep(p|x,ω) [p] � , (3.19) UT = UA + UE , (3.20) where E and V denote expectation and variance, respectively, and UT is the total pre- dictive uncertainty. In order to evaluate the segmentation performance, the following classification metrics are used: 3.1 Methodology 39 • Sensitivity (SE) is used to assess the ability of the model to discover lesions: SE = TP TP + FN , (3.21) where TP and FN are the amounts of true positive and false negative pixels, respectively. • Positive predictive value (PPV) is used in addition to sensitivity but takes into account false positives FP : PPV = TP TP + FP . (3.22) • Specificity (SP) is used to assess to ability of the model to correctly segment healthy pixels: SP = TN TN + FP , (3.23) where TN is the amount of true negative pixels. • Intersection over union (IoU) IoU = T ∩ P T ∪ P , (3.24) where T is a set of target pixels and P is a set of predicted pixels. • F1 score F1 = TP TP + 0.5(FP + FN ) . (3.25) • The metrics above are calculated by thresholding the label probabilities (3.17). In this work the threshold value is 0.5. ROC-AUC is an integral metric regardless of the threshold value. ROC-AUC is calculated under the area of the curve plotted as a true positive rate against false positive rate by varying the threshold. • Area under the precision-recall curve (PR-AUC) is another integral metric regard- less of the threshold value. PR-AUC more realistically represents the segmenta- tion performance in comparison to the area under receiver operating characteristic ROC-AUC [46]. • Expected calibration error (ECE) is used to assess a model’s calibration [18]: ECE = Epˆ � |P � lˆ = l | pˆ = π � − π| � , π ∈ [0, 1] , (3.26) where pˆ is a confidence estimate of the predicted class lˆ, l is a true label and π is a true probability. Together with ECE, reliability diagrams can be presented. These reliability diagrams are graphs showing the expected accuracy against classi- fication confidence, thereby representing calibration quality. In the case of perfect calibration, the graph is an identity function. 40 3. Fundus image segmentation Apart from evaluating the segmentation, it is also important to assess the estimated un- certainty. In this work the uncertainty evaluation procedure is based on the assumptions presented by Mobiny et al. [42] that the misclassified pixels must have higher uncertain- ties. Thus, in this work the uncertainty validation procedure is formulated as a binary classification problem where the estimated uncertainties are considered as predicted clas- sification scores and the misclassifications are as ground truth labels. The uncertainty validation metrics used in this work are similar to those used to validate the segmentation results but with prefix U. 3.2 Retinal artery-vein segmentation 3.2.1 Background The problem of the retinal artery-vein (AV) segmentation considered in this work is the simultaneous segmentation of the vasculature and its classification into arteries and veins. The problem can be solved by just applying regular frameworks for semantic segmentation but the major issue of segmenting thin vessels remains. In order to overcome the problem Girard et al. [15] proposed a post-processing technique which builds a vasculature tree and uses a likelihood propagation score to update the segmentation maps based on the connectivity patterns. Badawi et al. [4] aimed to solve the same problem by augmenting the BCE loss with an additional segment-level loss which is defined through the mismatch between the segments extracted from the vasculature tree. Zhang et al. [65] achieved better performance by training a refined U-Net that minimized a multi-scale loss. The utilized multi-scale loss was inspired by [37] and it sums loss values from different stages of decoding and the downscaled ground truth segmentation. Zhang et al. also proposed to use a cascade network which predicts the probabilities of the vessels labels and then sequentially passes the results to subnetworks for arteries and veins, Figure 3.4 illustrates this principle. More detailed review of artery-vein segmentation approaches is given in [43]. 3.2.2 Research findings This section presents the results from Publication II and Publication III. In this work the segmentation for the arteries and veins first produced p = [partery pvein] and then the probabilities for the blood vessels are inferred: pvessel = partery + pvein − parterypvein. (3.27) The minimized loss function is a sum of three terms for each label: θˆ = argmin θ [Lartery (θ) + Lvein (θ) + Lvessel (θ)] , (3.28) where L denotes the BCE loss for each corresponding label. The epistemic uncertainty was estimated using three different methods: 1. MC-Dropout is a baseline method; 3.2 Retinal artery-vein segmentation 41 Figure 3.4: Cascade network for the artery-veins segmentation [65] (©2020 IEEE). 2. SWA-MC-Dropout is a method which used MC-Dropout but SWA was applied as a part of the training; 3. SWAG is a method which estimates an approximation of the normal distribution during SWA training stage. The performance metrics for all three labels are given in Tables 3.1 – 3.31. Figure 3.5 shows an example of the resulting AV segmentation. Table 3.1: Network performance in artery classification (the best accuracy and calibration are presented in bold) Method Accuracy Sensitivity Specificity ECE ROC-AUC Baseline 0.970 0.642 0.990 0.0988 0.974 SWA 0.975 0.690 0.992 0.0943 0.981 SWAG 0.973 0.706 0.989 0.0871 0.966 The examples of the estimated aleatoric and epistemic uncertainties are shown in Figure 3.6. From the images one can notice changes in aleatoric uncertainties when the weight averaging is applied. In the baseline case the aleatoric uncertainty is mostly higher near the optic disc and edges of the vessels. If the weight averaging is applied the pattern is similar but the uncertainties near optic disc are lower. It is also clear that just sampling around the found optimum using SWAG yields lower epistemic uncertainty than sampling 1Due to an error in the code calculating the average of the calibration errors the ECE values in Publication III are wrong. Here the corrected values are given. 42 3. Fundus image segmentation Table 3.2: Network performance in vein classification (the best accuracy and calibration are in bold) Method Accuracy Sensitivity Specificity ECE ROC-AUC Baseline 0.971 0.655 0.994 0.169 0.980 SWA 0.974 0.742 0.991 0.120 0.991 SWAG 0.971 0.804 0.983 0.107 0.980 Table 3.3: Network performance in vessel classification (the best accuracy and calibration are in bold) Method Accuracy Sensitivity Specificity ECE ROC-AUC Baseline 0.957 0.723 0.989 0.221 0.980 SWA 0.961 0.782 0.986 0.208 0.983 SWAG 0.961 0.836 0.978 0.338 0.984 using MC-Dropout. Table 3.4 presents the total estimated uncertainties which gives the quantitative support to the claims above. Table 3.4: Mean sums of estimated aleatoric and epistemic uncertainties per image. Method Aleatoric Epistemic Arteries Veins Vessels Arteries Veins Vessels Baseline 1276.2 1159.5 1807.5 4853.6 4066.4 5069.7 SWA 3.3 3.5 5.3 4038.6 3882.3 4659.7 SWAG 31.1 38.9 57.3 997.8 1104.3 1396.1 The proposed methods yield performance comparable to the state of the art methods without any additional preprocessing or multi-scale loss functions. Table 3.5 shows a comparison of the performance of recent works and proposed methods. 3.3 Diabetic retinopathy lesion segmentation 3.3.1 Background The IDRiD challenge [45, 46] is the common benchmark for diabetic retinopathy lesion segmentation algorithms. The best performing algorithms in the challenge are presented by deep learning based techniques. The authors experimented with different architectures and custom loss functions such as combinations of BCE and dice loss or balanced BCE [46]. The dataset is highly imbalanced and the custom loss functions were employed to overcome these problems. The input images are very high-dimensional 4288× 2848 and all the reported methods were trained on cropped patches. Yan et al. [63] proposed an architecture which aims to solve the problem of high dimen- sionality of the inputs and a lack of the global context when training only using the 3.3 Diabetic retinopathy lesion segmentation 43 (a) (b) (c) (d) Figure 3.5: (a) The input image; (b) ground truth; (c) mean predicted AV probabilities; (d) mean predicted vessel probabilities. The results are obtained using stochastic weight averaging. cropped images. The architecture consists of two U-Nets. The first network is the Glob- alNet which processes a downscaled input image and produces a coarse segmentation map. The second network is the LocalNet which processes cropped patches and concate- nates corresponding cropped features with the features from GlobalNet and produces the resulting segmentation map. The network is trained end to end using a combination of local and global supervision. Figure 3.7 illustrates the architecture proposed by Yan et al. The reported PR-AUC for hard exudates is 0.889, for soft exudates 0.697, for haemorrhages 0.703, and for microaneurysms 0.525. 44 3. Fundus image segmentation (a) (b) (c) (d) (e) (f) Figure 3.6: Aleatoric uncertainties calculated using (a) the baseline, (b) stochas- tic weight averaging, and (c) stochastic weight averaging Gaussian. Epistemic un- certainties calculated using (d) the baseline, (e) stochastic weight averaging, and (f) stochastic weight averaging Gaussian. The pseudo-colors represent different labels. The red channel shows the artery segmentation uncertainty, and the blue channel shows the vein segmentation uncertainty Figure 3.7: Local-Global U-Nets for DR lesion segmentation [63] (©2020 IEEE). 3.3 Diabetic retinopathy lesion segmentation 45 Table 3.5: Comparison of evaluation results (accuracies of each label). The datasets are specified with splitting methods used by the authors. Method Vessels Arteries Veins Dataset Girard et al. [15] 0.948 N/A N/A CT-DRIVE Badawi et al. [4] 0.960 N/A N/A DRIVE (standard) Hemelings et al. [19] N/A 0.948 0.930 DRIVE (standard) Zhang et al. [65] N/A 0.977 0.975 DRIVE (5-fold CV) Baseline 0.957 0.970 0.971 DRIVE (standard) SWA 0.961 0.975 0.974 DRIVE (standard) SWAG 0.961 0.973 0.971 DRIVE (standard) 3.3.2 Research findings This section presents the results from Publication IV. In this work the basic Bayesian deep learning approach with Dense-FCN was used. The major challenge was to overcome the class imbalance problem. The most efficient approach in this work appeared to be oversampling. For the input batch positive and negative samples are selected with a probability of 0.5. The probability of selecting a certain image is a logarithm of the positive pixels in the image normalized to the total amount of positive pixels in the dataset. The probability of sampling a certain patch is a logarithm of the number of the positive pixels normalized to the total number of positive pixels in the image. The network is trained on 224 patches and processes downsampled images 2144 × 1440 as a whole. It was empirically found that a simple preprocessing based on gamma-correction and contrast limited adaptive histogram equalization [66] improves the segmentation performance. Table 3.6 shows the performance metrics for the DR lesions segmentation using the proposed Bayesian method. From the table one can see that the trained segmentation models are very specific and the main issue is the sensitivity. The best achieved perfor- mance is for the hard exudate segmentation, since the hard exudates have clear edges and are relatively big. The soft exudates and haemorrhages typically have lower contrast and blurred edges in comparison with hard exudates. The microaneurysms are the most difficult to segment as they are the smallest lesions. Table 3.6: Evaluation results of the baseline training scheme. Label PR-AUC ROC-AUC Sensitivity PPV Specificity ECE Hard exudates 0.842 0.995 0.767 0.753 0.997 0.090 Soft exudates 0.641 0.993 0.639 0.611 0.999 0.145 Haemorrhages 0.593 0.977 0.464 0.670 0.997 0.066 Microaneurysms 0.484 0.997 0.434 0.531 0.999 0.116 The examples of the resulted segmentations together with the visualizations of the mis- classifications and uncertainties are given in Figures 3.8 – 3.11. From the images it is clear that there are certain similarities between the misclassifications and epistemic uncertainty visualizations. From Figure 3.9 it is clear that soft exudates can be eas- ily confused with any yellow spots on the image, and from Figure 3.10 it can be noted 46 3. Fundus image segmentation that haemorrhages can be confused with the vasculature, since the model trained by DR lesions segmentation did not learn anything about the retinal vasculature. (a) (b) (c) (d) (e) (f) Figure 3.8: Inference results for hard exudates: (a) input image; (b) ground truth mask; (c) misclassifications; (d) mean inferred probabilities; (e) aleatoric uncertainties (standard deviations of probabilities); (f) epistemic uncertainties (standard deviations of probabilities). (a) (b) (c) (d) (e) (f) Figure 3.9: Inference results for soft exudates: (a) input image; (b) ground truth mask; (c) misclassifications; (d) mean inferred probabilities; (e) aleatoric uncertainties (standard deviations of probabilities); (f) epistemic uncertainties (standard deviations of probabilities). Table 3.7 shows the evaluation results for the estimated uncertainties. The uncertain- 3.3 Diabetic retinopathy lesion segmentation 47 (a) (b) (c) (d) (e) (f) Figure 3.10: Inference results for haemorrhages: (a) input image; (b) ground truth mask; (c) misclassifications; (d) mean inferred probabilities; (e) aleatoric uncertainties (standard deviations of probabilities); (f) epistemic uncertainties (standard deviations of probabilities). (a) (b) (c) (d) (e) (f) Figure 3.11: Inference results for microaneurysms: (a) input image; (b) ground truth mask; (c) misclassifications; (d) mean inferred probabilities; (e) aleatoric uncertainties (standard deviations of probabilities); (f) epistemic uncertainties (standard deviations of probabilities). ties are very specific but the sensitivity is very low indicating that the uncertainty has been underestimated. Laves et al. [34] reported similar problems in the context of deep Bayesian regression. In this work different attempts to calibrate the uncertainty estimates 48 3. Fundus image segmentation Table 3.7: Evaluation results for the estimated uncertainty maps. Label U-PR-AUC U-SE U-PPV U-SP U-ECE Hard exudates 0.336 0.031 0.566 0.999 0.104 Soft exudates 0.257 0.113 0.388 0.999 0.195 Haemorrhages 0.243 0.029 0.302 0.999 0.303 Microaneurysms 0.257 0.045 0.332 0.999 0.237 were maid but significantly better results were not achieved. 3.4 Hyperspectral image segmentation 3.4.1 Background The architectures described in Section 3.1.3 have can be adapted to process HSI by reducing dimensionality of the spectral features to appropriate dimensions suitable for the pretrained networks. The pretrained networks were trained using RGB images that have 3 channels, whereas HSI images contain 30 channels. The introduced dimensionality reduction approach enables a possibility to employ transfer learning techniques to train the segmentation models. The transfer learning can improve the models convergence by utilizing neural networks which already were pretrained on big datasets [16]. Jiao et al. [28] proposed an architecture fusing features from VGG16 encoder [54] and principal component analysis and Yu et al. [64] showed that dimensionality reduction (DR) blocks can be trained end-to-end in combination with the convolutional neural networks. 3.4.2 Research findings This section presents the results from Publication I. In this work the approach similar to [64] was used. A series of 1×1 convolutional layers followed by ReLU was used to reduce the dimensionality of the spectral features. The architectures utilized in the experiments are based on SegNet and Dense-FCN. The SegNet based architectures (DR-SegNet) use the encoder pretrained on ImageNet [9], whereas the Dense-FCN based architectures (DR-DenseFCN) were not pretrained which allows the dimensionality of the output of the dimensionality reduction blocks to be tuned. The dimensionality reduction blocks in the case of Dense-FCN are also built using the principles of dense convolutional block. Additional experiments were performed without the dimensionality reduction blocks to assess the need for the dimensionality reduction layers before the base architecture. MC- Dropout was used to estimate epistemic uncertainty using NE = 100 forward passes. The architectures were trained in two stages: patchwise pretraining and full resolution fine-tuning. During the pretraining stage batches of three samples are sampled so that each sample is selected for the particular class. The mining of positive samples was also employed to handle the class imbalance related to the optic disc and macula segmentation tasks. The patches for blood vessels were selected uniformly, whereas the patches for optic disc and macula were selected so that they were centered around the mean of the true labels coordinates. 3.4 Hyperspectral image segmentation 49 Table 3.8: Evaluation results. The best F1 scores are in bold. Architecture Vessels Optic Disc Macula F1 IoU F1 IoU F1 IoU DR-SegNet 0.8091 0.6802 0.8947 0.8356 0.6566 0.5291 RGB-SegNet 0.7925 0.6571 0.8802 0.8149 0.6033 0.4657 DR-2-Dense-FCN 0.8243 0.7019 0.7311 0.6257 0.3084 0.2414 HSI-Dense-FCN 0.7974 0.6647 0.7323 0.6202 0.2932 0.2282 RGB-Dense-FCN 0.8112 0.6840 0.7154 0.6153 0.1543 0.1070 Table 3.8 contains an evaluation of the performance of the proposed architectures. From the table it is clear that the blood vessels segmentation is the easiest task and segment- ing the macula accurately is more difficult which is due to the lack of clear edges of the object. DR-SegNet architecture achieves the best performance for the optic disc and macula, whereas DR-DenseFCN with the spectral features reduced to two dimensional vectors achieves the best results for the vasculature segmentation task. Additionally, the evaluation results for the architecture (HSI-Dense-FCN) not pretrained on any data and without dimensionality reduction blocks are given. The results suggest that dimensional- ity reduction blocks help to more efficiently utilize spectral information leading to better segmentation performance. The visualizations of the results are given in Figures 3.12 – 3.14. From the figures it is clear that the additional spectral information can help to localize the macula more accurately. It is also worth to note that the model trained using RGB images is more uncertain in the areas where labels overlap and the macula segmentation results con- tain artifacts where the macula and vasculature overlaps. The DR-SegNet shows less segmentation artifacts. 50 3. Fundus image segmentation Figure 3.12: Top row: example segmentation results using DR-SegNet. Bottom row: standard deviations of the activations. Figure 3.13: Top row: example segmentation results using RGB-SegNet. Bot- tom row: standard deviations of the activations. 3.4 Hyperspectral image segmentation 51 Figure 3.14: Top row: example segmentation results using DR-6-Dense-FCN. Bottom row: the epistemic uncertainties presented as standard deviations of the activations. Each column shows visualization for the tasks (from left to right): blood vessels, optic disc, macula. 52 3. Fundus image segmentation Chapter IV Discussion 4.1 Current results This study focuses on the application of deep Bayesian approaches to the problem of fundus image segmentation. The fundus segmentation problem was studied from different perspectives including the landmark segmentation as well as DR lesion segmentation. In addition to different considered objects of interest, two modalities were considered: RGB and spectral images. The proposed Bayesian baseline for artery-vein segmentation produced performance com- parable to the previously published state-of-the-art results. The presented method uses the multi-label segmentation model and two different uncertainty quantification meth- ods: MC-Dropout and SWAG. In this work, it is shown that the utilization of the weight averaging techniques can moderately improve the segmentation performance as well as the calibration of the model. It is shown that the highest uncertainties are concentrated near the optic disc, artery-vein crossings, and thin vessels. The high uncertainties near the optic disc and crossings can be explained by overlapping labels. The uncertain thin vessels are too far away from the optic disc and it is difficult to classify them into arteries and veins. It is shown that the weight averaging techniques can lead to vanishing aleatoric uncertainties. The utilization of aleatoric uncertainty inference can also be treated as a learned loss attenuation [32] and it seems that this attenuation plays lesser role when weight averaging is applied. The method proposed by Zhang et al. [65] still outperforms the methods proposed in this work. Nevertheless, the Bayesian methods proposed in this work keep the predictions more consistent, since the probabilities for blood vessels are inferred from the predicted probabilities for the arteries and veins. Thus, the objective of developing Bayesian baseline for retinal artery-vein segmentation using MC-Dropout and SWAG for uncertainty quantification is achieved. This work extended the recent research on DR lesion segmentation on the uncertainty quantification of deep Bayesian models. The uncertainty validation procedure based on the idea of predicted misclassifications using the uncertainty estimates was introduced. The uncertainties estimated using Monte-Carlo dropout can clearly highlight problematic 53 54 4. Discussion image areas for the DR lesion segmentation. Nevertheless, straightforward utilization of the produced uncertainties to detect misclassifications has not provided satisfactory results. Apart from the previously mentioned problems, the proposed method suffers from the low sensitivity and significant miscalibration. The problem of low sensitivities is a common problem also reported in the recent studies. Thus, the objective of developing Bayesian baseline for diabetic retinopathy retinal lesion is achieved and the validation procedure is proposed. Spectral imaging allows to capture additional information about the eye which poten- tially can improve the segmentation performance. In this work, it is shown that ad- ditional spectral information helps to moderately improve the retinal vasculature, optic disc and macula segmentation performance. The proposed method extended standard ar- chitectures with the dimensionality reduction layers and MC-Dropout for the epistemic uncertainty quantification. The utilization of the dimensionality reduction layers can help to adapt the standard architectures for the retinal HSI segmentation without sig- nificantly increasing the number of parameters of the models. The highest uncertainties highlighted unclear label edges and the areas of the overlapping objects. Thus, the objec- tive of developing deep Bayesian models for hyperspectral fundus landmark segmentation is achieved. The comparison with RGB image segmentation is provided. 4.2 Future work The low sensitivity is a common problem in many subproblems regarding retinal image segmentation including those sub-problems considered in this work. The issue is primarily due to the small size of the objects of interest such as microaneurysms or thin arteries and veins. Other objects such as the macula or soft exudates do not have clear edges. Another factor affecting the performance is related to unbalanced datasets where the objects of interest are underrepresented in comparison to the background classes. In medical image analysis and segmentation, the use of custom heuristic loss functions has become commonplace [30]. Nevertheless, they typically have hyperparameters that are difficult to tune. In this work, results outperforming the proposed baselines with cross-entropy were not achieved and, thus, are not published. Further research focused on evaluating these loss functions is needed before making any conclusions on their effectiveness. It is also interesting what kind of effects the loss functions have on the produced uncertainty estimates and model calibration. In this work, the methods addressing the class imbalance problem are based on simple heuristics involving sampling the image patches near the objects of interest. An alterna- tive approach is to use adaptive active learning methods which can utilize the uncertainty estimates to propose interesting patches. The future work can be focused on adapting those methods in order to address the class imbalance problem. Another issue shown in this study is model miscalibration. There are a number of meth- ods that address this problem. The traditional approaches are based on post-processing of training results to improve the calibration [18]. Thulasidasan et al. [57] showed that the calibration can be improved by applying the mix-up augmentation which blends the input samples. Seo et al. [52] modified the training procedure by adding an additional term to the loss function which penalizes the model for too overconfident or undercon- fident predictions. Seo et al. [52] also followed the Bayesian approach and used the 4.2 Future work 55 produced uncertainty estimates to calculate the penalizing miscalibration term. The fu- ture research can be focused on exploring the existing or new calibration methods to improve the fundus image segmentation methods. Monte-Carlo based methods utilized in this work are computationally costly and require a number of forward passes to estimate the uncertainty of the outputs. Currently, meth- ods for calculating the uncertainty using a single forward pass are being explored [60]. These methods could reduce the computation time for the inference. However, apply- ing these methods to semantic segmentation in a straightforward manner can lead to a number of issues leading to poor uncertainty estimates [47, 59]. Thus, more research is needed to make the uncertainty estimation methods more practical for the fundus image segmentation problems. 56 4. Discussion Chapter V Conclusion In this work, the Bayesian multilabel baselines for the artery-vein segmentation are pro- posed. The impacts of weight averaging techniques on the segmentation performance, model calibration and produced uncertainties were assessed. The epistemic uncertainties were estimated using Monte-Carlo dropout and SWAG. It is shown that the weight aver- aging techniques can improve the overall performance but also lead to vanishing aleatoric uncertainties. SWAG also yields significantly lower epistemic uncertainty which can po- tentially lead to problems with the future risk analysis. The deep Bayesian approach for diabetic retinopathy lesion segmentation is proposed. The analysis of the epistemic and aleatoric uncertainties is provided, and the calibration of the model is assessed. The extended lesion segmentation approach is presented which takes into account both the segmentation performance and the quality of uncertainty estimates. The proposed validation procedure suggests that the uncertainty estimates produced by Monte-Carlo dropout cannot be straightforwardly used to reliably estimate the misclassifications. The spectral fundus image segmentation of retinal vasculature, optic disc and macula were studied. The problem is formulated as a multi-label segmentation problem. The for- mulation allows to form continuous segmentation maps for the retinal vasculature, optic disc and macula even in the case of overlapping labels. The neural network architectures for the segmentation of the landmarks are proposed. The proposed architectures are straightforward modifications of the existing semantic segmentation architectures. The modifications are formulated in a form of dimensionality reduction layers trained together with the rest of the architecture end-to-end. Different configurations of dimensionality reduction layers are analyzed and compared. It was shown that spectral information may give additional benefits in the landmark segmentation moderately improving the segmentation performance. 57 58 5. Conclusion References [1] Abdar, M., Pourpanah, F., Hussain, S., Rezazadegan, D., Liu, L., Ghavamzadeh, M., Fieguth, P., Cao, X., Khosravi, A., Acharya, U. R., et al. A review of uncertainty quantification in deep learning: Techniques, appli- cations and challenges. Information Fusion (2021). [2] Abràmoff, M. D., Garvin, M. K., and Sonka, M. Retinal imaging and image analysis. IEEE Reviews in Biomedical Engineering 3 (2010), 169–208. [3] Agurto, C., Joshi, V., Nemeth, S., Soliz, P., and Barriga, S. Detection of hypertensive retinopathy using vessel measurements and textural features. In 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (2014), IEEE, pp. 5406–5409. [4] Badawi, S., and Fraz, M. Multiloss function based deep convolutional neural network for segmentation of retinal vasculature into arterioles and venules. BioMed Research International 2019 (04 2019), 1–17. [5] Badrinarayanan, V., Kendall, A., and Cipolla, R. SegNet: A deep convo- lutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12 (2017), 2481–2495. [6] Budai, A., Odstrcilík, J., Kolár, R., Hornegger, J., Jan, J., Kubena, T., and Michelson, G. A public database for the evaluation of fundus image segmentation algorithms. Investigative Ophthalmology & Visual Science 52 (2011), 1345–1345. [7] Carmona, E. J., Rincón, M., García-Feijóo, J., and de-la Casa, J. M. Identification of the optic nerve head with genetic algorithms. Artificial Intelligence in Medicine 43 3 (2008), 243–59. [8] Decencière, E., Zhang, X., Cazuguel, G., Lay, B., Cochener, B., Trone, C., Gain, P., Ordóñez-Varela, J.-R., Massin, P., Erginay, A., Charton, B., and Jc, K. Feedback on a publicly distributed image database: The MESSI- DOR database. Image Analysis & Stereology 33 (2014), 231–234. [9] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009), IEEE, pp. 248–255. 59 60 REFERENCES [10] Filos, A., Farquhar, S., Gomez, A. N., Rudner, T. G., Kenton, Z., Smith, L., Alizadeh, M., de Kroon, A., and Gal, Y. A systematic comparison of Bayesian deep learning robustness in diabetic retinopathy tasks. arXiv preprint arXiv:1912.10481 (2019). [11] Fumero, F., Alayon, S., Sanchez, J. L., Sigut, J., and Gonzalez- Hernandez, M. RIM-ONE: An open retinal image database for optic nerve evalu- ation. In 2011 24th International Symposium on Computer-Based Medical Systems (CBMS) (2011), pp. 1–6. [12] Fält, P., Hiltunen, J., Hauta-Kasari, M., Sorri, I., Kalesnykiene, V., Pietilä, J., and Uusitalo, H. Spectral imaging of the human retina and compu- tationally determined optimal illuminants for diabetic retinopathy lesion detection. Journal of Imaging Science and Technology 55, 3 (2011), 253–263. [13] Gal, Y. Uncertainty in deep learning. PhD thesis, University of Cambridge, 2016. [14] Gal, Y., and Ghahramani, Z. Dropout as a Bayesian approximation: Repre- senting model uncertainty in deep learning. In International Conference on Machine Learning (2016), pp. 1050–1059. [15] Girard, F., Kavalec, C., and Cheriet, F. Joint segmentation and classification of retinal arteries/veins from fundus images. Artificial Intelligence in Medicine 94 (2019), 96 – 109. [16] Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT Press, 2016. http://www.deeplearningbook.org. [17] Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Narayanaswamy, A., Venugopalan, S., Widner, K., Madams, T., Cuadros, J., Kim, R., Raman, R., Nelson, P. C., Mega, J. L., and Web- ster, D. R. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316, 22 (12 2016), 2402–2410. [18] Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In International Conference on Machine Learning (2017), PMLR, pp. 1321–1330. [19] Hemelings, R., Elen, B., Stalmans, I., Van Keer, K., De Boever, P., and Blaschko, M. B. Artery-vein segmentation in fundus images using a fully convolutional network. Computerized Medical Imaging and Graphics (2019). [20] Hoover, A., Kouznetsova, V., and Goldbaum, M. Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response. IEEE Transactions on Medical Imaging 19, 3 (2000), 203–210. [21] Hoover, A., Kouznetsova, V., and Goldbaum, M. Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response. IEEE Transactions on Medical imaging 19, 3 (2000), 203–210. REFERENCES 61 [22] Hu, Q., Abràmoff, M. D., and Garvin, M. K. Automated separation of binary overlapping trees in low-contrast color retinal images. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013 (Berlin, Heidelberg, 2013), K. Mori, I. Sakuma, Y. Sato, C. Barillot, and N. Navab, Eds., Springer Berlin Heidelberg, pp. 436–443. [23] Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 2261–2269. [24] Issac, A., Sarathi, M. P., and Dutta, M. K. An adaptive threshold based im- age processing technique for improved glaucoma detection and classification. Com- puter Methods and Programs in Biomedicine 122, 2 (2015), 229–244. [25] Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D. P., and Wilson, A. G. Averaging weights leads to wider optima and better generalization. In The Conference on Uncertainty in Artificial Intelligence (2018). [26] Izmailov, P., Vikram, S., Hoffman, M. D., and Wilson, A. G. What are Bayesian neural network posteriors really like? arXiv preprint arXiv:2104.14421 (2021). [27] Jegou, S., Drozdzal, M., Vázquez, D., Romero, A., and Bengio, Y. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. pp. 1175–1183. [28] Jiao, L., Liang, M., Chen, H., Yang, S., Liu, H., and Cao, X. Deep fully convolutional network-based spatial distribution prediction for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing 55, 10 (Oct 2017), 5585–5599. [29] Jogi. Basic ophthalmology. Jaypee Brothers Medical Publishers, New Delhi, India, 2008. [30] Jun, M. Segmentation loss odyssey. arXiv preprint arXiv:2005.13449 (2020). [31] Kauppi, T., Kalesnykiene, V., Kämäräinen, J., Lensu, L., Sorri, I., Ra- ninen, A., Voutilainen, R., Uusitalo, H., Kälviäinen, H., and Pietilä, J. The DIARETDB1 diabetic retinopathy database and evaluation protocol. In BMVC (2007). [32] Kendall, A., and Gal, Y. What uncertainties do we need in Bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems (2017), pp. 5574–5584. [33] Laaksonen, L. Spectral retinal image processing and analysis for ophthalmology. PhD thesis, Lappeenranta University of Technology, 2016. [34] Laves, M.-H., Ihler, S., Fast, J. F., Kahrs, L. A., and Ortmaier, T. Well- calibrated regression uncertainty in medical imaging with deep learning. In Medical Imaging with Deep Learning (2020), PMLR, pp. 393–412. 62 REFERENCES [35] Leibig, C., Allken, V., Ayhan, M. S., Berens, P., and Wahl, S. Leveraging uncertainty information from deep neural networks for disease detection. Scientific Reports 7 (12 2017). [36] Li, T., Bo, W., Hu, C., Kang, H., Liu, H., Wang, K., and Fu, H. Applications of deep learning in fundus images: A review. Medical Image Analysis 69 (2021), 101971. [37] Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 2980–2988. [38] Ma, Y.-A., Chen, T., and Fox, E. B. A complete recipe for stochastic gradient mcmc. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (Cambridge, MA, USA, 2015), NIPS’15, MIT Press, p. 2917–2925. [39] Maddox, W., Garipov, T., Izmailov, P., Vetrov, D. P., and Wilson, A. G. A simple baseline for Bayesian uncertainty in deep learning. In NeurIPS (2019). [40] Manikis, G. C., Sakkalis, V., Zabulis, X., Karamaounas, P., Triantafyl- lou, A., Douma, S., Zamboulis, C., and Marias, K. An image analysis frame- work for the early assessment of hypertensive retinopathy signs. In 2011 E-Health and Bioengineering Conference (EHB) (2011), pp. 1–6. [41] Medeiros, F. A., Jammal, A. A., and Thompson, A. C. From machine to machine: An OCT-Trained deep learning algorithm for objective quantification of glaucomatous damage in fundus photographs. Ophthalmology 126, 4 (2019), 513– 521. [42] Mobiny, A., Nguyen, H., Moulik, S., Garg, N., and Wu, C. DropConnect is effective in modeling uncertainty of Bayesian deep networks. ArXiv abs/1906.04569 (2019). [43] Mookiah, M. R. K., Hogg, S., MacGillivray, T. J., Prathiba, V., Pradeepa, R., Mohan, V., Anjana, R. M., Doney, A. S., Palmer, C. N., and Trucco, E. A review of machine learning methods for retinal blood ves- sel segmentation and artery/vein classification. Medical Image Analysis 68 (2021), 101905. [44] Noronha, K., Navya, K., and Nayak, K. P. Support system for the auto- mated detection of hypertensive retinopathy using fundus images. In International Conference on Electronic Design and Signal Processing (ICEDSP) (2012), pp. 7–11. [45] Porwal, P., Pachade, S., Kamble, R., Kokare, M., Deshmukh, G., Sa- hasrabuddhe, V., and Meriaudeau, F. Indian diabetic retinopathy image dataset (IDRiD): A database for diabetic retinopathy screening research. Data 3, 3 (2018). [46] Porwal, P., Pachade, S., Kokare, M., Deshmukh, G., Son, J., Bae, W., Liu, L., Wang, J., Liu, X., Gao, L., et al. IDRiD: Diabetic retinopathy– segmentation and grading challenge. Medical Image Analysis 59 (2020), 101561. REFERENCES 63 [47] Postels, J., Segu, M., Sun, T., Van Gool, L., Yu, F., and Tombari, F. On the practicality of deterministic epistemic uncertainty. arXiv preprint arXiv:2107.00649 (2021). [48] Pratt, H., Coenen, F., Harding, S. P., Broadbent, D. M., and Zheng, Y. Feature visualisation of classification of diabetic retinopathy using a convolutional neural network. In CEUR Workshop Proceedings (2019), vol. 2429, pp. 23–29. [49] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer- Assisted Intervention – MICCAI 2015 (Cham, 2015), N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, Eds., Springer International Publishing, pp. 234–241. [50] Ruder, S. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016). [51] Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. Grad-Cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision (2017), pp. 618–626. [52] Seo, S., Seo, P. H., and Han, B. Learning for single-shot confidence calibration in deep neural networks through stochastic inferences. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 9022–9030. [53] Sharma, A., Agrawal, M., Roy, S. D., and Gupta, V. Automatic glaucoma diagnosis in digital fundus images using deep CNNs. Springer Singapore, Singapore, 2020, pp. 37–52. [54] Simonyan, K., and Zisserman, A. Very deep convolutional networks for large- scale image recognition. arXiv preprint arXiv:1409.1556 (2014). [55] Sreng, S., Maneerat, N., Hamamoto, K., and Win, K. Y. Deep learning for optic disc segmentation and glaucoma diagnosis on retinal images. Applied Sciences 10, 14 (2020), 4916. [56] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhut- dinov, R. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 56 (2014), 1929–1958. [57] Thulasidasan, S., Chennupati, G., Bilmes, J., Bhattacharya, T., and Michalak, S. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. arXiv preprint arXiv:1905.11001 (2019). [58] Triwijoyo, B. K., Budiharto, W., and Abdurachman, E. The classification of hypertensive retinopathy using convolutional neural network. Procedia Computer Science 116 (2017), 166–173. Discovery and innovation of computer science tech- nology in artificial intelligence era: The 2nd International Conference on Computer Science and Computational Intelligence (ICCSCI 2017). [59] van Amersfoort, J., Smith, L., Jesson, A., Key, O., and Gal, Y. On feature collapse and deep kernel learning for single forward pass uncertainty. arXiv preprint arXiv:2102.11409 (2021). [60] Van Amersfoort, J., Smith, L., Teh, Y. W., and Gal, Y. Uncertainty estima- tion using a single deep deterministic neural network. In International Conference on Machine Learning (2020), PMLR, pp. 9690–9700. [61] Wang, L., and Zhao, C. Hyperspectral Image Processing. Springer, 2015. [62] Wei, Q., Li, X., Yu, W., Zhang, X., Zhang, Y., Hu, B., Mo, B., Gong, D., Chen, N., Ding, D., and xin Chen, Y. Learn to segment retinal lesions and beyond. 2020 25th International Conference on Pattern Recognition (ICPR) (2021), 7403–7410. [63] Yan, Z., Han, X., Wang, C., Qiu, Y., Xiong, Z., and Cui, S. Learning mu- tually local-global U-Nets for high-resolution retinal lesion segmentation in fundus images. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019) (2019), pp. 597–600. [64] Yu, S., Jia, S., and Xu, C. Convolutional neural networks for hyperspectral image classification. Neurocomputing 219 (2017), 88–98. [65] Zhang, S., Zheng, R., Luo, Y., Wang, X., Mao, J., Roberts, C. J., and Sun, M. Simultaneous arteriole and venule segmentation of dual-modal fundus images using a multi-task cascade network. IEEE Access 7 (2019), 57561–57573. [66] Zhou, M., Jin, K., Wang, S., Ye, J., and Qian, D. Color retinal image enhancement based on luminosity and contrast adjustment. IEEE Transactions on Biomedical Engineering 65, 3 (2018), 521–527. [67] Zhou, T., Ruan, S., and Canu, S. A review: Deep learning for medical image segmentation using multi-modality fusion. Array 3-4 (2019), 100004. Publication I Garifullin A., Kööbi P., Ylitepsa P., Ådjers K., Hauta-Kasari M., Uusitalo H., and Lensu L. Hyperspectral Image Segmentation of Retinal Vasculature, Optic Disc and Macula Reprinted with permission from 2018 Digital Image Computing: Techniques and Applications (DICTA) pp. 1-5, 2018. © 2021, IEEE Hyperspectral Image Segmentation of Retinal Vasculature, Optic Disc and Macula Azat Garifullin∗, Peeter Ko¨o¨bi†§, Pasi Ylitepsa†, Kati A˚djers†§, Markku Hauta-Kasari‡, Hannu Uusitalo†§ and Lasse Lensu∗ ∗ Machine Vision and Pattern Recognition Laboratory, School of Engineering Science, Lappeenranta University of Technology, P.O. Box 20, 53851, Lappeenranta, Finland † SILK, Department of Ophthalmology, ARVO F313, 33014, University of Tampere, Finland ‡ University of Eastern Finland, School of Computing, P.O. Box 111. 80101, Joensuu, Finland § Tays Eye Center, Tampere University Hospital, P.O. Box 2000, 33520, Tampere, Finland Emails: {azat.garifullin, lasse.lensu}@lut.fi, {peeter.koobi, pasi.ylitepsa, hannu.uusitalo}@uta.fi, kati.adjers@pshp.fi, markku.hauta-kasari@uef.fi Abstract—The most common approach for retinal imaging is the eye fundus photography which usually results in RGB images. Recent studies show that the additional spectral information provides useful features for automatic retinal image analysis. The current work extends recent research on the joint segmentation of retinal vasculature, optic disc and macula which often appears in different retinal image analysis tasks. Fully convolutional neural networks are utilized to solve the segmentation problem. It is shown that the network architectures can be effectively modified for the spectral data and the utilization of spectral information provides moderate improvements in retinal image segmentation. I. INTRODUCTION Retinal diseases like diabetic retinopathy, age-related mac- ular degeneration and glaucoma are the leading causes of blindness worldwide [1]. A diagnostic process to recognize the signs of these diseases is traditionally based on retinal RGB images. Recent developments in machine vision technologies provide various methods for automatic RGB images analysis. In [2], it has been shown that additional spectral features introduced to machine learning methods may improve the performance of lesions classification and enable new ways to analyze the retinal tissue layers. Thus, spectral retinal imaging can be treated as a useful alternative to traditional color fundus imaging. This work studies the joint segmentation of retinal vasculature, optic disc and macula for hyperspectral retinal images. Deep convolutional neural networks is a common trend in both retinal and hyperspectral image analysis. Deep ar- chitectures similar to U-Net are extensively used for the vasculature [3], optic disc and cup segmentation [4] purposes. Tan et al. [5] studied the segmentation of optic disc, fovea and retinal vasculature using a single model trained on the DRIVE dataset [6]. All the mentioned approaches have been tested on RGB images. To the best of the authors’ knowledge, this work is the first work which studies spectral retinal image segmentation using deep fully-convolutional neural networks. One way to build deep architectures for hyperspectral image (HSI) segmentation is to combine dimensionality reduction methods and convolutional neural networks. Jiao et al. [7] proposed to use the feature fusion from VGG16 encoder [8] and principal component analysis for HSI segmentation. Yu et al. [9] showed that dimensionality reduction blocks can be trained end-to-end altogether with the convolutional neural networks. Other approaches are based on 3D convolutional neural networks [10], [11] that can effectively extract both spatial and spectral features. However, it is more difficult to scale the 3D convolutions on high-resolution images. In this paper, we followed ideas similar to Yu et al. [9] and Jiao et al. [7], and adapted SegNet [12] and dense fully-convolutional neural networks (Dense-FCNs) [13] for the spectral retinal seg- mentation task. These architectures are trained and evaluated by using a spectral image dataset with manual ground truth for the vasculature, optic disc and macula. II. SPECTRAL RETINAL IMAGE DATASET Several spectral fundus imaging setups with different optical principles have been proposed. A typical hyperspectral imag- ing setup is an adapted fundus camera with a light source with a broadband illumination, and a spectral device for selecting a spectral band. Fa¨lt et al. [14] modified a Canon CR5-45NM fundus camera to spectral fundus camera by replacing the standard light source with a fibre optic illuminator including a halogen lamp with the illumination spectrum from 380 to 780 nm and 30 interference filters with 10 nm interval were used for the wavelength selection. As the detector, a grayscale charge-coupled device (CCD) camera with the sensor array size of 2048× 2048 pixels and 2× 2 binning was used. The resulted dataset is a set of 1024 × 1024 images with the 30 channels where each channel corresponds to the certain wavelength. For each image in the dataset, the field-of-view (FOV) mask is provided. The FOV masks are binary images where the white areas correspond to regions of the fundus of the eye. Manual segmentation masks for the vasculature, optic disc and macula are available (Fig. 2). The dataset consists of 55 spectral retinal images acquired from patients with diabetic retinopathy: randomly selected 25 images are in the training set, and the rest are used as the testing set. Thus, the amount of data in the dataset is comparable to the amount 978-1-5386-6602-9/18/$31.00 ©2018 IEEE Authorized licensed use limited to: Lappeenrannan Teknillinen Korkeakoulu. Downloaded on August 13,2021 at 11:40:24 UTC from IEEE Xplore. Restrictions apply. Fig. 1. Visualization of the mean spectrum and 3σ range of (a) blood vessels, (b) optic disc and (c) macula of labelled data in DRIVE [6] and STARE [15] datasets which are typically used for blood vessel segmentation algorithms benchmarking. Examples of the mean and variation of the spectra are presented in Fig. 1. (a) (b) (c) (d) (e) (f) Fig. 2. (a) An RGB image from the spectral retinal image dataset. (b) Montage of the channel images composing an example hyperspectral retinal image. The image was normalized for the visualization purpose. (c) FOV mask and the corresponding segmentation masks for the (d) vessels, (e) optic disc and (f) macula. III. SEMANTIC SEGMENTATION ARCHITECTURES A. SegNet SegNet is an encoder-decoder architecture for semantic segmentation. In this work, we used basic-SegNet with small modifications for the HSI segmentation. The modified archi- tecture consists of a dimensionality reduction block, encoder and decoder. The dimensionality reduction block is a sequence of blocks consisting of 1× 1 convolutional layer and rectified linear unit (ReLU) activation. As the encoder, VGG16 pre- trained on ImageNet [16] was used. The decoder is a sequence of transposed convolutions and convolutional layers followed by batch normalization (BN), ReLU activation and dropout. The scheme of the architecture is shown in Fig. 3. DR-30 DR-30 DR-16 DR-3 VGG-16 Dec-512 Dec-256Dec-128 Dec-64Dec-32 Conv-3x1x1 + Sigmoid Segmentationmasks HSI DR-N – Conv Nx1x1 -  ReLU Dec-N – Transposed Conv Nx3x3 stride=2x2 - BN - ReLU - Dropout (p=0.5) + Conv Nx3x3 = BN - ReLU - Dropout (p=0.5) Fig. 3. DR-SegNet architecture. B. Dense-FCN It has been shown that Dense-FCNs have less parameters and may outperform the SegNet architecture in a variety of different segmentation tasks [13]. Here we adapted the Dense- FCN architecture for the retinal HSI segmentation task. The main building block of Dense-FCN is a dense con- volutional block (DCB) where the input of each layer is a concatenation of the outputs of previous layers. The block consists of repeating BN, ReLU, convolution and dropout p = 0.5 layers resulting in K feature maps (growth rate). The main concept of Dense-FCN is similar to SegNet in a sense that the input is first compressed to a hidden Authorized licensed use limited to: Lappeenrannan Teknillinen Korkeakoulu. Downloaded on August 13,2021 at 11:40:24 UTC from IEEE Xplore. Restrictions apply. representation by the downsampling part, and then the seg- mentation masks are recovered by an upsampling part. The downsampling part consists of DCBs and downsampling tran- sitions (DT) with skip connections to the upsampling part. The upsampling part consists of DCBs and upsampling transitions (UT). The scheme of the utilized architecture is given in Fig. 4. DDRB(3, 6) DCB(8, 2) DCB(16, 4) DCB(16, 16) TD(16) TD(64) TU(64) Conv 30x1x1 C C DCB(16, 4) TU(16) C DDRB(3, 6) DCB(8, 2) C Conv 3x1x1 + Sigmoid C C DDRB(K, N) C – DCB with 1x1 kernels – Dense Convolutional (3x3) Block K – growth rate N – depth of the block DCB(K, N) – Concatenation TU(N) TD(N) – Transposed Conv Nx3x3 stride=2x2 – BN-ReLU-Conv Nx1x1 + Dropout (p=0.5) + Max-Pool 2x2 HSI Segmentationmasks Fig. 4. DR-6-Dense-FCN architecture. C. Image preprocessing First, each channel of a spectral image is normalized to values between 0 and 255. After the normalization step, contrast limited adaptive histogram equalization [17] with the clip limit of 2 and the grid size of 8 × 8 is applied to each channel of the input. The described preprocessing scheme was applied to both RGB and spectral images. The preprocessing scheme was used to reduce the effects of uneven illumination fields of the channel images and indirectly affect the inter-person variation in the limited dataset. The scheme was found to improve the convergence and performance of the trained models. Examples of the preprocessed RGB and spectral images are given in Fig. 5. (a) (b) Fig. 5. Preprocessed (a) RGB and (b) spectral images. D. Training details Dense-FCNs are pretrained for 50 epochs with 500 steps per epoch on random patches 512 × 512 with the batch size equal to 3. Each batch consists of examples for blood vessels, optic disc, and macula. The vasculature examples are sampled uniformly, whereas the patch centers of the examples of the optic disc and macula are sampled from the normal distribution the parameters of which are estimated as the sample mean and covariance of true label coordinates. The common training step for both architectures is fine- tuning on full size images. The number of epochs used was 50 for Dense-FCN and 100 for SegNet. In both cases, the weights were initialized using HeNor- mal [18]. Binary cross-entropy was used as the loss function. In addition to dropout, l2 regularization with the weight decay factor 10−4 was used. As the optimizer, Adadelta [19] with the learning rate l = 1 and the decay rate ρ = 0.95 was used for the both pretraining and fine-tuning. The learning rate was dropped by a factor of 10 if the training loss was not decreased by 0.005 for 5 epochs. Data augmentation through flipping, reflecting and rescaling (with scale rates 0.8 and 1.2) was applied in both cases. The parameter values were determined empirically based on initial experiments. IV. EXPERIMENTS AND RESULTS The trained networks were evaluated on the full size images from the testing set using Monte Carlo dropout [20] in the test phase with 100 forward passes. The standard F1 measure, intersection over union (IoU) for each class and mean IoU over the classes were used as the evaluation metrics and they are presented in Table I. The evaluation metrics were calculated only inside the FOV. In order to distinguish between the RGB and spectral architectures, the DR prefix was added for the architectures for spectral images and the RGB prefix for the architectures for RGB images. The Dense-FCN architecture was also tested for the hyperspectral images without the di- mensionality reduction layers (HSI prefix) and with a different number of output channels of the DR layers (DR-6 means 6 output channels). From the table, it is clear that the vessel segmentation is the easiest task for all the architectures, since F1 score is comparable in all the cases, whereas the macular region is the most difficult to segment. The latter can be explained by the fact that there are normally no clearly defined structural characteristic in the macula. Furthermore, there are numerous images where the macula is only partly visible, and these images happen to be present only in the test set, because of what all the considered models were unable to generalize to a partly visible macula. Another interesting fact is that the region may have different shapes in different spectral channels. The VGG encoder pretrained on ImageNet allows to signif- icantly improve the performance on the optic disc and macula segmentation tasks. In the case of Dense-FCNs, it is difficult to achieve satisfactory performance for both the optic disc and macula segmentation, and the performance depends on the number of output channels of the dimensionality reduction layers. The segmentation results for the image presented in Fig. 5 are presented in Fig. 6 – 8. Comparing the segmentation results for hyperspectral and RGB images in Fig. 6 and Fig. 7 shows that in some cases the Authorized licensed use limited to: Lappeenrannan Teknillinen Korkeakoulu. Downloaded on August 13,2021 at 11:40:24 UTC from IEEE Xplore. Restrictions apply. TABLE I EVALUATION RESULTS. THE BEST F1 SCORES ARE IN BOLD. Architecture Vessels Optic Disc Macula Mean IoU # Parameters F1 IoU F1 IoU F1 IoU DR-SegNet 0.8091 0.6802 0.8947 0.8356 0.6566 0.5291 0.6816 21795786 RGB-SegNet 0.7925 0.6571 0.8802 0.8149 0.6033 0.4657 0.6458 21793379 DR-1-Dense-FCN 0.8125 0.6853 0.7128 0.6122 0.3962 0.3223 0.5399 730028 DR-2-Dense-FCN 0.8243 0.7019 0.7311 0.6257 0.3084 0.2414 0.5230 730383 DR-3-Dense-FCN 0.8069 0.6776 0.6986 0.5947 0.2822 0.2120 0.4948 730738 DR-4-Dense-FCN 0.8200 0.6958 0.6880 0.5867 0.3389 0.2665 0.5163 731093 DR-5-Dense-FCN 0.8006 0.6772 0.6977 0.5949 0.3843 0.2954 0.5225 731448 DR-6-Dense-FCN 0.8021 0.6714 0.7427 0.6394 0.4244 0.3402 0.5503 731803 HSI-Dense-FCN 0.7974 0.6647 0.7323 0.6202 0.2932 0.2282 0.5043 745837 RGB-Dense-FCN 0.8112 0.6840 0.7154 0.6153 0.1543 0.1070 0.4688 729043 Fig. 6. Top row: example segmentation results with DR-SegNet. Bottom row: standard deviations of the activations. Fig. 7. Top row: example segmentation results with RGB-SegNet. Bottom row: standard deviations of the activations. models trained on RGB fail to localize the macula properly, whereas the utilization of spectral information may help to avoid such problems. It is worth to mention that in both cases the segmentation artefacts are present in areas where the labels overlap. This is mainly caused by the VGG encoder pretrained on ImageNet, since there were no signs of such artefacts when Fig. 8. Top row: example segmentation results with DR-6-Dense-FCN. Bottom row: standard deviations of the activations. it was not used or used without pretraining. It is also clear that the results obtained with DR-SegNet show less artefacts compared to RGB-SegNet. The dimensionality reduction layers allowed to adapt stan- dard convolutional architecures for the hyperspectral image segmentation without significantly increasing the amount of parameters of the model. In addition, the utilization of the dimensionality reduction layers may slightly improve the performance of vessels and macula segmentation. In Fig. 9, the outputs of the dimensionality reduction layers for the both spectral architectures are shown. In Fig. 9 one can see that the models try to emphasize areas where certain labels are most visible. For example, in the first image of the bottom row the macula is clearly seen, whereas in the second and third images the optic disc is visible more clear compared to the first image. Nevertheless, training the dimensionality reduction layers is a challenging task, and it is not always possible to train them to extract useful features. In the case of DR-SegNet, the pretrained VGG encoder makes the training easier. However, if the network is trained from the scratch, it is difficult to get results comparable to the results obtained with a pretrained model. We also tried to add skip connections to DR-SegNet in a manner similar to DR-Dense- FCNs, but in the case of DR-SegNet, it just confuses the model Authorized licensed use limited to: Lappeenrannan Teknillinen Korkeakoulu. Downloaded on August 13,2021 at 11:40:24 UTC from IEEE Xplore. Restrictions apply. Fig. 9. Visualization of dimensionality reduction layers. The top two rows illustrate the results for DR-6-Dense-FCN and the bottom row for DR-SegNet. more and the performance decreases, whereas in the case of DR-Dense-FCNs, it boosts performance significantly. Previously, Laaksonen [2] has shown that diabetic retinopa- thy lesions classification algorithms trained on spectral data outperform algorithms trained on RGB data. Furthermore, spectral information can also be utilized for the histological analysis of fundus images [2]. From the presented results, it also clear that the utilization of additional spectral information may also improve the segmentation results compared to RGB images. V. CONCLUSIONS In this work, multilabel segmentation of retinal vasculature, optic disc and macula for spectral retinal images was studied. It was shown that spectral information may give additional advantages in optic disc and macula segmentation moderately improving the segmentation performance. The results also show that it is necessary to study more the dimensionality reduction layers to find a way to train them effectively on small datasets. The future work will be con- centrated on further improvements of training the Dense-FCN architecture in order to achieve comparable performance with the architectures pretrained on ImageNet. Another direction of the future work is the development of a gold standard based on label data from multiple experts. ACKNOWLEDGEMENTS The authors wish to thank CSC for the computational resources for some of the experiments. REFERENCES [1] “World Health Organization: causes of blindness and visual impairment,” http://www.who.int/blindness/causes/en/, accessed: 2018-06-09. [2] L. Laaksonen, “Spectral retinal image processing and analysis for oph- thalmology,” Ph.D. dissertation, Lappeenranta University of Technology, 2016. [3] L. Giancardo, K. Roberts, and Z. Zhao, “Representation learning for reti- nal vasculature embeddings,” in Fetal, Infant and Ophthalmic Medical Image Analysis, M. J. Cardoso, T. Arbel, A. Melbourne, H. Bogunovic, P. Moeskops, X. Chen, E. Schwartz, M. Garvin, E. Robinson, E. Trucco, M. Ebner, Y. Xu, A. Makropoulos, A. Desjardin, and T. Vercauteren, Eds. Cham: Springer International Publishing, 2017, pp. 243–250. [4] S. M. Shankaranarayana, K. Ram, K. Mitra, and M. Sivaprakasam, “Joint optic disc and cup segmentation using fully convolutional and adversarial networks,” in Fetal, Infant and Ophthalmic Medical Image Analysis, M. J. Cardoso, T. Arbel, A. Melbourne, H. Bogunovic, P. Moeskops, X. Chen, E. Schwartz, M. Garvin, E. Robinson, E. Trucco, M. Ebner, Y. Xu, A. Makropoulos, A. Desjardin, and T. Vercauteren, Eds. Cham: Springer International Publishing, 2017, pp. 168–176. [5] J. H. Tan, U. R. Acharya, S. V. Bhandary, K. C. Chua, and S. Sivaprasad, “Segmentation of optic disc, fovea and retinal vasculature using a single convolutional neural network,” Journal of Computational Science, vol. 20, pp. 70–79, 2017. [6] A. Hoover, V. Kouznetsova, and M. Goldbaum, “Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response,” IEEE Transactions on Medical imaging, vol. 19, no. 3, pp. 203–210, 2000. [7] L. Jiao, M. Liang, H. Chen, S. Yang, H. Liu, and X. Cao, “Deep fully convolutional network-based spatial distribution prediction for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 10, pp. 5585–5599, Oct 2017. [8] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [9] S. Yu, S. Jia, and C. Xu, “Convolutional neural networks for hyperspec- tral image classification,” Neurocomputing, vol. 219, pp. 88–98, 2017. [10] M. He, B. Li, and H. Chen, “Multi-scale 3d deep convolutional neural network for hyperspectral image classification,” in 2017 IEEE International Conference on Image Processing (ICIP), Sept 2017, pp. 3904–3908. [11] X. Yang, Y. Ye, X. Li, R. Y. K. Lau, X. Zhang, and X. Huang, “Hyperspectral image classification with deep learning models,” IEEE Transactions on Geoscience and Remote Sensing, pp. 1–16, 2018. [12] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con- volutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, Dec 2017. [13] S. Je´gou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio, “The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on. IEEE, 2017, pp. 1175–1183. [14] P. Fa¨lt, J. Hiltunen, M. Hauta-Kasari, I. Sorri, V. Kalesnykiene, J. Pietila¨, and H. Uusitalo, “Spectral Imaging of the Human Retina and Com- putationally Determined Optimal Illuminants for Diabetic Retinopathy Lesion Detection,” Journal of Imaging Science and Technology, vol. 55, no. 3, pp. 253–263, 2011. [15] M. Michael Goldbaum, “Structured analysis of the retina,” http://www.cecas.clemson.edu/ ahoover/stare, 2003, online. [16] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009. [17] K. Zuiderveld, “Graphics gems iv,” P. S. Heckbert, Ed. San Diego, CA, USA: Academic Press Professional, Inc., 1994, ch. Contrast Limited Adaptive Histogram Equalization, pp. 474–485. [Online]. Available: http://dl.acm.org/citation.cfm?id=180895.180940 [18] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034. [19] M. D. Zeiler, “Adadelta: An adaptive learning rate method,” CoRR, vol. abs/1212.5701, 2012. [20] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in international conference on machine learning, 2016, pp. 1050–1059. Authorized licensed use limited to: Lappeenrannan Teknillinen Korkeakoulu. Downloaded on August 13,2021 at 11:40:24 UTC from IEEE Xplore. Restrictions apply. Publication II Garifullin, A., Lensu, L., and Uusitalo, H. On the Uncertainty of Retinal Artery-Vein Classification with Dense Fully-Convolutional Neural Networks Reprinted with permission from Advanced Concepts for Intelligent Vision Systems (ACIVS) pp. 87–98, 2020. © 2020, Springer Nature Switzerland AG On the Uncertainty of Retinal Artery-Vein Classification with Dense Fully-Convolutional Neural Networks Azat Garifullin1(B) , Lasse Lensu1 , and Hannu Uusitalo2,3 1 LUT University, P.O. Box 20, 53851 Lappeenranta, Finland {azat.garifullin,lasse.lensu}@lut.fi 2 SILK, Department of Ophthalmology, Tampere University, ARVO F313, 33014 Tampere, Finland hannu.uusitalo@tuni.fi 3 Tays Eye Center, Tampere University Hospital, P.O. Box 2000, 33520 Tampere, Finland Abstract. Retinal imaging is a valuable tool in diagnosing many eye diseases but offers opportunities to have a direct view to central nervous system and its blood vessels. The accurate measurement of the charac- teristics of retinal vessels allows not only analysis of retinal diseases but also many systemic diseases like diabetes and other cardiovascular or cerebrovascular diseases. This analysis benefits from precise blood ves- sel characterization. Automatic machine learning methods are typically trained in the supervised manner where a training set with ground truth data is available. Due to difficulties in precise pixelwise labeling, the question of the reliability of a trained model arises. This paper addresses this question using Bayesian deep learning and extends recent research on the uncertainty quantification of retinal vasculature and artery-vein classification. It is shown that state-of-the-art results can be achieved by using the trained model. An analysis of the predictions for cases where the class labels are unavailable is given. Keywords: Bayesian deep learning · Blood vessels segmentation · Artery-vein classification 1 Introduction A number of eye and systemic diseases influence the vasculature of the retina in different ways. The blood vessel characteristics in retinal images may provide visible evidence about numerous diseases such as hypertensive retinopathy, dia- betic retinopathy, as well as other cardio- and cerebrovascular diseases [12]. The related characteristics include the shape and size of retinal vessels, arteriovenous ratio and arteriovenous crossing [14]. These characteristics may be obtained by using blood vessel segmentation masks produced by automatic machine learning techniques [5]. c© Springer Nature Switzerland AG 2020 J. Blanc-Talon et al. (Eds.): ACIVS 2020, LNCS 12002, pp. 87–98, 2020. https://doi.org/10.1007/978-3-030-40605-9_8 88 A. Garifullin et al. The topic of blood vessels segmentation is well studied by the community [1]. However, the artery-vein (AV) classification task remains challenging not only for machines, but also for humans. Despite the fact that discriminative features based on color and geometry are described, it is still difficult to distinguish arteries from veins [14] due to imperfect imaging conditions and limited visibility of the retinal blood vessels. Recently, deep convolutional neural networks have become a common trend for retinal vasculature segmentation and AV classification because of the abil- ity to automatically learn meaningful features. Welikala et al. [16] proposed a method based on a convolutional neural network (CNN) classifying arteries and veins in a patch-wise manner. The authors considered the problem as a multi- class classification task placing a softmax layer at the end of the network. The UK Biobank database was used from which 100 images were labeled and classification accuracy of 82.26% for arteries and veins was reported. Girard et al. [5] proposed to use a modified U-Net [15] with likelihood score propagation in the minimum spanning tree effectively utilizing information about the global vessel topology. The approach was tested on the DRIVE data set [8] and it achieved 94.93% accu- racy for the AV classification. Badawi et al. [2] proposed to train a CNN with multiloss function consisting of pixelwise cross entropy loss and segment-level loss to overcome training issues appearing because of inconsistent thickness of blood vessels. The authors also created a new data set consisting of labeled sub- sets of EPIC and MESSIDOR [3] data sets and classification accuracy of 96.5% was reported. Hemelings et al. [7] applied the U-Net architecture for the task of AV classification stating the problem as a multi-class classification problem predicting labels for four classes (background, vein, artery, and unknown) with classification accuracy of 94.42% and 94.11% for arteries and veins, respectively. Zhang et al. [18] proposed cascade refined U-net which modifies the original model with multi-scale loss training and includes sub-networks for simultaneous AV and blood vessel segmentation. The authors achieved 97.27% arteriovenous classification accuracy evaluated on the automatically detected vessels. In this work, a multi-label classification approach is considered with the uncertainty quantification experiments presented. Our approach is most simi- lar to the method proposed by Zhang et al. [18] in a way how three-component loss is used. The main difference is that in this work, classification of arteries and veins are not conditioned on blood vessel predictions, but vessel labels are conditioned on arteries and veins. Using the multi-label classification approach, there is no need to separately model the AV crossings and background. To the best of authors’ knowledge, this work is the first presenting uncertainty quantifi- cation experiments for the of AV classification. For the experiments, the RITE data set is utilized. On the Uncertainty of Retinal Artery-Vein Classification 89 2 Data and Methods 2.1 DRIVE and RITE Data Sets The DRIVE database is a common benchmark for the retinal blood vessel seg- mentation task [8]. It contains 20 train and 20 test images with two sets of manual blood vessel segmentations. The RITE data set [9] extends DRIVE with an AV reference standard containing four types of labels: arteries (red), veins (blue), overlapping (green), and uncertain vessels (white). An example test image is shown in Fig. 1. 2.2 AV Classification Let f be a model with parameters θ that maps an input image x to a map of logits with the same spatial dimensionality as the original image: yˆ = f (x, θ) . (1) Given predicted logits yˆ = [yˆartery yˆvein], probabilities of assigning labels to arteries and veins can be calculated as follows: partery = sigmoid (yˆartery) , (2) pvein = sigmoid (yˆvein) . (3) In the multi-label setup, the same pixel can be classified with both artery and vein labels, which is meaningful in the case of AV crossings. A vessel probability label can then be naturally inferred by a simple formula: pvessel = partery + pvein − parterypvein. (4) Since the data set contains the masks for both the AV classification and blood vessel segmentation, it is possible to state the following optimization problem θˆ = argminθ [Lartery (θ) + Lvein (θ) + Lvessel (θ)] , (5) where L denotes the binary cross entropy loss for the corresponding labels. This way even if the labels for arteries and veins are not given for uncertain vessel labels, it is possible to enforce a model to predict correct labels for the blood vessels. 2.3 Aleatoric and Epistemic Uncertainties The approach described in the previous section gives only point estimates for the label probabilities and the model parameters are considered to be deterministic. In order to better capture imperfect data labeling and image noise, one can consider the model outputs and the parameters to be random variables. The first approach captures the heteroscedastic aleatoric uncertainty that depends 90 A. Garifullin et al. (a) (b) Fig. 1. The RITE data set: (a) An example test image and (b) corresponding artery- vein reference standard. (Color figure online) on the input data, whereas the second represents the epistemic uncertainty that models a distribution of the learned parameters. More detailed explanations for the uncertainties can be found in [13] and [4]. In this work, a brief explanation for the AV classification task is given below. Aleatoric uncertainty can be captured by modifying the original model to predict the mean and standard deviations of logits: [yˆ,σ] = f (x, θ) . (6) In order to predict standard deviations, a second layer similar and parallel to the one used for logits is added to the output of the network. In order to ensure that the predicted standard deviations are positive, an additional absolute value activation is added to the output of the layer. The probabilities of the labels can then be calculated as follows: pˆ = sigmoid (yˆ + σ ⊙ ǫ) , ǫ ∼ N (0, I) , (7) where ⊙ stands for the Hadamard product and ǫ are sampled during inference. The main inference scheme for AV remains the same with the exception that instead of a point estimate, the model now yields NA samples that are then used to calculate the loss (5). The final minimized loss is just an average over the predicted losses for each sample. Epistemic uncertainty can be captured by considering the model parameters to be a random variable and considering the following posterior predictive: p (y | x,D) = ∫ p (y | x, θ) p (θ | D) dθ, (8) On the Uncertainty of Retinal Artery-Vein Classification 91 where D denotes a data set of input-output pairs. Typically, the parameter’s posterior p (θ | D) for complex models such as deep neural networks is intractable and variational approximations are used. The posterior in (8) can be replaced by a simpler distribution q (θ) and the training procedure can then be formulated as the minimization of the Kullback-Leibler divergence between the true posterior and the approximation. In this work, the model f is parameterized as a dense fully-convolutional network (Dense-FCN) and Monte-Carlo dropout [4] is used for the variational approximation. The description of the utilized architecture is given below. 2.4 Architecture The architecture utilized in this work is a Dense-FCN. It has been shown that Dense-FCNs have less parameters and may outperform other fully-convolutional network (FCN) architectures in a variety of different segmentation tasks [11]. Here we adapt the Dense-FCN architecture for the AV classification tasks. The main building block of Dense-FCN is a dense convolutional block (DCB) where the input of each layer is a concatenation of the outputs of the previous layers. The block consists of repeating batch normalization (BN), ReLU, convo- lution and dropout p = 0.5 layers resulting in g feature maps (growth rate). The main concept of Dense-FCN is similar to other encoder-decoder archi- tectures in the sense that the input is first compressed to a hidden representation by the downsampling part, and then the segmentation masks are recovered by an upsampling part. The downsampling part consists of DCBs and downsam- pling transitions with skip connections to the upsampling part. The upsampling part consists of DCBs and upsampling transitions. An example of two blocks in downsampling and upsampling paths of a Dense-FCN is given in Fig. 2. The architectural parameters used are given below: – Growth rate for all DCBs: g = 16. – Downsampling path consists of five DCBs with depths Ddown = [4, 5, 7, 10, 12, 15]. – Upsampling also consists of five DCBs with depths Dup = [12, 10, 7, 5, 4]. – The first and last convolution layers are the same as in Fig. 2. 2.5 Image Preprocessing It was noticed in the experimental part of the work that simple preprocessing involving contrast enhancement and channel normalization improves the conver- gence and performance of the trained models. First, contrast-limited adaptive histogram equalization [19] with the clip limit of 2 and the grid size of 8× 8 is applied and then each image channel is normalized to values between 0 and 255. The preprocessing scheme was used to reduce the effects of uneven illumination fields of the channel images (Fig. 3). 92 A. Garifullin et al. Dense Down C Conv (48x3x3) Dense Down C Dense Dense Up C Conv (2x1x1) Dense Up C C Input image x logits y H H H H C C C C H BN-ReLU-Conv (gx3x3) Dropout (p=0.5) Down BN-ReLU-Conv(Fx1x1) Dropout (p=0.5) Max-Pool(2x2) Up TransposedConv k:Fx3x3 s:2x2 Conv-Abs (2x1x1) logits std Fig. 2. Dense-FCN architecture: Dense stands for a DCB; C is a tensor concatenation; H is a block consisting of BN, ReLU and a convolutional layer with growth rate g; Down is a transition down block with F output feature maps; Up is a transition up with F output feature maps and 2 × 2 stride. logits std denotes standard deviations of logits. 2.6 Training Details The Dense-FCN was pretrained for 200 epochs with 1000 steps per epoch on random patches 224×224 with the batch size equal to 5. Then it was fine-tuned for 50 epochs with 500 steps per epoch on full size images with the batch size equal to 1. The weights were initialized using HeNormal [6]. In addition to dropout, l2 regularization with the weight decay factor 10−4 was used. As the optimizer, Adadelta [17] with the learning rate l = 1 and the decay rate ρ = 0.95 was used for both the pretraining and fine-tuning. The learning rate was dropped by a factor of 10 if the training loss was not decreased by 0.005 for 10 epochs. Data augmentation by using flipping, reflecting and rescaling (with scale rates 0.8 and 1.2) was applied in both cases. During the fine-tuning stage, the images were randomly padded to size 608× 608 so that the size is divisible by 32 and could be properly compressed by the downsampling path. The parameter values were determined empirically based on initial experiments with the RITE database. On the Uncertainty of Retinal Artery-Vein Classification 93 3 Experiments and Results 3.1 Training and Evaluation Strategies Considering the given reference standard, the question arises of how to use the uncertain class labels and its effect on the final training results. Possible ways for utilizing this information are to consider these pixels to be arteries and veins simultaneously including uncertain (IU), or to exclude them from the training completely excluding uncertain (EU). In this work, a comparison of both training strategies is provided. The crossing labels are considered to be veins and arteries simultaneously. Both strategies are evaluated against the reference standard with excluded uncertain labels, and the vessels classification metrics are given by evaluating against the reference standard provided by the second expert. (a) (b) Fig. 3. Two examples of preprocessed RITE images. Since the AV classification problem stated being multilabel, binary classi- fication metrics were calculated for each class separately: area under receiver operating characteristic curve (ROC-AUC), accuracy, sensitivity and specificity. During the inference stage, the model parameters are sampled 100 times and the number of inferred samples is NA = 50. The final posterior predictive mean is calculated over all predicted samples, and the outputs aleatoric uncertainty UA and epistemic uncertainty UE are calculated as in [10]: UA = Eq [ Vp(y | x,θ) [y] ] , (9) UE = Vq [ Ep(y | x,θ) [y] ] , (10) where E and V denote expectation and variance, respectively. 94 A. Garifullin et al. 3.2 Experimental Results The receiver operating characteristic (ROC) curves calculated after training with both strategies are shown in Fig. 4. The corresponding performance metrics are given in Tables 1 and 2. From the tables, it is clear that the AV classification performances are high, not far from the vessel pixel classification performance. Including uncertain labels into the training set leads to reduced classification accuracy for arteries and veins, but it slightly improves the performance of vessel classification. It is also clear that the Including uncertain strategy increases classification sensitivity, since the training procedure now takes all labeled vessels into account during the AV inference stage. (a) (b) Fig. 4. ROC curves for arteries (red), veins (blue) and vessels (orange): (a) Excluding uncertain and (b) including uncertain strategies. (Color figure online) Table 1. Evaluation results for the excluding uncertain strategy. Label ROC-AUC Accuracy Sensitivity Specificity Arteries 0.973 0.970 0.607 0.992 Veins 0.976 0.970 0.669 0.992 Vessels 0.980 0.960 0.749 0.989 The segmentation results for two example images from the test set are illus- trated in Fig. 5. Comparing the results for the training strategies shows that the network trained with the EU strategy tends to be more discriminative for arter- ies and veins in the areas closer to the optic disc. The common issue for both strategies is the learned bias about the thin vessels being arteries and incapacity to capture connectivity patterns of the predicted segmentation masks inferring vein branches to be arteries. The aforementioned problems can also be visualized as predicted epistemic and aleatoric uncertainties which are presented in Fig. 6 for the same images On the Uncertainty of Retinal Artery-Vein Classification 95 shown in Fig. 5. From the figure, it is clear that the epistemic uncertainty is larger near the optic disc where blood vessels cross. Further away from the optic disc it is concentrated mostly on the vessels’ edges with a pattern similar to the one of the aleatoric uncertainty. Similar observations can be made from Fig. 7 where the uncertainties are compared for the two training strategies. The regions of highest uncertainty include vessel crossings and thin vessels even in the case correct classification. 3.3 Comparison with the State of the Art The Table 3 shows a comparison of the proposed method with recently proposed methods. It is troublesome to directly compare the methods, since the evaluation methods and metrics used by different authors vary. The method proposed by Zhang et al. [18] is clearly superior compared to all the other methods, including the method studied in this work, but the authors use 5-fold cross-validation split, meaning that they have at least 32 images in the training set, whereas in this work the experiments were carried out using standard split with 20 images in the training set. Nevertheless, the performance obtained in this work is comparable with those recently published by Girard et al. [5] and Hemelings et al. [7]. Table 2. Evaluation results for the including uncertain strategy. Label ROC-AUC Accuracy Sensitivity Specificity Arteries 0.973 0.968 0.636 0.988 Veins 0.976 0.966 0.752 0.982 Vessels 0.981 0.961 0.797 0.984 Fig. 5. Visualization of inference result: from left to right, the original image, reference standard, posterior predictive mean obtained with the excluding uncertain strategy with the including uncertain strategy. 96 A. Garifullin et al. Table 3. Comparison of evaluation results. The datasets are specified with splitting methods used by authors. Method Vessels accuracy Arteries accuracy Veins accuracy Dataset Girard et al. [5] 0.948 N/A N/A CT-DRIVE Badawi et al. [2] 0.960 N/A N/A DRIVE (standard) Hemelings et al. [7] N/A 0.948 0.930 DRIVE (standard) Zhang et al. [18] N/A 0.977 0.975 DRIVE (5-fold CV) This work 0.960 0.970 0.970 DRIVE (standard) Fig. 6. Visualization of estimated uncertainty: from left to right, targets with removed uncertain labels and crossings, posterior predictive mean, epistemic uncertainty and aleatoric uncertainty. The results are obtained using the excluding uncertain strategy. Fig. 7. Visualization of estimated uncertainty: from left to right, targets with removed uncertain labels and crossings, posterior predictive mean, epistemic uncertainty, and aleatoric uncertainty. The results are obtained using the excluding uncertain (top row) and including uncertain (bottom row) strategy. On the Uncertainty of Retinal Artery-Vein Classification 97 4 Conclusion In this work, multilabel classification of arteries and veins using a Bayesian fully-convolutional network was studied. It was shown that the misclassified areas on the images can be visualized using uncertainty estimates. The proposed approach is comparable with recent state-of-the-art approaches for blood vessel segmentation and AV classification methods. The main topics for the future research are how to reduce the epistemic uncertainty and more careful study on the classification of uncertain labels in the RITE database. Retinal vasculature segmentation and AV classification meth- ods typically include preprocessing procedures that affect the data. One of the opened questions, how different preprocessing techniques change the aleatoric uncertainty estimates. Other possible directions include differentiable end-to- end methods for modeling the connectivity and regularizations similar to [5] and [2]. References 1. Almotiri, J., Elleithy, K., Elleithy, A.: Retinal vessels segmentation techniques and algorithms: a survey. Appl. Sci. 8(2), 155 (2018) 2. Badawi, S., Fraz, M.: Multiloss function based deep convolutional neural network for segmentation of retinal vasculature into arterioles and venules. BioMed Res. Int. 2019, 1–17 (2019). https://doi.org/10.1155/2019/4747230 3. Decencie`re, E., et al.: Feedback on a publicly distributed image database: the messidor database. Image Anal. Stereol. 33(3), 231–234 (2014) 4. Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: International Conference on Machine Learn- ing, pp. 1050–1059 (2016) 5. Girard, F., Kavalec, C., Cheriet, F.: Joint segmentation and classification of retinal arteries/veins from fundus images. Artif. Intell. Med. 94, 96–109 (2019). https:// doi.org/10.1016/j.artmed.2019.02.004 6. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human- level performance on ImageNet classification. In: Proceedings of the IEEE Inter- national Conference on Computer Vision, pp. 1026–1034 (2015) 7. Hemelings, R., Elen, B., Stalmans, I., Van Keer, K., De Boever, P., Blaschko, M.B.: Artery-vein segmentation in fundus images using a fully convolutional network. Comput. Med. Imaging Graph. 76, 101636 (2019) 8. Hoover, A., Kouznetsova, V., Goldbaum, M.: Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response. IEEE Trans. Med. Imaging 19(3), 203–210 (2000) 9. Hu, Q., Abra`moff, M.D., Garvin, M.K.: Automated separation of binary overlap- ping trees in low-contrast color retinal images. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 436–443. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40763-5 54 10. Hu, S., Worrall, D., Knegt, S., Veeling, B., Huisman, H., Welling, M.: Super- vised uncertainty quantification for segmentation with multiple annotations. arXiv preprint arXiv:1907.01949 (2019) 98 A. Garifullin et al. 11. Je´gou, S., Drozdzal, M., Vazquez, D., Romero, A., Bengio, Y.: The one hun- dred layers tiramisu: fully convolutional DenseNets for semantic segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1175–1183. IEEE (2017) 12. Jogi, R.: Basic Ophthalmology. Jaypee Brothers Medical Publishers, New Delhi (2008) 13. Kendall, A., Gal, Y.: What uncertainties do we need in Bayesian deep learning for computer vision? In: Advances in Neural Information Processing Systems, pp. 5574–5584 (2017) 14. Malek, J., Tourki, R.: Blood vessels extraction and classification into arteries and veins in retinal images. In: 10th International Multi-conferences on Systems, Sig- nals Devices 2013 (SSD 2013), pp. 1–6, March 2013 15. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomed- ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4 28 16. Welikala, R., et al.: Automated arteriole and venule classification using deep learn- ing for retinal images from the UK Biobank cohort. Comput. Biol. Med. 90, 23–32 (2017). https://doi.org/10.1016/j.compbiomed.2017.09.005 17. Zeiler, M.D.: ADADELTA: an adaptive learning rate method. Technical report, December 2012. arXiv: 1212.5701 http://arxiv.org/abs/1212.5701 18. Zhang, S., et al.: Simultaneous arteriole and venule segmentation of dual-modal fundus images using a multi-task cascade network. IEEE Access 7, 57561–57573 (2019). https://doi.org/10.1109/ACCESS.2019.2914319 19. Zuiderveld, K.: Contrast limited adaptive histogram equalization. In: Heckbert, P.S. (ed.) Graphics Gems IV, pp. 474–485. Academic Press Professional Inc., San Diego (1994). http://dl.acm.org/citation.cfm?id=180895.180940 Publication III Lindén, M., Garifullin A., and Lensu, L. Weight Averaging Impact on the Uncertainty of Retinal Artery-Venous Segmentation Reprinted with permission from Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, pp. 52–60, 2020. © 2021, Springer Nature Switzerland AG Weight Averaging Impact on the Uncertainty of Retinal Artery-Venous Segmentation Markus Linde´n, Azat Garifullin(B) , and Lasse Lensu LUT University, P.O. Box 20, 53851 Lappeenranta, Finland {markus.linden,azat.garifullin,lasse.lensu}@lut.fi Abstract. By examining the vessel structure of the eye through retinal imaging, a variety of abnormalities can be identified. Owing to this, reti- nal images have an important role in the diagnosis of ocular diseases. The possibility of performing computer aided artery-vein segmentation has been the focus of several studies during the recent years and deep neural networks have become the most popular tool used in artery-vein segmen- tation. In this work, a Bayesian deep neural network is used for artery- vein segmentation. Two algorithms, that is, stochastic weight averaging and stochastic weight averaging Gaussian are studied to improve the performance of the neural network. The experiments, conducted on the RITE and DRIVE data sets, and results are provided along side uncer- tainty quantification analysis. Based on the experiments, weight averag- ing techniques improve the performance of the network. Keywords: Uncertainty quantification · Bayesian deep learning · Artery-vein segmentation · Blood vessel segmentation · Weight averaging 1 Introduction Eye diseases have become a rapidly increasing health threat worldwide. Retinal images are a great tool for detecting some of the many ocular disease and dis- eases such as diabetic retinopathy and glaucoma can be detected from retinal images [12]. Ocular diseases are typically detected from retinal images by ana- lyzing the vessel structure. The use of retinal images enables the diagnosis of ocular diseases in their early stages. The task of analyzing the vessel structure has been traditionally left to medical experts. The attention required by the medical experts in this tasks is, however, great and the task is very consum- ing and expensive. Studying the possibilities in making this process faster is for that reason important, as it would enable wider screenings for ocular diseases from retinal images. Automated image processing methods are a well-motivated possibility in solving this problem [3]. The possibility to use computers in performing artery-vein segmenta- tion has been the focus of a number of studies during the recent years. c© Springer Nature Switzerland AG 2020 C. H. Sudre et al. (Eds.): UNSURE 2020/GRAIL 2020, LNCS 12443, pp. 52–60, 2020. https://doi.org/10.1007/978-3-030-60365-6_6 Weight Averaging Impact on the Uncertainty 53 However, artery-vein segmentation still remains a challenging tasks for both humans and machines alike. Some of the difficulties in artery-vein segmentation are related to the imaging conditions in which the retinal images are taken. The images tend to suffer from low contrast and changing lighting conditions, both of which make the segmentation process harder. The deep convolutional neural network (DCNN) has recently become the most common tool used in artery-vein segmentation of retinal images, due to the DCNNs ability to automatically learn meaningful features from images. In a paper by Welikala et al., a convolutional neural network (CNN) was used in artery-vein segmentation. The CNN managed to achieve a 82.26% classifi- cation rate using UK Biobanks’ retinal image database [13]. Hemelings et al. proposed the usage of U-Net architecture for artery-vein classification [5]. In the paper, Hemeling et al. considered the task as a multi-class classification problem with the goal of labeling pixels into four classes: background, vein, artery and unknown. The problem was solved using the retinal images found in DRIVE data set [6] and it achieved classification rates of 94,42% and 94.11% for arter- ies and veins. Girard et al. [3] modified the U-Net for artery-vein segmentation and found out that using likelihood score in the minimum spanning tree it was possible to improve the performance of the network in the case of smaller ves- sels. The method was tested using DRIVE data set, achieving an accuracy of 94.93%. Zhang et al. proposed cascade refined U-net to be used in artery-vein classification [14]. The cascade refined U-net consisted of three sub-networks. The task of the first sub-net (A-net in their paper) was to detect all the vessels from the input image, B-net segmented veins from the predicted vessels from the A-net, and finally the C-net segmented the arterioles from the outputs of the previous nets. In the paper, a classification rate of 97.27% was achieved using the automatically detected vessels from the RITE data set. In a paper by Garifullin et al., a dense fully convolutional neural network (Desne-FCN) was used in the task of artery-vein classification [2]. Using the Dense-FCN architecture and the RITE data the authors were able to achieve classification rates of 96%, 97% and 97% for vessels, arteries and veins respectively. In addition to that the authors performed uncertainty quantification on the results obtained using Monte-Carlo dropout [1] for variational approximation. In the aforementioned article, how- ever, the authors did not illustrate the model calibration and the experiments were conducted with one training setup for different labelling strategies. Thus, the question of reliability of the shown uncertainty estimates arises. This work illustrates how stochastic weight averaging affects the estimated uncertainties. In addition, differences between two epistemic uncertainty esti- mation techniques are illustrated. Both more traditional binary classification metrics as well as uncertainty quantification metrics are used to evaluate the algorithms. 54 M. Linde´n et al. 2 Data The retinal image data set chosen to be used in this work was the DRIVE data set [6]. The DRIVE data set contains 20 RGB images for testing and 20 for training. The images are of size 584× 565. The AV references standard used in this work is the RITE data set [7]. The RITE data set extends the DRIVE data set with references for arteries, veins, overlapping vessels and uncertain vessels. Red labels in the DRIVE data set stand for arteries, blue labels for veins, green for overlapping vessels and white ones for uncertain vessels. An example of a retinal image from the DRIVE data set as well as the corresponding data labels from the RITE data set can be seen in Fig. 1. During the training the labels for crossings were replaced by labels for both arteries and veins simultaneously and the uncertain labels were omitted for arteries and veins and left for the vessels. Fig. 1. (a) Retinal image from the DRIVE data set. (b) Retinal image labels from RITE dataset. (Color figure online) 3 Bayesian AV Classification 3.1 Baseline Garifullin et al. followed a standard approach for deep Bayesian classifica- tion. First, a neural network f is used to estimate the distribution of logits parametrized through the estimate of the mean yˆ and variance σ of logits for arteries and veins: [yˆ,σ] = f (x, θ) . (1) Weight Averaging Impact on the Uncertainty 55 The probability vector p = [partery pvein] of the labels can then be calculated as follows: pˆ = sigmoid (yˆ + σ ⊙ ǫ) , ǫ ∼ N (0, I) . (2) Given the probability vector for arteries and veins the probability for the vessels can be inferred based on the addition law of probability: pvessel = partery + pvein − parterypvein. (3) The resulting optimisation objective is a sum of binary cross-entropy functions for all three labels over all produced aleatoric samples. The formulae (1)–(3) take into account heteroscedastic aleatoric uncertainty which is a type of uncertainty dependent on the data capturing imperfect imaging conditions, labeling and image noise. The second kind of uncertainty is epistemic uncertainty representing the model’s ignorance. By considering the parameters of the model as a random variable with the posterior p (θ | D) the posterior predictive distribution over logits can be calculated as follows: p (y | x,D) = ∫ p (y | x, θ) p (θ | D) dθ. (4) Typically, the integral (4) is intractable and stochastic approximations are used in order to estimate the posterior predictive. One of the most common techniques is to use stochastic variational approximation called MC-Dropout [1] which employs dropout as a Monte Carlo sampling technique in order to obtain samples from the model’s posterior. Another widely used method is stochastic weight averaging Gaussian [11] where the model’s posterior is approximated by a normal distribution the moments of which are estimated during the training procedure. 3.2 Stochastic Weight Averaging Izmailov et al. found out that the values traversed by SGD would be around the flat regions of the loss surface, without actually reaching the center of this area [9]. By equally averaging these points traversed by SGD, Izmailov et al. found out that points that are inside this more desirable part of the loss surface would be achieved. They named this method stochastic weight averaging (SWA) and it was shown to improve the results and generalization of networks on a variety of architectures and in multiple applications. Given initial pre-trained weights SWA can be implemented as a running average of the weights calculated while continuing training with an additional computation of batch normalization statistics after (see [9] for more details). 3.3 Stochastic Weight Averaging Gaussian SWAG was first introduced by Maddox et al. [11] for model averaging and uncer- tainty estimation. The main idea behind is to use SWA to calculate the mean of 56 M. Linde´n et al. the model’s parameters and at the same time to estimate a diagonal approxima- tion of the covariance matrix. Thus, the approximated posterior of the model’s parameters is a normal distribution: p (θ | D) = N (θSWA,ΣSWAG) , (5) where θSWA is a parameter vector estimated with SWA and ΣSWAG is a corre- sponding diagonal covariance matrix. 4 Experiments and Results 4.1 Description of Experiments The parameters and methodologies presented here were selected so that the baseline model used in this work would be as similar as possible to [2]. The utilized architecture is Dense-FCN-103 [10]. The baseline model was, however, re-implemented and the experiments reproduced to some degree in this work. In all the experiments, the network was first pre-trained on RITE dataset with random patches of the input images of size 224× 224. The batch size used in the pre-training was 5 and the network was pre-trained with 100 epochs and 1000 steps per epoch. After the pre-training, the networks were fine-tuned with full-size images that were padded to size of 608× 608 so that they could be properly compressed by the downsampling part of the network. The main optimizer used in all of the experiments was Adadelta with learning rate of 1 and decay rate of 0.95. The use of either SWA or SWAG would start on a later epochs of full resolution training. To increase the diversity of the data set data augmentation techniques were used. The augmentation was performed by applying rotation, flipping, and scal- ing to the input data. The rotation angles used were 90, 180 and 270 degrees and the scaling rates were 0.8, 0.9, 1.0, 1.1 and 1.2. The aleatoric and epistemic uncertainties were estimated using formulae from [8]. The uncertainties are estimated as an average sum standard deviations per image Sp = ∑ i ∑ j σj/Ntest, where i is an index of the image, j is an index of the pixel, and Ntest is the total number of test images (Table 4). Baseline. The fine-tuning of the network used as baseline was done using 50 epochs with 500 steps per epoch to match the hyperparameters used in [2]. The batch size used in the fine-tuning of the baseline was selected to be 1. MC- Dropout was used to quantify epistemic uncertainty. SWA. The SWA implementation also had 50 epochs with 500 steps in each epoch in the full resolution training. Like in the baseline the batch size used was 1. The starting epoch for SWA was selected to be 10 and it was only used in the fine-tuning of the network. The starting epoch was selected through empirical experimentation. MC-Dropout was used to quantify epistemic uncertainty. Weight Averaging Impact on the Uncertainty 57 SWAG. The hyperparameters used in the SWAG implementation were 500 epochs with 50 steps per epoch. This was done so that the Gaussian posteriori approximation formed by SWAG would be generated from a higher number of epochs. Like in the baseline the batch size used was 1. The SWAG starting epoch was selected to be 100. The epistemic uncertainty was quantified by sampling the model’s parameters from Gaussian distribution (5). Whereas the sampling is performed from the posterior estimated with SWAG, dropout is still used during the training phase. 4.2 Performance of the Networks Due to the fact that artery-vein classification was considered a multilabel prob- lem, the performance metrics used in were calculated for arteries, veins and vessels separately. The selected classification metrics were accuracy, sensitiv- ity, specificity, Area Under the Receiver Operating Characteristic Curve (ROC- AUC) and Estimated Calibration Error (ECE) [4]. By examining the performance metrics presented in Tables 1, 2 and 3, it can be seen that SWA improved the network performance overall compared to the baseline and SWAG models including the model calibration. Table 1. Network performance in artery classification (the best accuracy and calibra- tion are in bold) Method Accuracy Sensitivity Specificity ECE ROC-AUC Baseline 0.970 0.642 0.990 0.00988 0.974 SWA 0.975 0.690 0.992 0.00943 0.981 SWAG 0.973 0.706 0.989 0.00871 0.966 Table 2. Network performance in vein classification (the best accuracy and calibration are in bold) Method Accuracy Sensitivity Specificity ECE ROC-AUC Baseline 0.971 0.655 0.994 0.0169 0.980 SWA 0.974 0.742 0.991 0.0120 0.991 SWAG 0.971 0.804 0.983 0.0107 0.980 Table 3. Network performance in vessel classification (the best accuracy and calibra- tion are in bold) Method Accuracy Sensitivity Specificity ECE ROC-AUC Baseline 0.957 0.723 0.989 0.0221 0.980 SWA 0.961 0.782 0.986 0.0208 0.983 SWAG 0.961 0.836 0.978 0.0338 0.984 58 M. Linde´n et al. The example of the segmentation results for SWAG is given in Fig. 2. The segmentation examples for the baseline and SWA look similar. The uncertainties of the results were visualized and example figures can be seen in Fig. 3. In the figure, the intensities of the colors describe the uncertainty in that region as standard deviations of the predicted probabilities: the higher intensity the higher the uncertainty. Fig. 2. (a) The input image; (b) ground truth; (c) mean predicted AV probabilities; (d) mean predicted vessels probabilities. The results are obtained using SWAG. From the tables and figures, it can be concluded that the aleatoric uncer- tainty of the baseline is much higher than those of SWA and SWAG. It can also be concluded that sampling the network weights from the Gaussian posterior generated by SWAG to create the variational approximation, rather than using Monte-Carlo dropout, has a reducing effect on the levels of epistemic uncertainty present in the predictions. This could probably be explained by the fact that the variance is estimated only around a local optimum during the late stages of the training, whereas MC-Dropout is enabled during the whole training process. From the estimated performance metrics, however, it is difficult to conclude whether it is a positive or negative effect. One noticeable pattern is the high epistemic uncertainty near the optic disc when estimated with MC-Dropout. On the other hand, sampling from Gaussian distribution leads to the high uncer- tainties mostly near the end points of the blood vessels and the areas after the crossings which is also present in the case of MC-Dropout. At the same time one can see that aleatoric uncertainties change when SWA or SWAG are utilized. Kendall et al. [1] describe the aleatoric uncertainty as a loss attenuation mechanism allowing the model to adapt the loss dependent on the data and labelling. While the aleatoric uncertainty is meant to be data depen- dent, the changes to the training procedure affecting the model’s convergence and the parameters of the layers predicting variances also affect the predicted aleatoric uncertainties. For the baseline and SWAG, we can see a similar pattern of the higher aleatoric uncertainty levels near the optic disc and borders of the vasculature, whereas the aleatoric uncertainties almost vanish when estimated using MC-Dropout trained with SWA. Weight Averaging Impact on the Uncertainty 59 Table 4. Mean sums of estimated aleatoric and epistemic uncertainties per image. Method Aleatoric Epistemic Arteries Veins Vessels Arteries Veins Vessels Baseline 1276.2 1159.5 1807.5 4853.6 4066.4 5069.7 SWA 3.3 3.5 5.3 4038.6 3882.3 4659.7 SWAG 31.1 38.9 57.3 997.8 1104.3 1396.1 Fig. 3. Aleatoric uncertainties calculated using (a) the baseline, (b) SWA, and (c) SWAG. Epistemic uncertainties calculated using (d) the baseline, (e) SWA, and (f) SWAG. 4.3 Conclusions In this work, the focus was on blood vessel segmentation from retinal images and on artery-vein classification by using a deep neural network. More specifically, two algorithms were studied to improve the classification performance and help in the model calibration. SWA and SWAG algorithms were implemented on top of the baseline and experimented with the DRIVE and RITE data sets. The use of SWA improved the performance of the deep neural network on most of the binary classifications as well as the calibration metrics. SWAG showed slight improvements in the vessels and artery classification tasks. The weight averaging as a process significantly affecting the model’s convergence seems to lead to diminishing aleatoric uncertainties and sampling from the nor- mal distribution captures less epistemic uncertainty. 60 M. Linde´n et al. References 1. Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: International Conference on Machine Learn- ing, pp. 1050–1059 (2016) 2. Garifullin, A., Lensu, L., Uusitalo, H.: On the uncertainty of retinal artery-vein classification with dense fully-convolutional neural networks. In: Blanc-Talon, J., Delmas, P., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2020. LNCS, vol. 12002, pp. 87–98. Springer, Cham (2020). https://doi.org/10.1007/978-3-030- 40605-9 8 3. Girard, F., Kavalec, C., Cheriet, F.: Joint segmentation and classification of retinal arteries/veins from fundus images. Artif. Intell. Med. 94, 96–109 (2019). https:// doi.org/10.1016/j.artmed.2019.02.004 4. Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: International Conference on Machine Learning, pp. 1321–1330 (2017) 5. Hemelings, R., Elen, B., Stalmans, I., Van Keer, K., De Boever, P., Blaschko, M.B.: Artery-vein segmentation in fundus images using a fully convolutional network. Comput. Med. Imaging Graph. 76, 101636 (2019) 6. Hoover, A., Kouznetsova, V., Goldbaum, M.: Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response. IEEE Trans. Med. Imaging 19(3), 203–210 (2000) 7. Hu, Q., Abra`moff, M.D., Garvin, M.K.: Automated separation of binary overlap- ping trees in low-contrast color retinal images. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 436–443. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40763-5 54 8. Hu, S., Worrall, D., Knegt, S., Veeling, B., Huisman, H., Welling, M.: Supervised uncertainty quantification for segmentation with multiple annotations. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11765, pp. 137–145. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32245-8 16 9. Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D.P., Wilson, A.G.: Averaging weights leads to wider optima and better generalization. In: The Conference on Uncertainty in Artificial Intelligence (2018) 10. Je´gou, S., Drozdzal, M., Vazquez, D., Romero, A., Bengio, Y.: The one hun- dred layers tiramisu: fully convolutional densenets for semantic segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1175–1183. IEEE (2017) 11. Maddox, W., Garipov, T., Izmailov, P., Vetrov, D.P., Wilson, A.G.: A simple base- line for Bayesian uncertainty in deep learning. In: NeurIPS (2019) 12. Miri, M., Amini, Z., Rabbani, H., Kafieh, R.: A comprehensive study of retinal vessel classification methods in fundus images. J. Med. Signals Sens. 7(2), 59 (2017) 13. Welikala, R., et al.: Automated arteriole and venule classification using deep learn- ing for retinal images from the UK biobank cohort. Comput. Biol. Med. 90, 23–32 (2017). https://doi.org/10.1016/j.compbiomed.2017.09.005 14. Zhang, S., et al.: Simultaneous arteriole and venule segmentation of dual-modal fundus images using a multi-task cascade network. IEEE Access 7, 57561–57573 (2019). https://doi.org/10.1109/ACCESS.2019.2914319 Publication IV Garifullin, A., Lensu, L., and Uusitalo, H. Deep Bayesian baseline for segmenting diabetic retinopathy lesions: Advances and challenges Reprinted with permission from Computers in Biology and Medicine Vol. 136, pp. 104725, 2021. © 2021, Elsevier Ltd Computers in Biology and Medicine 136 (2021) 104725 Available online 6 August 2021 0010-4825/© 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). Deep Bayesian baseline for segmenting diabetic retinopathy lesions: Advances and challenges Azat Garifullin a,*, Lasse Lensu a, Hannu Uusitalo b,c a Computer Vision and Pattern Recognition Laboratory, School of Engineering Science, LUT University, P.O. Box 20, 53851, Lappeenranta, Finland b Department of Ophthalmology, ARVO F313, 33014, Tampere University, Finland c Tays Eye Center, Tampere University Hospital, P.O. Box 2000, 33520, Tampere, Finland A R T I C L E I N F O Keywords: Bayesian deep learning Diabetic retinopathy Lesion segmentation Microaneurysm Hard exudate Soft exudate Haemorrhage A B S T R A C T Early diagnosis of retinopathy is essential for preventing retinal complications and visual impairment due to diabetes. For the detection of retinopathy lesions from retinal images, several automatic approaches based on deep neural networks have been developed in the recent years. Most of the proposed methods produce point estimates of pixels belonging to the lesion areas and give no or little information on the uncertainty of method predictions. However, the latter can be essential in the examination of the medical condition of the patient when the goal is early detection of abnormalities. This work extends the recent research with a Bayesian framework by considering the parameters of a convolutional neural network as random variables and utilizing stochastic variational dropout based approximation for uncertainty quantification. The framework includes an extended validation procedure and it allows analyzing lesion segmentation distributions, model calibration and prediction uncertainties. Also the challenges related to the deep probabilistic model and uncertainty quantification are presented. The proposed method achieves area under precision-recall curve of 0.84 for hard exudates, 0.641 for soft exudates, 0.593 for haemorrhages, and 0.484 for microaneurysms on IDRiD dataset. 1. Introduction Diabetic retinopathy (DR) is the most common complication of dia-betes mellitus and can lead to a vision loss if not treated properly [1]. Screening of the condition and early detection of retinal abnormalities is essential and consists of examining retinal images for diabetic lesions. In the early stages of the disease, these lesions are small, typically have low contrast and sometimes difficult to detect for humans. The core of the screening problem is, however, the amount of images that need to be analyzed. Thus, automatic retinal image analysis methods are required. One way to build an assisting system is to train an end-to-end classifier that processes an input image and yields a diabetic retinopathy grade [2]. These systems are often criticized for being black-boxes producing results that are difficult to interpret [3]. As an alternative, one can train a segmentation model that processes the input image and produces a segmentation map where each element represents the probability of being a lesion. This way the diagnosis can be inferred from the seg-mentation maps by counting the detected lesions. In recent years, the field of DR lesion segmentation has advanced with the introduction of new retinal image datasets making it possible to accelerate research in related computer vision methods [4]. One of the most widely used benchmarks is Indian diabetic retinopathy image dataset (IDRiD) dataset providing high-quality ground truth masks for hard exudates, soft exudates, haemorrhages and microaneurysms. Por-wal et al. [5] published the results of the IDRiD challenge held in 2018. The best performing algorithms were represented by deep convolutional architectures such as U-Net [6], dense fully-convolutional network (Dense-FCN) [7] and Mask-RCNN [8] or their variants. It should be noted that the data is very unbalanced and achieving high sensitivity was a challenge for many algorithms. To overcome this issue, the au-thors used balanced cross-entropy [9] and dice loss [10]. Due to the high resolution of the images, the models were trained in a patchwise manner. Guo et al. [11] proposed a multi-scale feature fusion method to handle issues with small lesion detection. Binary cross-entropy (BCE) loss with balancing coefficients was used to improve the sensitivity. The model was trained with full images resized to 1440 × 960 pixels without any further preprocessing. Yan et al. [12] proposed mutually local-global U-Net mitigating the problems of patchwise training which does not capture the global context. The proposed architecture consists * Corresponding author. E-mail addresses: azat.garifullin@lut.fi (A. Garifullin), lasse.lensu@lut.fi (L. Lensu), hannu.uusitalo@tuni.fi (H. Uusitalo). Contents lists available at ScienceDirect Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/compbiomed https://doi.org/10.1016/j.compbiomed.2021.104725 Received 28 June 2021; Received in revised form 29 July 2021; Accepted 30 July 2021 Computers in Biology and Medicine 136 (2021) 104725 2 Fig. 1. (a) An example of IDRiD image with ground truth masks for (b) hard exudates, (c) soft exudates, (d) haemorrhages, and (e) microaneurysms. Fig. 2. Statistics of lesions in the train set. The number of positive pixels per image for (a) hard exudates (EX), (b) soft exudates (SE), (c) haemorrhages (HE), and (d) microaneurysms (MA). (e) The number of pixels for the lesions and the background. (f) The number of positive pixels for each lesion for the whole dataset. A. Garifullin et al. Computers in Biology and Medicine 136 (2021) 104725 3 of two U-Nets (global and local) that share the last layers of their de-coders. Both networks are jointly trained minimizing weighted cross-entropy loss to deal with the class imbalance. The aforementioned approaches consider only point estimates of the trained models and produced results. Thus, the question of reliability of a trained model arises. In this work, the problem is addressed by using Bayesian deep learning modeling a distribution over the learned pa-rameters of the model and produces the segmentation results in a form of posterior predictive distribution. Recently, Bayesian deep learning models have started finding their applications in the area of retinal image analysis. Leibig et al. [13] evaluated dropout based uncertainty measures and demonstrated improved diagnostic performance using uncertainty-informed decisions. Filos et al. [14] proposed a new benchmark for deep Bayesian models with application to DR diagnosis also assessing the robustness of the models to out-of-distribution ex-amples and distribution shift. This work extends the preceding research with Bayesian DR lesion segmentation. To the best of authors’ knowledge, this is the first work discussing the Bayesian approach for DR lesion segmentation. The aim is to establish a baseline that would inspire future research on the topic. The contributions of this work can be highlighted as follows: 1. The introduction of a novel Bayesian baseline for DR lesion seg-mentation allowing the analysis of segmentation distributions. Fig. 3. Statistics of lesions in the test set. The number of positive pixels per image for (a) hard exudates (EX), (b) soft exudates (SE), (c) haemorrhages (HE), and (d) microaneurysms (MA). (e) The number of pixels for the lesions and the background. (f) The number of positive pixels for each lesion for the whole dataset. A. Garifullin et al. Computers in Biology and Medicine 136 (2021) 104725 4 2. An assessment and analysis of model calibration and prediction uncertainties. 3. The presentation of an extended validation procedure for DR lesion segmentation task beyond the point estimates. The rest of the paper is organized as follows: Section 2 describes the utilized dataset and gives the information about class imbalance and the statistics of labels, and Section 3 explains the Bayesian image segmen-tation setup, utilized data sampling approach and training details. Sec-tion 4 explains the evaluation protocol and presents the performance metrics together with the visualizations of the inferred results. Section 5 discusses faced issues and directions for future research. The results of the work are summarized in Section 6. 2. IDRiD dataset The IDRiD dataset is a common benchmark for the diabetic reti-nopathy lesion segmentation [5]. It contains 54 train and 27 test images of resolution 4288× 2848 with segmentation masks aiming to be spatially accurate for four lesion types: hard exudates, soft exudates, haemorrhages, and microaneurysms. An example image from the data-set is shown in Fig. 1. The class imbalance can be visualized as a bar graph with the number of positive pixels for lesions for each image separately as well as for the whole dataset. The calculated statistics for the train and test sets are presented in Fig. 2 and Fig. 3. 3. Bayesian lesion segmentation 3.1. Background The classical approaches give only point estimates for the class label probabilities and the model parameters are considered to be determin-istic. In order to capture imperfect data labeling and image noise, the model outputs and learned parameters can be considered as random variables. The first approach captures the heteroscedastic aleatoric un-certainty that depends on the input data, whereas the second represents the epistemic uncertainty that models a distribution of the learned pa-rameters. Here, a brief explanation for the lesion segmentation task is given below. More detailed explanations for the uncertainties can be found in Refs. [15,16]. Let f be a model, with parameters θ, that maps an input image x to a map of logits ŷ, accompanied by a map standard deviations σ of the logits: [ŷ, σ] = f (x, θ). (1) Then, the probabilities of the class labels can be calculated as follows: p̂ = sigmoid(ŷ + σ ⊙ ε), ε ∼ 𝒩(0, I), (2) where ⊙ stands for the Hadamard product and ε are sampled during inference. Epistemic uncertainty can be captured by considering the model parameters to be a random variable and making use of the following posterior predictive: p(y|x,𝒟) = ∫ p(y|x, θ)p(θ|𝒟)dθ, (3) where 𝒟 denotes a dataset of input-output pairs. Typically, the parameter’s posterior p(θ|𝒟) for complex models such as deep neural networks is intractable and variational approximations are used [16]. The posterior in (3) can be replaced by a simpler distri-bution qθ(ω) with variational parameters ω. In this work, Monte-Carlo dropout [16] is used as a framework to perform stochastic variational inference. The relation between the true and approximate posteriors is given by ω = θ ⊙ MD, (4) where MD is a dropout mask that randomly sets the model weights to zero. The training procedure can then be formulated as the minimization of the Kullback-Leibler divergence DKL between the true posterior and the approximation. This is equivalent to minimizing the negative vari-ational lower bound [16]: L VI(ω)= ∫ qθ (ω) log p (Y|X, ω) dω − DKL(qθ(ω) ‖p(ω)), (5) where X,Y represent the inputs and outputs of the model, respectively, and p(ω) is the prior for the variational parameters ω. The expectation in the first part of (5) is typically approximated using Monte-Carlo inte-gration [16]. In this work, it is approximated using one sample from the variational distribution. Therefore, the optimization objective becomes ℒMCD(ω) = ∑N−1 i=0 ℒ(yi|xi,ω) + ℛ(ω), (6) where i is an index of the training example and N is the total number of samples in the training set. ℛ is a regularization term that depends on the form of a prior distribution over the parameters of the model. In this case, the prior is a normal distribution corresponding to L2 weight decay. The loss function chosen for this work is binary cross-entropy and it is summed over the aleatoric samples: ℒ(y|x,ω) = ∑N−1 i=0 ∑NA−1 j=0 ℒBCE ( yij|xi,ω ) , (7) where NA is a number of aleatoric samples. The training scheme described above does not take into account class imbalance. In this work, a straightforward oversampling scheme based on class frequencies statistics is used and it is described in the next section. 3.2. Oversampling One way to handle class imbalance is to perform oversampling of the underrepresented classes. Here, three-stage sampling is performed: 1. Positive samples are selected with π+ probability and negative samples are selected with 1− π+ probability. 2. An image of the selected class is sampled with the probability pimagei proportional to the logarithm of the pixel count of the given class, that is, p image i = log max(N imagei , 1)∑ jlog max(N image j , 1) , (8) where Nimagei is the number of positive pixels for the class of interest in the image with index i. 3. The final step is to select an image patch containing pixels of the class of interest. In order to select such a patch, we follow a scheme similar to the previous stage. The image is divided into a set of overlapping patches and the patch is selected with probability p patch i = log max(Mpatchi , 1)∑ jlog max(M patch j , 1) , (9) where Mpatchi is the number of positive pixels for the class of interest in the patch with index i. The log scale here is used in order to increase the diversity of chosen A. Garifullin et al. Computers in Biology and Medicine 136 (2021) 104725 5 Fig. 4. The Dense-FCN architecture: Dense stands for a dense convolutional block; C is a tensor concatenation; H is a block consisting of batch normalization (BN), rectified linear unit (ReLU) and a convolutional layer with growth rate g; Down is a transition-down block with F output feature maps; Up is a transition up with F output feature maps and 2 × 2 stride; logits std denotes standard deviations of logits. Fig. 5. Two examples of the original (left column) and enhanced (right column) images. A. Garifullin et al. Computers in Biology and Medicine 136 (2021) 104725 6 samples. π+ is a tunable hyperparameter and should be chosen depending on the class imbalance in a particular case. In this work, π+ =0.5 is used as experimentally it has been found that this value provides the best results. 3.3. Architecture The architecture utilized in this work is a Dense-FCN [17]. It has been shown that Dense-FCNs have less parameters and may outperform other fully-convolutional network (FCN) architectures in a variety of different segmentation tasks [17]. Here we adapt the Dense-FCN ar-chitecture for the lesion segmentation task. The main building block of Dense-FCNs is a dense convolutional block (DCB) where the input of each layer is a concatenation of the outputs of the previous layers. The block consists of repeating batch normalization (BN), rectified linear unit (ReLU), convolution and dropout p = 0.5 layers resulting in g feature maps (growth rate). The main concept of Dense-FCNs is similar to other encoder-decoder architectures in the sense that the input is first compressed to a hidden representation by the downsampling part. Thereafter the segmentation masks are recovered by an upsampling part. The downsampling part consists of DCBs and downsampling transitions with skip connections to the upsampling part. The upsampling part consists of DCBs and upsampling transitions. An example of two blocks in downsampling and upsampling paths of a Dense-FCN is shown in Fig. 4. The total number of trainable parameters is 9319778. The archi-tectural parameters used are are as follows: ● The growth rate for all DCBs: g = 16. ● The downsampling path consists of DCBs with depths Ddown = [4,5, 7,10, 12,15]. ● The upsampling also consists of five DCBs with depths Dup = [12, 10,7, 5,4]. ● The first and last convolution layers are the same as in Fig. 4. 3.4. Image preprocessing It was noticed in the experimental part of the work that simple preprocessing proposed in Ref. [18] improves the results. The pre-processing is implemented in two steps: 1. Luminosity enhancement employs luminance gain matrix G that is applied in the red-green-blue (RGB) color space: x ′ = [G ⊙ r G ⊙ g G ⊙ b], (10) Gi = V ′ i max{ri, gi,bi} , (11) where r, g and b are red, green and blue image channels respectively, x′ is an enhanced image, and V′i is an enhanced luminance value at pixel with index i. The enhanced luminance value is calculated by converting the image to hue-saturation-value (HSV) color space and enhancing the luminance V using gamma enhancement. Here, we choose Γ = 1/ 2.2 as in the original work [18]. 2. Contrast enhancement is performed using Contrast Limited Adaptive Histogram Equalization [19] algorithm with the clip limit 0.1 and the grid size 8× 8. In order to reduce requirements for computing resources, the images were resized to the resolution of 2144 × 1440 pixels. Two examples of the original and enhanced images are presented in Fig. 5. 3.5. Training details The Dense-FCN was trained for 100 epochs with 500 steps per epoch on random patches 224× 224 with the batch size equal to 6. The patches were generated with the overlap 192× 192. Data augmentation by vertical and horizontal mirroring was applied. The parameter values were empirically tuned based on initial experiments with the IDRiD dataset. The weights were initialized using HeNormal [20]. In addition to dropout, L2 regularization with the weight decay factor 10−4 was used. As the optimizer, Adadelta [21] with the learning rate l = 1 and the decay rate ρ = 0.95 was used. The learning rate was adjusted according to the following schedule: 1. if 0 <= epoch < 50, l = 1; 2. if 50 <= epoch < 70, l = 0.1; 3. if 70 <= epoch < 85, l = 0.01; 4. if 85 <= epoch < 100, l = 0.001. 4. Experiments and results 4.1. Evaluation protocol In [5], many authors processed images in a patchwise manner during the validation stage. In this work, it was noticed that with Bayesian neural networks this can lead to checkerboard artifacts that have a negative impact on the segmentation performance. Therefore, in the inference stage images are not divided into patches but are processed as full images. It is also worth to note that full-resolution processing is much faster and it takes approximately 14 min to process an image with 50 epistemic and 100 aleatoric samples. The input and output images have the resolution of 2144 × 1440 pixels. In order to evaluate the segmentation performance, the following classification metrics are used: ● Sensitivity (SE) is used to assess the ability of the model to discover lesions: SE = TP TP + FN , (12) where TP and FN are the amounts of true positive and false negative pixels, respectively. ● Positive predictive value (PPV) is used in addition to sensitivity but taking into account false positives FP: PPV = TP TP + FP . (13) ● Specificity (SP) is used to assess to ability of the model to correctly segment healthy pixels: SP = TN TN + FP , (14) where TN is the amount of true negative pixels. ● Area under receiver-operating-characteristic curve (ROC-AUC) is an integral metric regardless of the thresholding value. ROC-AUC is calculated under the area of the curve plotted as a true positive rate against false positive rate by varying the threshold. ● Area under precision-recall curve (PR-AUC) is another integral metric regardless of the thresholding value. PR-AUC more realisti-cally represents the segmentation performance in comparison to the area under receiver operating characteristic ROC-AUC. A. Garifullin et al. Computers in Biology and Medicine 136 (2021) 104725 7 ● Expected calibration error (ECE) is used to assess a model’s cali-bration [22]: ECE = E p̂ [|P(ŷ = y|p̂ = π)− π| ], π ∈ [0, 1], (15) where p̂ is a confidence estimate of the predicted class ŷ, y is a true label and π is a true probability. Together with ECE, reliability diagrams are also presented. These reliability diagrams are graphs showing the expected accuracy against classification confidence, thereby representing calibration quality. In the case of perfect calibration, the graph is an identity function. In the evaluation, sensitivity, specificity and positive predictive value are calculated by thresholding the output predictive mean with T = 0.5. In the inference, the model parameters are sampled 100 times and the number of inferred aleatoric samples is NA = 100. The final poste-rior predictive mean is calculated over all the predicted samples, and the aleatoric uncertainty UA and epistemic uncertainty UE of the outputs are calculated as in Ref. [23]: UA = Eq [ Vp(y|x,θ)[y] ] , (16) UE = Vq [ Ep(y|x,θ)[y] ] , (17) UT = UA + UE, (18) where E and V denote expectation and variance, respectively, and UT is the total predictive uncertainty. Apart from characterizing the total uncertainty, it is also important to evaluate the meaningfulness of the produced uncertainty maps. This is a more challenging task since only point estimates of ground truth labels are available. However, it is reasonable to assume that incorrectly segmented areas must have higher uncertainties. Mobiny et al. [24] proposed to use the uncertainty as a tool predict incorrect classification results by thresholding the output uncertainties. Camarasa et al. [25] analyzed different uncertainty measures for medical image segmenta-tion and concluded that the averaged variance and averaged entropy perform equally well and are better than other metrics. In this work, the standard deviation is used. We follow the same approach and use the following: 1. Area under uncertainty precision-recall curve (PR-AUC) is used an integral metric to assess the quality of uncertainty estimates. 2. Uncertainty sensitivity (U-SE) is used to assess the ability of the uncertainty estimates to discover misclassifications. 3. Uncertainty specificity (U-SP) is used to assess the ability of the uncertainty estimates to correctly classify misclassifications. 4. Uncertainty expected calibration error (U-ECE) is also used to vali-date the uncertainty calibration. U-SE and U-SP are calculated using the threshold which is half of the maximum uncertainty value. To summarize, the extended validation approach consists of the analysis of the produced segmentation masks as well as comparison of the produced uncertainties and the misclassification maps. 4.2. Evaluation of segmentation results The precision-recall (PR) and receiver operating characteristic (ROC) curves are shown in Fig. 6. It is clear that the ROC curves demonstrate close-to-optimal classification results due to large class imbalance. On the other hand, the PR curves represent the classification performance more realistically. The corresponding performance metrics are given in Table 1. Based on the figures and the table, it is clear that the easiest task is to segment the hard exudates, whereas the most difficult one is the Fig. 6. Precision-recall and receiver operating characteristic curves for (a), (e) hard exudates; (b), (f) soft exudates; (c), (g) haemorrhages; (d), (h) microaneurysms. Table 1 Evaluation results of the baseline training scheme. The abbreviations of the evaluation metrics are explained in the text. Label PR-AUC ROC-AUC Sensitivity PPV Specificity ECE Hard exudates 0.842 0.995 0.767 0.753 0.997 0.090 Soft exudates 0.641 0.993 0.639 0.611 0.999 0.145 Haemorrhages 0.593 0.977 0.464 0.670 0.997 0.066 Microaneurysms 0.484 0.997 0.434 0.531 0.999 0.116 A. Garifullin et al. Computers in Biology and Medicine 136 (2021) 104725 8 Fig. 7. Reliability diagram for (a) hard exudates; (b) soft exudates; (c) haemorrhages; (d) microaneurysms. Fig. 8. Visualizations of inference results for input image 5b for lesions: (a), (b), (c) hard exudates; (d), (e), (f) soft exudates; (g), (h), (i) haemorrhages; (j), (k), (l) microaneurysms. The first column shows the ground truth masks, the second shows the mean inferred probabilities and the third shows epistemic uncertainty masks (standard deviations of probabilities). A. Garifullin et al. Computers in Biology and Medicine 136 (2021) 104725 9 segmentation of microaneurysms. Low sensitivies are a common prob-lem for the DR lesion segmentation task [5]. This can be explained by the relatively low contrast and size of lesions. Apart from the analysis of true positive classifications, it is also essential to have classifiers with high specificity. From Table 1 it is possible to see that specificities are very high for all types of lesions being close to one. Nevertheless, it can be easily achieved due to the class imbalance. PPVs, on the other hand, give more insights into the problem of false positive classifications comparing them to true positives. It is easy to notice that in the worst case scenario for microaneurysms there are almost as many falsely classified pixels of healthy tissues as correctly discovered pixels of microaneurysms. This fact gives additional motivation for analyzing the uncertainties. The reliability diagrams are given in Fig. 7. It can be seen that the trained models are miscalibrated and the one for haemorrhages repre-sents the best result. Guo et al. [22] have shown that deep neural net-works are typically poorly calibrated and the authors proposed methods decreasing the degree of miscalibration. Guo et al. claimed that the ECE of approximately 0.01 − 0.02 can be achieved for standard classification benchmark datasets and Dense architectures. In this work, no methods Fig. 9. Visualizations of inference results for input image 5d for lesions: (a), (b), (c) hard exudates; (d), (e), (f) soft exudates; (g), (h), (i) haemorrhages; (j), (j), (l) microaneurysms. The first column shows the ground truth masks, the second shows the mean inferred probabilities and the third shows epistemic uncertainty masks (standard deviations of probabilities). Table 2 Evaluation results for the estimated uncertainty maps. The abbreviations of the evaluation metrics are explained in the text. Label U-PR-AUC U-SE U-PPV U-SP U-ECE Hard exudates 0.336 0.031 0.566 0.999 0.104 Soft exudates 0.257 0.113 0.388 0.999 0.195 Haemorrhages 0.243 0.029 0.302 0.999 0.303 Microaneurysms 0.257 0.045 0.332 0.999 0.237 A. Garifullin et al. Computers in Biology and Medicine 136 (2021) 104725 10 for improving the calibration were used and the reliability is assessed for the baseline model. The segmentation results for two example images from the test set (shown in Fig. 5) are illustrated in Fig. 8 and Fig. 9. From the images, it is possible to observe visual similarities between the ground truth and mean inferred probability maps. Higher uncertainties are concentrated around the areas with high predicted confidence and false positive segmented pixels. A more detailed discussion about the inference results and the estimated uncertainties is given in the next section. 4.3. Uncertainty quantification The PR curves and reliability diagrams are shown in Fig. 6 and the evaluation metrics are given in Table 2. From the results, it is clear that normalized uncertainties are not efficient predictors of mis-classifications and have low sensitivities. It is worth to note that the evaluation procedure is straightforward and considers only soft un-certainties against hard misclassifications. Nevertheless, the un-certainties are not necessarily high only near the misclassification areas, but also near the areas of relatively low confidence as shown below. This can also explain the uncertainty miscalibrations. The uncertainty PR curves are given in Fig. 10 and the uncertainty reliability diagrams are presented in Fig. 11. From the reliability diagrams it is clear that the uncertainties are mostly underestimated, since the growing confidence values stop matching with the increasing accuracy values. Inference results for hard exudates of the magnified example image are shown in Fig. 12. It is clear that the misclassifications and epistemic uncertainties are mostly concentrated around the edges of the lesions. This can be explained by unclear boundaries of the lesions. The aleatoric uncertainties acting as a learned loss attenuation are also higher around the borders. The boundary uncertainties are a general pattern for segmentation models and can be observed within a wide variety of tasks. It is also possible to see small yellow lesions being incorrectly classified as background which highlights the problems of detecting small-scale lesions. It is worth noting that there is a soft exudate left to the hard exudates cluster and the model is certain for not classifying it as a hard exudate. Inference results for soft exudates of the magnified example image are shown in Fig. 13. The high boundary uncertainties are presented in this case as well. Soft exudates typically have low contrast, no texture, unclear edges and can be easily confused with the background. It is possible to see false positive detections of soft exudates in the lower left part of the image which is slightly more yellow comparing to the other background pixels. The soft exudate in the lower right part of the image has uneven contrast and the low-contrast part of the lesion is incorrectly classified as the background. In both cases, the model yielded non- maximum mean confidence and the incorrectly classified pixels also have high uncertainties. In Fig. 14, the inference results for the haemorrhages of the magni-fied example image are presented. The lesion is surrounded by blood vessels and a part of the macula is presented in the magnified input image. The part with blood vessels to the left is incorrectly classified as a haemorrhage. It is also possible to see the model’s confusion about the part with the macula. Epistemic uncertainty is in general higher near the areas with similar colors highlighting the surrounding blood vessels and macula.Inference results for microaneurysms of the magnified example image are given in Fig. 15. Microaneurysms are the smallest of all lesions and the epistemic uncertainty is high over the whole area of lesions. On the other hand, the aleatoric uncertainties are still higher near the edges. Being small-scale lesions with no textures, microaneurysms are confused with any red small spots, which is visible on the epistemic uncertainty maps. Fig. 10. Uncertainty precision-recall curves for (a) hard exudates; (b) soft exudates; (c) haemorrhages; (d) microaneurysms. A. Garifullin et al. Computers in Biology and Medicine 136 (2021) 104725 11 5. Discussion The approach presented in this work shows classification perfor-mance comparable to previously reported methods [5]. The uncertainty maps can be used for the visual inspection and analysis of the perfor-mance. The estimated uncertainties and the produced confidence maps provide more information about the model’s behaviour. Nevertheless, a few challenges remain and they are discussed in this section in addition to brief explanations of failed experiments. One of the main issues in lesion detection is low sensitivity of the segmentation model. This problem is present in the related previous works [4,26] and also in this study. In medical image analysis and Fig. 11. Uncertainty reliability diagrams for (a) hard exudates; (b) soft exudates; (c) haemorrhages; (d) microaneurysms. Fig. 12. Inference results for hard exudates with magnified input image 5b: (a) input image; (b) ground truth mask; (c) misclassifications; (d) mean inferred probabilities; (e) aleatoric uncertainties (standard deviations of probabilities); (f) epistemic uncertainties (standard deviations of probabilities). A. Garifullin et al. Computers in Biology and Medicine 136 (2021) 104725 12 segmentation, it is common to use custom heuristic loss functions [26] to improve sensitivity [27] or deal with lesion boundary issues [28]. We also experimented with other loss functions including focal loss [29], Tversky loss [27], generalized dice loss [28], and boundary loss [30]. Nevertheless, results outperforming the proposed baseline were not achieved. This negative outcome is likely due to omitting the tuning of loss functions’ hyperparameters. These objectives are typically synthetic in the sense that they are formulated already in the form of loss functions and not as log-likelihoods. This means that they are not derived from specific distributions encoding the information about class imbalance. On the other hand, binary cross-entropy is derived as a negative loga-rithm of the Bernoulli likelihood. To study the issue with low sensitivity, more focused research is required to evaluate modern loss functions for medical image segmentation in the context of Bayesian deep learning and model calibration. In this work, a straightforward scheme based on label statistics is used to balance the lesion and background data. A potentially more efficient approach would be to use Bayesian active learning [31] where uncertainty-based acquisition functions are used to select the training samples. Typically, these methods do not work well with unbalanced data which can be another topic for the future research. Model and uncertainty calibration metrics are also subjects for further improvements. Apart from the classical calibration methods described in Ref. [22], alternative ways of improving the calibration exist. Thulasidasan et al. [32] proposed to use mix-up augmentation to improve the model calibration. Seo et al. [33] proposed single-shot calibration by regularizing the model with the uncertainty of the out-puts. Laves et al. [34] considered the uncertainty calibration in the context of deep Bayesian regression and discovered that the predicted uncertainties are typically underestimated. The problem was solved Fig. 13. Inference results for soft exudates with magnified input image 5b: (a) input image; (b) ground truth mask; (c) misclassifications; (d) mean inferred probabilities; (e) aleatoric uncertainties (standard deviations of probabilities); (f) epistemic uncertainties (standard deviations of probabilities). Fig. 14. Inference results for haemorrhages with magnified input image 5b: (a) input image; (b) ground truth mask; (c) misclassifications; (d) mean inferred probabilities; (e) aleatoric uncertainties (standard deviations of probabilities); (f) epistemic uncertainties (standard deviations of probabilities). A. Garifullin et al. Computers in Biology and Medicine 136 (2021) 104725 13 using simple temperature scaling of aleatoric and epistemic uncertainty. During the development of this work, experiments with the uncertainty calibration using Platt scaling and isotonic regression were conducted. However, no improvements over the baseline were found. It is likely that a more systematic approach aiming to solve both calibration problems is required. 6. Conclusion In this paper, a Bayesian baseline for the diabetic retinopathy lesion segmentation, allowing the analysis of segmentation distributions, model calibration and prediction uncertainties, is proposed. Also an extended validation approach consisting of the analysis of segmentation performance and the ability of uncertainty estimates to detect false classifications is provided. The presented results from the uncertainty quantification experiments show that the estimates are qualitatively similar to misclassification maps and can be used to assess issues in the lesion segmentation. Overall, the main challenges of the deep probabi-listic model are the small-scale lesions, areas with low contrast and unclear boundaries. The color information is also essential for successful segmentation and healthy tissues can be confused with lesions when being of a similar color. Further research and development is required to make the predicted lesion segmentation uncertainties suitable for numeric quantification. Declaration of competing interest None of the authors have any conflict of interest. Acknowledgement This work has been supported by LUT Doctoral School. They were not involved in the study design, data collection and analysis, decision to publish, or preparation of this work. The computational resources for this work were provided by CSC – IT Center for Science, Finland. The authors wish to thank the authors of the open-access data utilized in this work. References [1] E. Reichel, D. Salz, Diabetic Retinopathy Screening, Springer International Publishing, Cham, 2015, pp. 25–38. [2] F. Arcadu, F. Benmansour, A. Maunz, J. Willis, Z. Haskova, M. Prunotto, Deep learning algorithm predicts diabetic retinopathy progression in individual patients, NPJ Dig. Med. 2 (1) (2019) 1–9. [3] J. de la Torre, A. Valls, D. Puig, A deep learning interpretable classifier for diabetic retinopathy disease grading, Neurocomputing 396 (2020) 465–476, https://doi. org/10.1016/j.neucom.2018.07.102. [4] T. Li, W. Bo, C. Hu, H. Kang, H. Liu, K. Wang, H. Fu, Applications of deep learning in fundus images: a review, Med. Image Anal. 69 (2021) 101971, https://doi.org/ 10.1016/j.media.2021.101971. [5] P. Porwal, S. Pachade, M. Kokare, G. Deshmukh, J. Son, W. Bae, L. Liu, J. Wang, X. Liu, L. Gao, et al., Idrid: diabetic retinopathy–segmentation and grading challenge, Med. Image Anal. 59 (2020) 101561. [6] O. Ronneberger, P. Fischer, T. Brox, U-net, Convolutional networks for biomedical image segmentation, in: N. Navab, J. Hornegger, W.M. Wells, A.F. Frangi (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, Springer International Publishing, Cham, 2015, pp. 234–241. [7] S. Jegou, M. Drozdzal, D. Va´zquez, A. Romero, Y. Bengio, The One Hundred Layers Tiramisu: Fully Convolutional Densenets for Semantic Segmentation, 2017, pp. 1175–1183, https://doi.org/10.1109/CVPRW.2017.156. [8] K. He, G. Gkioxari, P. Dolla´r, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2961–2969. [9] S. Xie, Z. Tu, Holistically-nested edge detection, in: 2015 IEEE International Conference on Computer Vision, ICCV), 2015, pp. 1395–1403, https://doi.org/ 10.1109/ICCV.2015.164. [10] M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury, C. Pal, The importance of skip connections in biomedical image segmentation, in: G. Carneiro, D. Mateus, L. Peter, A. Bradley, J.M.R.S. Tavares, V. Belagiannis, J.P. Papa, J.C. Nascimento, M. Loog, Z. Lu, J.S. Cardoso, J. Cornebise (Eds.), Deep Learning and Data Labeling for Medical Applications, Springer International Publishing, Cham, 2016, pp. 179–187. [11] S. Guo, T. Li, H. Kang, N. Li, Y. Zhang, K. Wang, L-seg, An end-to-end unified framework for multi-lesion segmentation of fundus images, Neurocomputing 349 (2019) 52–63. [12] Z. Yan, X. Han, C. Wang, Y. Qiu, Z. Xiong, S. Cui, Learning mutually local-global u- nets for high-resolution retinal lesion segmentation in fundus images, in: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), 2019, pp. 597–600. [13] C. Leibig, V. Allken, M. S. Ayhan, P. Berens, S. Wahl, Leveraging uncertainty information from deep neural networks for disease detection, Sci. Rep. 7. [14] A. Filos, S. Farquhar, A. N. Gomez, T. G. Rudner, Z. Kenton, L. Smith, M. Alizadeh, A. de Kroon, Y. Gal, A Systematic Comparison of Bayesian Deep Learning Robustness in Diabetic Retinopathy Tasks, arXiv preprint arXiv:1912.10481. [15] A. Kendall, Y. Gal, What uncertainties do we need in bayesian deep learning for computer vision?, in: Advances in Neural Information Processing Systems, 2017, pp. 5574–5584. [16] Y. Gal, Z. Ghahramani, Dropout as a bayesian approximation: representing model uncertainty in deep learning, in: International Conference on Machine Learning, 2016, pp. 1050–1059. Fig. 15. Inference results for microaneurysms with magnified input image 5b: (a) input image; (b) ground truth mask; (c) misclassifications; (d) mean inferred probabilities; (e) aleatoric uncertainties (standard deviations of probabilities); (f) epistemic uncertainties (standard deviations of probabilities). A. Garifullin et al. Computers in Biology and Medicine 136 (2021) 104725 14 [17] S. Je´gou, M. Drozdzal, D. Vazquez, A. Romero, Y. Bengio, The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation, in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, IEEE, 2017, pp. 1175–1183. [18] M. Zhou, K. Jin, S. Wang, J. Ye, D. Qian, Color retinal image enhancement based on luminosity and contrast adjustment, IEEE (Inst. Electr. Electron. Eng.) Trans. Biomed. Eng. 65 (3) (2018) 521–527, https://doi.org/10.1109/ TBME.2017.2700627. [19] K. Zuiderveld, in: P.S. Heckbert (Ed.), Graphics Gems IV, Academic Press Professional, Inc., San Diego, CA, USA, 1994, pp. 474–485. Ch. Contrast Limited Adaptive Histogram Equalization. [20] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: surpassing human-level performance on imagenet classification, in: Proceedings of the IEEE International Conference on Computer Vision, ICCV), 2015. [21] M.D. Zeiler, ADADELTA: an adaptive learning rate method, Tech. rep., arXiv: 1212.5701 (Dec. 2012). URL, http://arxiv.org/abs/1212.5701. [22] C. Guo, G. Pleiss, Y. Sun, K.Q. Weinberger, On calibration of modern neural networks, in: International Conference on Machine Learning, PMLR, 2017, pp. 1321–1330. [23] S. Hu, D. Worrall, S. Knegt, B. Veeling, H. Huisman, M. Welling, Supervised Uncertainty Quantification for Segmentation with Multiple Annotations, 01949, 1907. arXiv preprint arXiv. [24] A. Mobiny, H. Nguyen, S. Moulik, N. Garg, C. Wu, Dropconnect Is Effective in Modeling Uncertainty of Bayesian Deep Networks, ArXiv abs/1906.04569. [25] R. Camarasa, D. Bos, J. Hendrikse, P. Nederkoorn, E. Kooi, A. van der Lugt, M. de Bruijne, Quantitative comparison of monte-carlo dropout uncertainty measures for multi-class segmentation, in: Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and Graphs in Biomedical Image Analysis, Springer, 2020, pp. 32–41. [26] M. Jun, Segmentation Loss Odyssey, arXiv preprint arXiv:2005.13449. [27] N. Abraham, N.M. Khan, A novel focal tversky loss function with improved attention u-net for lesion segmentation, in: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), IEEE, 2019, pp. 683–687. [28] C.H. Sudre, W. Li, T. Vercauteren, S. Ourselin, M.J. Cardoso, Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations, in: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Springer, 2017, pp. 240–248. [29] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dolla´r, Focal loss for dense object detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988. [30] H. Kervadec, J. Bouchtiba, C. Desrosiers, E. Granger, J. Dolz, I.B. Ayed, Boundary loss for highly unbalanced segmentation, in: International Conference on Medical Imaging with Deep Learning, PMLR, 2019, pp. 285–296. [31] A. Kirsch, J. van Amersfoort, Y. Gal, in: Batchbald: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning, NeurIPS, 2019. [32] S. Thulasidasan, G. Chennupati, J. Bilmes, T. Bhattacharya, S. Michalak, On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks, arXiv preprint arXiv:1905.11001. [33] S. Seo, P.H. Seo, B. Han, Learning for single-shot confidence calibration in deep neural networks through stochastic inferences, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9022–9030. [34] M.-H. Laves, S. Ihler, J.F. Fast, L.A. Kahrs, T. Ortmaier, Well-calibrated regression uncertainty in medical imaging with deep learning, in: Medical Imaging with Deep Learning, PMLR, 2020, pp. 393–412. A. Garifullin et al. ACTA UNIVERSITATIS LAPPEENRANTAENSIS 965. ZHANG, TAO. Intelligent algorithms of a redundant robot system in a future fusion reactor. 2021. Diss. 966. YANCHUKOVICH, ALEXEI. Screening the critical locations of a fatigue-loaded welded structure using the energy-based approach. 2021. Diss. 967. PETROW, HENRI. Simulation and characterization of a front-end ASIC for gaseous muon detectors. 2021. Diss. 968. DONOGHUE, ILKKA. The role of Smart Connected Product-Service Systems in creating sustainable business ecosystems. 2021. Diss. 969. PIKKARAINEN, ARI. Development of learning methodology of additive manufacturing for mechanical engineering students in higher education. 2021. Diss. 970. HOFFER GARCÉS, ALVARO ERNESTO. Submersible permanent-magnet synchronous machine with a stainless core and unequal teeth widths. 2021. Diss. 971. PENTTILÄ, SAKARI. Utilizing an artificial neural network to feedback-control gas metal arc welding process parameters. 2021. Diss. 972. KESSE, MARTIN APPIAH. Artificial intelligence : a modern approach to increasing productivity and improving weld quality in TIG welding. 2021. Diss. 973. MUSONA, JACKSON. Sustainable entrepreneurial processes in bottom-of-the-pyramid settings. 2021. Diss. 974. NYAMEKYE, PATRICIA. Life cycle cost-driven design for additive manufacturing: the frontier to sustainable manufacturing in laser-based powder bed fusion. 2021. Diss. 975. SALWIN, MARIUSZ. Design of Product-Service Systems in printing industry. 2021. Diss. 976. YU, XINXIN. Contact modelling in multibody applications. 2021. Diss. 977. EL WALI, MOHAMMAD. Sustainability of phosphorus supply chain – circular economy approach. 2021. Diss. 978. PEÑALBA-AGUIRREZABALAGA, CARMELA. Marketing-specific intellectual capital: Conceptualisation, measurement and performance. 2021. Diss. 979. TOTH, ILONA. Thriving in modern knowledge work: Personal resources and challenging job demands as drivers for engagement at work. 2021. Diss. 980. UZHEGOVA, MARIA. Responsible business practices in internationalized SMEs. 2021. Diss. 981. JAISWAL, SURAJ. Coupling multibody dynamics and hydraulic actuators for indirect Kalman filtering and real-time simulation. 2021. Diss. 982. CLAUDELIN, ANNA. Climate change mitigation potential of Finnish households through consumption changes. 2021. Diss. 983. BOZORGMEHRI, BABAK. Finite element formulations for nonlinear beam problems based on the absolute nodal coordinate formulation. 2021. Diss. 984. BOGDANOV, DMITRII. Transition towards optimal renewable energy systems for sustainable development. 2021. Diss. 985. SALTAN, ANDREY. Revealing the state of software-as-a-service pricing. 2021. Diss. 986. FÖHR, JARNO. Raw material supply and its influence on profitability and life-cycle assessment of torrefied pellet production in Finland – Experiences from pilot-scale production. 2021. Diss. 987. MORTAZAVI, SINA. Mechanisms for fostering inclusive innovation at the base of the pyramid for community empowerment - Empirical evidence from the public and private sector. 2021. Diss. 988. CAMPOSANO, JOSÉ CARLOS. Integrating information systems across organizations in the construction industry. 2021. Diss. 989. LAUKALA, TEIJA. Controlling particle morphology in the in-situ formation of precipitated calcium carbonate-fiber composites. 2021. Diss. 990. SILLMAN, JANI. Decoupling protein production from agricultural land use. 2021. Diss. 991. KHADIM, QASIM. Multibody system dynamics driven product processes. 2021. Diss. 992. ABDULKAREEM, MARIAM. Environmental sustainability of geopolymer composites. 2021. Diss. 993. FAROQUE, ANISUR. Prior experience, entrepreneurial outcomes and decision making in internationalization. 2021. Diss. 994. URBANI, MICHELE. Maintenance policies optimization in the Industry 4.0 paradigm. 2021. Diss. 995. LAITINEN, VILLE. Laser powder bed fusion for the manufacture of Ni-Mn-Ga magnetic shape memory alloy actuators. 2021. Diss. 996. PITKÄOJA, ANTTI. Analysis of sorption-enhanced gasification for production of synthetic biofuels from solid biomass. 2021. Diss. 997. MASHLAKOV, ALEKSEI. Flexibility aggregation of local energy systems— interconnecting, forecasting, and scheduling. 2021. Diss. 998. NIKITIN, ALEKSEI. Microwave processes in thin-film multiferroic heterostructures and magnonic crystals. 2021. Diss. 999. VIITALA, MIRKA. The heterogeneous nature of microplastics and the subsequent impacts on reported microplastic concentrations. 2021. Diss. 1000. ASEMOKHA, AGNES. Understanding business model change in international entrepreneurial firms. 2021. Diss. 1001. MUSTO, JIRI. Improving the quality of user-generated content. 2021. Diss. 1002. INKERI, EERO. Modelling of component dynamics and system integration in power-to- gas process. 2021. Diss. 1003 DEEP BAYESIAN APPROACH TO EYE FUN DUS IM AGE SEGM EN TATION Azat Garifullin ISBN 978-952-335-761-7 ISBN 978-952-335-762-4 (PDF) ISSN-L 1456-4491 ISSN 1456-4491 Lappeenranta 2021