Representation Learning for Non-Melanoma Skin Cancer
S.M.Thomas | 5th September 2022
Australian e-Health Research Centre, CSIRO
This article explores ideas presented in the paper Representation Learning for Non-Melanoma Skin Cancer using Latent Autoencoders.
The accompanying code is available in the Github repository.
Note: For a smoother interactive experience, it is recommended that you download the repository and view it locally.
[DOWNLOAD ~266MB]
Useful Representations of Non-Melanoma Skin Cancer
Generative modelling techniques such as GANs, Diffusion Models or Autoregressive Transformers, have
shown unprecedented representational capacity. They take high-dimensional data inputs from a particular problem domain (e.g. images, text
or both), and learn a lower-dimensional represention which captures semantic structure and approximates
the real world distribution. We can then sample from the distribution, generating synethic images
or texts which resmeble real-world data. However, merely drawing samples from the distribution is limited in its application
to real-world problems. Indeed, a desirable ability is to be able to project existing data points into a structured latent
space, not one only optimized for discriminatory tasks such as classification. Why this is desriable is arguably not-obvious
and also under-appreciated, particularly for high-stakes decision domains such as medical imaging.
Therefore, this work attempts to showcase several ways in which learning to generate real images can (in the long term) improve
the quality of our models and also deliver highly interpretable outputs. This work focuses on images and text within the context of digital pathology,
but the techniques are applicable to other medical (and non-medical) domains with multi-modal data.
Digital pathology utilises microscopic images of tissues and cells at various magnifications, where the morphological features
are visualy enhanced using H&E staining (pink and purple spectrum of colours). In this case, images of skin tissue
were used, representing healthly tissue and cancerous tissue (Intra-epidermal Carcinoma - IEC), and the graduations in between. The data
consisted of 11,588 images of size $256 \times 256$ pixels, each with an accompanying natural language description using a controlled
vocabularly of anatomical pathology terms.
Each image captures three layers of the skin.
- Keratin Layer
- Epidermal Layer
- Dermal Layer
The accompanying captions described the above layers in a systematic way e.g..
The upper layer shows fragmented basket weave keratosis with focal parakeratosis.
The epidermis shows severe dysplasia. The dermis shows inflammation.
Instead of using a traditional Generative Adversarial Network (GAN) training paradigm, which implicitly learns to match a target distribution,
the Adversarial Latent Autoencoder (ALAE) paradigm is used. This is a slight modification which includes an additional term to explicitly
match the target distiribution via an autoencoding loss. To improve reconstruction quality, a subnetwork, consisting of
an $\text{encoder}$ and $\text{decoder}$ network, is first pre-trained for image reconstruction. The adversarial training then begins using
a locked network, where only the latent represention, $w$, is learned. [Refer to the paper for details].
The video below shows the training progress of the adversarial training stage. The model learns the global structure of a highly-diverse
set of histological images rather quickly, with the majority of training dedicated to learning finer and finer detail.
The model captures variations in all three tissue layers, including staining and background colours. Although it may look just a like traditional autoencoder,
the adversarial component contrains the model so that all inputs are placed in a structured latent space $w$, capturing the relationships between
features within the images.
Concept Vectors
We can refine the way that we explore the structure of the latent space using labels associated with images. Indeed, it is a reasonable assumption that we would already have labels for
real data, whether dense in the case of text, or sparse in the case of class labels. We can use the labels to define concept vectors, describing directions within the latent space $w$
that correspond to intentional semantic manipulations of the image content. Examples of this can be seen below, where images are transformed by moving their location in $w$-space along a particular
concept e.g. increased inflammation or increased dysplasia.
References
Lewis Jr, J. S., Tarabishy, Y., Luo, J., Mani, H., Bishop, J. A., Leon, M. E., ... & Di Palma, S. (2015). Inter-and intra-observer variability in the classification of extracapsular extension in p16 positive oropharyngeal squamous cell carcinoma nodal metastases. Oral oncology, 51(11), 985-990.
Wu, Z., Lischinski, D., & Shechtman, E. (2021). Stylespace analysis: Disentangled controls for stylegan image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12863-12872).