Representation Learning for Non-Melanoma Skin Cancer

S.M.Thomas | 5th September 2022

Australian e-Health Research Centre, CSIRO

This article explores ideas presented in the paper Representation Learning for Non-Melanoma Skin Cancer using Latent Autoencoders. The accompanying code is available in the Github repository.

Note: For a smoother interactive experience, it is recommended that you download the repository and view it locally.
[DOWNLOAD ~266MB]



Useful Representations of Non-Melanoma Skin Cancer

Generative modelling techniques such as GANs, Diffusion Models or Autoregressive Transformers, have shown unprecedented representational capacity. They take high-dimensional data inputs from a particular problem domain (e.g. images, text or both), and learn a lower-dimensional represention which captures semantic structure and approximates the real world distribution. We can then sample from the distribution, generating synethic images or texts which resmeble real-world data. However, merely drawing samples from the distribution is limited in its application to real-world problems. Indeed, a desirable ability is to be able to project existing data points into a structured latent space, not one only optimized for discriminatory tasks such as classification. Why this is desriable is arguably not-obvious and also under-appreciated, particularly for high-stakes decision domains such as medical imaging. Therefore, this work attempts to showcase several ways in which learning to generate real images can (in the long term) improve the quality of our models and also deliver highly interpretable outputs. This work focuses on images and text within the context of digital pathology, but the techniques are applicable to other medical (and non-medical) domains with multi-modal data.

Digital pathology utilises microscopic images of tissues and cells at various magnifications, where the morphological features are visualy enhanced using H&E staining (pink and purple spectrum of colours). In this case, images of skin tissue were used, representing healthly tissue and cancerous tissue (Intra-epidermal Carcinoma - IEC), and the graduations in between. The data consisted of 11,588 images of size $256 \times 256$ pixels, each with an accompanying natural language description using a controlled vocabularly of anatomical pathology terms.




Each image captures three layers of the skin.

  • Keratin Layer
  • Epidermal Layer
  • Dermal Layer

The accompanying captions described the above layers in a systematic way e.g..

The upper layer shows fragmented basket weave keratosis with focal parakeratosis. The epidermis shows severe dysplasia. The dermis shows inflammation.

Instead of using a traditional Generative Adversarial Network (GAN) training paradigm, which implicitly learns to match a target distribution, the Adversarial Latent Autoencoder (ALAE) paradigm is used. This is a slight modification which includes an additional term to explicitly match the target distiribution via an autoencoding loss. To improve reconstruction quality, a subnetwork, consisting of an $\text{encoder}$ and $\text{decoder}$ network, is first pre-trained for image reconstruction. The adversarial training then begins using a locked network, where only the latent represention, $w$, is learned. [Refer to the paper for details].

The video below shows the training progress of the adversarial training stage. The model learns the global structure of a highly-diverse set of histological images rather quickly, with the majority of training dedicated to learning finer and finer detail. The model captures variations in all three tissue layers, including staining and background colours. Although it may look just a like traditional autoencoder, the adversarial component contrains the model so that all inputs are placed in a structured latent space $w$, capturing the relationships between features within the images.

Diverse Samples and Interpolations

We can demonstrate the learned representation clearly by sampling from the latent space (via a mapping network $F$, such as that $w=F(x)$, where $z \sim N(0, 1)$. Below you can see diverse samples of synthetic images, and their interpolations between different points within $w$. The result is that every point within $w$ correponds to a real, or seemingly (conjectured) real image. Indeed, we can directly see for ourselves how a healthy tissue transforms into a cancerous tissue ( according to how the model understands the world).

Concept Vectors

We can refine the way that we explore the structure of the latent space using labels associated with images. Indeed, it is a reasonable assumption that we would already have labels for real data, whether dense in the case of text, or sparse in the case of class labels. We can use the labels to define concept vectors, describing directions within the latent space $w$ that correspond to intentional semantic manipulations of the image content. Examples of this can be seen below, where images are transformed by moving their location in $w$-space along a particular concept e.g. increased inflammation or increased dysplasia.


Parakeratosis Basket weave

+ Dysplasia - Dysplasia

- Inflammation + Inflammation

- Solar Elastosis + Solar Elastosis

The above shows that we can use simple linear directions to characterise images in terms of high-level concepts along a continuum. With this ability, we can work towards characterising cancer as a progression along a continuum, rather than just binary. This has enormous potential for systematic and repeatable characterisation of cancer, alleviating the discrepancies associated with inter and intra-observer variable in pathologists (Lewis et al., 2015).

The features in the images above retain many correlations e.g. increased inflammation correlates with increased dysplasia. In reality, such features have more independence. Consequently, as models improve and features in the latent space are further disentangled (e.g. using a Style Network (Wu et al., 2021) ), the value for this method to produced highly interpretable outputs and the potential for finer and finer separation of the tissue progression increases.

Exhaustive Latent Space Exploration

The fact that we can sample from the latent space also helps to put bounds on what is and is not a realistic point in space. For example, one approach is to learn a 2-dimensional representation of both the sampled (synthetic) images, and the real images, and see where they do and do not overlap. Exhaustive sampling reveals what every single point looks like, as well as it's closet real data point, allowing proability estimates to be estabilished for a given region. This technique also reveals how the densities of the sampled and real distributions compare. Intuitively, images that fall outside of the "cloud of realism" can be easily identfied and visualised.




Highly Expressive Characterisation

Pathologists have created a highly specialised vocabulary to describe all the nuanced varations they see under the microscope. There is thus a natural pairing of images and words. When combined with text generation models (in this case an auto-regressive transformer), we can work towards a highly-expressive characterisation of the entire latent space, using the nuanced language that pathologist themselves use. This firstly provides a way for us to criticise the knowledge within the model e.g. errors between images and their captions*, but, secondly, provides a means for the model to explain itself in terms we already understand.

* The examples below, although accurate in some instances, still contain many errors between the images and caption. The interface is primarily an example of what future interpretability / knowledge interfaces could look like and be used for.

References

Lewis Jr, J. S., Tarabishy, Y., Luo, J., Mani, H., Bishop, J. A., Leon, M. E., ... & Di Palma, S. (2015). Inter-and intra-observer variability in the classification of extracapsular extension in p16 positive oropharyngeal squamous cell carcinoma nodal metastases. Oral oncology, 51(11), 985-990.

Wu, Z., Lischinski, D., & Shechtman, E. (2021). Stylespace analysis: Disentangled controls for stylegan image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12863-12872).