Multimodal perception and interaction with transformers
In this advanced tutorial we review the emergence of attention for bilingual language models, and show how this led to the transformer architecture composed of stacked encoder and decoder layers using multi-headed attention. We discuss techniques for token embeddings for natural language, and show how these can be trained using a masked language model. We describe how this approach can be extended to multiple modalities by concatenating encodings of modalities and discuss problems and approaches for adapting transforms for use with computer vision and spoken language interaction. We conclude with a review of current research challenges, performance evaluation metrics and benchmark data sets.