For a long time, each ML model operated in one data mode – text (translation, language modeling), image (object detection, image classification), or audio (speech recognition). However, natural intelligence is not limited to just a single modality. Humans can read and write text. We can see images and watch videos. We listen to music to relax and watch out for strange noises to detect danger. Bein