Here we go! Microsoft introduces a multimodal large language model called Kosmos-1. Achieves great performance on language understanding, OCR-free NLP, perception-language tasks, visual QA, and more.

Take a look at some of the examples of generations from the Kosmos-1 model that can perceive general modalities. Would love to see how something like this might do with more data-specific visuals like charts, etc.

More examples demonstrating visual dialogue capabilities.

Okay, I have to admit this one caught my attention. Results on the Raven IQ test.

And of course, multimodal chain-of-thought prompting which allows for dealing with complex question-answering and reasoning tasks.

And before I forget, the paper: arxiv.org/abs/2302.14045

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large...

arxiv.org/abs/2302.14045

Language Is Not All You Need: Aligning Perception with Language Models

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large...

elvis

@omarsar0

Machine Learning & NLP Research • PhD • Building @dair_ai • Previously: Meta AI, Elastic

Follow on 𝕏

twitter-thread.com/t/1630392643602599937

elvis@omarsar0

Here we go! Microsoft introduces a multimodal large language model called Kosmos-1. Achieves great performance on language understanding, OCR-free NLP, perception-language tasks, visual QA, and more.

elvis

elvis
@omarsar0