# Multi-Modal Networks
After the success of transformer models for solving NLP tasks, there were many attempts to apply the same or similar architectures to computer vision tasks. Also, there is a growing interest in building models that would *combine* vision and natural language capabilities. One of such attempts was done by OpenAI, which is called CLIP.
## Contrastive Image Pre-Training (CLIP)microsoft/AI-For-Beginners
Publicmirrored fromhttps://github.com/microsoft/AI-For-BeginnersAvailable
lessons/X-Extras/X1-MultiModal/README.md
6lines · modepreview