## Image-Text to Text

Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input.

{{{tips.linksToTaskPage.image-text-to-text}}}

### Recommended models

{{#each recommendedModels.conversational-image-text-to-text}}
- [{{this.id}}](https://huggingface.co/{{this.id}}): {{this.description}}
{{/each}}

{{{tips.listModelsLink.image-text-to-text}}}

### Using the API

{{{snippets.conversational-image-text-to-text}}}

### API specification

For the API specification of conversational image-text-to-text models, please refer to the [Chat Completion API documentation](https://huggingface.co/docs/inference-providers/tasks/chat-completion#api-specification).
