scripts/inference-providers/templates/task/image-text-to-text.handlebars (12 lines of code) (raw):
## Image-Text to Text
Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input.
{{{tips.linksToTaskPage.image-text-to-text}}}
### Recommended models
{{#each recommendedModels.conversational-image-text-to-text}}
- [{{this.id}}](https://huggingface.co/{{this.id}}): {{this.description}}
{{/each}}
{{{tips.listModelsLink.image-text-to-text}}}
### Using the API
{{{snippets.conversational-image-text-to-text}}}
### API specification
For the API specification of conversational image-text-to-text models, please refer to the [Chat Completion API documentation](https://huggingface.co/docs/inference-providers/tasks/chat-completion#api-specification).