scripts/inference-providers/templates/task/image-text-to-text.handlebars (12 lines of code) (raw):

## Image-Text to Text Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input. {{{tips.linksToTaskPage.image-text-to-text}}} ### Recommended models {{#each recommendedModels.conversational-image-text-to-text}} - [{{this.id}}](https://huggingface.co/{{this.id}}): {{this.description}} {{/each}} {{{tips.listModelsLink.image-text-to-text}}} ### Using the API {{{snippets.conversational-image-text-to-text}}} ### API specification For the API specification of conversational image-text-to-text models, please refer to the [Chat Completion API documentation](https://huggingface.co/docs/inference-providers/tasks/chat-completion#api-specification).