⚡️ Saturday AI Sparks 🤖 - 🖼️📝 Image Captioning with Pre-trained Model
Posted On: September 13, 2025
Description:
How do we get a machine to look at an image and describe it in natural language? This task, called image captioning, combines computer vision (extracting features from images) with natural language generation (turning those features into words).
In this post, we’ll use a pretrained multimodal model (nlpconnect/vit-gpt2-image-captioning) from Hugging Face that links a Vision Transformer (ViT) with GPT-2 to generate captions directly — no training required.
Why Image Captioning?
- Accessibility: Generate alt-text for the visually impaired.
- Content tagging: Add metadata automatically to images.
- Creative AI: Build story generators and auto-annotation tools.
Installing Requirements
pip install transformers torch pillow
Loading the Pretrained Model
We use a Vision Transformer (encoder) to process the image and GPT-2 (decoder) to generate captions.
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
feature_extractor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
Input Image
You can supply an image from a URL or from your local file system.
from PIL import Image
import requests
url = "https://images.pexels.com/photos/1108099/pexels-photo-1108099.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
Generating a Caption
We preprocess the image, feed it into the model, and decode the output.
import torch
pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values
attention_mask = torch.ones(pixel_values.shape[:-1], dtype=torch.long)
output_ids = model.generate(
pixel_values,
attention_mask=attention_mask,
max_length=16,
num_beams=4
)
caption = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
print("Generated Caption:", caption)
Sample Output
Generated Caption: two dogs are sitting in a field with flowers
Key Takeaways
- Pretrained vision + language models can generate captions out-of-the-box.
- No labeled dataset or training required — just plug and play.
- Useful for accessibility, auto-tagging, and creative AI projects.
Code Snippet:
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
import matplotlib.pyplot as plt
from PIL import Image
import requests
import torch
# Model and tokenizer
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
feature_extractor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
# Load image from a URL (sample)
url = "https://images.pexels.com/photos/1108099/pexels-photo-1108099.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
plt.figure(figsize=(2, 2)) # small display window
plt.imshow(image)
plt.axis("off")
plt.title("Sample Image for Image Captioning")
plt.show()
# Preprocess image
pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values.to(device)
# Create attention mask (all ones, since no padding)
attention_mask = torch.ones(pixel_values.shape[:-1], dtype=torch.long).to(device)
# Generate caption IDs
output_ids = model.generate(pixel_values, attention_mask=attention_mask, max_length=16, num_beams=4)
# Decode IDs into text
caption = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
print("Generated Caption:", caption)
Link copied!
Comments
Add Your Comment
Comment Added!
No comments yet. Be the first to comment!