⚡️ Saturday AI Sparks 🤖 - 🖼️📝 Image Captioning with Pre-trained Model


Description:

How do we get a machine to look at an image and describe it in natural language? This task, called image captioning, combines computer vision (extracting features from images) with natural language generation (turning those features into words).

In this post, we’ll use a pretrained multimodal model (nlpconnect/vit-gpt2-image-captioning) from Hugging Face that links a Vision Transformer (ViT) with GPT-2 to generate captions directly — no training required.


Why Image Captioning?

  • Accessibility: Generate alt-text for the visually impaired.
  • Content tagging: Add metadata automatically to images.
  • Creative AI: Build story generators and auto-annotation tools.

Installing Requirements

pip install transformers torch pillow

Loading the Pretrained Model

We use a Vision Transformer (encoder) to process the image and GPT-2 (decoder) to generate captions.

from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer

model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
feature_extractor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

Input Image

You can supply an image from a URL or from your local file system.

from PIL import Image
import requests

url = "https://images.pexels.com/photos/1108099/pexels-photo-1108099.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

Generating a Caption

We preprocess the image, feed it into the model, and decode the output.

import torch

pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values
attention_mask = torch.ones(pixel_values.shape[:-1], dtype=torch.long)

output_ids = model.generate(
    pixel_values,
    attention_mask=attention_mask,
    max_length=16,
    num_beams=4
)

caption = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
print("Generated Caption:", caption)

Sample Output

Generated Caption: two dogs are sitting in a field with flowers

Key Takeaways

  • Pretrained vision + language models can generate captions out-of-the-box.
  • No labeled dataset or training required — just plug and play.
  • Useful for accessibility, auto-tagging, and creative AI projects.

Code Snippet:

from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
import matplotlib.pyplot as plt
from PIL import Image
import requests
import torch


# Model and tokenizer
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
feature_extractor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)


# Load image from a URL (sample)
url = "https://images.pexels.com/photos/1108099/pexels-photo-1108099.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
plt.figure(figsize=(2, 2))   # small display window
plt.imshow(image)
plt.axis("off")
plt.title("Sample Image for Image Captioning")
plt.show()


# Preprocess image
pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values.to(device)

# Create attention mask (all ones, since no padding)
attention_mask = torch.ones(pixel_values.shape[:-1], dtype=torch.long).to(device)

# Generate caption IDs
output_ids = model.generate(pixel_values, attention_mask=attention_mask, max_length=16, num_beams=4)

# Decode IDs into text
caption = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
print("Generated Caption:", caption)

Link copied!

Comments

Add Your Comment

Comment Added!