by google-research-datasets

google-research-datasets / conceptual-captions

Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and...

216 Stars 12 Forks Last release: Not found Other 10 Commits 0 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:

Conceptual Captions Dataset

Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.


See for details.


Automatic image captioning is the task of producing a natural-language utterance (usually a sentence) that correctly reflects the visual content of an image. Up to this point, the resource most used for this task was the MS-COCO dataset, containing around 120,000 images and 5-way image-caption annotations (produced by paid annotators).

Google's Conceptual Captions dataset has more than 3 million images, paired with natural-language captions. In contrast with the curated style of the MS-COCO images, Conceptual Captions images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. The raw descriptions are harvested from the Alt-text HTML attribute associated with web images. We developed an automatic pipeline that extracts, filters, and transforms candidate image/caption pairs, with the goal of achieving a balance of cleanliness, informativeness, fluency, and learnability of the resulting captions.

More details are available in this paper (please cite the paper if you use or discuss this dataset in your work):

  title = {Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning},
  author = {Sharma, Piyush and Ding, Nan and Goodman, Sebastian and Soricut, Radu},
  booktitle = {Proceedings of ACL},
  year = {2018},

Dataset Description

Conceptual Captions dataset release contains two splits: train (~3.3M examples) and validation (~16K examples). See Table 1 below for more details.

Table 1: Dataset stats.

Tokens per Caption
Split Examples Uniqe Tokens Mean StdDev Median
Train 3,318,333 51,201 10.3 4.5 9.0
Valid 15,840 10,900 10.4 4.7 9.0
Test (Hidden) 12,559 9,645 10.2 4.6 9.0

Hidden Test set

We are not releasing the official test split (~12.5K examples). Instead, we are hosting a competition (see dedicated to supporting submissions and evaluations of model outputs on this blind test set.

We strongly believe that this setup has several advantages: a) it allows the evaluation to be done using an unbiased, large number of images b) it keeps the test completely blind and eliminate suspicions of fitting to the test, cheating, etc. c) it overall provides a clean setup for advancing the SoTA on this task, including reporting reproducible results for paper publications, etc.

Data Format

The released data is provided as TSV (tab-separated values) text files with the following columns:

Table 2: Columns in TSV files.

| Column | Description | | -------- | ----------------------------------------------------------------- | | 1 | Caption. The text has been tokenized and lowercased. | | 2 | Image URL |

Contact us

If you have a technical question regarding the dataset, code or publication, please create an issue in this repository. This is the fastest way to reach us.

If you would like to share feedback or report concerns, please email us at [email protected]

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.