key: cord-0547668-hlpdcogo authors: Desai, Karan; Kaul, Gaurav; Aysola, Zubin; Johnson, Justin title: RedCaps: web-curated image-text data created by the people, for the people date: 2021-11-22 journal: nan DOI: nan sha: b684ce8e13f4caaee46f6ea445423f7a21e3c0a7 doc_id: 547668 cord_uid: hlpdcogo Large datasets of paired images and text have become increasingly popular for learning generic representations for vision and vision-and-language tasks. Such datasets have been built by querying search engines or collecting HTML alt-text -- since web data is noisy, they require complex filtering pipelines to maintain quality. We explore alternate data sources to collect high quality data with minimal filtering. We introduce RedCaps -- a large-scale dataset of 12M image-text pairs collected from Reddit. Images and captions from Reddit depict and describe a wide variety of objects and scenes. We collect data from a manually curated set of subreddits, which give coarse image labels and allow us to steer the dataset composition without labeling individual instances. We show that captioning models trained on RedCaps produce rich and varied captions preferred by humans, and learn visual representations that transfer to many downstream tasks. : RedCaps dataset comprises 12M image-text pairs from 350 subreddits. RedCaps data is created by the people, for the people -it contains everyday things that users like to share on social media, for example hobbies (r/crafts) and pets (r/shiba). Captions often contain specific and fine-grained descriptions (northern cardinal, taj mahal). Subreddit names provide relevant image labels (r/shiba) even when captions may not (mlem!), and sometimes may group many visually unrelated images through a common semantic meaning (r/perfectfit). Large datasets of image-text pairs from the web have enabled successful transfer learning applications in computer vision. Two such prominent datasets -SBU [1] and Conceptual Captions [2] -are widely used for pre-training vision-and-language (V&L) representations [3] [4] [5] [6] [7] [8] [9] [10] [11] that transfer to a variety of downstream V&L tasks like visual question answering [12] [13] [14] , visual reasoning [15, 16] , and image captioning [17, 18] . Recent work [19, 20] also shows that image-text data from COCO [17] can be used to learn visual features that are competitive with supervised pretraining [21] on ImageNet [22, 23] when transfered to downstream tasks [24] [25] [26] [27] [28] . More recently, CLIP [29] and ALIGN [30] scale up to 400M and 1B+ web-curated image-text pairs, enabling zero-shot visual recognition. These datasets have an appealing advantage -they are free from expensive annotations. However, they apply complex filtering steps to deal with noisy web data. For example, Conceptual Captions (CC-3M [2] , CC-12M [31] ) discard captions without nouns, or whose nouns do not match with image labels predicted by in-house image taggers. They also perform text pre-processing like replacing proper nouns with common nouns. These pipelines are data-inefficient -for example, CC-3M collected 5B image-text pairs and filtered them down to 3.3M. CLIP and ALIGN scale primarily by relaxing such filtering, resulting in gargantuan datasets which could be extremely noisy. How can we obtain high-quality image-text data from the web without complex data filtering? We argue that the quality of data depends on its source and the intent behind its creation. Revisiting data sources, SBU query Flickr with predefined keywords while CC-3M and CC-12M extract images and HTML alt-text from an unspecified set of web pages; CLIP and ALIGN give only vague descriptions of their data sources, and their datasets are non-public. In these sources, text is secondary to images: Flickr focuses on photos, and alt-text is an oft-overlooked fallback when images cannot be viewed that frequently contains metadata or generic text (e.g. "alt img" [30] ). To obtain higher-quality data, we look for sources where humans use both images and text equally for interaction on the web. In this paper, we explore the Reddit [32] social media platform for collecting image-text pairs. Textual data from Reddit is already used for pre-training massive language models [33] [34] [35] [36] in NLP. We collect images and their captions as submitted by Reddit users in topic-specific subreddits. Our dataset of image captions from Reddit (RedCaps in short) consists of 12M image-text pairs submitted in 350 subreddits between 2008-2020. RedCaps data is created by the people, for the people to engage with the broader community. Figure 1 shows some examples from RedCaps -the captions are more conversational, humorous, emotional, and generally more diverse than HTML alt-text. Apart from linguistic diversity, Reddit offers many other advantages. Subreddits provide additional image labels and group related content -manually selecting subreddits allows us to steer dataset contents without labeling individual instances. Reddit's voting system gives free and organic quality control: unappealing or spam content is actively downvoted by users or removed by moderators. RedCaps is one of the largest public image-text datasets, but it is not static: we plan to release regular updates with newly uploaded Reddit content, allowing RedCaps to grow over time. We claim that captions written with the intent of human interaction on Reddit are a better source of data than used in other image-text datasets. To this end, we follow VirTex [19] to learn visual representations by training image captioning models from scratch. We find that human evaluators prefer captioning outputs from models trained on RedCaps vs CC-3M. We also transfer the learned features to eleven different downstream datasets for tasks including image classification, object detection, instance segmentation, and fine-grained recognition using both fine-tuning and languagebased zero-shot classification [29] . We show that features learned on RedCaps outperform those learned on SBU or CC-3M, demonstrating the utility of our data collection strategy. Reddit is the singular data source for RedCaps. This leads to a very different data collection pipeline than datasets based on HTML alt-text or search engine results. Here we describe how we collect RedCaps. Overview of Reddit: Reddit is a social media platform for content sharing and discussion. It comprises user-run communities called subreddits that cover diverse topics like animals (r/cats, r/foxes), food (r/pizza, r/sushi), leisure (r/hiking, r/crafts), and utility (r/ceramics, r/tools). Users can submit new posts or share existing posts from other subreddits (cross-posting), and may comment and upvote (or downvote) posts to express their interest. We are specifically interested in posts containing images. Figure 2 shows an image post submitted by user u/johndoe in subreddit r/itookapicture. It com-prises an image, caption, score (upvotes minus downvotes), and information about the author and time of post creation. We extract this metadata from millions of image posts to build RedCaps. Reddit posts also have associated comment threads. These are usually casual conversations loosely based on the image. In Figure 2 , the comment describes ducks as following social distancing -it includes context beyond the image (COVID-19 pandemic) and conveys it with a witty remark. Prior works in dialog modeling and text summarization have trained on Reddit comments [33, [37] [38] [39] [40] . For RedCaps, we only use captions as textual data and leave comments for future work. Reddit's uniform structure allows us to parallelize data collection as independent tasks -each task involves collecting posts submitted to a single subreddit in one year. Our collection pipeline has three steps: (1) subreddit selection, (2) image post filtering, and (3) caption cleaning. Step 1. Subreddit selection: We collect data from a manually curated set of subreddits. Subreddits have their own rules, community norms, and moderators so curating subreddits allows us to steer the dataset's composition without annotating individual instances. We select subreddits with a high volume of images posts, where images tend to be photographs (rather than memes, drawings, screenshots, etc) and post titles tend to describe image content (rather than making jokes, political commentary, etc). We do not select any NSFW, banned, or quarantined subreddits. We want to minimize the number of people that appear in RedCaps, so we omit subreddits whose primary purpose is to share or comment on images of people (such as celebrity pics or user selfies). We choose subreddits focused on general photography (r/pics, r/itookapicture), animals (r/axolotls, r/birdsofprey, r/dachshund), plants (r/roses, r/succulents), objects (r/classiccars, r/trains, r/mechanicalkeyboards), food (r/steak, r/macarons), scenery (r/cityporn 1 , r/desertporn), or activities (r/carpentry, r/kayaking). In total we collect data from 350 subreddits; the full list can be found in Appendix A. Step 2. Image post filtering: We use Pushshift [41] and Reddit [42, 43] APIs to download all image posts submitted to our selected subreddits from 2008-2020. Posts are collected at least six months after their creation to let upvotes stabilize. We only collect posts with images hosted on three domains: Reddit (i.redd.it), Imgur (i.imgur.com), and Flickr (staticflickr.com). Some image posts contain multiple images (gallery posts) -in this case we only collect the first image and associate it with the caption. We discard posts with < 2 upvotes to avoid unappealing content, and we discard posts marked NSFW (by their authors or subreddit moderators) to avoid pornographic or disturbing content. Step 3. Caption cleaning: We expect Reddit post titles to be less noisy than other large-scale sources of image captions such as alt-text [2, 31] , so we apply minimal text cleaning. We lowercase captions and use ftfy [44] to remove character accents, emojis, and non-latin characters, following [29, 35, 36] . Then we apply simple pattern matching to discard all sub-strings enclosed in brackets ((.*), [.*]). These sub-strings usually give non-semantic information: original content tags [oc], image resolutions (800x600 px), camera specs (shot with iPhone), self-promotion [Instagram: @user], and other references (link in comments). Finally, like [31] we replace social media handles (words starting with '@') with a [USR] token to protect user privacy and reduce redundancy. Due to such filtering, ≈12K (0.1%) captions in our dataset are empty strings. We do not discard them, as subreddit names alone provide meaningful supervision. Unlike CC-3M or CC-12M that discard captions without nouns or that don't overlap image tags, we do not discard any instances in this step. Through this pipeline, we collect 13.4M instances from 350 subreddits. Our collection pipeline is less resource-intensive than existing datasets -we do not require webpage crawlers, search engines, or large databases of indexed webpages. RedCaps is easily extensible in the future by selecting more subreddits and collecting posts from future years. Next, we perform additional filtering to mitigate user privacy risks and harmful stereotypes in RedCaps, resulting in final size of 12M instances. There has been growing awareness about potential biases and harms that can arise from internet-scale image and text datasets [45] [46] [47] [48] [49] [50] [51] . There is a fundamental tension in such datasets: the use of internet data is motivated by the desire to use datasets larger than can be manually annotated or verified, but this also means that such datasets cannot be fully controlled or curated by their creators. We identify two potential risks with RedCaps -privacy of people appearing in RedCaps images, and harmful stereotypes -and attempt to minimize them by automatic data filtering. We also discuss the impact of data curation from Reddit on user consent and data distribution in RedCaps. Detected Precision Missed dets. (Filtered) 5K (%) 50K 12M Privacy: The individual who posts a given photo on Reddit may not be the person appearing in said photo; this can pose privacy risks for people who did not expect to appear in images online [49, 50] . Our first method of mitigation is the manual curation of subreddits which are not focused on describing people (Section 2.1). As an additional measure, we use RetinaFace [52] to filter images having any face detection with confidence ≥ 0.9. Results of this filtering are shown in Table 1 . The number of detections are high (1.2M), however the precision is low (32%) -most detections are masked faces, statues, and animals. Nevertheless we remove all of these images to reduce privacy risks while minimizing impact to downstream vision tasks. Harmful Stereotypes: Another concern with Reddit data is that images or language may represent harmful stereotypes about gender, race, or other characteristics of people [48, 49, 51] . We select only non-NSFW subreddits with active moderation for collecting data. This stands in contrast to less curated uses of Reddit data, such as GPT-2 [35] whose training data includes at least 63K documents from banned or quarantined subreddits which may contain toxic language [53] . We attempt to further reduce harmful stereotypes in two ways: -NSFW images: We use the InceptionV3 [54] model from [55] to filter images detected as porn or hentai with confidence ≥ 0.9. Similar to face filtering, we estimated precision of our filtering and estimated amount of missed detections, shown in Table 1 . The model detects 87K images with low precision (∼1%) -most detections are non-NSFW images with pink and beige hues. -Potentially derogatory language: We filter instances whose captions contain words or phrases from a common blocklist [56] . It is important to note that such coarse filtering might suppress language from marginalized groups reclaiming slurs [51] ; however, as RedCaps is not intended to describe people, we believe this is a pragmatic tradeoff to avoid propagating harmful labels. Consent: When submitting to Reddit, users expect their posts to be publicly visible and accessible via the Reddit API we use to download data. However, they did not explicitly consent for their data to be used for training large-scale neural networks [49] . We mitigate this concern in two ways. First, we distribute URLs instead of images; posts deleted from Reddit will thus be automatically removed from RedCaps. Second, we provide a public form allowing anyone to request that specific instances be removed from RedCaps on our website. These decisions mean that over time some image will disappear from RedCaps, making it difficult to exactly reproduce experiments in the future. However we believe this to be less important than allowing users to opt out from RedCaps. Even if images are removed, we expect RedCaps to grow over time as we include newer posts ( Figure 3 ). Reddit demographics: Reddit's user demographics are not representative of the population at large. Compared to US adults, Reddit users skew male (69% vs 49%), young (58% 18-29 years old vs 22%), college educated (36% vs 28%), and politically liberal (41% vs 25%) [57] . Reddit users are predominantly white (63%) [57] , and 49% of desktop traffic to Reddit comes from the United States [58] . All of the subreddits in RedCaps use English as their primary language. Taken together, these demographic biases likely also bias the types of objects and places that appear in images on Reddit, and the language used to describe these images. We do not offer explicit countermeasures to these biases, but users of RedCaps should keep in mind that size doesn't guarantee diversity [51] . Subtler issues may also exist, such as imbalanced representation of demographic groups [59] or gender bias in object co-occurrence [60] or language [61] . These are hard to control in internet data, so we release RedCaps with explicit instructions on suitable use-cases; specifically requesting models not be trained to identify people, or make decisions that impact people. We document these instructions and other terms-of-use in a datasheet [45] , provided in Appendix G. 3 RedCaps data analysis RedCaps is 2× larger than the English subset of multilingual Wikipedia image-text dataset [62] , and nearly as large as CC-12M [31] . Based on current trends, we expect RedCaps to outsize CC-12M by the end of 2021. While CLIP [29] and ALIGN [30] used orders of magnitude larger training datasets, they are not released for public use -RedCaps remains one of the largest public image-text datasets. Subreddit distribution: RedCaps instances are distributed across 350 subreddits in a long-tail distribution. In Figure 4 , we show top 20 subreddits with most instances in RedCaps. Subreddit sizes highly correlate with their popularity on Reddit, which depends on what users find interesting to view and share on social media. Large subreddits are based on general photography (r/pics, r/mildlyinteresting, r/itookapicture), while specific subreddits show that Reddit users enjoy sharing images of food (r/food, r/foodporn), cute pets (r/cats, r/dogpictures, r/rabbits), and show off their hobbies (r/gardening, r/crochet, r/baking) and accesories (r/sneakers, r/mechanicalkeyboards, r/carporn). This gives a distribution of visual concepts encountered by humans in daily life without having to predefine an ontology of object classes. Caption lengths: Figure 5 compares caption lengths between RedCaps and other datasets. We see that RedCaps has the highest mode length at 5 words (vs 3 for CC-3M, SBU) and a heavier tail of long captions ≥25 words. SBU has a fairly flat distribution of captions between 3 and 17 words, likely since they only retain captions with at least one preposition and two words in a manually curated term list; RedCaps and CC-3M captions are not filtered in this way and have more peaked distributions reflecting natural language usage. Word count statistics: Table 2 (top) compares linguistic diversity between datasets by computing the number of unique unigrams (words), bigrams, and trigrams occurring at least 10 times. This reveals that CC-3M has surprisingly little linguistic diversity, having less unique unigrams than SBU despite having ≈3× more captions. RedCaps has the most unique terms, with more than 4× unigrams and more than 3× bigrams and trigrams than CC-3M. Greater linguistic diversity means that models trained on RedCaps should recognize a larger variety of visual concepts. Linguistic statistics: We use part-of-speech (POS) tagging to dig deeper into linguistic diversity of RedCaps. We use the en_core_web_trf model from SpaCy [63] to tag POS in all captions. Figure 6 (top) shows number of unique words per POS appearing at least 10 times. RedCaps has >2× more common nouns and >4× more proper nouns than SBU, and >2× more adjectives and >1.5× more verbs than CC-3M. Nouns in CC-3M are artifically deflated, since their pipeline replaces proper nouns and named entities with hypernyms (which may explain their low unigram counts in Table 2 ). Figure 6 (bottom) shows the most frequent occurring nouns in RedCaps. We see a variety of common nouns, both concrete (cat, plant) and abstract (day, time). We find that nouns like guy, baby, and boy are frequent with RedCaps images with pet animals. Moreover, most frequent proper nouns comprise many cities (chicago, london), states (california, texas), and countries (japan, germany, india), indicating the geographical diversity of RedCaps. We aim to show that RedCaps offers a unique style of data for both vision and V&L applications. We demonstrate both applications by adapting VirTex [19] , a recent method for pre-training visual representations by performing image captioning as proxy task. In this section, we measure the effect of data quality on downstream vision tasks by training VirTex models with the same architecture but different datasets -SBU, CC-3M, and RedCaps. To control for RedCaps's size, we also train on a subset of RedCaps instances from 2020 -this has size comparable to CC-3M (3.2M vs 2.9M). Extending VirTex to VirTex-v2: VirTex comprises an image encoder (visual backbone) and a pair of text decoders (textual head) that predict the caption token-by-token in forward and backward directions. The base model from [19] used a ResNet-50 [21] visual backbone, and Transformers [64] in textual head that are L = 1 layers deep and H = 2048 dimensions wide, and was trained on COCO Captions [17] (118K images). We modify this model from [19] to VirTex-v2 in order to scale to larger noisy datasets, making the following changes: -Model architecture: We use deeper Transformers with L = 6 layers. To balance the memory requirements, we reduce the width to H = 512. We use the recent Pre-LN Transformer variant [35, 65, 66] that is more stable to train large transformers [67] -LayerNorm [68] is moved inside the residual connection, and we add LayerNorm before the prediction layer. Table 3 : Transfer learning: zero-shot and linear probe. We train VirTex-v2 models on different image-text datasets, then transfer the learned features to seven downstream classification datasets (N = #classes). Models trained on RedCaps perform best on all datasets except one. -Tokenization: Similar to VirTex, we use SentencePiece tokenizer [69] with BPE [70] . We build a vocabulary of 30K tokens from the combined caption corpus of SBU, CC-3M and RedCaps. For fair comparison, we use the same vocabulary for all models trained on different datasets. When training with RedCaps, we prefix the caption with subreddit tokens: e.g. for Figure 1 ( . We use wordsegment [71] to break subreddit names to words (e.g. itookapicture → i took a picture). -Training details: We use AdamW [72, 73] with weight decay 10 −2 and max learning rate 5 × 10 −4 with linear warmup for the first 10K iterations, followed by cosine decay [74] to zero. We also use label smoothing ( ls = 0.1) [54] which has improved language generation for machine translation [64] . We train for 1.5M iterations with total batch size 256 across 8× 2080Ti GPUs. We save checkpoints every 2000 iterations, and average the last five checkpoints to use for downstream tasks and image captioning. All other details remain unchanged from [19] . We have open-sourced all the training code and pre-trained checkpoints, available at https://redcaps.xyz. We evaluate the quality of visual representations learned from SBU, CC-3M, and RedCaps by training VirTex-v2 models on each, then transferring the visual backbone to image classification and instance segmentation on eleven different downstream datasets. Our evaluation setup closely follows recent works on self-supervised learning [75] [76] [77] and language-supervised [19, 29] learning. We describe the main evaluation settings here; see Appendix F for more details. Zero-shot image classification: Training with language supervision enables zero-shot transfer to downstream tasks without any task-specific training [29, 78] . We evaluate the utility of different datasets for representation learning by comparing zero-shot performance on seven classification datasets: Oxford-IIIT Pets [79] , Food-101 [80] , Flowers-102 [81] , Stanford Cars [82] , Country-211 [29] , and SUN-397 [83] , and Birdsnap [84] . Inspired by CLIP [29] , we perform zero-shot classification by designing one prompt per category in the target dataset and ranking the log-probabilities predicted by the trained captioning model for each prompt, averaging predictions from the forward and backward Transformers. For SBU and CC-3M we follow CLIP and use the prompt [ Results are shown in Table 3 (top). VirTex-v2 models trained on RedCaps outperform those trained on SBU and CC-3M by wide margins on six out of seven datasets. This not due to RedCaps's larger size: models trained on RedCaps-20 also outperform those trained on CC-3M. Linear probe image classification: We also evaluate image classification on these datasets by training linear models over frozen visual features. Our evaluation details exactly follow CLIP -we use scikit-learn [85] logistic regression with L-BFGS. We train for 1K iterations, and search L2 regularization λ over 96 logarithmic spaced values in [10 −6 , 10 6 ] by validating on held-out 10% training data. Results are shown in Table 3 Figure 7 : Human evaluation: CC-3M vs. RedCaps. We decode image captions from VirTex-v2 models trained on CC-3M and RedCaps. We show both captions (excluding subreddit names) to three crowd workers and ask them to guess which is more likely to be written by a human. All three workers chose the underlined caption for each of the displayed images. We found that workers preferred organic references (little guy vs animal), witty remarks (snow sculpture), and specific mentions (singapore) by the RedCaps-trained model. Among negative cases are mostly instances where RedCaps-trained models make blatant errors in identifying common visual objects (e.g. pizza). transformer (12 vs 6 layers), larger dataset (400M vs 12M instances), longer training (12.8B image updates vs 384M), and prompt ensembling. Our goal is not to achieve state-of-the-art performance, but instead to compare impact of different data sources on the quality of learned visual features. Other tasks: We evaluate on standard transfer tasks with four other datasets: PASCAL VOC and ImageNet-1k linear classification with frozen features and instance segmentation [86] on COCO [26] and LVIS [27] with end-to-end fine-tuning of Mask R-CNN. These tasks follow the same setup as [19] . On Im-ageNet, we also perform k nearest neighbor classification (k=20), following [87, 88] , and zero-shot classification as described above. Results are shown in Table 4 We hope that the human interaction flavored data of RedCaps enables more human-like and conversational image captioning models. We use VirTex-v2 pre-trained models for image captioning -we use nucleus sampling [89] with nucleus size 0.9 to decode a caption from the forward Transformer. In this section, we demonstrate all results on an additional held-out test set of 1K instances sampled randomly from image posts submitted to our selected subreddits in the first week of 2021. Evaluating caption predictions: Automatic captioning evaluation metrics correlate poorly with human judgement [90, 91] . We thus evalute caption predictions via user studies. We sample captions from models trained on RedCaps and CC-3M, then present crowd workers with the image and both captions. Workers are told that one caption is written by a human and the other machine-generated, and asked to guess which is human-written. We take a majority vote among three workers for each of our 1K test images. Results are shown to the right -workers preferred captions from the RedCaps-trained model for 633/1000 images. We run a similar study to compare against ground-truth captions, and workers still prefer generated captions for 416/1000 images. Some qualitative results are shown in Figure 7 ; more are shown in Appendix ( Figure 10 ). Subreddit-conditioned generation: Captions from different subreddits have distinct styles, focusing on different image aspects or using community-specific jargon. We use this observation to generate captions with distinct styles by prompting a RedCaps-trained model with different subreddits. Figure 8 shows examples of such diverse captions for images; see Appendix ( Figure 11 ) for more. Figure 8 : Subreddit-controlled caption style. We prompt the VirTex-v2 model trained on RedCaps with subreddit names while decoding captions. We observe that such conditioning captures subtle linguistic structures (r/itookapicture: itap of ..., r/somethingimade: i made...). or changes the main subject of caption (r/earthporn: venice, r/food: cold beer). However, for completely unrelated images (saturn), the model tends to ignore the conditioning while generating captions. RedCaps is directly related to many recent efforts on building large datasets of image-text pairs from the internet without expensive human annotation. Two notable datasets are SBU [1] and Conceptual Captions [2] . Originally intended for image-text retrieval and image captioning, they are now widely used for training generic V&L representations [3] [4] [5] [6] [7] [8] [9] [10] [11] 92 ] that transfer to downstream tasks like visual question answering [12] [13] [14] , referring expressions [93] , and visual reasoning [15, 16] . More recent works build larger datasets specifically for V&L pre-training, e.g. LAIT [94] , Conceptual-12M [31] , and Wikipedia-ImageText [62] . Similar to these datasets, RedCaps offers rich semantic data for pre-training applications. However, our choice of data source and hence the data quality is unique. Image-text datasets are now also used for learning visual features. Li et al. [78] trained visual N-gram models on YFCC-100M [95] ; [19, 20] learn features from COCO Captions [17] that are competitive with supervised ImageNet training [21, 96] on many downstream tasks [22, 24, [26] [27] [28] , and [29, 30] scale up to very larger non-public datasets that are larger than RedCaps. A core motivation for collecting image-text data is scaling to larger datasets without bearing annotation costs. Related to this goal are efforts that learn from large quantities of noisy non-text labels for web images such as WebVision [97] , YFCC-100M [95] , JFT-300M [98, 99] , and Instagram-3.5B [100] . This paper has introduced RedCaps, a large-scale dataset of images and captions collected from Reddit. As a source of data, Reddit is appealing: text and image are both created and shared by people, for the explicit purpose of starting a discussion with other people, leading to natural and varied content. Its subreddit structure allows manually curation of our dataset's content without labeling individual instances. We utilize this structure to collect a dataset focused on animals, objects, scenery, and activities, and specifically aim to minimize the appearance of people. We have shown that RedCaps is useful for learning visual representations that transfer to many downstream tasks, including zero-shot settings that use no task-specific training data. We have also shown that RedCaps can be used to learn image captioning models that generate high-quality text of multiple styles. RedCaps is not without flaws. We have tried to minimize problematic content through subreddit curation and automated filtering, but the unfathomable nature of large data means that RedCaps may contain a small number of instances with NSFW images or harmful language. Reddit's demographic biases mean that RedCaps may not equally represent all groups. Users should carefully consider these limitations for any new tasks developed on RedCaps, and should be especially wary of applications that make predictions about people. Despite these limitations, we hope that RedCaps will help enable a wide variety of new applications and advances in vision and language. We thank Mohit Virli for suggestions on the project website. We thank Mohamed El Banani, Nilesh Kulkarni, Stefan Lee, Ramprasaath Selvaraju, Ramakrishna Vedantam, and Erik Wijmans for helpful discussions and feedback on the paper. We thank Priya Goyal and Ishan Misra for help related to VISSL codebase. We thank all anonymous reviewers for constructive feedback during the review phase. We also thank the UMich ARC-TS team for support with GPU cluster management. We curated RedCaps from a manually chosen set of 350 subreddits, as described in Section 2.1. All these subreddits are listed below alphabetically with the number of instances in each subreddit. In Section 4.2, we conducted user studies to evaluate the quality of caption predictions from VirTex-v2 models trained on CC-3M and RedCaps. Here are some additional details of the evaluation procedure. We conduct the user study on the Amazon Mechanical Turk (AMT). The task is framed as a guessing game -we mention the crowd-workers that an AI bot is trying to impersonate humans by generating its own image captions. We set the price of this task as $0.3 for a batch of 5 images and obtain worker choices for 1K images, 3 workers per image. Refer the detailed instructions in Figure 9 below. Our final accuracy from this evaluation shows that humans preferred RedCaps pre-trained model over CC-3M for 633/1000 images. Each instance in RedCaps belongs to one of 350 subreddits. These subreddits serve as image labels, and can cluter visually similar images together. Here we observe this effect by visualizing the visual feature space of RedCaps images per subreddit. We choose an off-the-shelf ResNeXt-101 32×8d pre-trained on 940M Instagram images [100] as a feature extractor 3 . We extract 2048-dimensional global average pooled features for all images and average them per subreddit, resulting in a single 2048-dimensional vector per subreddit. We perform dimensionality reduction using Barnes Hut T-SNE [101] with default parameters in scikit-learn. Visualization is shown below in Figure 12 . This feature space reveals that subreddits of similar topics form very tight local clusters, such as dogs in top-center (r/corgi, r/husky, r/lookatmydog, r/pugs), food and drinks in top-right (r/bbq, r/cocktails, r/eatsandwiches, r/pizza, r/spicy, r/tea). Hence, manually selecting subreddits can let us steer the distribution of visual concepts in RedCaps. Table 5 : Additional results: Zero-shot (top-5) and low-shot transfer. We report top-5 accuracy of zero-shot image classification on datasets evaluated in our transfer experiments. We also perform low-shot transfer to six datasets -end-to-end fine-tuning on 1K randomly sampled class-balanced subset of each dataset. Models trained on RedCaps perform best on all datasets except SUN397. Linear probe image classification: We use scikit-learn Logistic Regression with L-BFGS solver, 1000 maximum iterations, and tolerance set to 10 −4 . For each dataset, we hold out a randomly sampled 10% subset of the training data and use it for validation. Similar to CLIP, we start with sweeping L2 regularization parameter λ ∈ {10 −6 , 10 −4 , 10 −2 , 1, 10 2 , 10 4 , 10 6 } and select two λ values with highest top-1 accuracy on held out split (these very always consecutive in our experiments). We zoom in the range with eight equally spaced λ per decade in logarithmic space to find the best value. Finally, we use this λ to train on the combined training data (including held-out 10%) and report top-1 accuracy on test split. The amount of instances in training and test splits used are exactly same as used for evaluating CLIP. Low-shot classification: Another common way of transfer learning is end-to-end finetuning of learned features. Hence in addition to zero-shot and linear probe classification, here we transfer on low-shot image classification on a subset of six datasets from the main experiments -Oxford-IIIT Pets [79] , Food-101 [80] , Flowers-102 [81] , Stanford Cars [82] , SUN-397 [83] , and Birdsnap [84] . For each dataset, we randomly sample 1000 instances such that their class distribution stays balanced. We perform end-to-end fine-tuning of pre-trained weights and follow the same training schedule for every dataset, highly similar to VTAB [102] . We use SGD with momentum 0.9 and weight decay 10 −6 . We use a batch size of 256 distributed across 8 GPUs (with synchronized BatchNorm [103] ) and train for 5000 iterations (∼1250 epochs with 1K examples). We use a maximum learning rate of 0.1 which is multiplied by 0.1 at iterations 1500, 3000, and 4500. We use the VISSL [104] codebase for all the low-shot transfer experiments. Results are shown in Table 5 . Similar to zero-shot transfer, models trained on RedCaps and RedCaps-20 perform best on all but one dataset. Datasheets for datasets introduced by Gebru et al. [45] serve as a medium of communication between the creators and concumers (users) of a dataset. They effectively consolidate the motivation, creation process, composition, and intended uses of a dataset as a series questions and answers. In this document, we provide a datasheet for the RedCaps dataset. It accompanies the first version (v1.0) released in October 2021 with our accepted paper at the NeurIPS 2021 Track on Datasets and Benchmarks. For the rest of this document: -All mentions of RedCaps and all reported data statistics refer to RedCaps v1.0. -All mentions of dataset website refer to https://redcaps.xyz. -All mentions of data collection code refer to the redcaps-downloader repository available at https://github.com/redcaps-dataset/redcaps-downloader (also linked on the website). For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description. -Large datasets of image-text pairs are widely used for pre-training generic representations that transfer to a variety of downstream vision and vision-and-language tasks. Existing public datasets of this kind were curated from search engine results (SBU Captions [1] ) or HTML alt-text from arbitrary web pages (Conceptual Captions [2, 31] ). They performed complex data filtering to deal with noisy web data. Due to aggressive filtering, their data collection is inefficient and diversity is artificially supressed. We argue that the quality of data depends on its source, and the human intent behind its creation. In this work, we explore Reddit -a social media platform, for curating high quality data. We introduce RedCaps -a large dataset of 12M image-text pairs from Reddit. While we expect the use-cases of RedCaps to be similar to existing datasets, we discuss how Reddit as a data source leads to fast and lightweight collection, better data quality, lets us easily steer the data distribution, and facilitates ethically responsible data curation. Q2. Who created this dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? We collected RedCaps without any monetary costs, since no part of our dataset requires annotations from crowd workers or contractors. This research work was partially supported by the Toyota Research Institute (TRI). However, note that this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity. -No. Q5. What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description. -Each instance in RedCaps represents a single Reddit image post. Q6. How many instances are there in total (of each type, if appropriate)? -There are nearly 12M (12,011,111) instances in RedCaps. Q7. Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable). RedCaps is a small sample drawn from all the data uploaded to Reddit. Millions of Reddit users submit image posts across thousands of subreddits on a daily basis. We hand-picked 350 subreddits containing high-quality photographs with descriptive captions, while leaving out lots of subreddits focused on many other topics like politics, religion, science, and memes. Even within the selected subreddits, we filtered instances to improve data quality and mitigate privacy risks for people appearing images. Hence, RedCaps data does not fully represent Reddit. Q8. What data does each instance consist of? "Raw" data (e.g., unprocessed text or images) or features? In either case, please provide a description. -Each instance in RedCaps consists of nine metadata fields: • "image_id": Unique alphanumeric ID of the image post (assigned by Reddit). • "author": Reddit username of the image post author. • "url": Static URL for downloading the image associated with the post. • "raw_caption": Textual description of the image, written by the post author. • "caption": Cleaned version of "raw_caption" by us (see Q35). • "subreddit": Name of subreddit where the post was submitted. • "score": Net upvotes (discounting downvotes) received by the image post. • "created_utc": Integer time epoch (in UTC) when the post was submitted to Reddit. • "permalink": Partial URL of the Reddit post (https://reddit.com/). Q9. Is there a label or target associated with each instance? If so, please provide a description. -No, we do not define any label or target for the instances. Targets are task-dependent. RedCaps can be used for a variety of tasks such as image captioning (inputs = images, targets = captions), image classification (inputs = images, targets = subreddits), text-to-image generation (inputs = captions, targets = images), or self-supervised visual learning (inputs = images, no targets). Q10. Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text. Q11. Are relationships between individual instances made explicit (e.g., users' movie ratings, social network links)? If so, please describe how these relationships are made explicit. -Some implicit relationships do exist in our data. All instances belonging to the same subreddit are likely to have high related visual and textual content. Moreover, multiple images posted by a single Reddit user may be highly related (photos of their pets, cars, etc.). Q12. Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them. We intend our dataset to be primarily used for pre-training with one or more specific downstream task(s) in mind. Hence, all instances in our dataset would be used for training while the validation split is derived from downstream task(s). If users require a validation split, we recommend sampling it such that it follows the same subreddit distribution as entire dataset. Q13. Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description. -RedCaps is noisy by design since image-text pairs on the internet are noisy and unstructured. Some instances may also have duplicate images and captions -Reddit users may have shared the same image post in multiple subreddits. Such redundancies constitute a very small fraction of the dataset, and should have almost no effect in training large-scale models. Q14. Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, (a) Are there guarantees that they will exist, and remain constant, over time? (b) Are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created)? (c) Are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a future user? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate. We do not distribute images of our dataset to respect Reddit user privacy and to limit our storage budget. Instead we provide image URLs ("url", Q8) that point to images hosted on either Reddit, Imgur, or Flickr image servers. In response to sub-questions: (a) These image servers ensure stable access unless the Reddit user deletes their image post. (b) Yes, Reddit archives all the metadata of submitted posts. For images, Reddit only archives the URL and not the media content, giving full control of accessibility to the users. (c) All image URLs are freely accessible. It is unlikely for the image servers to restrict access in the future, given their free accessibility over the past decade. Q15. Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals non-public communications)? If so, please provide a description. -No, the subreddits included in RedCaps do not cover topics that may be considered confidential. All posts were publicly shared on Reddit prior to inclusion in RedCaps. Q16. Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why. The scale of RedCaps means that we are unable to verify the contents of all images and captions. However we have tried to minimize the possibility that RedCaps contains data that might be offensive, insulting, threatening, or might cause anxiety via the following mitigations: (a) We manually curate the set of subreddits from which to collect data; we only chose subreddits that are not marked NSFW and which generally contain non-offensive content. (b) Within our curated subreddits, we did not include any posts marked NSFW. (c) We removed all instances whose captions contained any of the 400 potentially offensive words or phrases 4 . Refer Section 2.2 in the main paper. (d) We remove all instances whose images were flagged NSFW by an off-the-shelf detector. We manually checked 50K random images in RedCaps and found one image containing nudity (exposed buttocks; no identifiable face). Refer Section 2.2 in the main paper. Q17. Does the dataset relate to people? If not, you may skip remaining questions in this section. -The dataset pertains to people in that people wrote the captions and posted images to Reddit that we curate in RedCaps. We made specific design choices while curating RedCaps to avoid large quantities of images containing people: (a) We collect data from manually curated subreddits in which most contain primarily pertains to animals, objects, places, or activities. We exclude all subreddits whose primary purpose is to share and describe images of people (such as celebrity photos or user selfies). (b) We use an off-the-shelf face detector to find and remove images with potential presence of human faces. We manually checked 50K random images in RedCaps (Q16) and found 79 images with identifiable human faces -the entire dataset may have ≈19K (0.15%) images with identifiable people. Refer Section 2.2 in the main paper. Q18. Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset. -RedCaps does not explicitly identify any subpopulations. Since some images contain people and captions are free-form natural language written by Reddit users, it is possible that some captions may identify people appearing in individual images as part of a subpopulation. Q19. Is it possible to identify one or more natural persons, either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how. -Yes, all instances in RedCaps include Reddit usernames of their post authors. This could be used to look up the Reddit user profile, and some Reddit users may have identifying information in their profiles. Some images may contain human faces (Q17) which could be identified by appearance. However, note that all this information is already public on Reddit, and searching it in RedCaps is no easier than searching directly on Reddit. Q20. Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please provide a description. -Highly unlikely, the data from our manually selected subreddits does not contain sensitive information of the above forms. In case some instances have such information, then note that all this information is already publicly available on Reddit. Q21. Any other comments? -No. Q22. How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how. We collected instance IDs using Pushshift API (https://pushshift.io) and remaining metadata fields (Q8) using the Reddit API (https://www.reddit.com/wiki/api). All fields except "caption" are available in API responses; "caption" is derived by applying text preprocessing to "raw_caption" field (Q35). Q23. What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated? -We collected all data using compute resources at the University of Michigan. The code for querying APIs and filtering data is implemented in Python. We validated our implementation by manually checking few RedCaps instances with their posts on https://reddit.com. Q24. If the dataset is a sample from a larger set, what was the sampling strategy? -RedCaps is a small sample containing data from 350 subreddits out of thousands of subreddits on Reddit. We hand-picked each subreddit for our dataset based on its content. See Q7, Q16, and Q17 for details on how we selected each subreddit. Q25. Who was involved in data collection process (e.g., students, crowd-workers, contractors) and how were they compensated (e.g., how much were crowd-workers paid)? -Our data collection pipeline is fully automatic and does not require any human annotators. Reddit users have uploaded image posts whose metadata is a part of RedCaps -we did not directly interact with these users. Q26. Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please provide a description of the timeframe. -RedCaps contains image posts that were uploaded to Reddit between 2008-2020. We collected all data in early 2021, which we used to conduct experiments for our NeurIPS 2021 submission. Since Reddit posts may get deleted over time, we exactly re-collected a fresh version in August 2021 after acceptance (and re-trained all our experiments). Reddit posts observe the most user activity (upvotes, comments, moderation) for six months after their creation -posts from 2008-2020 are less likely to be updated after August 2021. Q27. Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation. We did not conduct a formal ethical review process via institutional review boards. However, as described in Section 2.2 of the main paper and Q16 we employed several filtering mechanisms to try and remove instances that could be problematic. Q28. Does the dataset relate to people? If not, you may skip remaining questions in this section. -Some images of RedCaps may contain images of people (see Q17). Q29. Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)? -We collected data submitted by Reddit users indirectly through the Reddit API. However, users agree with Reddit's User Agreement regarding redistribution of their data by Reddit. Q30. Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself. -No. Reddit users are anonymous by default, and are not required to share their personal contact information (email, phone numbers, etc.). Hence, the only way to notify the authors of RedCaps image posts is by sending them private messages on Reddit. This is practically difficult to do manually, and will be classified as spam and blocked by Reddit if attempted to programmatically send a templated message to millions of users. Q31. Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented. -Users did not explicitly consent to the use of their data in our dataset. However, by uploading their data on Reddit, they consent that it would appear on the Reddit plaform and will be accessible via the official Reddit API (which we use to collect RedCaps). Q32. If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate). -Users have full control over the presence of their data in our dataset. If users wish to revoke their consent, they can delete the underlying Reddit post -it will be automatically removed dfrom RedCaps since we distributed images as URLs. Moreover, we provide an opt-out request form on our dataset website for anybody to request removal of an individual instance if it is potentially harmful (e.g. NSFW, violates privacy, harmful stereotypes, etc.). Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation. Preprocessing, Cleaning, and/or Labeling Q35. Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section. We filtered all image posts with < 2 net upvotes, and those marked NSFW on Reddit. We remove character accents, emojis, non-latin characters, sub-strings enclosed in brackets ((.*), [.*]), and replace social media handles (words starting with '@') with a special [USR] token. Refer Section 2.1 in the main paper for more details. We also remove additional instances with focus on ethical considerations, see Q16, Q17 for more details. Q36. Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the "raw" data. We provide the unprocessed captions obtained as-is from Reddit as part of our annotations (see "raw_caption" in Q8). However, we entirely discard all instances that were filtered with ethical considerations -based on presence of faces, NSFW content, or harmful language. -We anticipate that the dataset could be used for a variety of vision-and-language (V&L) tasks, such as image or text retrieval or text-to-image synthesis. Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms? -This is very difficult to anticipate. Future users of our dataset should be aware of Reddit's user demographics (as described in Section 2.2 of the main paper) which might subtly influence the types of images, languages, and ideas that are present in the dataset. Moreover, users should be aware that our dataset intentionally excludes data from subreddits whose primary purpose is to share images that depict or describe people. Q43. Are there any tasks for which the dataset should not be used? If so, please provide a description. -Broadly speaking, our dataset should only be used for non-commercial academic research. Our dataset should not be used for any tasks that involve identifying features related to people (facial recognition, gender, age, ethnicity identification, etc.) or make decisions that impact people (mortgages, job applications, criminal sentences; or moderation decisions about user-uploaded data that could result in bans from a website). Any commercial and for-profit uses of our dataset are restricted -it should not be used to train models that will be deployed in production systems as part of a product offered by businesses or government agencies. Q44. Any other comments? -No. Q45. Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description. -Yes, our dataset will be publicly available. Q46. How will the dataset will be distributed (e.g., tarball on website, API, GitHub) Does the dataset have a digital object identifier (DOI)? -We distribute our dataset as a ZIP file containing all the annotations (JSON files). Users will have to download the images by themselves by using our data collection code. All uses of RedCaps should cite the NeurIPS 2021 paper as the reference. Q47. When will the dataset be distributed? -The dataset will be publicly available starting from October 2021. Q48. Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions. -Uses of our dataset are subject to Reddit API terms (https://www.reddit.com/wiki/ api-terms). Additionally users must comply with Reddit User Agreeement, Content Policy, and Privacy Policy -all accessible at https://www.redditinc.com/policies. The data collection code is released with an MIT license. Q49. Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions. The images corresponding to our instances are legally owned by Reddit users. Our dataset users can download them from the URLs we provide in annotation files, but resdistributing images for commercial use is prohibited. Q50. Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation. Q52. Who will be supporting/hosting/maintaining the dataset? -The dataset is hosted using Dropbox service provided by the University of Michigan. All the information about the dataset, including links to the paper, code, and future announcements will be accessible at the dataset website (https://redcaps.xyz). Q53. How can the owner/curator/manager of the dataset be contacted (e.g., email address)? -The contact emails of authors is available on the dataset website and in this datasheet. Q54. Is there an erratum? If so, please provide a link or other access point. -There is no erratum for our initial release. We will version all errata as future releases (Q55) and document them on the dataset website. Q55. Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to users (e.g., mailing list, GitHub)? -We will update our dataset once every year and announce it on the dataset website. These future versions would include new instances corresponding to image posts made in 2021 and beyond, would remove instances that were requested to be removed via the opt out form (Q32). Q56. If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced. -Some images in RedCaps may depict people (Q17). Rather then directly distributing images, we distribute URLs that point to the original images uploaded by Reddit users. This means that users retain full control of their data -any post deleted from Reddit will be automatically removed from RedCaps (see also Q10, Q14, Q31). Q57. Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to users. -A new version release of RedCaps will automatically deprecate its previous version. We will only support and maintain the latest version at all times. Deprecated versions will remain accessible on the dataset website for a few weeks, after which they will be removed. We decided to deprecate old versions to ensure that any data that is requested to be removed (Q32) will be no longer accessible in future versions. Q58. If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to other users? If so, please provide a description. -Anyone can extend RedCaps by using our data collection code (linked on the website). We are open to accept extensions via personal communication with contributors. Otherwise, our code and data licenses allow others to create independent derivative works (with proper attribution) as long as they are used for non-commercial academic research. Im2Text: Describing Images Using 1 Million Captioned Photographs Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset for Automatic Image Captioning LXMERT: Learning cross-modality encoder representations from transformers ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks VisualBERT: A simple and performant baseline for vision and language VL-BERT: Pre-training of generic visual-linguistic representations Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training Uniter: Learning universal image-text representations Unified vision-language pre-training for image captioning and VQA Object-semantics aligned pre-training for vision-language tasks Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers VQA: Visual question answering Visual7W: Grounded Question Answering in Images GQA: A New Dataset for Real-world Visual Reasoning and Compositional Question Answering A corpus for reasoning about natural language grounded in photographs From recognition to cognition: Visual commonsense reasoning Microsoft COCO captions: Data collection and evaluation server nocaps: novel object captioning at scale VirTex: Learning Visual Representations from Textual Annotations Learning visual representations with caption annotations Deep residual learning for image recognition Imagenet Large Scale Visual Recognition Challenge ImageNet: A Large-Scale Hierarchical Image Database The pascal visual object classes (VOC) challenge. IJCV Learning deep features for scene recognition using places database Microsoft COCO: Common objects in context LVIS: A dataset for large vocabulary instance segmentation The inaturalist species classification and detection dataset Learning Transferable Visual Models From Natural Language Supervision Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts Reddit: the front page of the internet TL;DR: Mining Reddit to learn automatic summarization Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training Language models are unsupervised multitask learners Language models are few-shot learners Conversational Contextual Cues: The Case of Personalization and History for Response Ranking Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems Training Millions of Personalized Dialogue Agents A Repository of Conversational Datasets Does object recognition work for everyone? Lessons from archives: Strategies for collecting sociocultural data in machine learning Data and its (dis) contents: A survey of dataset development and use in machine learning research Large Image Datasets: A Pyrrhic Win for Computer Vision? In WACV A study of face obfuscation in imagenet On the dangers of stochastic parrots: Can language models be too big? Retinaface: Single-stage dense face localisation in the wild Realtoxicityprompts: Evaluating neural toxic degeneration in language models Rethinking the Inception Architecture for Computer Vision Deep nn for nsfw detection Nearly eight in ten reddit users get news on the site Distribution of reddit.com traffic 2020, by country Gender shades: Intersectional accuracy disparities in commercial gender classification Men also like shopping: Reducing gender bias amplification using corpus-level constraints Women also snowboard: Overcoming bias in captioning models Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning spaCy: Industrial-strength Natural Language Processing in Python Attention is all you need Adaptive Input Representations for Neural Language Modeling Learning Deep Transformer Models for Machine Translation On Layer Normalization in the Transformer Architecture SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing Neural machine translation of rare words with subword units Python word segmentation Adam: A Method for Stochastic Optimization Decoupled Weight Decay Regularization SGDR: Stochastic gradient descent with warm restarts Momentum contrast for unsupervised visual representation learning A simple framework for contrastive learning of visual representations Unsupervised learning of visual features by contrasting cluster assignments Learning visual n-grams from web data Cats and Dogs Food-101 -Mining Discriminative Components with Random Forests Automated Flower Classification over a Large Number of Classes 3D Object Representations for Fine-Grained Categorization SUN database: Large-scale scene recognition from abbey to zoo Birdsnap: Large-scale Fine-grained Visual Categorization of Birds Scikit-learn: Machine learning in Python Mask R-CNN Unsupervised feature learning via non-parametric instance-level discrimination Emerging Properties in Self-Supervised Vision Transformers The Curious Case of Neural Text Degeneration CIDEr: Consensus-based image description evaluation SPICE: Semantic propositional image caption evaluation ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision Referitgame: Referring to objects in photographs of natural scenes ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data The New Data in Multimedia Research Imagenet Classification with Deep Convolutional Neural Networks WebVision Database: Visual Learning and Understanding from Web Data Distilling the Knowledge in a Neural Network Xception: Deep Learning with Depthwise Separable Convolutions Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining Visualizing data using t-SNE A large-scale study of representation learning with the visual task adaptation benchmark MegDet: A large mini-batch object detector