key: cord-0604805-l4n14cc5 authors: Liu, Vivian; Qiao, Han; Chilton, Lydia title: Opal: Multimodal Image Generation for News Illustration date: 2022-04-19 journal: nan DOI: nan sha: 757850713998d0a8ff005ba64dd70353a5076a20 doc_id: 604805 cord_uid: l4n14cc5 Multimodal AI advancements have presented people with powerful ways to create images from text. Recent work has shown that text-to-image generations are able to represent a broad range of subjects and artistic styles. However, translating text prompts into visual messages is difficult. In this paper, we address this challenge with Opal, a system that produces text-to-image generations for editorial illustration. Given an article text, Opal guides users through a structured search for visual concepts and provides pipelines allowing users to illustrate based on an article's tone, subjects, and intended illustration style. Our evaluation shows that Opal efficiently generates diverse sets of editorial illustrations, graphic assets, and concept ideas. Users with Opal were more efficient at generation and generated over two times more usable results than users without. We conclude on a discussion of how structured and rapid exploration can help users better understand the capabilities of human AI co-creative systems. Text-to-image generative frameworks allow users to generate images based on text. However, while in text and language we can compose any number of visual concepts, it is not guaranteed that any prompt passed through a text-to-image AI can produce a quality outcome. These frameworks utilize deep learning models that have been pre-trained on large magnitudes of data. While this allows users to have the freedom to experiment with a near infinite number of visual concepts, what a model is capable of generating is opaque to the user. Recent work in [27] has empirically shown that structured prompts such as "{SUBJECT} in the style of {STYLE}" can produce consistent results across a wide variety of styles, and that styles with very salient visual characteristics (such as a specific artist's name) can help guide users towards high quality aesthetic outcomes. However, these models still perform variably depending on the style and subject matter. One core usability problem with these frameworks is that understanding what combinations of subject and style the model tends to be successful at is often a random, trial and error process. When the generations fail, they often result in generations with distorted uncanny compositions that fall far afield from anything natural. Because these systems are also stochastic, we explore this challenge of translating text into images in a systematic, structured, and efficient process. An area that would greatly benefit text-to-image models is news illustration. News illustration refers to the practice of creating visuals that accompany news and journalism feature pieces.News illustrations often capture facets of the article such as the emotion, levity, subject matter with visual metaphors and conceptual illustrations. News illustrators often work under the breakneck pace of a news cycle and the pressure for the news to be presented in a timely, relevant, and eye-catching manner. They often have to come out fast and be created rapidly after the article is written or even concurrently. To help editorial illustrators efficiently create illustrations, we present Opal, a system that guides users through a structured search for visual concepts from text. From the beginning state of an article text, the structured search helps illustrators arrive at visual concepts and aesthetics that organize and streamline prompt ideation for text-to-image generation. We provide three pipelines that allow users to engage with the article in all its facets: its emotional tone, its subject matter, and its goal illustration style. At each stage of the pipeline, we employ techniques and leverage advancements from natural language processing such as semantic search and methods of prompt engineering GPT3 to query it as a knowledge base and association network. We present the following set of contributions: We make the following contributions: • Opal, a system that guides users through a structured search for visual concepts • Automatic technique of structured prompt engineering to help translate text into images • Annotation studies demonstrating that suggestions prompted from large language models can retrieve associated visual concepts as keywords and approach human gold standards with significantly less effort. • User studies demonstrating the efficacy of this structured prompt engineering technique in a system. Users with Opal were over two times more efficient at generation and generated over two times more usable results than users without. We end by discussing how large language model prompted keywords can visually anchor and structure exploration with text-toimage AI, how such technologies can benefit the process of illustrators, and how users can systematically experiment and get system feedback to understand generative AI capabilities. Our related works pertain to the domains of human AI co-creation, prompt engineering in natural language processing, and multimodal systems. Generative models have evolved greatly over the past decade in tandem with machine learning. Style transfer approaches were some of the early and popular seminal approaches. Neural style transfer approaches transitively applied style attributes to target images. [19] StyleGAN and StyleGAN2 similarly showed that deep learning models could learn to isolate style attributes within classes of images such as faces. [24] SPADE similarly granted users with stylistic control over high quality images by preserving the content information across learning blocks. Within their system, users could control style in along a continuous dimension, turning a mountain landscape from a cool cloudy blue atmosphere into something fiery red and volcanic. [31] For generative design and AI, style has long been a popular method of steering and interacting with generative art tools. One rationale for why stylization is so compelling as a core functionality for generative systems could be that subject and style have been empirically found to have different cognitive markers. [4, 14, 25] While style is one of vector of interest, a north star goal for human AI co-creation systems has always been in pursuit of declarative generation, being able to offer users the ability to semantically control generative outputs. This goal has been approached in a number of domains from images to music to 3D shapes. [1, 3, 10] For example, within the domain of music, Cococo gave users the ability to control at a local level how happy, sad, or surprising the measure of music was-allowing users to declaratively edit music based on emotion using Semantic Sliders. Language, however, presents one of the highest forms of control as it is, by nature, understandable. A number of studies have looked at how language models can be embedded into creativity support systems while navigating the tradeoffs of suggestion and control. [7] conducted an exploration into the trade-offs between autocomplete and user control for novelists interacting with language models to write fiction. [8] and [22] both investigated how computational models explored the connections of abstract concept to abstract concept via symbols to implement metaphor generation in the vein of creativity support. The rise of transformer models [6] and its success in few-shot learning has led to the rise of a new form of interaction with AI models known as prompt engineering. Prompt engineering refers to the formal search for prompts that can condition the model to produce more relevant and high quality outcomes. [5, 34] At present, it is currently unknown how prompt engineering compares to finetuning, and if it is better, as well as whether or not automatically discovered continuous prompts [18] or discrete prompts are better for prompt engineering. [26] While automatically tuned prompts have shown modestly better performance on certain benchmarks [26] , discrete manually crafted prompts are advantageous in that they are human readable and can draw from existing traditions of work. For example, in the system Sparks [23] , the prompt engineering that helped extract ideas from GPT2 for their system was crafted based on narrative and expository theory. This allowed the prompts to be rendered in the system as structured guidance and ideation for their writing support tool. Alongside the GPT family has been the longstanding popularity BERT as a model to study and utilize. [16] Before prompt engineering was reified as an active line of research and novel form of interaction with AI systems, BERT's masked language modeling objective was often employed for model completion-in essence also creating something to the effect of a templated prompt. Within [20] , BERT was utilized as a part of the text-to-image pipeline to have BERT produce shape analogies by responding to a masked template. They demonstrated how the commonsense and world knowledge collected by large pretrained models could be incorporated within systems. Likewise, we use GPT3 by querying it as a knowledge base in this paper for keyword, association, and stylistic knowledge. However, we approach GPT3 from a prompt engineering perspective, and provide a number of prompt engineering solutions that connect text and image together on dimensions beyond just shape. Our system adds to a long tradition of work within multimodal authoring tools. [9] connected language to 3D scenes, allowing users to use text prompts to generate 3D scenes. [37] studied color as a multimodal bridge between language and vision by studying how correlations in color and emotion could be computationally modeled. Most recently, multimodal authoring tools have leveraged text, image, and aural prompts to help inspire creative writing in multimodal authoring tools. [36] Recent momentum within multimodal representation learning has driven more people to create and open-source multimodal frameworks and systems. One of the latest advancements CLIP demonstrated that images and text pairs could be contrastively learned such that the image and text embeddings were optimized to share a multimodal embedding space. [32] Many of the earliest opensource text-to-image generative frameworks to gain traction used CLIP in a discriminator like fashion, using it in conjunction with a host of generative models from BigGan to VQGAN. [2, 12, 13, 17] Newer methods such as diffusion models have also increased output quality. [11, 15, 29, 30] In order how a text-to-image model could best augment a news illustrator's process, we conducted a co-design process with three news illustrators on a weekly basis over the course of two months. We listened to how illustrators described their usual process, and how they traditionally translated the text they received from writers into illustrations. Week to week, we had illustrators generate sets of text-to-image generations for a given news article. These generations explored prompts structured as "{SUBJECT} in the style of {STYLE}", to search for what qualities of prompts could perform well and capture different facets of news articles. As a group, these text-to-image generations were discussed holistically for whether or not they were able to express the subject and style of the prompts. 3.1.1 Traditional News Illustration Process. We first learned how news illustrators usually crafted illustrations. All the illustrators confirmed that they are generally given partial information from work-in-progress article drafts to start with: chunks of article text, keywords, working article titles, and high-level description. More often than not, news illustrators do not get well-formed, complete articles to work off of. One of the news illustrators said that they often received text from early paragraphs or latter paragraphs of their assigned articles. They rarely received titles, because those generally tended to be come up with last. Thus, for our system we intended on separating out for short lines and longer blurbs, to reflect where illustrators could put in shorter ledes and headlines and longer draft paragraphs. Styles. We also learned that in news illustration, illustrators tended to stick with their own style by virtue of their expertise. However, illustrators explored more broadly across styles when they produced text-to-image generations. As a group, we gradually found that there were a certain set of styles that worked better for news and that these sets of styles tended to connect with the subject on two dimensions: keywords and tones. For example, "glitch art" was a style of art that worked well for many articles where computers, Internet, and the digital age were relevant-connecting on keywords. On the other hand, we found that artistic styles were often effective at conveying an emotion across the composition-connecting on tone. For example, action painting as an art style created a sense of dynamic movement, positive energy, and excitement that paired well with news articles describing family fights during Thanksgiving dinners, which we can see in the generation of "food fight in the style of action painting" in Figure 2 . On the other hand, other styles such as Impressionism captured abstract qualities such as happiness, wonder and a sense of tranquility. However, we also learned that prompting the text-to-image models with words that were too abstract made it difficult for the model to find purchase in an recognizable subject. To remedy this we would expand abstract concepts into symbols for those concepts; for example, we expanded "happiness" to "sun", "beachballs", and "emojis". We later began to refer to concept as finding icons to connect keywords, and continued to use this approach, particularly for tone words. Illustrators liked to attempt to reproduce styles that often accompanied news articles, like vector art. Generating in the vector art styles tended to produce generations were hit or miss in that they sometimes successfully produced thick cartoonish, flat illustrations and sometimes drew up blanks. Based off of experimentation with the illustrators, we curated a list of styles commonly seen in journalism and pared it down for ones that did well when generated with VQGAN+CLIP. These styles included cartoon, vector art, street photography, pencil drawing, flat illustration, and so on. We additionally curated list of styles that happened to perform well in terms of aesthetic quality from VQGAN+CLIP, which built off of previous research that analyzed the stylistic knowledge spanned by VQGAN+CLIP, creating a style corpus of 125 styles. Week by week, we conducted exercises to try to produce text-toimage generations that could be adequate for news illustration, while searching for design patterns as a way to structure the process. From these exercises, we realized that often times the generations did not necessarily need to be taken as is. Sometimes they were fantastic standalone, but often their artifacts and flaws could be easily edited away. Generations could be collaged, drawn over, and post-processed in a number of ways as a visual asset. Illustrators commented that they could work around unclear subjects by using the generations as a background and editing in subjects from real images, or taking the compositions of generations as idea springboards. The learnings from these co-design sessions week over week helped us learn and justify design choices that we later implemented in Opal. The design goals were to implement the findings from our codesign sessions with news illustrators. To do so, we developed a simple prompt engineering method to retrieve keyword and tone suggestions. Similarly, we retrieved stylistic information from GPT3 to create a corpus of style information. To generate keyword suggestions for some {ARTICLE TEXT}, we prompted GPT3 with the following prompt, "Here are ten keywords for: {ARTICLE TEXT}. Likewise, we additionally extracted emotions that could have been conveyed through the tone of the text using the following prompt for GPT3: "Here are ten emotions for: {ARTICLE TEXT} " 3.2.2 Icon Suggestions for Keywords and Tones. we again prompted GPT3 using a prompt template: "Here are 10 icons related to:" This wording of this prompt was deliberate on a conceptual level. "Icons" are the bridge between the lemma and the image representation of that lemma. In this fashion, we utilized GPT3 as a knowledge base to provide associative knowledge. From these expansions, we were able to collect a diverse set of subjects from an unbounded number of options that would have been otherwise unavailable if we used dataset approaches such as Small World of Words. The stylistic knowledge that was searched over was also collected from scraping GPT3 as a knowledge base for artistic styles using the following prompt: "What are some of the defining characteristics of style as an art style?" We conducted this "scrape" using the "best-of" parameter set at 3 and with the Figure 2 : Text-to-image generations that were successful with news illustrators during the co-design process. These generations captured design patterns discussed in the formative study, where subjects and styles were matched based on alignment in keywords and tones. For example glitch art is relevant to technology and crisis, art deco is relevant to movie theater, action painting conveys the tone of chaos and watercolor conveys the tones of calm and hopeful. maximum number of tokens set to 256. For example, for "gothic art" the GPT3 elaborated, "Some common characteristics of Gothic art are intricate designs, often featuring pointed arches; tall, thin spires; and large stained glass windows. Gothic art is often associated with the Gothic architecture, which is characterized by its pointed arches, ribbed vaults, and flying buttresses. " These sentences were saved these results to asymmetrically search over using SBERT's approach for semantic search. For example, upon inputting a query such as "dark and moody", users were presented with results such as "gothic art", "baroque", "black and white photography", "photo negative" based on the result sentences that GPT3 had retrieved. For text-to-image generation, we used the checkpoint and configuration of VQGAN+CLIP pretrained on Imagenet with the 16384 codebook size [30] . Each image that was prompted was generated to be 256x256 pixels and iterated on for 100 steps on a local NVIDIA GeForce RTX 3080 GPU or a remote Tesla V100 GPU. Figure 3 : System design: The system takes in a news headline, generates keywords and tones, allows users to generate more icons for keywords, and allows users to get artistic style recommendations. The actual system organizes all the exploration pipeline on the left side and the gallery view on the right side. Opal is an interface composed of two components: a Palette and an Ouevre, which is by definition "the works of a painter, composer, or author regarded collectively". The Palette provides users with a pipeline of tools to help create an editorial illustration. The user begins with an Article Area, which asks the user to input some article text as well as the article title. As per our learnings from the co-design, this article text could be from any area of the article and does not have to be well formed. These text boxes allowed users to put short one-line descriptions of the article, which could include an actual article title, as well as longer form text. These words populating the Keyword Area and the words populating the Tone Area could both be expanded one by one for visual icons relevant to the selected keyword. They were both retrieved by the same icon suggestion prompt. The next area was the Style Explorer, which is shown in the middle of Figure 3 . Users were presented with a text area for which they could enter any query about a style they wanted. This query was then performed as an asymmetric semantic search using Sen-tenceBert [33] . Alongside the styles came rationales that helped explain the reason the style was returned by the Style Explorer. The Style Explorer also came with default styles. These were styles that either performed well within the formative phase of our system design and were included in list of curated styles that performed well for journalism in particular and in generations. We additionally identified a set of diverse styles to use as defaults: "abstract art", "vector art", "documentary photography", "collage", and "sketch". A key feature of the Style Explorer was its ability to match subjects to styles. When subjects were selected in the Style Explorer Area, they were subjects were searched as asymmetric queries against the corpus of style information we collected from GPT3. Lastly, the users were given an PROMPT area to put in prompts of there own. In a baseline version of the Opal system, this area and the article area were the only features that were given to users. The system was implemented in Python and Flask as well as HTML/CSS Javascript and Ajax. The text-to-image framework used was originally written by [30] and Katherine Crowson and refactored by the authors. [11, 30] The Opal system automatically generates keywords, tones and icons by prompting GPT3. The first part of this study aims to address the research questions below to test the result from GPT3 against the human gold standard for the tasks implemented in Opal: • RQ1 Does GPT3 perform as well as humans in terms of extracting keywords from an article, extracting tones from an article, expand keywords into icons and expand tones into icons? • RQ2 Can GPT3 reduce the mental demand of completing these tasks? • RQ3 Which of the two ways of prompting GPT3 for artistic style recommendations perform better? We conducted annotation studies to answer the research questions above. To prepare for the data used in the study, we randomly selected 5 news articles -3 articles were from the New York Times's "2021 The Year in Illustration" and 2 articles were from a local newspaper's "illustrations 2021 Year in Review. " From each of the 5 articles, we extracted the title, the first and the last paragraph as a summary of the article. To address RQ1, we compared annotation results for keywords, tones and icons generations by GPT3 and by human participants. To address RQ2, we asked participants to answer the NASA Task Load Index questionnaire at the end of each set of tasks. To address RQ3 we compared the annotation results for art styles generated by GPT3 using forward and backward search. Forward search refers to a method of prompting GPT3 directly by inputting "Give me 5 visual artistic styles associated with [selected keyword / tone]." Backward search refers to a method where we prompt GPT3 with "Give me visual hallmarks of [style], " to build an information database for a list of curated styles. We prompted GPT3 with "Give me 10 keywords associated with: [summary of article]" to get 10 keywords associated with the article input. One participant was given the same summary of the article and was asked to come up with 10 keywords for the article. Two annotators rated the relevance of keywords generated by both GPT3 and human by answering the question: "On a scale of 1-5, how relevant is the keyword to the article?" An example of the rubrics and result is shown in Figure 4 . The task was repeated five times for all five selected news articles and in total 100 keywords were rated. Tones. Similar to the previous steps, we prompted GPT3 with "Give me 10 emotions associated with: [summary of article]" to get 10 tones of the article. One participant was given the same summary of the article and was asked to come up with 10 tones of an article. Two annotators later rated the relevance of tones to the articles generated by both GPT3 and human, by answering the question: "On a scale of 1-5, how relevant is the keyword to the article?" An example of the rubrics and result is shown in Figure 5 . The task was repeated five times for all five selected news articles and in total 100 tones were rated. Keywords. From the list of GPT3 generated keywords, one annotator was asked to select three keywords that can best represent each of the articles. To evaluate the performance of icon generation, we then prompted GPT3 with "Give me 10 icons associated with [keyword]" and we asked one participant to also come up with 10 icons associated with the same keywords. Again, two annotators rated the relevance of icons generated by GPT3 and by human to the keywords by answering the question: "On a scale of 1-5, how relevant is the icon to the keyword?" An example of the rubrics and result is shown in Figure 6 . This process is repeated for each of the five selected news articles. In total 13 keywords were being expanded into 260 icons and two keywords were eliminated from the list due to duplication. Tone. Similar to the previous section, we asked one annotator to select three tones that can best represent each of the articles. We then prompted GPT3 with "Give me 10 icons associated with [tones]" and we asked one participant to also come up with 10 icons associated with the same tones. Again, two annotators rated the relevance of icons generated by GPT3 and by human to the tones by answering the question: "On a scale of 1-5, how relevant is the icon to the tone?" An example of the rubrics and result is shown in Figure 7 . This process is repeated for each of the five selected news articles. In total 14 tones were being expanded into 280 icons and one tone was eliminated from the list due to duplication. We used another annotation study to evaluate the performance of style recommendation by prompting GPT3 in two ways -forward and backward. For the forward method, we directly prompted GPT3 with "Give me 5 visual artistic styles associated with [selected keyword / tone]" For the backward method, we first curated X artistic styles including popular journalistic styles and high performance styles reported by previous studies and then prompted GPT3 with "Give me visual hallmarks of [style], " to build an information database based on responses from GPT3. Five keywords and five tones from the previous GPT3 generated keywords lists and tones lists were selected for this study. The five selected keywords and tones include the three most representative ones selected by one of the annotators and the top two unselected ones given the order by GPT3. We recruited two participants with art backgrounds to annotate the artistic styles based on the question of "On a scale of 1-5, how well can the artistic style express the keyword or tone?" An example of the rubrics and result is shown in Figure 8 . All participants were compensated for $20 per hour for however long the task took. We used a weighted Cohen's Kappa to measure the interrater reliability between the two annotators. We report a weighted Cohen's Kappa of 0.48 (moderate agreement) for the ratings on keywords. We used a two-sample t-test to compare the mean ratings of keywords generated by human and GPT3 and found that the mean rating of human generated keywords is significantly higher than the mean rating of the GPT3 generated keywords (p-value = 0.008). For human generated keywords, 77% are rated high (4 or higher) and for GPT3 generated keywords, 56% are rated high. The results are summarized in Table 1 and Figure 9 . Although people are clearly better at extracting accurate keywords for an article, GPT3 still produces many good results. Tones. We report a weighted Cohen's Kappa of 0.28 (fair agreement) when measuring the interrater reliability for the ratings on tones. Again we used a two-sample t-test to compare the mean ratings of tones generated by human and GPT3 and found that the mean rating of human generated tones is significantly higher than the mean rating of the GPT3 generated tones (p-value = 0.005). For human generated tones, 49% were rated high (4 or higher) and for GPT3 generated tones 23% were rated high. The results are summarized in Table 1 and Figure 9 . Similarly to the result of extracting keywords, people are better at extracting accurate tones for article, but GPT3 still produces several good results. We report a weighted Cohen's Kappa of 0.13 (none to slight agreement) for interrater reliability on the ratings for icons expanded from keywords. We used a two-sample t-test to compare the mean ratings of icons generated by human and GPT3 and found that the mean rating of human generated icons is significantly higher than the mean rating of the GPT3 generated icons (p-value < 0.001). For human generated icons, 71% were rated high (4 or higher) and for GPT3 generated icons 53% were rated high. The results are summarized in Table 1 and Figure 9 . Again, although people are better at generating relevant icons from keywords, GPT3 still produces a decent amount of good results. Tones. We report a weighted Cohen's Kappa of 0.33 (moderate agreement) for interrater reliability on the ratings for icons expanded from tones. We used a two-sample t-test to compare the mean ratings of icons generated by human and GPT3 and found that the mean rating of human generated icons is significantly higher than the mean rating of the GPT3 generated icons (p-value < 0.001). For human generated icons, 65% were rated high (4 or higher) and for GPT3 generated icons 37% were rated high. The results are summarized in Table1 and Figure 9 . Again, people are better at coming up with relevant icons from tones, but GPT3 still produces a good amount of highly rated results. 4.2.6 Generating Art Styles from Keywords. We report a weighted Cohen's Kappa of 0.28 (fair agreement) for interrater reliability on the ratings for art styles related to keywords. We used a twosample t-test to compare the mean ratings of art styles generated by forward prompting and backward prompting and found that the mean rating of forward prompting is significantly higher than the mean rating of backward prompting (p-value = 0.009). For forward prompting, 39% were rated 4 or higher and for backward prompting 24% were rated 4 or higher. The results are summarized in Table 2 and Figure 11 . 4.2.7 Generating Art Styles from Tones. We report a weighted Cohen's Kappa of 0.1 (none to slight agreement) for interrater reliability on the ratings for art styles related to tones. We used a two-sample t-test to compare the mean ratings of art styles generated by forward prompting and by backward prompting and found that the mean rating of forward prompting is significantly higher than the mean rating of backward prompting (p-value < 0.001). For forward prompting, 48% were rated 4 or higher and for backward prompting 27% were rated 4 or higher. The results are summarized in Table 2 and Figure 11 From the annotation results, we see that GPT3 is definitely not as good as humans in performing the tasks of generating keywords, tones and icons. However, we still argue that using GPT3 in the system can be helpful in the image generation process to help reduce time as well as mental stress. GPT3 can complete the same tasks that humans would take more than 25 minutes in under a few seconds. The NASA TLX also indicates a high level of mental demand, effort and frustration for doing the tasks and using GPT3 to provide recommendations eliminates these efforts and stress. Furthermore, we found that the percentage of GPT3 generated results being rated above 4 is enough for providing ideas and inspirations for people. Specifically, out of the ten generated outcomes, we would get at least 5 good keywords, 2 good tones, 5 icons from keywords and 3 icons from tones and that is enough for coming up with ideas in the graphics generation process. Therefore, the performance of GPT3 and the effort level required by humans to complete these tasks justify the use of GPT3 in the system for generating keywords, tones and icons. For artistic style recommendations, we found that both forward and backward methods still has room for improvement. We found that out of five recommended art styles, at least two would be good results with the forward method and at least one would be good results with the backward method. Although the forward search method received higher ratings, failures in recommending artistic styles occur in the forward search method more often than the backward search method, which lead to our decision on using the backward search in the system. One failure is that with forward search, random results that are not necessarily visual artistic styles might appear. For instance, some of the artistic styles appeared in the list include: social distancing art, coronavirus art, affordable art, prog rock and baroque pop etc. Some of these are made-up art styles and others are not visual art styles but music genres. Another failure with forward search is that the recommendations often default to certain generic artistic styles. For instance, the top 5 most frequent styles recommended by the forward method are cubism, surrealism, expressionism, abstract expressionism and neo-expressionism, which together accounted for 33% of all recommendations. On the other hand, with the backward method, the top 5 most frequent results only accounted for 17% of all recommendations. Therefore, we argue that with backward search, the system can provide more variety and less mistakes in style recommendations. Study 1 demonstrated that using GPT3 to suggest keywords and tones has validity. We next wanted to understand if this technique could structure text-to-image prompting and help people get more usable generations. In particular, we explored this approach in Opal towards the goal of creating generations for news illustrations. Formally we investigate the following five research questions around this system: • RQ1 Does Opal help users arrive at a larger set of usable generations more than our baseline? • RQ2 Does Opal help users arrive at a set of generations with lower cognitive load more than our baseline? • RQ3 Does Opal help users arrive at a set of generations with greater creative expression more than our baseline? • RQ4 To what extent can Opal help users create generations of usable quality as is, as visual assets, or as ideas? For our user study, we ran 12 participants, all of whom had some degree of experience with art and design across the following disciplines: editorial illustration, fine art, user interface design, architecture, and interactive media arts. This study was approved by the relevant IRB. . Participants were given a task to illustrate two articles, one of which was about the problem of overcrowding in national parks and another which was about healing in the time of COVID-19. These articles were randomly selected from a pool of 62 articles that came with editorial illustrations and can be found in the Appendix. We conducted a within-subjects experiments and created an ablated version of our system to use as a baseline. This baseline presents the user with fields for title and article text and nothing more beyond a text box for prompts. For each participant, the system automatically generated an image for the article title. This baseline is depicted in 12. Our experiment took place online through screenshared video chat in two stages. First, participants were introduced to the task and given a tour of the interface and article they were to work with for the first round. To mitigate for learning effects, we counterbalanced our control and experimental conditions, giving half of the participants Opal for the first round and half of the participants a baseline for the second round. We additionally counterbalanced the order of the articles we presented. These articles were chosen randomly from a set of 62 articles with illustrations from the New York Times "2021 The Year in Illustration" using a random number, such that the article did not include named entities. (Articles with named entities would have been better accompanied with a photo over an illustration.) Users spent approximately 25 minutes with both the system and the baseline. They then spent 15 minutes post-study in a semistructured interview. We had all users talk through the images they had generated. First, we had them eliminate unusable generations to arrive at set of usable generations. Then, we had participants talk through the remaining generations and categorize generations based on if they would use them as is, as a visual asset that could be used with editing, or as an idea. Often, participants found it difficult to bin generations into one category; for example, participants an image potentially usable both as is and with a bit of post-processing or as an idea and with a bit of post-processing. In situations like these, we rounded up to the higher degree of usability for analysis. Participants then filled out a questionnaire including NASA-TLX workload questions as well as Likert scale questions for creativity support measurement. The entire study took about an hour and a half and participants were compensated $40 USD. : Does Opal help users arrive at a larger sets of usable generations? We had the following hypothesis H1: Opal would help users arrive at larger sets of usable generations. We found that Opal significantly increased the number of generations participants judged compared to our baseline. Across 12 participants, participants using Opal on average generated 43 generations while participants using the baseline on average generated 16 generations. Using a paired t-test, we found that this difference was significant; using Opal increased the number of generations created by over two times ( =< 0.01). In Figure 13 we can see this pictorially in the gray bars; the number of generations increased across all twelve participants. We further found that Opal significantly improved the number of usable generations compared to our baseline. With Opal, users found an average of 17 usable generations, while with the baseline, users found an average of 6 usable generations. Using a paired t-test, we again found that Opal these improvements in usable generations were significant to ( <0.01), meaning that Opal increased the number of usable generations by almost two times. In Figure 13 we can see this pictorially in the green bars; the number of usable generations improved across all twelve participants. Given these results,we conclude and confirm H1: Opal does allow users to arrive at a larger set of usable generations within the same amount of time. We had the following hypothesis H2: Opal would help lower cognitive load for users. To answer RQ2: Does Opal help users arrive at a set of generations with lower cognitive load?, we looked at a number of different measures from NASA-TLX as well as creativity support questionnaires drawn from the evaluations of other AI co-creative systems [ [1] , [35] , and [28] ]. After each participant had conducted each round, they answered Likert scale questions on these measures on a scale from 1 to 7. We found that Opal was rated highly in terms of performance. On average, Opal was given a 5.45/7 in terms of performance, with half the participants giving Opal above a 6 or 7 in terms of performance. In testing for statistical significance using paired t-tests, we found that across the NASA-TLX measures, the differences between the averages of Opal and baseline on performance, temporal demand, mental demand, effort, and frustration were insignificant. However in studying Figure 14 we can see more nuances in the distributions for each measure. We see that the median performance for Opal compared to the baseline is higher. In terms of mental demand, we also see a very wide spread in the mental demand measure for Opal, captured in the boxplot as a high interquartile range and in the standard deviation=1.9. For participants, Opal performed near neutrally in terms of frustration (mean=4.36/7) and effort Figure 13 : Across twelve participants, we found that the average number of generations created using Opal was significantly higher (2.68x) than the average number of generations created with the baseline. The number of usable generations created by Opal compared to the baseline was also significantly higher (2.28x). (mean=4/7), which was lower than the frustration and effort scores for the baseline (mean=4.81/7 and 4.90 respectively). From these results, we cannot confirm H2, which hypothesized that Opal would lower cognitive load. Opal help users arrive at a set of generations with greater creative support? We had the following hypothesis H3: Opal would help users arrive at a set of generations with greater creativity support. Next, we analyzed the creativity support measures. We found through paired t-tests that Opal was significantly better at helping users in terms of exploration ( =0.05) compared to the baseline system. Ten of twelve of the users rated Exploration highly at around 6 or 7, which is also illustrated in the boxplot for exploration in Figure 15 . In the other measures of control, ability to create novel generations, enjoyment, ease, and completeness, we found no significant difference between Opal and the baseline. From these results, we partially confirm H3, which hypothesized that Opal would increase creativity support, but only in the dimension of Exploration. To contextualize these numbers for cognitive load and creativity support better, we elaborate on these numbers through qualitative analysis from our interview data later on. To what extent can Opal help users create generations of usable quality: as is, as visual assets, or as ideas? We had the following hypothesis H4: Opal would primarily help users create visual assets and ideas, and to a lesser degree generations usable as is. To understand how our system performed on RQ4, we conducted qualitative analysis on the interview data we collected as we had participants analyze their gallery of generations and pare them down into a usable set. As users pruned their generations, they reported a number of reasons why a generation could not be usable. Aside from the obviously blank generations the AI would occasionally return, participants noted that artifacts in generations in photorealistic styles, distortions in figures (i.e. animals or people), and uncanny compositions made participants not want to use a generation at all. We then had participants categorize the generations based on our schema: usable as is, as visual assets, or as ideas. Regardless of the quantity of usable outcomes a participant came up with or whether the participant was using Opal or the baseline, participants tended to only choose a few generations to use as is, with many choosing none. However, many tended to look at these generations positively as visual assets or ideation, reporting different ways they could post-process and use the generation as visual assets. Figure 14 : Boxplots describing NASA-TLX subjective workload measures which range from 1 to 7. Higher scores mean that there was better performance or that less cognitive demand was incurred. We did not find significant difference between our Opal and the baseline in terms of any dimension. Figure 15 : Boxplots describing creativity support measures which ranged from 1 to 7 (disagree to agree). Higher scores indicated agreement that the system fulfilled them in these measures: user control, user ability to create novel things, enjoyment, ability to create at least one design, ease of use, ability to explore, and exploration completeness. We found significant difference ( = 0.05) in terms of exploration in that Opal helped users better explore the space of potential designs. For Opal, 10 of twelve participants were able to come up with at least one generation they could use as a visual asset, while in the baseline, only 7 participants were able to come up with at least one generation for a visual asset. In terms of ideas, for Opal, 11 came up with at least one generation that gave them an idea. Similarly, for the baseline, 10 participants came up with at least one generation. After the study, users were also given the option to be compensated for 10-15 minutes spent post-processing their visual assets, to complete the thoughts they shared during user studies. These illustrations are displayed in Figure 17 . Participant 11 created an artist edit depicted in the center of the figure to show overcrowding in national parks by creating a visual blend of a generation prompt "visitor numbers in the style of painting" and "crowds in a national park in the style of collage". Using a combination selection and masking, they were able to overlay the two generations as visual assets such that the artifacts of each were compensated for by the other. Findings. From our interview data, we observed different ways in which user interaction with Opal differed from user interaction with the baseline. Participants unanimously reported that keywords were helpful. From these icons of keywords, participants were able to generate broadly across a wide range of visual concepts. For example, to follow the keyword and tone expansion process through for one participant, P7 began with keywords such as "Sierra Nevadas", "Conservation", and "national parks" "an animal in the wild". They then drilled down into icons that could picture national parks. GPT3 responded with a list of subjects ranging from "a person with their hand in their air" to "an animal in the wild" to "Acadia National Park". They chose all of these related icons and found inspiration in "an animal in the wild in the style of painting" as inspiration and wanted to take "an animal in the wild in the style of collage" as is. As generations for the subject "a person with their hands in the air" came in, they felt like the set could be used as inspiration for another hypothetical news article about the private lives of tech CEOs. In contrast, when P7 then used the baseline, they generated four images and stopped, finding the first two generations outside of the default one for the article title unusable and the last one a lukewarm idea. In reflecting on their process with the keywords, they felt like because the suggestions were present they felt like they wanted to give each angle of exploration "a fair shot", whereas with the baseline, the less feedback it gave them, the less they were willing to try. As P10 noted, sometimes the tones were less helpful because could take participants in a divergent direction. This stage interrupted some participants just as they began to converge on some ideas they had seen in the icon stage for keywords. For example, in the case of P4, after having generated a number of concrete visualizations of "trees", "a picture of a campground", "a megaphone", "a national park", they pulled out of that focus to generate in "serenity" and "wonder", and later icons for these tones: "a bubbling brook winding through a verdant stream" and "a waterfall cascading down a mountain". They later went back to the keywords stage, reviewed the keywords, and began generating in a custom subject for "family hiking", converging upon the final concept with Opal. Figure 16 : Participants found usable generations in all three of the categories we defined for usability: use as is, use as a visual asset, and use as a conceptual idea. Examples are shown above, with the subject and style delimited by a comma / custom prompt. Participants also reported that the amount of choices returned when tones and keywords were expanded into icons could at times be overwhelming and conducive to choice paralysis, as well as repetitive and overspecific. To quote P9, "Keywords kind of give me generic images. I like how these [icons] start to get a little bit more specific, more rich conceptually. Some of these were a little bit too predictable, like a broken heart or a wilted flower, I feel like you see these images a lot. So I thought 'a person looking at the stars' is not predictable, so that's why I picked it. In terms of the STYLE EXPLORER, participants enjoyed trying diverse styles even if they had no exposure to them, testing out: conte crayon drawing, decollage, double exposure photography, etc. However all participants tended to concentrate their exploration on subjects rather than styles. Another theme we note is that Opal encouraged reaction over intention. P1 commented, "I'm reacting more to what it's generating than me proactively putting in things. It's nice for generating ideas, it's good for cherry picking. Here [Opal] spits out so many, I can just do this forever until I get that one ideal one. The other one took a lot more effort in terms of generating an image. " Similarly, P6 acknowledged that they were more interested in the outcome than the prompts driving the process. "I wish it just did everything for me. I really don't care if I chose the right keyword, but I wanted the right outcome. " In contrast, P10 highlighted that the baseline allowed them to "have control of what I am searching. I can come up with keywords on my own, which gives me more time to think about what I want to see more intentionally. " Some participants mentioned that the way they viewed the generations was influenced by the amount of effort they put in. P1 commented that because the baseline took deliberate search, they felt less likely to edit the images and more likely to take them as is. P12 similarly mentioned that the text-to-image systems seemed to be giving them "solutions" compared to what they would usually get from Google Images-where they would have gotten ideas and reference images. They did not feel like they would significantly modify the images because "it would interrupt the aesthetic of the generation". Next, we note that participants tended to experiment with the prompts more to understand system quality, but only in the baseline. Participants would put in custom prompts to build a concept of what the system could or could not do, even if they prompts they chose deviated them from their overall goal of generating a news graphic. This occurred across all 12 participants, though the way they approached this experimentation was highly individual. Overall, participants tried to understand how different nuances of prompt language would be handled by the system. P12 wanted to understand how well the system could handle abstraction-"I tried economics, conservation. . . abstract ideas to see if the AI could generate anything that could be used stylistically more than just subject matter. Participant 6 attempted different linguistic arrangements of the prompt, at first trying to convey photographic qualities (i.e. "close up of a hiker in the style of photography), and then different prepositions (an elk centered in front of a mountain the style of", and additionally plurality (i.e. "two hikers"). P5 began with a suggested icon for a tone keyword and replaced the word "a person" with "woman", wondering if the model would respond better if the subject was gendered. From this experimentation, participants surmised assumptions such as a reluctance to try subjects that were nonhuman, figurative, and animal things (P1, P2, P3). Meanwhile, P11 and P12 presumed early on that the model might be better at scenic images or "high texture ones" that possess an "uncontrolled" quality often seen in nature. Most participants in the baseline condition pursued a trajectory often for no more than a few prompts before making a working assumption. For this reason, some participants enjoyed the "tailored" (P4), cumulative, and yet explorative nature of Opal. It was structured as a suggested pipeline, and P12 in particular enjoyed the cumulative yet explorative nature of Opal. "I prefer this one [Opal] . I think it was like a pipeline. You start with an idea and you develop it more. The other one [baseline] felt like going to ground zero each time. It didn't feel like the ideas were building on each other that much. You were still in the same zone, you were still dealing with the same thing, it was changing but it didn't seem like it was accumulating as much. " We center our discussion around 1) how suggested keywords and tones can can help visually anchor text-to-image exploration, 2) how editorial illustrators can benefit from computational assistance, and 3) how experimentation and system feedback for users can help people understand generative AI capabilities. We found in our qualitative findings that keywords were excellent ways find connections within the news article to visual concepts and anchor the text-to-image exploration for users such that they had new directions at every step of the way. Each stage captured a diverse range of subjects, from named entities ("Yellowstone National Park") to figures in action ("a person looking up at the stars") to complex phrases like "people not respecting nature". Study 1 validated these prompt engineered suggestions as accurate and plausible suggestions, but more importantly it surfaced the insight that it is remarkably challenging and taxing for a human to come up with a large, diverse set of associations. Even though GPT3 could not beat out the human gold standard in terms of performance, what GPT3 can do is significantly mitigate human effort by suggesting concepts. This was one of our design rationales behind integrating associations from GPT3 into Opal. However, the results from Study 2 on RQ2 measuring cognitive load illustrate a trade-off: if users are presented with too many options without the ability to control for how generic or specific they are, the suggestions add to their cognitive load and induce choice paralysis. If users are presented with too few options, they are stuck with recalling and working through brute force trial and error. It could be as simple as controlling the number of suggestions returned, but it could also as complex as crafting the prompt engineering to be more context and user dependent, so we present these as lines of future work to consider. In interviews with illustrators who worked on journalism teams (n=3), we found that illustrators are receptive to Opal and to using computational tool within their process. P11, an illustrator for a community newspaper said, "I've also in the past used an AI generator GauGAN, the landscape generator, because I was asked to draw a fantasy landscape. I was like that's a lot of work to think about-I'm going to go generate a bunch of landscapes in different colors and see what it gives me. It helped me figure out color palette or like some of the textures and some of the more concrete forms that would go in my illustration. " P11 and P8 both mentioned that text-to-image systems have an advantage in terms of copyright. P8 mentioned that as opposed to browsing the internet for inspiration and running into potential concerns of copyright infringement such as accidentally taking inspiration from other people's illustrations, they could freely use the generations from Opal in any way. Both referred to Opal as a great resource to generate reference photos that they could trace over. Text-to-image systems also have the advantage that they can generate images into galleries infinitely but in a curated and mostly controlled manner. For example, within Google Images or Pinterest, it is impossible to avoid certain types of results or automatically explore in a clean and declarative manner. They instead return millions of results, whereas text-to-images systems can return higher quality sets with differences that can be followed and captioned. Two of the editorial illustrators in their think aloud process mentioned that they had a concept in mind at the start. Even though they were not able to input that concept directly into a text box, they were able to work with the generations and build on top of their mental image nonetheless. For example, P11, an editorial illustrator for a community newspaper said that they had a mental image from the start of a glowing heart". They began putting in prompts related to "glowing hearts", and eventually converged on some concepts. "I got more clarity on what the background could look like, because to do with nature. and this gave me a color palette, some texture, some more concrete ways to represent that. So then from here I could not only use these images but find ones that are similar to it and look up fields that look similar to that one to use in that image. " As they used Opal, Participant 6 responded to a generation of "an animal in the wild in the style of painting" with multiple ideas", which had depicted a blend of a lionesque animal onto a rock face. The first idea was that the image was "good for a children's book story", but also for articles intended for "people who got scammed by something...for some article about how some person's iCloud account got hacked and they got a nightmare. These sorts of things are good for conveying emotions. " These examples are depicted in Figure 16 . In Study 2, we found that all users wanted to understand the AI's "operating characteristics" before working with it towards the goal of an editorial illustration. If they began with the baseline, they experimented with a number of different prompts to get a quick impression of what the model could and could not do. This is in line with work by Gero et. al. [21] , who mention that it is important for users interacting with AI to develop a mental concept of what kind knowledge an AI is capable of displaying. When users began with Opal, they were able to get more feedback from the text-to-image model because significantly more images were being generates with less effort cost to the user. An added benefit of allowing users to see more generations as a form of system feedback is that the generations become less expensive to users. P1 commented that they felt more open to editing the generations when usnig Opal as opposed to the baseline (when they accepted the images for what it was). This could have been because the participant felt more open to using the generation as a design material when they saw a greater quantity of them coming in with ease. In creating a large garden of options for users to peruse, they are also able to build a good understanding of the system they are using. While many generative systems have to tried to pursue high fidelity forms of user control, control is hard to guarantee in generative systems, which are inherently stochastic. Potentially, it is better to allow users to generate cheaply and efficiently and to create a large garden of results to prune upon. Allowing the user to be a gardener rather the architect, the person attempting to fulfill all the constraints could be the right approach for artists who intuitively think visually and spatially to tackle something as novel and stochastic as text-to-image generation. One limitation related to our insight about the difficulties of translating text into image is that we supported the process with a novel technology that departed from the usual process in a very significant way. Four participants reported that this process was an inversion of what certain parts of their illustration or design process were like. While we did structure Opal through learnings from co-design with two editorial illustrators, which involved a focus on keywords and visual concept exploration, our user interface and steering controls for the AI generation were primarily text-based. Many participants mentioned that image creation has fundamentally been about interacting with the image through direct manipulation, establishing composition spatially as opposed to through language. Therefore, a limitation within our system was that we did not give users the ability to work more off of images and other processes they may have been more traditionally used to. While we had tried to implement ways of letting users pass in image prompts, which is possible with these technologies [30], we found that image prompts added another layer of stochasticity that would have compounded the lack of user control a user could feel. Nonetheless, we could have pursued other simple ways of bridging language and text such as color. Many participants found that they could use color as a way to sift and respond to the generations. For P1, "most of these are color studies, they relate more strongly to the color than the actual subject." Many generations were also responded to as usable when they were a springboard for potential color palettes. Color is another sensible direction to explore as it is cousin to tone and has been validated in research as a bridge between language and text. People can now generate images with nothing more than text, and we explore applications of this novel technology for news illustration. We address this challenge with Opal, a system that produces text-to-image generations for editorial illustration. Opal guides users through a structured search for visual concepts and provides pipelines allowing users to illustrate based on an article's tone, subjects, and intended illustration style. Our evaluation shows that Opal efficiently generates diverse sets of editorial illustrations, graphic assets, and concept ideas. Users with Opal were over two times more efficient at generation and generated over two times more usable results than users without. We conclude on a discussion of how structured and rapid exploration can help users better understand the capabilities of human AI co-creative systems. 2020. CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems Neuro-Symbolic Generative Art: A Preliminary Study Style follows content: On the microgenesis of art perception Gpt-3 creative fiction Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners How Novelists Use Generative Language Models: An Exploratory User Study MERMAID: Metaphor Generation with Symbolism and Discriminative Decoding SceneSeer: 3D Scene Design with Natural Language AttribIt: Content Creation with Semantic Attributes 2021. afiaka87/clip-guided-diffusion: A CLI tool/python module for generating images from text using guided diffusion and CLIP from OpenAI Rivers Have Wings VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance Viewing artworks: Contributions of cognitive control and perceptual facilitation to aesthetic experience BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Taming Transformers for High-Resolution Image Synthesis Making Pre-trained Language Models Better Few-shot Learners A Neural Algorithm of Artistic Style Visual Conceptual Blending with Large-scale Language and Vision Models Mental Models of AI Agents in a Cooperative Game Setting Metaphoria: An Algorithmic Companion for Metaphor Creation Sparks: Inspiration for Science Writing using Language Models Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN Neural correlates of beauty Prefix-Tuning: Optimizing Continuous Prompts for Generation Design Guidelines for Prompt Engineering Text-to-Image Generative Models Dream Lens: Exploration and Visualization of Large-Scale Generative Design Datasets lucidrains/big-sleep: A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN Semantic Image Synthesis with Spatially-Adaptive Normalization Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm Design Adjectives: A Framework for Interactive Model-Guided Exploration of Parameterized Design Spaces Where to Hide a Stolen Elephant: Leaps in Creative Writing with Multimodal Machine Intelligence Lighter' Can Still Be Dark: Modeling Comparative Color Descriptions