key: cord-0640802-6rcyiozs authors: Fung, Kelvin L. T.; Perrault, Simon T.; Gastner, Michael T. title: Effectiveness of Area-to-Value Legends and Grid Lines in Contiguous Area Cartograms date: 2022-01-09 journal: nan DOI: nan sha: b8e88dfcfda183c76a072d2feb08992815535625 doc_id: 640802 cord_uid: 6rcyiozs A contiguous area cartogram is a geographic map in which the area of each region is rescaled to be proportional to numerical data (e.g., population size) while keeping neighboring regions connected. Few studies have investigated whether readers can make accurate quantitative assessments using contiguous area cartograms. Therefore, we conducted an experiment to determine the accuracy, speed, and confidence with which readers infer numerical data values for the mapped regions. We investigated whether including an area-to-value legend (in the form of a square symbol next to the value represented by the square's area) makes it easier for map readers to estimate magnitudes. We also evaluated the effectiveness of two additional features: grid lines and an interactive area-to-value legend that allows participants to select the value represented by the square. Without any legends and only informed about the total numerical value represented by the whole cartogram, the distribution of estimates for individual regions was centered near the true value with substantial spread. Selectable legends with grid lines significantly reduced the spread but led to a tendency to underestimate the values. When comparing differences between regions or between cartograms, legends and grid lines made estimation slower but not more accurate. However, legends and grid lines made it more likely that participants completed the tasks. We recommend considering the cartogram's use case and purpose before deciding whether to include grid lines or an interactive legend. A S contemporary computer technology has simplified the production of data visualizations, researchers are now interested in evaluating and improving existing design practices for visualizations displayed on a computer screen. A cartogram is a type of data visualization for which there are currently only a few design guidelines [1] , [2] . A contiguous area cartogram is a special type of cartogram in which the area of each region is rescaled according to quantitative statistical data without changing the underlying map topology (i.e., neighboring regions must remain connected). Because contiguous area cartograms can simultaneously visualize geography and statistics, they are used, for example, in newspaper articles [3] , textbooks [4] , and online tutorials [5] . An example of a contiguous area cartogram is shown in Fig. 1 . The term "cartogram" is also used for many other related map designs (e.g., distance cartograms [6] and non-contiguous area cartograms [7] ). Hereinafter, we refer to contiguous area cartograms simply as "cartograms" for the sake of brevity. Cartograms satisfy Tufte's principle of graphical integrity: "The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the numerical quantities represented" [8] . To assist map readers with quantitative assessments, Dent recommended including an area-to-value legend in every cartogram [1] . Such a legend comprises a square symbol that shows how the area of the square is to be converted to the numerical value printed next to the square (Fig. 1) . We build on Dent's recommendation [1] and propose two additional features: grid lines and a selectable legend (i.e., an interactive area-to-value legend that allows the user to choose the area of the square). The purpose of this study is to evaluate whether static legends, grid lines, and selectable legends help map readers to retrieve quantitative information from cartograms. Illustration of the interactive selectable legend feature used in the experiment. Participants could select from three squares placed below the cartogram. In this example, the areas of the squares corresponded to populations of 10, 20, and 50 million. The active square appeared white. The other squares had a gray fill color and were stacked below the active square. When the participant clicked on a legend square that was not currently selected, the newly selected square was highlighted, and the space between the grid lines either expanded or contracted to correspond to the new legend square size. Through this example, we demonstrate the effect of switching from a legend square that corresponds to 10 million people (left) to a square that corresponds to 50 million people (right). We use the term "static legend" to refer to a single, noninteractive square symbol next to the associated numerical value. For the size of the legend square, we chose an area that represents around 1% of the cartogram's total area, consistent with Dent's recommendation that this area should be at "the low end of the value range" [1] (see Section 2.1). We only used values that were "nice" numbers (i.e., powers of 10 multiplied by 1, 2, or 5) [2] , [9] , [10] . Grid lines are vertical and horizontal lines overlaid on a cartogram, as shown in Fig. 1 . In this study, to ensure that the lines did not obfuscate the underlying map, we used translucent gray lines. The size of each square in the grid corresponds to the size of the legend square. In our experiment, a selectable legend consists of three squares of different sizes overlapping each other, as shown in Fig. 2 . We chose to have three squares because previous studies on proportional symbol maps found that having at least three symbols can reduce estimation errors [9] , [11] . When a user hovered the mouse cursor over a legend square that was not currently selected, the mouse cursor changed into a pointer, indicating that the user could perform a left click. Upon left clicking, the newly selected legend square was highlighted, the legend magnitude was updated, and the space between the grid lines either expanded or contracted to correspond to the legend square size. For the smallest square, we chose an area that represents approximately 1% of the cartogram's total data value, and we ensured that the value was a "nice" number, as described in Section 1.1.1. We then computed three consecutive nice numbers, with the value of the smallest square as the smallest number. The areas of the medium and large squares represent the second-smallest and largest nice numbers respectively. For example, in Fig. 2 , the small square represents 10 million people, the medium square represents 20 million people, and the large square represents 50 million people. Area is the visual variable with which cartograms communicate quantitative data. Humans are generally poorer at judging areas than they are at judging lengths or angles [12] . The ability of readers to extract quantitative information from areas shown in cartograms has been the focus of previous studies as well. In 1975, Dent conducted an experimental study to evaluate the communication aspects of cartograms [1] . Dent reported that participants found cartograms "confusing and difficult to read, but at the same time [they appeared] interesting, generalized, innovative, unusual, and having-as opposed to lacking-style." From the quantitative data collected during the experiment, Dent concluded that "better magnitude estimation is achieved when at least a key symbol at the low end of the value range is included." Dent did not specify what the low end of the value range is, but the sample cartogram he provided has a key that represents approximately 1% of the total population. We followed Dent's example by having each legend in our experiment represent approximately 1% of the total value. Subsequent studies have often pitted cartograms against other types of thematic maps (e.g., choropleth maps and proportional symbol maps). Kaspar et al. [13] conducted a study that assessed how well map readers made spatial inferences about data presented in cartograms versus data presented in choropleth maps. They found that cartograms and choropleth maps with graduated circles were equally effective when participants had to answer simple questions. However, when questions were complex, participants performed better with choropleth maps. Similar results were also obtained in other cognitive experiments [14] , [15] , [16] . Recently, Nusrat et al. [17] evaluated the effectiveness of four types of area cartograms: contiguous, non-contiguous, rectangular, and Dorling cartograms. Their experimental results show that there is partial evidence that contiguous area cartograms lead to the lowest error rate during comparison of areas. In this study, we investigated whether value-toarea legends and grid lines can aid map readers in making more accurate area judgments. In this digital age, features that can be added to cartograms are no longer constrained by the limits of pen and paper. As Goodchild noted, cartography's "true potential lies in less conventional methods of analysis and display and in the degree to which it can escape its traditional constraints" [18] . An early study conducted in 1999 by Peterson [19] explored the potential of computer technology by developing an interactive legend that controlled a cartographic animation using JavaScript. In a more recent study, Duncan et al. [20] experimentally evaluated whether cartograms can convey information more effectively with three additional interactive features: cartogram-switching animations, linked brushing, and infotips. These features were also implemented using JavaScript and are currently deployed on the web application go-cart.io [21] , [22] . In our study, we also tapped the potential of computer technology. We considered whether an interactive feature (a selectable legend) allows map readers to retrieve information from cartograms more effectively. To assess how well cartograms convey information, participants should be asked to perform a variety of task types that encapsulate the different ways cartograms can be used as a source of information. To this end, Nusrat and Kobourov [23] designed a cartographic task taxonomy with ten objective-based task types (i.e., task types that "focus on user intent, or what the user wishes to perform"). These task types are characterized by verbs (e.g., "Compare" and "Identify") that do not specify the method or the feature that the participant should use when performing the tasks. We adopted the same task taxonomy for our experiment to evaluate the effectiveness of legends and grid lines. From the objective-based task taxonomy by Nusrat and Kobourov [23] , we selected four task types for which legends and grid lines are conceivably relevant: Compare, Detect Change, Cluster, and Find Top. We used the Compare and Detect Change task types twice: once for administrative units (e.g., states in the USA or provinces in Nepal) and a second time for "zones", which are spatially contiguous areas formed by aggregating administrative units. We divided the respective countries into two zones, and both zones were approximately equally large. The zones were colored yellow and purple, respectively, to achieve a clear color contrast. For example, we considered New Zealand's South Island as a "zone" with seven administrative units (Canterbury, Marlborough, Nelson, Otago, Southland, Tasman, and West Coast). The South Island was colored yellow, and the other zone (i.e., the North Island) was colored purple. We included task types for zones as well as for individual administrative units so that participants had to make area comparisons on a variety of length scales. We also added a task type that we call Estimate Administrative Unit, which is not part of Nusrat and Kobourov's task taxonomy. Estimate Administrative Unit tasked participants to quantify an administrative unit's associated data value. We added this task type because it asks directly for the magnitude associated with a region in a single cartogram, whereas all other task types concern the relation between different regions or between different cartograms. In Table 1 , we list all task types used in our experiment, each with a description and an example task. We generated cartograms of 28 different countries. These countries are listed in Section 1 of the supplemental text, available online. Participants encountered each cartogram only once during the experiment because we wanted to mitigate any learning effects. To generate the cartograms, we used several types of data, such as population, GDP, and COVID-19 cases. For the Detect Change in Administrative Unit and Detect Change in Zone questions, where the task was to compare two cartograms, we used data sets of the same type of data for two different years (e.g., population in 1985 and population in 2018) and showed the participants both cartograms next to each other. All cartograms were generated using the web application go-cart.io, which uses the fast flow-based algorithm proposed by Gastner et al. [24] . For each task, a conventional equal-area map was presented alongside the cartogram(s). Each administrative unit was filled with the same color on the conventional map and the cartogram(s). We used the ColorBrewer palette Dark2 with six colors [25] and ensured that distinct colors were used for neighboring regions. We also implemented a linked-brushing effect, whereby the color of an administrative unit changed its brightness on both maps when TABLE 1 The seven task types used in our experiment with a sample question for each task type. Given a cartogram, participants were required to estimate an administrative unit's associated data value. Example On the other monitor, you can see a conventional map of Denmark and a population (2018) cartogram. Estimate the population of Hovedstaden (HS). If you are uncertain, please enter "NA". Compare Administrative Units Task Given a cartogram, participants were required to determine whether an administrative unit was larger or smaller than another, and by what magnitude (e.g., population size) they were different. Given two cartograms of the same country that was divided into two zones (indicated by the colors yellow versus purple), participants were required to estimate the extent to which a zone changed in magnitude (e.g., population size). Example On the other monitor, you can see a conventional map of New Zealand and two cartograms representing the population in 1991 and 2018. Is the population of the South Island (yellow) in 1991 larger or smaller than its population in 2018? By what magnitude is the population of the yellow region smaller in 1991? If you are uncertain, please enter "NA". Cluster Task Given a cartogram and an administrative unit U, participants were required to choose, from a set of four candidates, the administrative unit that had an area most similar to U. Example On the other monitor, you can see a conventional map of the United States and a population (2018) cartogram. Out of the states listed below, which state has a population most similar to Colorado (CO)? Find Top Task Given a cartogram, participants were required to identify the administrative unit with the largest area from a set of four candidates. Example On the other monitor, you can see a conventional map of Kazakhstan and a population cartogram. Which district has the largest population? the participant hovered the pointer over the unit on one of the maps [20] . The choice of colors and the linked brushing effect were intended to make it easier for participants to locate an administrative unit on all maps shown during a task [2] . On the conventional map, we labeled every administrative unit with two-letter abbreviations. The labels simplified the process of locating the administrative units for participants who may be unfamiliar with the geography of the displayed country. We recruited 44 participants, all students of Yale-NUS College. Out of the total sample, 20 participants were female, 22 were male, and 2 preferred not to answer. The mean age of the participants was 20.0, and all participants were between 18 and 24 years of age. Because the experiment required participants to distinguish between map regions filled with different colors, we used the Ishihara color perception test to ensure that the participants were not color blind. All participants were able to identify the correct numbers on the Ishihara test plates. Thus, we believe that none of the participants were color blind. Before participants started working on any cartogram task, we asked them to assess their familiarity with maps and cartograms on a 5-point Likert scale (1="Not familiar at all" and 5="Extremely familiar"). The mean rating was 3.0 for familiarity with maps in general and 1.9 for cartograms. The low familiarity with cartograms is likely to be representative of most casual cartogram users. We acknowledge that a limitation of our experiment is that all participants were college students. Therefore, our results may only be representative of a younger, more educated population. Previous cartogram evaluation studies also faced this limitation [1] , [26] , [27] . However, the tasks in our experiment did not require specialized knowledge and can be grasped by teenagers and adults with normal vision (possibly with correction). Thus, we believe that our results can be generalized to a larger population. The experiment consisted of four parts. During all four parts, there was always an experiment supervisor present to oversee the procedure in a one-on-one setting. (1) Introduction: At the start of the experiment, participants read an information sheet and then signed a form to indicate that they consented to participating in the experiment. Next, participants sat in front of two liquid-crystal display monitors, each with a resolution of 1920 × 1080, and watched a five-minute video that introduced them to cartograms and provided details about the experimental procedure. During the video, participants could pause, rewind, and ask for clarification as they wished. After the video, the experiment supervisor set up the two displays as follows (Fig. 3) . Monitor 1 displayed a Qualtrics XM survey, where participants read task descriptions and entered their answers. Monitor 2 displayed a web-based graphical user interface that showed the conventional maps and cartograms. With this setup, participants answered a practice question and tried out the selectable legend feature. The cartogram and country used in this practice question were not reused in the later questions. (2) Preliminary questions: Participants answered questions about their age, gender, and level of education. Participants also rated their familiarity with maps and cartograms. Thereafter, we conducted a color perception test with four Ishihara test plates. (3) Cartogram tasks: Participants answered 28 task-based questions. Each question required them to read a conventional map and one or two cartograms. They were provided with scratch paper and a pocket calculator. We also informed participants that they could take as much time as they needed for each question and that the time they took to complete each question would be recorded. We used a 7 × 4 within-subject experiment design. There were 7 task types, as listed in Table 1 , and 4 treatments based on the features displayed in the cartogram(s) for the task: • Neither legend nor grid lines. • Static legend only. • Static legend with grid lines. • Selectable legend with grid lines. Each participant encountered each combination of task type and treatment only once. We randomly divided the 44 participants into 4 groups, and the order in which the tasks appeared was the same for all groups. However, participants in different groups encountered the 4 experimental conditions at different times, as per the Latin square design [28] . Section 1 of the supplemental text, available online, lists the order in which combinations of countries, task types, and features appeared during the experiment. (4) Attitude study: We posed four free-response questions in which participants were asked to briefly describe the different strategies they used to perform the tasks in each of the four treatments (i.e., no features, static legend only, static legend with grid lines, and selectable legend with grid lines). Participants had to write their descriptions in text boxes displayed on the monitor that also showed all previous survey questions. Next, participants rated the aesthetics and effectiveness of static legends, static grid lines, and selectable legends with grid lines in a semantic differential test that we adapted from the tests used by Dent [1] and Nusrat et al. [26] . We devised seven pairs of words and phrases that are polar opposites (e.g., "ugly" versus "elegant", "difficult to understand" versus "easy to understand"), and participants marked them on a 5-point Likert scale for their attitude toward each of the three features. The instructional video used in part (1) and the complete list of questions used in parts (2)-(4) are available as supplemental material for this article from the journal's website. The seven task types in the experiment were divided into two categories: task types that required participants to enter numerical responses and task types in which participants had to select the name of an administrative unit. Five of the seven task types belong to the former category (Estimate Administrative Unit, Compare Administrative Units, Compare Zones, Detect Change in Administrative Unit, and Detect Change in Zone), while two belong to the latter (Find Top and Cluster). For the five task types that required numerical responses, we wanted to compare the participants' task completion rates, the accuracy of their responses, and response times when performing the tasks under the four treatments listed in Section 3.4. For the two task types that did not require numerical responses, we wanted to compare error rates and response times under the four treatments. The five task types with numerical responses can be divided into two subcategories. In the first subcategory (Estimate Administrative Unit), the correct answer had to be a positive number. However, in the second subcategory (comprising both Compare and both Detect Change task types), the correct answer could be positive or negative. Therefore, responses to the second subcategory were in two parts. First, participants had to answer whether the focal region had a larger or smaller area than the reference region (i.e., they had to determine the sign of the difference in area). Subsequently, the participant entered the magnitude of the difference. To compare the participants' accuracy for the five task types that required numerical responses, we first normal-ized the numerical data as follows: 1) If a participant answered the first part of a two-part question (e.g., Compare Administrative Units) incorrectly, we inverted the sign of the numerical response given in the second part. For example, suppose that a participant thought that region A is larger than region B by 5000, but the correct answer is that region A is smaller than region B by 3000. The two parts of the participant's answer would have been recorded as (i) "Larger", and (ii) "5000". Because (i) was incorrect, we inverted the sign of (ii) to obtain −5000. Thus, the participant would have been off by −5000 (response) − 3000 (correct answer) = −8000. This step allows us to properly account for the difference between the participant's numerical response and the correct answer. A positive sign of the sign-adjusted response indicates that the direction of the estimate is correct. 2) We then normalized the response using the below formula: normalized response = response − correct answer correct answer . Normalized responses for Estimate Administrative Unit tasks had to be greater than or equal to −1. For all other task types, values less than −1 were possible. After normalizing the numerical responses, we could meaningfully aggregate the results for different tasks of the same task type, even if the magnitude of the answer differed between the tasks. However, even after the above two steps, many of the distributions of the normalized responses were skewed. Therefore, we used the Kruskal-Wallis test for differences between treatments. The Kruskal-Wallis test is a non-parametric test that does not assume that the data are normally distributed [29] . The test statistic is χ 2distributed with 3 degrees of freedom. For post-hoc analysis, we used pairwise Mann-Whitney U tests with Bonferroni-Holm correction [30] , [31] . In the five task types with numerical responses, participants had the option to enter "NA" (no answer) instead of a number if they were uncertain. We define the task completion rate as percentage of non-"NA" responses. To assess whether the participants' task completion rate differed between treatments, we used Cochran's Q test [32] . In the post-hoc analysis, we used pairwise McNemar tests with Bonferroni-Holm correction [30] , [33] . For significant pairwise differences, we used the odds ratio to measure the effect size and calculated confidence intervals using the method proposed by Fay [34] . To calculate the distribution of response times for the five numerical-response task types, we excluded the response times of participants who did not complete the task (i.e., entered "NA" for the area estimate). Because the distributions of response times were right-skewed, we used the Kruskal-Wallis test to assess whether there are differences between treatments. If the Kruskal-Wallis test rejected the null hypothesis that there is no difference, we used pairwise Mann-Whitney U tests with Bonferroni-Holm correction for post-hoc analysis. For the task types that did not require a numerical response (i.e., Find Top and Cluster), we treated the response as binary data (correct versus incorrect) and applied Cochran's Q test, used pairwise McNemar test for post-hoc analysis, and determined the odds ratio and confidence intervals. To compare response times for these two task types, we excluded the response times of participants who answered the question incorrectly. Similar to comparing response times for numerical-response task types, we used the Kruskal-Wallis test. For post-hoc analysis, we used the pairwise Mann-Whitney U test with Bonferroni-Holm correction. For Estimate Administrative Unit tasks, we performed additional data analysis. We were interested in whether there was bias in overestimating or underestimating the magnitude. The distribution of the residuals (i.e. the normalized responses minus the mean conditioned on the treatment) failed the Shapiro-Wilk test for normality (W = 0.96, p < 10 −3 ). Thus, we applied the non-parametric Wilcoxon signed-rank test for a difference between zero and the pseudomedian of the normalized responses with Bonferroni-Holm correction. We also applied the Fligner-Killeen test to determine whether the variability of the numerical responses differed among the four treatments. As a post-hoc test, we applied the Ansari-Bradley test for difference in scale parameters with Bonferroni-Holm correction. In all the tests, we considered p-values ≤ 0.05 as statistically significant. However, p-values alone are not a good basis for drawing insights [35] . Therefore, the supplemental text, available online, also reports confidence intervals for all pairwise comparisons. The data and R scripts used for our statistical analysis are publicly available at https: //github.com/kvelon/cartogram-legend-effectiveness. Prior to the experiment, we believed that responses to Estimate Administrative Unit tasks would not exhibit a systematic trend of overestimating or underestimating areas. However, we anticipated that legends and grid lines would reduce variability in the responses because these features would provide a visual guide for the estimation. Therefore, we assumed that the distribution of numerical responses to Estimate Administrative Unit tasks would have H1: a mean near the true magnitude of the region to be estimated for all treatments. H2: less variability in treatments with additional features (i.e., static legend, grid lines, and selectable legend) than in the no-features treatment. We generally hypothesized that legends and grid lines would improve accuracy and increase the task completion rate in all task types with numerical responses (i.e., Estimate, Compare, and Detect Change). However, we foresaw that additional features might cause slower responses because they would encourage participants to make more careful and, thus, time-consuming judgments. Hence, we expected that in numerical-response task types H3: additional features would lead to more accurate estimates of magnitudes, conditioned on the task being completed. H4: additional features would give participants greater confidence in estimating magnitudes and, thus, increase the task completion rate. H5: participants would need more time in the presence of additional features. Both task types that did not require numerical responses (i.e., Find Top and Cluster) could be answered by ranking areas. While we expected that legends and grid lines would allow participants to make more accurate judgments, we conjectured that these features would not provide direct support for ranking. H6: For Find Top and Cluster tasks, we hypothesized that there would be no significant differences between treatment groups in terms of error rates and response times. The results of our data analysis for the five task types with numerical responses are shown in Fig. 4 . Tabular summaries of the results can be found in the supplemental text, available online. In the supplement, we also report confidence intervals of significant pairwise differences. The data analysis, outlined in Section 3.5, led to the following findings. • Estimate Administrative Unit: Differences between treatments had no significant effect on the average accuracy. However, the Fligner-Killeen test rejected the null hypothesis of equal standard deviation in the normalized responses [χ 2 (3) = 12.21, p < 0.01]. Pairwise Ansari-Bradley tests revealed that the selectablelegend-with-grid-lines treatment had a significantly smaller standard deviation (0.21) than the no-features treatment (0.38). Wilcoxon signed-rank tests detected that the median was different from zero in all but the no-feature treatment. Participants generally tended to underestimate the areas. The median ranged from −0.14 for the no-features treatment to −0.25 for the static-legend-only treatment. There was no evidence of a dependence of the response time on the treatment, but there was a significant effect on the task completion rate [χ 2 (3) = 36.97, p < 10 −7 ]. When participants had access to a static legend with grid lines or a selectable legend with grid lines, all participants completed the task. The post-hoc analysis of the participants' task completion rate confirmed that there were significant differences between the treatments without grid lines (lower completion rate) and those with grid lines (higher completion rate). Post-hoc analyses for the accuracy and response time showed pairwise differences between the static-legendonly treatment and the two treatments with grid lines. Hence, grid lines made responses more accurate but slower. For the task completion rate, we found pairwise differences between the no-features treatment and the other three treatments, as well as between the staticlegend-only treatment and the two treatments with grid lines, which made participants more likely to complete the tasks. Furthermore, we found that the majority of participants did not complete the task when they did not have any features (72.7%). • Compare Zones: We did not observe a significant effect of the treatments on the accuracy of the response. However, we did observe a significant effect on the response time [χ 2 (3) = 31.18, p < 10 −6 ] and the task completion rate [χ 2 (3) = 24.70, p < 10 −4 ]. For the response time and task completion rate, we observed pairwise differences between the no-features treatment (fastest responses and lowest completion rate) and the two treatments with grid lines. We also observed a difference between the static-legend-only treatment and the two treatments with grid lines (slowest responses and highest completion rate). • Detect Change in Administrative Unit: Differences between treatments had no significant effect on accuracy, but we observed a significant effect on the response time [χ 2 (3) = 9.64, p = 0.02] and task completion rate [χ 2 (3) = 45.20, p < 10 −9 ]. A majority of participants (56.8%) did not complete the task when they did not have any additional features. Post-hoc analysis revealed pairwise differences between the no-features treatment and the other three treatments, and between the staticlegend-only treatment and the selectable-legend-withgrid-lines treatment, which made participants more likely to complete the task. For the task completion rate, we found a pairwise difference between the no-features treatment (lowest completion rate) and the selectable-legend-with-grid-lines treatment (highest completion rate). For non-numerical task types, we measured the error rates and response times (Fig. 5) . We did not give participants the option to skip non-numerical tasks; the tasks had to be completed to move on to the next question. • Find Top: Differences between treatments had no significant effect on the error rate or on the response time for tasks of this type. • Cluster: We did not observe a significant effect of the treatments on the error rate. However, we did observe a significant effect on the response time [χ 2 (3) = 11.97, p < 0.01]. Post-hoc analysis of the response time revealed pairwise differences between the no-features treatment (fastest responses) and both treatments with grid lines (slowest responses). Regarding the hypotheses made prior to the experiment (see Section 3.6), we draw the following conclusions: • H1 is rejected because our experiment revealed a tendency to underestimate the true magnitude in Estimate Administrative Unit tasks when a legend was available. • H2 is partly supported by our results because we observed that the standard deviation of normalized responses decreased when a selectable legend with grid lines was available in Estimate Administrative Unit tasks. • H3 is rejected because the treatment groups had no significant effect on the participants' accuracy in four out of five numerical-response task types. • H4 and H5 are supported. For task types that required numerical responses, legends and grid lines tended to increase the task completion rates and response times. • H6 is partially supported. As expected, we did not observe any significant effect of the treatments on the error rates of the Find Top and Cluster tasks. However, we found evidence that additional features slowed down the responses to Cluster tasks. In the last part of the experiment, we asked the participants to rate the aesthetics and effectiveness of the three features (i.e., legend, grid lines, and selectable legend with grid lines). We showed the participants seven pairs of phrases, and each pair contained two phrases that were polar opposites: • Hindering -Helpful • Redundant -Essential • Difficult to understand -Easy to understand • Showing magnitude poorly -Showing magnitude clearly • Does not form an immediate impression -Forms an immediate impression The participants rated the three features in terms of these seven phrase pairs on a 5-point Likert scale. In Fig. 6 , we show the mean ratings for each of the seven phrase pairs and for each feature. The legend feature received the lowest mean rating among the three features for six out of seven phrase pairs (mean of all seven phrase pairs: 2.80). The selectable legend with grid lines had the highest mean rating (mean of all seven phrase pairs: 4.31). In free-response style questions at the end of the experiment, participants described their strategies under different treatment conditions. For the no-features treatment, many participants wrote that they had first estimated the proportion of the total area that had been occupied by the administrative unit or zone mentioned in the question. They then multiplied this proportion with the total value of the mapping variable stated below the legend. However, a few participants wrote that they were unable to perform the tasks without access to the features. With a legend, participants described two different strategies. The first strategy involved participants estimating the number of squares that could fit within an area; they then multiplied this number with the legend value. The second strategy was to disregard the legend and apply the same method many participants used when they had no features; that is, they estimated the proportion of the total area and then multiplied the proportion with the total value of the mapping variable. For the static-legend-with-gridlines treatment, almost every participant responded that they had performed the task by using the grid lines to count the number of squares covering the regions mentioned in the question. Finally, for the selectable-legend-with-grid-lines treatment, nearly 40% of the participants stated that they had chosen the legend size that had the closest fit for the region mentioned in the task; they then counted the number of squares. The "best" fit was a judgment that differed from individual to individual and from question to question. For certain participants, the best fit was obtained when the number of partially filled squares was minimized. Another strategy, which was adopted by approximately 20% of the participants, was to choose the largest of the three legend sizes to minimize the number of squares that needed to be counted. The median of the normalized responses to the Estimate Administrative Unit tasks was significantly smaller than zero when a legend was available. The tendency to underestimate the ratio of a larger area (an administrative unit) to a smaller area (a legend symbol) is consistent with results from psychophysics about area perception [36] . Participants tended to underestimate the magnitude even in the presence of grid lines. A plausible explanation is that participants might have judged the area based on a count of squares that were completely contained in the administrative unit, and squares that were only partly inside the administrative unit may have been omitted. While the median response to Estimate Administrative Unit tasks did not become more accurate, the variability became smaller when participants had access to a selectable legend with grid lines. In our opinion, the smaller variability and, hence, greater consistency of the estimates more than offsets the negative consequences of moderately underestimating the displayed magnitudes. An interesting question for future research is whether the tendency toward underestimation could be corrected using the "apparent magnitude scaling" technique [36] . Static legends without grid lines did not seem to have a statistically significant impact on the accuracy of the participants' responses when compared to the no-features treatment. Moreover, the legend did not significantly affect participants' response times compared to the no-features treatment. However, we observed that the legend-only treatment allowed participants to be significantly faster than the two treatments with grid lines for the Compare Administrative Units and Compare Zones tasks. Faster responses are likely a consequence of the participants' strategies for performing the tasks when only a legend was available. The first strategy involved estimating how many squares were needed to cover an administrative unit or zone; the second strategy disregarded the legend altogether and simply involved estimating the proportion of the total area occupied by a region. These two strategies did not involve meticulous counting of squares. Consequently, the static-legend-only treatment was faster than the two treatments with grid lines. Compared to the no-features treatment, the legend significantly reduced the number of participants who did not complete Compare Administrative Units and Detect Change in Administrative Unit tasks. Notably, the legend made it statistically significantly more likely to complete tasks of these two administrative-unit-based task types but not their sibling task types for zones (i.e., Compare Zones and Detect Change in Zone). A plausible explanation is that the administrative-unit-based task types were more difficult when using the strategy of estimating proportions because administrative units typically made up a much smaller proportion of the total area than zones. In the attitude study, participants considered the legend feature to be more elegant than grid lines (mean rating of 3.27 versus 2.43), presumably because the legend was less obtrusive. However, participants gave lower ratings to the legend than to the grid lines for all other phrase pairs. A plausible explanation for the moderately negative attitude toward the legend is that it appeared some distance below the cartogram, whereas grid lines were directly superimposed on the cartogram. Compared to the no-features treatment, the combination of a static legend with grid lines was not effective in improving the participants' accuracy. The only statistically significant effect observed was that this treatment allowed participants to give more accurate responses compared to the legendonly treatment for Compare Administrative Units tasks. The box plot of the normalized responses for Compare Administrative Units in Fig. 4 shows many outliers in the staticlegend-with-grid-lines treatment. Therefore, the positive effect of the static legend with grid lines is counterbalanced by more outliers in both directions. This treatment also caused participants to be slower while performing Detect Change in Zone and Cluster tasks compared to the no-features treatment, as well as slower in performing Compare Zones tasks compared to the no-features treatment and legend-only treatment. As mentioned in Section 4.3, when participants had no additional features or only a legend for Compare Zone or Detect Change in Zone tasks, many used the strategy of estimating the proportion of the total area occupied by the zone. However, with grid lines, most participants used the strategy of counting the number of squares that covered the zone. The second strategy is more time-consuming, and we believe that it explains why participants were slower with this treatment for these two task types. A potential improvement in the future might be to make grid cells interactive so that multiple squares can be highlighted by clicking and dragging (or shift-clicking on individual squares) to highlight a group of squares while displaying the total highlighted area next to the cartogram. The static legend with grid lines had a positive effect on participants' task completion rate. In four out of the five task types where participants were required to provide numerical responses (Estimate Administrative Unit, both Compare task types, and Detect Change in Administrative Unit), we observed that this treatment significantly increased the task completion rate compared to the no-features treatment. In fact, all participants completed the Estimate Administrative Unit tasks. The selectable legend with grid lines did not appear to have an effect on accuracy when compared to the nofeatures treatment. However, we did observe a statistically significant effect on accuracy compared to the static-legendonly treatment in the Compare Administrative Units tasks. The likely reason is that, with three legend sizes, participants were able to choose a legend size that is closest to the sizes of the administrative units that needed to be compared, thereby making it easier to spot differences in their areas. Compared to having no features, participants were significantly slower in performing Compare Zone, Detect Change in Zone, and Cluster tasks when given access to a selectable legend with grid lines. We believe that participants were slower because, with access to a selectable legend, they tended to select a legend size and then counted the number of squares that fit into an area. This strategy is more time-consuming than the strategy of roughly estimating proportions, which did not utilize any of the features. The selectable legend with grid lines made participants more likely to complete the task compared to the treatment without additional features. We observed a statistically significant effect in all five task types that required numerical responses. Additionally, in these five task types, no treatment had a percentage of missing responses lower than the selectable-legend-with-grid-lines treatment. Compared to the legend-only treatment, the selectable legend with grid lines also made participants significantly more likely to complete the tasks for four of the five numerical-response task types: Estimate Administrative Unit, both Compare task types, and Detect Change in Administrative Unit. Overall, although this treatment was not effective in improving the accuracy of responses and even caused participants to be slower in performing the tasks, it was effective in increasing the task completion rate. Participants rated the selectable legend with grid lines to be the most helpful (4.64 out of 5) and most essential (4.45 out of 5) among the three additional features (i.e., legend, grid lines, selectable legend with grid lines). The experimental results suggest that the three additional features that we tested (i.e., static legend, static grid lines, and selectable legend with grid lines) did not affect how accurately the participants performed the seven task types. At the same time, there is some indication that the additional features negatively affected the participants' speed. However, we saw a significant increase in the task completion rate when participants had access to any of the three features, especially the selectable legend with grid lines, which reduced the percentage of missing responses to less than 12% for all the task types that required numerical responses. There are advantages and disadvantages to adding grid lines or a selectable legend. Thus, we believe that the decision to include any of the features in a cartogram should depend on the individual use case. In web-based cartograms, it is possible to include more interactive features than those investigated in the present study. For example, an infotip [20] (i.e., a text box with information about the region at the position of the mouse pointer) serves a similar purpose as legends and grid lines; that is, it allows participants to infer the magnitude of the numerical data value that is represented by a region in a cartogram. As illustrated in Fig. 7 , an infotip can show precise values. Consequently, it is possible to retrieve information more accurately and quickly with an infotip than with legends or grid lines. However, we still recommend including a static legend by default because it communicates the magnitude of the displayed mapping variable [2] . Moreover, an infotip allows information retrieval for only one administrative unit at a time, whereas a legend and grid lines can be used just as easily for multiple administrative units (i.e., zones) as for one administrative unit. However, we do not recommend including grid lines or a selectable legend by default. Although the participants rated the selectable legend with grid lines to be helpful rather than hindering, we believe that not all users of webbased cartograms intend to perform tasks that are similar to those in our experiment. Moreover, certain users may find a selectable legend with grid lines obstructive because it adds significant clutter. Therefore, we recommend a toggle for this feature to be added to web-based cartograms so that users can decide for themselves whether they want to activate it. Common slide show and video formats (e.g., PowerPoint and MP4) do not allow individual access to interactive graphics during a presentation. As for any other medium, we recommend including a legend because it contains essential information that allows relating areas to quantitative data. However, for a slide show or video, there is no need for the legend to be selectable. Here, the intended purpose of the cartogram should be the main factor that determines whether static grid lines should be added. If the cartogram is intended solely for its visual impact, then adding grid lines may not be necessary. Another consideration is the amount of time the viewers are given to look at the cartogram. On one hand, a teacher may include a cartogram in a PowerPoint presentation and give students enough time to extract numerical information from the cartogram. On the other hand, a business consultant may include a cartogram in a PowerPoint presentation and may only display the slide for a few seconds. In these scenarios, the teacher should show a cartogram with grid lines, whereas the consultant should use a cartogram without them. If the cartogram is to be printed on paper, it is impossible to include interactive features such as a selectable legend or an infotip. In this case, we recommend adding a static legend with grid lines. Although our results do not indicate that the grid lines improved accuracy, they increased the number of participants completing the tasks. Therefore, readers are more likely to engage with a cartogram, rather than quickly passing over it, a cartogram when grid lines are present. Cartograms are a useful type of data visualization as they simultaneously show geography and statistics. However, previous studies have shown that retrieving quantitative information from cartograms is not a trivial task [1] , [13] , [37] . The purpose of this study was to find out whether legends and grid lines, both with and without interactivity, can support information retrieval from contiguous area cartograms. The results of our experiment show that these additional features cause map readers to be slower in estimating numerical values. The estimates are less variable, but not more accurate, when legends or grid lines are added. However, the additional features, especially the selectable legend with grid lines, have the positive effect that they significantly increase the map readers' confidence so that readers are more likely to complete cartogram reading tasks. We acknowledge that our study was limited in scope to contiguous cartograms. Therefore, it remains a question for further research whether legends and grid lines are effective for other cartogram types (e.g., rectangular [38] and mosaic cartograms [39] ). With these considerations in mind, we do not recommend incorporating grid lines into cartograms as a standard practice. Instead, we believe that it is crucial to examine the use case and the intended function of the cartogram before deciding which additional features are to be included. For contiguous cartograms, our study provided quantitative evidence that legends and grid lines cause readers to engage more with the presented data. We hope that our recommendations help cartograms to achieve their potential to become "a more socially just form of mapping" [40] that effectively highlights inequalities (e.g., in income [41] and health [42] ) to a wide audience. Communication aspects of value-by-area cartograms Motivating good practices for the creation of contiguous area cartograms Die Erde in Karten: So haben Sie die Welt noch nicht gesehen Stats: Data and Models The carbon map A new algorithm for distance cartogram construction Noncontiguous area cartograms The Visual Display of Quantitative Information Self-adjusting legends for proportional symbol maps The grammar of graphics Example of the infotip feature shown on maps of Singapore. When the user hovers the mouse cursor over a region, a pop-up appears that contains the region's name and numerical data Anchor effects and the estimation of graduated circles and squares Graphical perception: Theory, experimentation, and application to the development of graphical methods Empirical study of cartograms Kognitionsstudien mit mengentreuen Flächenkartogrammen Effectiveness of cartogram for the representation of spatial data Experimental evaluation of the usability of cartogram for representation of GlobeLand30 data Recognition and recall of geographic data in cartograms Stepping over the line: Technological constraints and the new cartography Active legends for interactive cartographic animation Taskbased effectiveness of interactive contiguous area cartograms go-cart.io: a web application for generating contiguous cartograms Creating cartograms online Task taxonomy for cartograms Fast flow-based algorithm for creating density-equalizing map projections ColorBrewer: Color advice for maps Evaluating cartogram effectiveness Learning from cartograms: The effects of region familiarity The Latin square principle in the design and analysis of psychological experiments Kruskal-Wallis test Holm's Sequential Bonferroni Procedure Mann-Whitney U test Which is the correct statistical test to use? McNemar test Two-sided exact tests and matching confidence intervals for discrete data Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-analysis The relative effectiveness of some common graduated point symbols in the presentation of quantitative data Recognition of areal units on topological cartograms On Rectangular Cartograms Mosaic drawings and cartograms Area cartograms: Their use and creation Adapted Dorling cartogram on wage inequality in Portugal Mapping the inequality of the global distribution of seasonal influenza vaccine We thank Ian K. Duncan and Chen-Chieh Feng for helpful discussions. This work was supported by the Singapore Ministry of Education (AcRF Tier 1 Grant IG18-PRB104, R-607-000-401-114) and Yale-NUS Summer Research Programme. We would like to thank Editage (www.editage.com) for English language editing. Kelvin L.T. Fung received his B.Sc. in mathematical, computational, and statistical sciences from Yale-NUS College (Singapore) in 2021. He is currently a student at University College London, pursuing an M.Sc. in machine learning and will graduate in 2022.