key: cord-0605638-idlrh7at authors: Pina, Alessandro; Baez, Marcos; Daniel, Florian title: Bringing Cognitive Augmentation to Web Browsing Accessibility date: 2020-12-07 journal: nan DOI: nan sha: eff3c366ffb8ddc7d0fb89200ac505c1749fed87 doc_id: 605638 cord_uid: idlrh7at In this paper we explore the opportunities brought by cognitive augmentation to provide a more natural and accessible web browsing experience. We explore these opportunities through textit{conversational web browsing}, an emerging interaction paradigm for the Web that enables blind and visually impaired users (BVIP), as well as regular users, to access the contents and features of websites through conversational agents. Informed by the literature, our previous work and prototyping exercises, we derive a conceptual framework for supporting BVIP conversational web browsing needs, to then focus on the challenges of automatically providing this support, describing our early work and prototype that leverage heuristics that consider structural and content features only. Accessing the Web has long relied on users to correctly process and interpret visual cues in order to have a proper user experience. Web browsers as well as information and services on the Web are optimised to make full use of user's visual perceptive capabilities for organising, delivering and fulfilling their goals. This introduces problems for blind and visually impaired people (BVIP) who due to genetic, health or age-related conditions are not able to effectively rely on their visual perception [7] . Assistive technology such as screen readers have traditionally supported BVIP users in interacting with visual interfaces. These tools exploit the accessibility tags used by Web developers and content creators in order to facilitate access to information and services online, typically by reading out the elements of the website sequentially from top to bottom (see Figure 1 ). They are usually controlled with a keyboard, offering shortcuts to navigate and access content. The challenges faced by BVIP in browsing the Web with this type of support is well documented in the literature, ranging from websites not designed for accessibility [19, 11] to limitations of screen reading technology [22, 30, 3] . This is a post-peer-review, pre-copyedit version of an article accepted to the International Workshop on AI-enabled Process Automation, at ICSOC 2020. Cognitive augmentation has been regarded as a promising direction to empower populations challenged by the traditional interaction paradigm for accessing information and services [8] . Conversational browsing is an emerging interaction paradigm for the Web that builds on this promise to enable BVIP, as well as regular users, to access the contents and services provided by websites through dialog-based interactions with a conversational agent [5] . Instead of relying on the sequential navigation and keyboard shortcuts provided by screen readers, this approach would enable BVIP to express their goals by directly "talking to websites". The first step towards this vision was to identify the conceptual vocabulary for augmenting websites with conversational capabilities and explore techniques for generating chatbots out of websites equipped with bot-specific annotations [14] . In this paper we take a deeper dive into the opportunities of cognitive augmentation for BVIP by building a conceptual framework that takes the lessons learned from the literature, our prior work and prototyping exercises to highlight areas for conversational support. We then focus on the specific tasks that are currently served rather poorly by screen readers, and describe our early work towards a heuristic-based approach that would leverage visual and structural properties of websites to translate the experience of graphical user interface into a conversational medium. You are now visiting the Tambury Gazette. This is a news website in English There are 17 main options. The top 5 are: 1. Politics 2. Local 3. World 4. Sports 5. Weather You can ask for "more options", or "go to" and the name or number of the option to access your choice. There are 2 matching articles in this page, with titles: 1. Tambury matches postponed indefinitely. 2. New vaccine in development for You are now in the article "Tambury matches postponed identify". This is part of the "Sports" section. Reading newspaper. Peter, 72, is a visually impaired man affected by Parkinson disease, who keeps hearing about the new virus COVID-19 on TV and wants to be updated constantly about the recent news from his favorite local newspaper, The Tambury Gazette. However, his experience with screen readers has been poor and frustrating, often requiring assistance from others to get informed. The vision is to enable users like Peter to browse the Web by directly "talking" to websites. As seen in Figure 2 , the user interacts with the website in dialog-based voice-based interactions with a conversational agent (e.g., Google Assistant). The user can start the session by searching for the website, or opening it up directly if already bookmarked. Once open, the user can inquire about the relevant actions that are available in its current context (e.g., "What can I do in this website"), which are automatically derived by the conversational agent based on heuristics. Instead of sequentially going through the website, the user can lookup for specific information within the website matching his interests (e.g., "Lookup COVID"). The user can then follow up on the list of resulting articles and chose one to be read out. As part of these interactions, the user can use the voice commands to navigate and get oriented in the website. The above illustrates the experience of browsing a website by leveraging natural language commands that improve over the keyboard-based sequential navigation of websites. As we will see, more advanced support can be provided by leveraging the contents and application-specific domain knowledge, but in this work we focus on improving on the features provided by screen readers, making no assumptions about compliance with accessibility and bot-specific annotations. Enabling conversational browsing requires first and foremost to understand the type of support that is needed to meet BVIP needs. Informed by previous research, our own work and prototyping experiences, we highlight a few relevant areas in Table 1 and describe them below. Browsing Outline "What can I do in this website?" Orientation "Where am I?" Navigation "Go to the main page"; "Next article" Lookup "Lookup COVID" Reading "Read article"; "Stop reading" Workflows Element-specific "Fill out the form" App-specific "Post a new comment on the news article" Operations Open "Open The Tambury Gazette" Search "Search for The Tambury Gazette" Bookmark "Bookmark page The Tambury Gazette" Speech "Increase speech rate" Verbosity "Turn on short interactions" Conversational access to content and metadata. BVIP should be able to satisfy their information needs without having to sequentially go through all the website content and structure -a process that can be time consuming and frustrating for screen reader users [19] . This support is rooted in the ongoing efforts in conversational Q&A [13] and document-centered digital assistant [18] . The idea is to support BVIP users to perform natural language queries (NLQ) on the contents of websites, and to inquire about the properties defined in the website's metadata. For example, a BVIP user might request an overview of the website (e.g., "What is this website about?"), engage in question & answering, with questions that can be answered by referring directly to the contents of the website (e.g., "When are the sports coming back?"), and ask for summaries of the contents of the website, its parts or responses from the agent (e.g., "Summarise the article"). Users might also ask about the properties and metadata of the artefacts, such as last modification, language, authors (e.g., "Who are the authors of this article?"), or simply engage in yes/no questions on metadata and content (e.g., "Is the document written in English?"). Conversational browsing. BVIP should be allowed to explore and navigate the artefacts using natural language, so as to support more traditional information seeking tasks. The idea is to improve on the navigation provided by traditional screen readers, which often require learning complex shortcuts and lower level knowledge about the structure of artefact (e.g., to move between different sections), by allowing users to utter simpler high level commands in natural language. This category of support is inspired by the work in Web accessibility, in using spoken commands to interact with non visual web browsers [29, 3] and conversational search [27, 28] . For example, BVIP should be able to inquire about the website organization and get an outline (e.g., "What can I do in this website?"), and navigate through the structure of the website and even across linked webpages (e.g., "Go to the main page"), and being able to get oriented during this exploratory process (e.g., "Where am I?"). The user should also be able to lookup for relevant content to avoid sequentially navigating the page structure (e.g., "Lookup COVID"). Conversational user workflows. BVIP should also be able to enact user workflows by leveraging the features provided by the website. This is typically done by the users, enacting their plan by following links, filling out forms and pressing buttons. This low level interactions have been explored by speech-enabled screen readers such as Capti-Speak [3] , enabling user to utter commands (e.g., "press the cart button", "move to the search box"). We call these element-specific intents. In our previous work we highlighted the need for supporting application-specific intents, i.e., intents that are specific to the offerings of a website (e.g., "Post a new comment on the news article") and that would trigger a series of lowlevel actions as a result. In our approach such experience required bot-specific annotations [14] . The automation of such workflows has also been explored in the context of Web accessibility. For example, Bigham et al. [10] introduced the trailblazer system, which focused on facilitating the process of creating web automation macros, by providing step by step suggestions based on CoScript [20] . It is also the focus of the research in robotic process automation [21] . Conversational control operations. BVIP should be able to easily access and personalise the operational environment. This goes from simple operations to support the main browsing experience, such as searching and opening websites and managing the bookmarks, to personalising properties of the voice-based interactions. Recent works in this context have highlighted the importance of providing BVIP with higher control over the experience. Abdolrahmani et al. [1] investigated the experience by BVIP with voice-activated personal assistance and reported that users often feel responses being too verbose, frustrated at interacting at a lower pace than desired, or not able adapt interactions to the requirements of social situations. It has been argued [12] that guidelines by major commercial voice-based assistants fail to capture preferences and experience of BVIP, used to faster and more efficient interactions with screen readers. This calls for further research into conversation design tailored to BVIP. There are many challenges in delivering the type of support required for conversational browsing. As discussed in our prior work [14] , this requires deriving two important types of knowledge: -domain knowledge: it refers to the knowledge about the type of functionality and content provided by the website, and that will inform the agent of what should be exposed to the users (e.g., intents, utterances and slots); -interaction knowledge: it refers to the knowledge about how to operate and automate the browsing interactions on behalf of the user. Websites are not equipped with the required conversational knowledge to enable voice-based interaction, which have motivated three general approaches. The annotation-based approach provides conversational access to websites by enabling developers and content producers to provide appropriate annotations [5] . Early approaches can be traced back to enabling access to web pages through telephone call services via VoiceXML [23] . Another general approach to voice-based accessible information is to rely on accessibility technical specifications, such as Accessible Rich Internet Applications (WAI-ARIA) [16] , but these specifications are meant for screen reading. Baez et al. [5] instead propose to equip websites with bot-specific annotations. The challenge in this regard is the adoption by annotations by developers and content producers. A recent report 3 analysing 1 million websites reported that a staggering 98.1% of the websites analysed had detectable accessibility errors, illustrating the extent of the adoption of accessibility tags and proper design choices on the Web. The crowd-based approach utilizes collaborative metadata augmentation approaches [9, 26] , relying instead on the crowd to "fix" accessibility problems or provide annotations for voice-based access. The Social Accessibility project [25] is one of these initiatives whose database supports various non visual browsers. Still, collaborative approaches require a significant community to be viable, and even so the numbers of services and the rate at which they are created make it virtually impossible to cover all websites. Automatic approaches have been used to support non-visual browsing and are based on heuristics and algorithms. The approaches in this space have focused on automatically fixing accessibility issues (e.g., page segmentation [15] ), deriving browsing context [22] or predicting next user actions based on current context [24] . These approaches, however, have not focused on enabling conversational access to websites. All of the above tell us of the diverse approaches that can support the cognitive augmentation of websites to enable voice-based conversational browsing. In this work we explore automatic approaches, which have not been studied in the context of conversational access to websites. From the conceptual framework, it becomes clear that enabling BVIP to browse websites conversationally requires us to: -Determine the main (and contextual) offerings of the website -Identify the current navigation context -Enable navigation through meaningful segments of the website -Allow for scanning and search for information in the website Determining the offerings of the website can be done by leveraging the components used in graphical user interfaces to guide users through their offerings: menus. Menus have specific semantic tags in HTML (