key: cord-0524140-kcv3f2gl
authors: Cooper, Nathan; Bernal-C'ardenas, Carlos; Chaparro, Oscar; Moran, Kevin; Poshyvanyk, Denys
title: It Takes Two to Tango: Combining Visual and Textual Information for Detecting Duplicate Video-Based Bug Reports
date: 2021-01-22
journal: nan
DOI: nan
sha: dd3eda21072188d6b520f614a4d2a1a077964d1e
doc_id: 524140
cord_uid: kcv3f2gl

When a bug manifests in a user-facing application, it is likely to be exposed through the graphical user interface (GUI). Given the importance of visual information to the process of identifying and understanding such bugs, users are increasingly making use of screenshots and screen-recordings as a means to report issues to developers. However, when such information is reported en masse, such as during crowd-sourced testing, managing these artifacts can be a time-consuming process. As the reporting of screen-recordings in particular becomes more popular, developers are likely to face challenges related to manually identifying videos that depict duplicate bugs. Due to their graphical nature, screen-recordings present challenges for automated analysis that preclude the use of current duplicate bug report detection techniques. To overcome these challenges and aid developers in this task, this paper presents Tango, a duplicate detection technique that operates purely on video-based bug reports by leveraging both visual and textual information. Tango combines tailored computer vision techniques, optical character recognition, and text retrieval. We evaluated multiple configurations of Tango in a comprehensive empirical evaluation on 4,860 duplicate detection tasks that involved a total of 180 screen-recordings from six Android apps. Additionally, we conducted a user study investigating the effort required for developers to manually detect duplicate video-based bug reports and compared this to the effort required to use Tango. The results reveal that Tango's optimal configuration is highly effective at detecting duplicate video-based bug reports, accurately ranking target duplicate videos in the top-2 returned results in 83% of the tasks. Additionally, our user study shows that, on average, Tango can reduce developer effort by over 60%, illustrating its practicality.

Many modern mobile applications (apps) allow users to report bugs in a graphical form, given the GUI-based nature of mobile apps. For instance, Android and iOS apps can include built-in screen-recording capabilities in order to simplify the reporting of bugs by end-users and crowd-testers [10, 12, 18] . The reporting of visual data is also supported by many crowd-testing and bug reporting services for mobile apps [8-13, 15, 17-19] , which intend to aid developers in collecting, processing, and understanding the reported bugs [26, 77] .

The proliferation of sharing images to convey additional context for understanding bugs, e.g., in Stack Overflow Q&As, has been steadily increasing over the last few years [81] . Given this and the increased integration of screen capture technology into mobile apps, developers are likely to face a growing set of challenges related to processing and managing app screenrecordings in order to triage and resolve bugs -and hence maintain the quality of their apps.

One important challenge that developers will likely face in relation to video-related artifacts is determining whether two videos depict and report the same bug (i.e., detecting duplicate video-based bug reports), as it is currently done for textual bug reports [27, 86, 87] . When video-based bug reports are collected at scale, either via a crowdsourced testing service [8-13, 15, 17-19] or by popular apps, the sizable corpus of collected reports will likely lead to significant developer effort dedicated to determining if a new bug report depicts a previously-reported fault, which is necessary to avoid redundant effort in the bug triaging and resolution process [26, 27, 73, 77] . In a user study which we detail later in this paper (Sec. III-E), we investigated the effort required for experienced programmers to identify duplicate video-based bug reports and found that participants reported a range of difficulties for the task (e.g., a single change of one step can result in two very similar looking videos showing entirely different bugs), and spent about 20 seconds of comprehension effort on average per video viewed. If this effort is extrapolated to the large influx of bug reports that could be collected on a daily basis [27, 35, 86, 87] , it illustrates the potential for the excessive effort associated with video-based duplicate bug detection. This is further supported by the plans of a large company that supports crowd-sourced bug reporting (name omitted for anonymity), which we contacted as part of eliciting the design goals for this research, who stated that they anticipate increasing developer effort in managing videobased reports and that they are planning to build a feature in their framework to support this process.

To aid developers in determining whether video-based bug reports depict the same bug, this paper introduces TANGO, a novel approach that analyzes both visual and textual information present in mobile screen-recordings using tailored computer vision (CV) and text retrieval (TR) techniques, with the goal of generating a list of candidate videos (from an issue tracker) similar to a target video-based report. In practice, TANGO is triggered upon the submission of a new video-based report to an issue tracker. A developer would then use TANGO to retrieve the video-based reports that are most similar (e.g., top-5) to the incoming report for inspection. If duplicate videos are found in the ranked results, the new bug report can be marked as a duplicate in the issue tracker. Otherwise, the developer can continue to inspect the ranked results until she has enough confidence that the newly reported bug was not reported before (i.e., it is not a duplicate).

TANGO operates purely upon the graphical information in videos in order to offer flexibility and practicality. These videos may show the unexpected behavior of a mobile app (i.e., a crash or other misbehavior) and the steps to reproduce such behavior. Two videos are considered to be duplicates if they show the same unexpected behavior (aka a bug) regardless of the steps to reproduce the bug. Given the nature of screen-recordings, video-based bug reports are likely to depict unexpected behavior towards the end of the video. TANGO attempts to account for this by leveraging the temporal nature of video frames and weighting the importance of frames towards the end of videos more heavily than other segments.

We conducted two empirical studies to measure: (i) the effectiveness of different configurations of TANGO by examining the benefit of combining visual and textual information from videos, as opposed to using only a single information source; and (ii) TANGO's ability to save developer effort in identifying duplicate video-based bug reports. To carry out these studies, we collected a set of 180 video-bug reports from six popular apps used in prior research [25, 34, 78, 79] , and defined 4,860 duplicate detection tasks that resemble those that developers currently face for textual bug reports -wherein a corpus of potential duplicates must be ranked according to their similarity to an incoming bug report.

The results of these studies illustrate that TANGO's most effective configuration, which selectively combines visual and textual information, achieves 79.8% mRR and 73.2% mAP, an average rank of 1.7, a HIT@1 of 67.7%, and a HIT@2 of 83%. This means that TANGO is able to suggest correct duplicate reports in the top-2 of the ranked candidates for 83% of duplicate detection tasks. The results of the user study we conducted with experienced programmers demonstrate that on a subset of the tasks, TANGO can reduce the time they spend in finding duplicate video-based bug reports by ≈ 65%.

In summary, the main contributions of this paper are:

1) TANGO, a duplicate detection approach for video-based bug reports of mobile apps which is able to accurately suggest duplicate reports; 2) The results of a comprehensive empirical evaluation that measures the effectiveness of TANGO in terms of suggesting candidate duplicate reports; 3) The results of a user study with experienced programmers that illustrates TANGO's practical applicability by measuring its potential for saving developer effort; and 4) A benchmark (included in our online appendix [42] ) that enables (i) future research on video-based duplicate detection, bug replication, and mobile app testing, and (ii) the replicability of this work. The benchmark contains 180 video-based bug reports with duplicates, source code, trained models, duplicate detection tasks, TANGO's output, and detailed evaluation results.

II . THE TANGO APPROACH   TANGO (DETECTING DUPLICATE SCREEN RECORDINGS  OF SOFTWARE BUGS) is an automated approach based on CV and TR techniques, which leverages visual and textual information to detect duplicate video-based bug reports.

A. TANGO Overview TANGO models duplicate bug report detection as an information retrieval problem. Given a new video-based bug report, TANGO computes a similarity score between the new video and videos previously submitted by app users in a bug tracking system. The new video represents the query and the set of existing videos represent the corpus. TANGO sorts the corpus of videos in decreasing order by similarity score and returns a ranked list of candidate videos. In the list, those videos which are more likely to show the same bug as the new video are ranked higher than those that show a different bug.

TANGO has two major components, which we refer to as TANGO vis and TANGO txt (Fig. 1) , that compute video similarity scores independently. TANGO vis computes the visual similarity and TANGO txt computes the textual similarity between videos. The resulting similarity scores are linearly combined to obtain a final score that indicates the likelihood of two videos being duplicates. In designing TANGO vis , we incorporated support for two methods of computing visual similarity -one of which is sensitive to the sequential order of visual data, and the other one that is not -and we evaluate the effectiveness of these two techniques in experiments described in Sec. III-IV.

The first step in TANGO's processing pipeline ( Fig. 1) is to decompose the query video, and videos from the existing corpus, into their constituent frames using a given sampling rate (i.e., 1 and 5 frames per second -fps). Then, the TANGO vis and TANGO txt components of the approach are executed in parallel. The un-ordered TANGO vis pipeline is shown at the top of Fig. 1 , comprising steps V 1 -V 3 ; the ordered TANGO vis pipeline is illustrated in the middle of Fig. 1 , comprising steps V 1 , V 4 , and V5 ; and finally, the TANGO txt pipeline is illustrated at the bottom of Fig. 1 through steps T 1 -T 3 . Any of these three pipelines can be used to compute the video ranking independently or in combination (i.e., combining the two TANGO vis together, one TANGO vis pipeline with TANGO txt , which we call TANGO comb , or all three -see Sec. III-C). Next, we discuss these three pipelines in detail.

The unordered version of TANGO vis computes the visual similarity (S vis ) of video-based bug reports by extracting visual features from video frames and converting these features into a vector-based representation for a video using a Bagof-Visual-Words (BoVW) approach [58, 92] . This process is depicted in the top of Fig. 1 . The visual features are extracted by the visual feature extractor model ( V 1 in Fig. 1) . Then, the visual indexer V 2 assigns to each frame feature vector a visual word from a visual codebook and produces a BoVW for a video. The visual encoder V5 , based on the video BoVW, encodes the videos using a TF-IDF representation that can be used for similarity computation.

1) Visual Feature Extraction: The visual feature extractor V 1 can either use the SIFT [75] algorithm to extract features, or SimCLR [40] , a recently proposed Deep Learning (DL) model capable of learning visual representations in an unsupervised, contrastive manner. TANGO's implementation of SimCLR is adapted to extract visual features from app videos.

The first method by which TANGO can extract visual features is using the Scale-Invariant Feature Transform (SIFT) [75] algorithm. SIFT is a state-of-the-art model for extracting local features from images that are invariant to scale and orientation. These features can be matched across images for detecting similar objects. This matching ability makes SIFT promising for generating features that can help locate duplicate images (in our case, duplicate video frames) by aggregating the extracted features. TANGO's implementation of SIFT does not resize images and uses the top-10 features that are the most invariant to changes and are based on the local contrast of neighboring pixels, with higher contrast usually meaning more invariant. This is done to reduce the number of SIFT features, which could reach at least three orders of magnitude for a single frame, and make the visual indexing V 2 (through k-Means -see Sec. II-B2) computationally feasible.

The other technique that TANGO can use to extract features is SimCLR. In essence, the goal of this technique is to generate robust visual features that cluster similar images together while maximizing the distance between dissimilar images in an abstract feature space. This is accomplished by (i) generating sets of image pairs (containing one original image and one augmented image) and applying a variety of random augmentations (i.e., image cropping, horizontal flipping, color jittering, and gray-scaling); (ii) encoding this set of image pairs using a base encoder, typically a variation of a convolutional neural network (CNN); and (iii) training a multi-layer-perceptron (MLP) to produce feature vectors that increase the cosine similarity between each pair of image variants and decrease the cosine similarity between negative examples, where negative examples for a given image pair are represented as all other images not in that pair, for a given training batch. TANGO's implementation of SimCLR employs the ResNet50 [53] CNN architecture as the base encoder, where this architecture has been shown to be effective [40] .

To ensure that TANGO's visual feature extractor is tailored to the domain of mobile app screenshots, we trained this component on the entire RICO dataset [43] , which contains over 66k Android screenshots from over 9k of the most popular apps on Google Play. Our implementation of SimCLR was trained using a batch size of 1, 792 and 100 epochs, the same hyperparameters (e.g., learning rate, weight decay, etc.) recommended by Chen et al. [40] in the original SimCLR paper, and resized images to 224×224 to ensure consistency with our base ResNet50 architecture. The training process was carried out on an Ubuntu 20.04 server with three NVIDIA T4 Tesla 16GB GPUs. The output of the feature extractor for SimCLR is a feature vector (of size 64) for each frame of a given video.

2) Visual Indexing: While the SimCLR or SIFT feature vectors generated by TANGO's visual feature extractor V 1 could be used to directly compute the similarity between video frames, recent work has suggested that a BoVW approach combined with a TF-IDF similarity measure is more adept to the task of video retrieval [66] . Therefore, to transform the SimCLR or SIFT feature vectors into a BoVW representation, TANGO uses a visual indexing process V 2 . This process produces an artifact known as a Codebook that maps SimCLR or SIFT feature vectors to "visual words" -which are discrete representations of a given image, and have been shown to be suitable for image and video recognition tasks [66] . The Codebook derives these visual words by clustering feature vectors and computing the centroids of these clusters, wherein each centroid corresponds to a different visual word.

The Codebook makes use of the k-Means clustering algorithm, where the k represents the diversity of the visual words, and thus can affect the representative power of this indexing process. TANGO's implementation of this process is configurable to 1k, 5k, or 10k for the k number of clusters (i.e., the number of visual words -VW). 1k VW and 10k VW were selected as recommended by Kordopatis-Zilos et al. [66] and we included 5k VW as a "middle ground" to better understand how the number of visual words impacts TANGO's performance. A Codebook is generated only once for a given k, however, it must be trained before it can be applied to convert an input feature vector to its corresponding visual word(s). Once trained, a Codebook can then be used to map visual words from frame feature vectors without any further modification. Thus, we trained TANGO's six Codebooks, three for SIFT and three for SimCLR, using features extracted from 15, 000 randomly selected images from the RICO dataset [43] . We did not use the entire RICO dataset due to computational constraints of the k-means algorithm.

After the feature vector for a video frame is passed through the visual indexing process, it is mapped to its BoVW representation by a trained Codebook. To do this, the Codebook selects the closest centroid to each visual feature vector, based on Euclidean distance. For SIFT, this process may generate more than one feature vector for a single frame, due to the presence of multiple SIFT feature descriptors. In this case, TANGO assigns multiple visual words to each frame. For SimCLR, TANGO assigns one visual word to each video frame, as SimCLR generates only one vector per frame.

3) Visual Encoding: After the video is represented as a BoVW, the visual encoder V 3 computes the final vector representation of the video through a TF-IDF-based approach [91] . The term frequency (TF) is computed as the number of visual words occurrences in the BoVW representation of the video, and the inverse document frequency (IDF) is computed as the number of occurrences of a given visual word in the BoVW representations built from the RICO dataset. Since RICO does not provide videos but individual app screenshots, we consider each RICO image as one document. We opted to use RICO to compute our IDF representation for two reasons: (i) to combat the potentially small number of frames present in a given video recording, and (ii) to bolster the generalizability of our similarity measure across a diverse set of apps. 4) Similarity Computation: Given two videos, TANGO vis encodes them into their BoVW representations, and each video is represented as one visual TF-IDF vector. These vectors are compared using cosine similarity, which is taken as the visual similarity S of the videos (S vis = S BoV W ).

The ordered version of TANGO vis considers the sequence of video frames when comparing two videos and is capable of giving more weight to common frames nearer the end of the videos, as this is likely where buggy behavior manifests. To accomplish this, the feature vector extractor V 1 is used to derive descriptive vectors from each video frame using either SimCLR or SIFT. TANGO determines how much the two videos overlap using an adapted longest common substring (LCS) algorithm V 4 . Finally, during the sequential comparison process V5 , TANGO calculates the similarity score by normalizing the computed LCS score.

1) Video Overlap Identification: In order to account for the sequential ordering of videos, TANGO employs two different versions of the longest common substring (LCS) algorithm.

The first version, which we call fuzzy-LCS (f-LCS), modifies the comparison operator of the LCS algorithm to perform fuzzy matching instead of exact matching between frames in two videos. This fuzzy matching is done differently for SimCLR and SIFT-derived features. For SimCLR, given that each frame is associated with only a single visual word, the standard BoVW vector would be too sparse for a meaningful comparison. Therefore, we compare the feature vectors that SimCLR extracts from the two frames directly using cosine similarity. For SIFT, we utilize the BoVW vectors derived by the visual encoder V 3 , but at a per-frame level.

The second LCS version, which we call weighted-LCS (w-LCS), uses the same fuzzy matching that f-LCS performs. However, the similarity produced in this matching is then weighted depending on where the two frames from each video appeared. Frames that appear later in the video are weighted more heavily, since that is where the buggy behavior is typically occurring in a video-based bug report, and thus should be given more weight for duplicate detection. The exact weighting scheme used is i m × j m , where i is the ith frame of video A, m is the # of frames in video A, j is the jth frame of video B, and n is the # of frames in video B.

2) Sequential Comparison: In order to incorporate the LCS overlap measurements into TANGO's overall video similarity calculation, the overlap scores must be normalized between zero and one ([0, 1]). To accomplish this, we consider the case where two videos overlap perfectly to be the upper bound of the possible LCS score between two videos, and use this to perform the normalization. For f-LCS, this is done by simply dividing by the # of frames in the smaller video since the max possible overlap that could occur is when the smaller video is a subsection in the bigger video, calculated as overlap/min where overlap denotes the amount the two videos share in terms of their frames and min denotes the # of frames in the smaller of the two videos. For w-LCS, if the videos are different lengths, we align the end of the shorter video to the end of the longer video and consider this the upper bound on the LCS score, which is normalized as follows:

where S w−LCS is the normalized similarity value produced by w-LCS, overlap and min are similar to the f-LCS calculation and max denotes the # of frames in the longer of the two videos. The denominator in Eq. 1 calculates the maximum possible overlap that can occur if the videos were exact matches, summing across the similarity score of each frame pair. Our online appendix contains the detailed f/w-LCS algorithms with examples [42] .

3) Similarity Computation: f-LCS and w-LCS output the visual similarity S score S f −LCS and S w−LCS , respectively. This can be combined with S BoV W to obtain an aggregate visual similarity score: S vis = (S BoV W + S f −LCS )/2 or S vis = (S BoV W + S w−LCS )/2. We denote these TANGO vis variations as B+f-LCS and B+w-LCS, respectively.

In order to determine the textual similarity between videobased bug reports, TANGO leverages the textual information from labels, titles, messages, etc. found in the app GUI components and screens depicted in the videos.

TANGO txt adopts a standard text retrieval approach based on Lucene [51] and Optical Character Recognition (OCR) [1, 16] to compute the textual similarity (S txt ) between video-based bug reports. First, a textual document is built from each video in the issue tracker ( T 1 in Fig. 1 ) using OCR to extract text from the video frames. The textual documents are preprocessed using standard techniques to improve similarity computation, namely tokenization, lemmatization, and removal of punctuation, numbers, special characters, and one-and twocharacter words. The pre-processed documents are indexed for fast similarity computation T 2 . Each document is then represented as a vector using TF-IDF and the index [91] 

In order to build the textual documents from the videos, TANGO txt applies OCR on the video frames through the Tesseract engine [1, 16] in the textual extractor T 1 . We experiment with three strategies to compose the textual documents using the extracted frame text. The first strategy (all-text) concatenates all the text extracted from the frames. The second strategy (unique-frames) concatenates all the text extracted from unique video frames, determined by applying exact text matching (before text pre-processing). The third strategy (unique-words) concatenates the unique words in the frames (after pre-processing).

1) Similarity Computation: TANGO computes the textual similarity (S txt ) in S using Lucene's scoring function [14] based on cosine similarity and document length normalization.

TANGO combines both the visual (S vis ) and textual (S txt ) similarity scores produced by TANGO vis and TANGO txt , respectively ( S in Fig. 1 ). TANGO uses a linear combination approach to produce an aggregate similarity value:

where w is a weight for S vis and S txt , and takes a value between zero (0) and one (1). Smaller w values weight S vis more heavily, and larger values weight S txt more heavily. We denote this approach as TANGO comb . Based on the combined similarity, TANGO generates a ranked list of the video-based bug reports found in the issue tracker. This list is then inspected by the developer to determine if a new video reports a previously reported bug.

We empirically evaluated TANGO with two goals in mind: (i) determining how effective TANGO is at detecting duplicate video-based bug reports, when considering different configurations of components and parameters, and (ii) estimating the effort that TANGO can save developers during duplicate video bug detection. Based on these goals, we defined the following research questions (RQs):

RQ 1 : How effective is TANGO when using either visual or textual information alone to retrieve duplicate videos? RQ 2 : What is the impact of combining frame sequence and visual information on TANGO's detection performance? RQ 3 : How effective is TANGO when combining both visual and textual information for detecting duplicate videos? RQ 4 : How much effort can TANGO save developers in finding duplicate video-based bug reports?

To answer our RQs, we first collected video-based bug reports for six Android apps (Sec. III-A), and based on them, defined a set of duplicate detection tasks (Sec. III-B). We instantiated different configurations of TANGO by combining its components and parameters (Sec. III-C), and executed these configurations on the defined tasks (Sec. III-D). Based on standard metrics, applied on the video rankings that TANGO produces, we measured TANGO's effectiveness (Sec. III-D). We answer RQ 1 , RQ 2 , and RQ 3 based on the collected measurements. To answer RQ 4 (Sec. III-E), we conducted a user study where we measured the time humans take to find duplicates for a subset of the defined tasks, and estimated the time TANGO can save for developers. We present and discuss the evaluation results in Sec. IV.

We collected video-based bug reports for six open-source Android apps, namely AntennaPod (APOD) [2] , Time Tracker (TIME) [6], Token (TOK) [7], GNUCash (GNU) [4], Grow-Tracker (GROW) [5] , and Droid Weight (DROID) [3] . We selected these apps because they have been used in previous studies [25, 34, 78, 79] , support different app categories (finance, productivity, etc.), and provide features that involve a variety of GUI interactions (taps, long taps, swipes, etc.). Additionally, none of these apps are included as part of the RICO dataset used to train TANGO's SimCLR model and Codebooks, preventing the possibility of data snooping. Since video-based bug reports are not readily available in these apps' issue trackers, we designed and carried out a systematic procedure for collecting them.

In total, we collected 180 videos that display 60 distinct bugs -10 bugs for each app and three videos per bug (i.e., three duplicate videos per bug). From the 60 bugs, five bugs (one bug per app except for DROID) are reported in the apps' issue trackers. These five bugs were selected because they were the only ones in the issue trackers that we were able to reproduce based on the provided bug descriptions. During the reproduction process, we discovered five additional new bugs in the apps not reported in the issue trackers (one bug each for APOD, GNU, and TOK, and two bugs for TIME) for a total of 10 confirmed real bugs.

The remaining 50 bugs were introduced in the apps through mutation by executing MutAPK [46] , a mutation testing tool that injects bugs (i.e., mutants) into Android APK binaries via a set of 35 mutation operators that were derived from a largescale empirical study on real Android application faults. Given the empirically-derived nature of these operators, they were shown to accurately simulate real-world Android faults [46] . We applied MutAPK to the APKs of all six apps. Then, from the mutant list produced by the tool, we randomly selected 7 to 10 bugs for each app, and ensured that they could be reproduced and manifested in the GUI. To diversify the bug pool, we selected the bugs from multiple mutant operators and ensured that they affected multiple app features/screens.

When selecting the 60 bugs, we ensured they manifest graphically and were reproducible by manually replicating them on a specific Android emulator configuration (virtual Nexus 5X with Android 7.0 configured via Android Studio). For all the bugs, we screen-recorded the bug and the reproduction scenario. We also generated a textual bug report (for bugs that did not have one) containing the description of the unexpected and expected app behavior and the steps to reproduce the bug.

To generate the remaining 120 video-based bug reports, we asked two professional software engineers and eight computer science (CS) Ph.D. students to replicate and record the bugs (using the same Android emulator), based only on the textual description of the unexpected and expected app behavior. The participants have between 2 and 10 years of programming experience (median of 6 years).

All the textual bug reports given to the study participants contained only a brief description of the observed and expected app behavior, with no specific reproduction steps. We opted to perform the collection in this manner to ensure the robustness of our evaluation dataset by maximizing the diversity of video-based reproduction steps, and simulating a real-world scenario where users are aware of the observed (incorrect) and expected app behavior, and must produce the reproduction steps themselves.

We randomly assigned the bugs to the participants in such a way that each bug was reproduced and recorded by two participants, and no participant recorded the same bug twice. Before reproducing the bugs, the participants spent five minutes exploring the apps to become familiar with their functionality. Since some of the participants could not reproduce the bugs after multiple attempts (mainly due to bug misunderstandings) and some of the videos were incorrectly recorded (due to mistakes), we reassigned these bugs among the other participants, who were able to reproduce and record them successfully.

Our bug dataset consists of 35 crashes and 25 non-crashes, and include a total of 470 steps (397 taps, 12 long taps, 14 swipes, among other types), with an average of 7.8 steps per video. The average video length is ≈ 28 seconds.

For each app, we defined a set of tasks that mimic a realistic scenario for duplicate detection. Each duplicate detection task is composed of a new video (i.e., the new bug report, aka the query) and a set of existing videos (i.e., existing bug reports in the issue tracker, aka the corpus). In practice, a developer would determine if the new video is a duplicate by inspecting the corpus of videos in the order given by TANGO (or any other approach). For our task setup, the corpus contains both duplicate and non-duplicate videos. There are two different types of duplicate videos that exist in the corpus: (i) those videos that are a duplicate of the query (the Same Bug group), and (ii) those videos which are duplicates of each other, but are not a duplicate of the query (the Different Bug group). This second type of duplicate video is represented by bug reports marked as duplicates in the issue tracker and their corresponding master reports [35, 86, 93] . Each non-duplicate video reports a distinct bug.

We constructed the duplicate detection tasks on a per app basis, using the 30 video reports collected for each app (i.e., three video reports for each of the 10 bugs, for a total of 30 video reports per app). We first divided all the 30 videos for an app into three groups, each group containing 10 videos (one for each bug) created by one or more participants. Then, we randomly selected a video from one bug as the query and took the other two videos that reproduce the same bug as the Same Bug duplicate group (i.e., the ground truth). Then, we selected one of the remaining nine bugs and added its three videos to the Different Bug duplicate group. Finally, we selected one video from the remaining eight bugs, and used these as the corpus' Non-Duplicate group. This resulted in a total of 14 distinct bug reports per task (two in the Same Bug group, three in the Different Bug group, eight in the Non-Duplicate group, and the query video). After creating tasks based on all the combinations of query and corpus, we generated a total of 810 duplicate detection tasks per app or 4, 860 aggregating across all apps.

We designed the duplicate detection setting described above to mimic a scenario likely to be encountered in crowd-sourced app testing, where duplicates of the query, other duplicates not associated with the query, and other videos reporting unique bugs, exist in a shared corpus for a given app. While there are different potential task settings, we opted not to vary this experimental variable in order to make for a feasible analysis that allowed us to explore more thoroughly the different TANGO configurations.

We designed TANGO vis and TANGO txt to have different configurations. TANGO vis 's configurations are based on different visual feature extractors (SIFT or SimCLR), video sampling rates (1 and 5 fps), # of visual words (1k, 5k, and 10k VW), and approaches to compute video similarity (BoVW, f-LCS, w-LCS, B+f-LCS, and B+w-LCS). TANGO txt 's configurations are based on the same sampling rates (1 and 5 fps) and the approaches to extract the text from the videos (all-text, unique-frames, and unique-words). TANGO comb combines TANGO vis and TANGO txt as described in Sec. II-E.

We executed each TANGO configuration on the 4, 860 duplicate detection tasks and measured its effectiveness using standard metrics used in prior text-based duplicate bug detection research [35, 86, 93] . For each task, we compare the ranked list of videos produced by TANGO and the expected duplicate videos from the ground truth.

We measured the rank of the first duplicate video found in the ranked list, which serves as a proxy for how many videos the developer has to watch in order to find a duplicate video. A smaller rank means higher duplicate detection effectiveness. Based on the rank, we computed the reciprocal rank metric: 1/rank. We also computed the average precision of TANGO, which is the average of the precision values achieved at all the cutting points k of the ranked list (i.e., precision@k). Precision@k is the proportion of the top-k returned videos that are duplicates according to the ground truth. We also computed HIT@k (aka Recall Rate@k [35, 86, 93] ), which is the proportion of tasks that are successful for the cut point k of the ranked list. A task is successful if at least one duplicate video is found in the top-k results returned by TANGO. We report HIT@k for cut points k = 1-2 in this paper, and 1-10 in our online appendix [42] .

Additionally, we computed the average of these metrics over sets of duplicate detection tasks: mean reciprocal rank (mRR), mean average precision (mAP), and mean rank (µ rank or µRk) per app and across all apps. Higher mRR, mAP, and HIT@k values indicate higher duplicate detection effectiveness. These metrics measure the overall performance of a duplicate detector.

We focused on comparing mRR values to decide if one TANGO configuration is more effective than another, as we consider that it better reflects the usage scenario of TANGO. In practice, the developer would likely stop inspecting the suggested duplicates (given by TANGO) when she finds the first correct duplicate. This scenario is captured by mRR, through the rank metric, which considers only the first correct duplicate video as opposed to the entire set of duplicate videos to the query (as mAP does).

We conducted a user study in order to estimate the effort that developers would spend while manually finding video-based duplicates. This effort is then compared to the effort measurements of the best TANGO configuration, based on µ rank and HIT@k. This study and the data collection procedure were conducted remotely due to COVID-19 constraints.

1) Participants and Tasks: One professional software engineer and four CS Ph.D. students from the data collection procedure described in Sec. III-A participated in this study. The study focused on APOD, the app that all the participants had in common from the data collection. We randomly selected 20 duplicate detection tasks, covering all 10 APOD bugs.

2) Methodology: Each of the 20 tasks was completed by two participants. Each participant completed four tasks, each task's query video reporting a unique bug. The assignment of the tasks to the participants was done randomly. For each task, the participants had to watch the new video (the query) and then find the videos in the corpus that showed the same bug of the new video (i.e., find the duplicate videos). All the videos were anonymized so that the duplicate videos were unknown to the participants. To do this, we named each video with a number that represents the video order and the suffix "vid" (e.g., "2 vid.mp4").

The corpus videos were given in random order and the participants could watch them in any order. To make the bug of the new video clearer to the participants, we provided them with the description of the unexpected and expected app behavior, taken from the textual bug reports that we generated for the bugs. We consider the randomization of the videos as a reasonable baseline given that other baselines (e.g., videobased duplicate detectors) do not currently exist and the videobased bug reports in our dataset do not have timestamps (which can be used to give a different order to the videos). This is a threat to validity that we discuss in Sec. V.

3) Collected Measurements: Through a survey, we asked each participant to provide the following information for each task: (i) the name of the first video they deemed a duplicate of the query, (ii) the time they spent to find this video, (iii) the number of videos they had to watch until finding the first duplicate (including the duplicate), (iv) the names of other videos they deemed duplicates, and (v) the time they spent to find these additional duplicates. We instructed the participants to perform the tasks without any interruptions in order to minimize inaccuracies in the time measurements.

4) Comparing TANGO and Manual Duplicate Detection: The collected measurements from the participants were compared against the effectiveness obtained by executing the best TANGO configuration on the 20 tasks, in terms of µ rank and HIT@k. We compared the avg. number of videos the participants watched to find one duplicate against the avg. number of videos they would have watched had they used TANGO.

We analyzed the performance of TANGO when using only visual or textual information exclusively. In this section, we present the results for TANGO's best performing configurations. However, complete results can be found in our online appendix [42] . Table I shows the results for TANGO vis and TANGO txt when using SimCLR, SIFT, as the visual feature extractor, and OCR as the textual extractor. For simplicity, we use SimCLR, SIFT, and OCR&IR to refer to SimCLRbased TANGO vis , SIFT-based TANGO vis , and TANGO txt , respectively. The best results for each metric are illustrated in bold on a per app basis. The results provided in Table I are those for the best parameters of the SimCLR, SIFT, and OCR&IR feature extractors, which are (BoVW, 5 fps, 1k VW), (w-LCS, 1 fps, 10k VW), and (all-text, 5 fps), respectively. Table I shows that TANGO vis is more effective when using SimCLR rather than SIFT across all the apps, achieving an overall mRR, mAP, avg. rank, HIT@1, and HIT@2 of 75.3%, 67.8%, 1.9, 61.6%, and 78%, respectively. SimCLR is also superior to OCR&IR overall, whereas SIFT performs least effectively of the three approaches. When analyzing the results per app, we observe that SimCLR is outperformed by OCR&IR (by 0.7% -4% difference in mRR) for APOD, DROID, GNU and GROW; with OCR&IR being the most effective for these apps. SimCLR outperforms the other two approaches for TIME and TOK by more than 16% difference in mRR. The differences explain the overall performance of SimCLR and OCR&IR. SimCLR is more consistent in its performance compared to OCR&IR and SIFT. Across apps, the mRR standard deviation of SimCLR is 6.2%, which is lower than that for SIFT and OCR&IR: 11.1% and 13.9%, respectively. The trend is similar for mAP and avg. rank. Since the least consistent approach across apps is TANGO txt in terms of effectiveness, we investigated the root causes for its lower performance on TIME and TOK. After manually watching a subset of the videos for these apps, we found that their textual content was quite similar across bugs. Based on this, we hypothesized that the amount of vocabulary shared between duplicate videos (from the same bugs) and nonduplicate videos (across different bugs) affected the discriminatory power of Lucene-based TANGO txt (see Sec. II-D).

To verify this hypothesis, we measured the shared vocabulary of duplicate and non-duplicate video pairs, similarly to Chaparro et al.'s analysis of textual bug reports [35] . We formed unique pairs of duplicate and non-duplicate videos from all the videos collected for all six apps. For each app, we formed 30 duplicate and 405 non-duplicate pairs, and we measured the avg. amount of shared vocabulary of all pairs, using the vocabulary agreement metric used by Chaparro et al. [35] . Table II shows the vocabulary agreement of duplicate (V d ) and non-duplicate pairs (V nd ) as well as the mRR and mAP values of TANGO txt for each app. The table reveals that the vocabulary agreement of duplicates and non-duplicates is very similar for TIME and TOK, and dissimilar for the other apps. The absolute difference between these measurements (i.e., |V d − V nd |) for TIME and TOK is 0.3% and 8.6%, while for the other apps it is above 16%. We found 0.94 / 0.91 Pearson correlation [47] between these differences and the mRR/mAP values.

The results indicate that, for TIME and TOK, the similar vocabulary between duplicate and non-duplicate videos negatively affects the discriminatory power of TANGO txt , which suggests that for some apps, using only textual information may be sub-optimal for duplicate detection.

Answer for RQ 1 : SimCLR performs the best overall with an mRR and HIT@1 of 75.3% and 61.6%, respectively. For 4 of 6 apps, OCR&IR outperforms SimCLR by a significant margin. However, due to issues with vocabulary overlap, it performs worse overall. SIFT is the worst-performing technique across all the apps.

To answer RQ 2 , we compared the effectiveness of the best configuration of TANGO when using visual information alone (SimCLR, BoVW, 5fps, 1k VW) and when combining visual & frame sequence information (i.e., B+f-LCS and B+w-LCS).

The results are shown in Table III . Overall, using TANGO with BoVW alone is more effective than combining the approaches; TANGO based on BoVW achieves 75.3%, 67.8%, 1.9, 61.6%, and 78% mRR, mAP, avg. rank, HIT@1, and HIT@2, respectively. Using BoVW and w-LCS combined is the least effective approach. BoVW alone and B+f-LCS are comparable in performance. However, BoVW is more consistent in its performance across apps: 6.2% mRR std. deviation vs. 6.6% and 9.2% for B+f-LCS and B+w-LCS.

The per-app results reveal that B+w-LCS consistently is the least effective approach for all apps except for GROW, for which B+w-LCS performs best. After watching the videos for GROW, we found unnecessary steps in the beginning/middle of the duplicate videos, which led to their endings being weighted more heavily by w-LCS, where steps were similar. In contrast, BoVW and B+f-LCS give a lower weight to these cases thus reducing the overall video similarity.

The lower performance of B+f-LCS and B+w-LCS, compared to BoVW, is partially explained by the fact that f-LCS and w-LCS are more restrictive by definition. Since they find the longest common sub-strings of frames between videos, small variations (e.g., extra steps) in the reproduction steps of the bugs may lead to drastic changes in similarity measurement for these approaches. Also, these approaches only find one common substring (i.e., the longest one), which may not be highly discriminative for duplicate detection. In the future, we plan to explore additional approaches for aligning the frames, for example, by using an approach based on longest common sub-sequence algorithms [49] that can help align multiple portions between videos. Another potential reason for these results may lie in the manner that TANGO combines visual and sequential similarity scores -weighting both equally. In future work, we plan to explore additional combination techniques.

Answer for RQ 2 : Combining ordered visual information (via f-LCS and w-LCS) with the orderless BoVW improves the results for four of the six apps. However, across all apps, BoVW performs more consistently.

We investigated TANGO's effectiveness when combining visual and textual information. We selected the best configurations of TANGO vis (SimCLR, BoVW, 5 fps, 1k VW) and TANGO txt (all-text, 5 fps) from RQ 1 based on their mRR score and measured its performance overall and per app. We provide the results for the best weight we obtained for TANGO's similarity computation and ranking which was w = 0.2, i.e., a weight of 0.8 and 0.2 on TANGO vis and TANGO txt , respectively. These weights were found by evaluating different w values from zero (0) to one (1) in increments of 0.1 and selecting the one leading to the highest overall mRR score. Complete results can be found in our online appendix [42] . Table IV shows that the overall effectiveness achieved by TANGO comb is higher than that achieved by TANGO txt and TANGO vis . TANGO comb achieves 75.9%, 69.2%, 1.9, 62.2%, and 78.5% mRR, mAP, avg. rank, HIT@1, and HIT@2, on average. The avg. improvement margin of TANGO comb is substantially higher for TANGO txt (6.2%/5.5% mRR/mAP) than for TANGO vis (0.7%/1.4% mRR/mAP).

Our analysis of the per-app results explains these differences. Table IV reveals that combining visual and textual information substantially increases the performance over just using one of the information types alone, except for the TIME and TOK apps. This is because TANGO txt 's effectiveness is substantially lower for these apps, compared to the visual version (see Table I ), due to the aforementioned vocabulary agreement. Thus, incorporating the textual information significantly harms the performance of TANGO comb .

The results indicate that combining visual and textual information is beneficial for most of our studied apps but harmful for a subset (TIME and TOK). This is because the textual information used alone, for TIME and TOK, leads to low performance. The analysis we made for TANGO txt in RQ 1 , revealed that the reason for the low performance of TANGO txt lies in the similar amount of vocabulary overlap between duplicate and non-duplicate videos. Fortunately, based on this amount of vocabulary, we can predict the performance of TANGO txt for new video-based bug reports as follows [35] . In practice, the issue tracker will contain reports marked as duplicates (reporting the same bugs) from previous submissions of bug reports as well as non-duplicates (reporting unique bugs). This information can be used to compute the vocabulary agreement between duplicates and non-duplicates, which can be used to predict how well TANGO txt would perform for new reports.

Based on this, we defined a new approach for TANGO, which is based on the vocabulary agreement metric from [35] applied on existing duplicate and non-duplicate reports. This approach dictates that if the difference of vocabulary agreement between existing duplicates and non-duplicates is greater than a certain threshold, then TANGO should combine visual and textual information. Otherwise, TANGO should only use the visual information because it is likely that the combination would not be better than using the visual information alone.

From the vocabulary agreement measurements shown in Table II , we infer a proper threshold from the new TANGO approach. This threshold may be taken as one value from the interval 8.6% -16.9% (exclusive) because those are the limits that separate the apps for which TANGO txt obtains low (TIME and TOK) and high performance (APOD, DROID, GNU, and GROW). For practical reasons, we select the threshold to be the middle value: 8.6 + (16.9 − 8.6)/2 = 12.8%. In future work, we plan to further evaluate this threshold on other apps.

We implemented this approach for TANGO, using 0.2 as weight, and measured its effectiveness. This approach resulted in a mRR, mAP, avg. rank, HIT@1 and HIT@2 of 79.8%, 73.2%, 1.7, 67.7%, and 83%, respectively. The approach leads to a substantial improvement (i.e., 3.9% / 4.1% higher mRR / mAP) over TANGO comb shown in Table IV .

The results mean that the best version of TANGO is able to suggest correct duplicate video-based bug reports in the first or second position of the returned candidate list for 83% of the duplicate detection tasks.

Answer for RQ 3 : Combining visual and textual information significantly improves results for 4 of 6 apps. However, due to the vocabulary agreement issue, across all apps, this approach is similar in effectiveness to using visual information alone. Accounting for this vocabulary overlap issue through a selective combination of visual and textual information via a threshold, TANGO achieves the highest effectiveness: an mRR, mAP, avg. rank, HIT@1, and HIT@2 of 79.8%, 73.2%, 1.7, 67.7%, and 83%, respectively.

As expected, the participants were successful in finding the duplicate videos for all 20 tasks. In only one task, one participant incorrectly flagged a video as duplicate because it was highly similar to the query. Participants found the first duplicate video in 96.4 seconds and watched 4.3 videos on avg. across all tasks to find it. Participants also found all the duplicates in 263.8 seconds on avg. by watching the entire corpus of videos. This means they spent 20.3 seconds in watching one video on average.

We compared these results with the measurements taken from TANGO's best version (i.e., selective TANGO) on the tasks the participants completed. TANGO achieved a 1.5 avg. rank, which means that, by using TANGO, they would only have to watch one or two videos on avg. to find the first duplicate. This would have resulted in (4.3 − 1.5)/4.3 = 65.1% of the time saved. In other words, instead of spending 20.3 × 4 = 81.2 seconds (on avg.) finding a duplicate for a given task, the participants could have spent 20.3×1.5 = 30.5 seconds. These results indicate the potential of TANGO to help developers save time when finding duplicates.

Answer for RQ 4 : On average, TANGO's bestperforming configuration can save 65.1% of the time participants spend finding duplicate videos.

Limitations. TANGO has three main limitations that motivate future work.

The first one stems from the finding that textual information may not be beneficial for some apps. The best TANGO version implements an approach for detecting this situation, based on a threshold for the difference in vocabulary overlap between duplicate and non-duplicate videos, which is used for selectively combining visual or textual information. This threshold is based on the collected data and may not generalize to other apps. Second, the visual TF-IDF representation for the videos is based on the mobile app images from the RICO dataset, rather than on the videos found in the tasks' corpus, as it is typically done in text retrieval. Additionally, we considered single images as documents rather than groups of frames that make up a video. These decisions were made to improve the generalization of TANGO's visual features and to support projects that have limited training data. Third, differences in themes and languages across video-based bug reports for an application could have an impact in the performance of TANGO. We believe that different themes (i.e., dark vs. light modes) will not significantly impact TANGO since the SimCLR model is trained to account for such color differences by performing color jittering and gray-scaling augmentations. However, additional experiments are needed to validate this conjecture. For different languages, TANGO currently assumes the text in an application to be English when performing OCR and textual similarity. Therefore, its detection effectiveness where the bug reports display different languages (e.g., English vs. French) could be negatively impacted. We will investigate this aspect in our future work.

Internal & Construct Validity. Most of the mobile app bugs in our dataset were introduced by MutAPK [46] , and hence potentially may not resemble real bugs. However, Mu-tAPK's mutation operators were empirically derived from a large-scale study of real Android faults, and prior research lends credence of the ability of mutants to resemble real faults [21] . We intentionally selected generated mutants from a range of operators to increase the diversity of our set of bugs and mitigate potential biases. Another potential threat is related to using real bugs from issue trackers that cannot be reproduced or that do not manifest graphically. We mitigated this threat by using a small, carefully vetted subset of real bugs that were analyzed by multiple authors before being used in our dataset. We did not observe major differences in the results between mutants and real bugs.

Another threat to validity is that our approach to construct the duplicate detection tasks does not take into account bug report timestamps, which would be typical in a realistic scenario [86] , and timestamps could be used as a baseline ordering of videos for comparing against the ranking given by TANGO. The lack of timestamps stems from the fact that we were not able to collect the video-based bug reports from existing mobile projects. We mitigated this threat in our user study by randomizing the ordering of the corpus videos given to the participants. We consider this as a reasonable baseline for evaluating our approach considering that, to the best of our knowledge, (1) no existing datasets, with timestamps, are available for conducting research on video-based duplicate detection, and (2) no existing duplicate detectors work exclusively on video-based bug reports, as TANGO does.

External Validity. We selected a diverse set of apps that have different functionality, screens, and visual designs, in an attempt to mitigate threats to external validity. Additionally, our selection of bugs also attempted to select diverse bug types (crashes and non-crashes), and the duplicate videos were recorded by different participants. As previously discussed, there is the potential that TANGO's different parameters & thresholds may not generalize to video data from other apps.

Our research is related to work in near duplicate video retrieval, analysis of graphical software artifacts, and duplicate detection of textual bug reports.

Near Duplicate Video Retrieval. Extensive research has been done outside SE in near-duplicate video retrieval, which is the identification of similar videos to a query video (e.g., exact copies [45, 50, 57, 68] or similar events [41, 58, 66, 88] ).

The closest work to ours is by Kordopatis-Zilos et al. [66] , who addressed the problem of retrieving videos of incidents (e.g., accidents). In their work, they explored using handcrafted-based [23, 59, 60, 75, 102, 106] (e.g., SURF or HSV histograms) and DL-based [32, 37, 65, 67, 71, 98] (e.g., CNNs) visual feature extraction techniques and ways of combining the extracted visual features [31, 58, 62, 65, 88, 92] (e.g., VLAD). While we do make use of the best performing model (CNN+BoVW) from this work [66] , we did not use the proposed handcrafted approaches, as these were designed for scenes about real-world incidents, rather than for mobile bug reporting. We also further modified and extended this approach given our different domain, through the combination of visual and textual information modalities, and adjustments to the CCN+BoVW model, including the layer configuration and training objective.

Analysis of Graphical Software Artifacts. The analysis of graphical software artifacts to support software engineering tasks has been common in recent years. Such tasks include mobile app testing [25, 56, 61, 79] , developer/user behavior modelling [22, 30, 48] , GUI reverse engineering and code generation [24, 38, 39, 44, 80, 83] , analysis of programming videos [20, 69, 76, 85, 103, 104] , and GUI understanding and verification [33, 105] . None of these works deal with finding duplicate video-based bug reports, which is our focus.

Detection of Duplicate Textual Bug Reports. Many research projects have focused on detecting duplicate textual bug reports [28, 29, 35, 36, 52, 54, 55, 64, 70, 72, 74, 82, 84, 86, 87, 89, 90, 93-97, 99-101, 107] . Similar to TANGO, most of the proposed techniques return a ranked list of duplicate candidates [35, 63] . The work most closely related to TANGO is by Wang et al. [99] , who leveraged attached mobile app images to better detect duplicate textual reports. Visual features are extracted from the images (e.g., representative colors), using computer vision, which are combined with textual features extracted from the text to obtain an aggregate similarity score. While this is similar to our work, TANGO is intended to be applied to videos rather than single images and focuses on video-based bug reports alone, without any extra information such as bug descriptions.

This paper presented TANGO, an approach that combines visual and textual information to help developers find duplicate video-based bug reports. Our empirical evaluation, conducted on 4, 680 duplicate detection tasks created from 180 videobased bug reports from six mobile apps, illustrates that TANGO is able to effectively identify duplicate reports and save developer effort. Specifically, TANGO correctly suggests duplicate video-based bug reports within the top-2 candidate videos for 83% of the tasks, and saves 65.1% of the time that humans spend finding duplicate videos.

Our future work will focus on addressing TANGO's limitations and extending TANGO's evaluation. Specifically, we plan to (1) explore additional ways to address the vocabulary overlap problem, (2) investigate the resilience of TANGO to different app characteristics such as the use of different themes, languages, and screen sizes, (3) extend TANGO for detecting duplicate bug reports that contain multimedia information (text, images, and videos), (4) evaluate TANGO using data from additional apps, and (5) assess the usefulness of TANGO in industrial settings.

VIII. DATA AVAILABILITY Our online appendix [42] includes the collected videobased bug reports with duplicates, TANGO's source code, trained models, evaluation infrastructure, TANGO's output, and detailed evaluation results.

Growtracker

Lucene's tfidfsimilarity javadoc

Outklip

Code localization in programming screencasts

Is mutation an appropriate tool for testing experiments?

scvripper: video scraping tool for modeling developers' behavior using interaction data

Surf: Speeded up robust features

pix2code: Generating code from a graphical user interface screenshot

Translating video recordings of mobile app usages into replayable scenarios

What makes a good bug report? In FSE'08

Duplicate bug reports considered harmful

Extracting structural information from bug reports

A Replicated Study on Duplicate Detection: Using Apache Lucene to Search Among Android Defects

Javasketchit: Issues in sketching the look of user interfaces

Million-scale near-duplicate video retrieval system

Quo vadis, action recognition? a new model and the kinetics dataset

Associating the visual representation of user interfaces with their internal structures and metadata

Assessing the quality of the steps to reproduce in bug reports

On the vocabulary agreement in software issue descriptions

Reformulating queries for duplicate bug report detection

Large scale online learning of image similarity through ranking

From ui design image to gui skeleton: a neural machine translator to bootstrap mobile gui implementation

Object detection for graphical user interface: Old fashioned or deep learning or a combination?

A simple framework for contrastive learning of visual representations

An interactive semantic video mining and retrieval platform-application in transportation surveillance video for incident detection

Rico: A mobile app dataset for building data-driven design applications

Content and hierarchy in pixelbased methods for reverse engineering interface structure

An image-based approach to video copy detection with spatio-temporal post-filtering

Mutapk: Source-codeless mutant generation for android apps

Statistics (4th edn

Inspectorwidget: A system to analyze users behaviors in their applications

Algorithms on stings, trees, and sequences: Computer science and computational biology

Stochastic multiview hashing for large-scale near-duplicate video retrieval

Lucene in Action. Manning Publications

Duplicate bug report detection using dual-channel convolutional neural networks

Deep residual learning for image recognition

A contextual approach towards more accurate duplicate bug report detection and ranking

Preventing duplicate bug reports by continuously querying bug reports. EMSE'18

AppFlow: using machine learning to synthesize robust, reusable UI tests

Partial copy detection in videos: A benchmark and an evaluation of popular methods

Towards optimal bag-of-features for object categorization and semantic video retrieval

Global-view hashing: harnessing global relations in near-duplicate video retrieval

Image indexing using color correlograms

Seven best practices for optimizing mobile testing efforts

Aggregating local descriptors into a compact image representation

Automated Duplicate Bug Reports Detection -An Experiment at Axis Communication AB. Master's thesis

New Features for Duplicate Bug Detection

Near-duplicate video retrieval by aggregating intermediate cnn layers

Fivr: Fine-grained incident video retrieval

Near-duplicate video retrieval with deep metric learning

Trecvid 2011 content-based copy detection: Task overview

Apparition: Crowdsourced user interfaces that come to life as you sketch them

Improving the Accuracy of Duplicate Bug Report Detection Using Textual Similarity Measures

Gradient-based learning applied to document recognition

Finding Duplicates of Your Yet Unwritten Bug Report

Are all duplicates value-neutral? an empirical analysis of duplicate issue reports

Has this bug been reported? In WCRE'13

Distinctive image features from scale-invariant keypoints. JCV'04

Code, camera, action: How software developers document and share program knowledge using youtube

Crowd intelligence enhances automated mobile testing

Auto-completing bug reports for android applications

Automatically discovering, reporting and reproducing android application crashes

Machine Learning-Based Prototyping of Graphical User Interfaces for Mobile Apps. TSE'18

Eye of the mind: Image processing for social coding

Duplicate Bug Report Detection with a Combination of Information Retrieval and Topic Modeling

Reverse engineering mobile application user interfaces with remaui

A systematic comparison of search algorithms for topic modelling-a study on duplicate bug report identification

Automatic identification and classification of software development video tutorial fragments

Revisiting the performance evaluation of automated approaches for the retrieval of duplicate issue reports

Revisiting the performance of automated approaches for the retrieval of duplicate reports in issue tracking systems that perform just-in-time duplicate retrieval

Event retrieval in large video collections with circulant temporal encoding

A soft alignment model for bug deduplication

Detection of Duplicate Defect Reports Using Natural Language Processing

Introduction to Modern Information Retrieval

Video google: a text retrieval approach to object matching in videos

Towards more accurate retrieval of duplicate bug reports

A Discriminative Model Approach for Accurate Duplicate Bug Report Retrieval

Detecting Duplicate Bug Report Using Character N-Gram-Based Features

DupFinder: Integrated Tool Support for Duplicate Bug Report Detection

Improved Duplicate Bug Report Identification

Learning spatiotemporal features with 3d convolutional networks

Images don't lie: Duplicate crowdtesting reports detection with screenshot information

0: Improving completion-aware crowdtesting management with duplicate tagger and sanity checker

An Approach to Detecting Duplicate Bug Reports Using Natural Language and Execution Information

Practical elimination of near-duplicates from web video search

Extracting code from programming tutorial videos

Actionnet: visionbased workflow action recognition from programming screencasts

Seenomaly: Vision-based linting of gui animation effects against design-don't guidelines

Dynamic texture recognition using local binary patterns with an application to facial expressions

Learning to Rank Duplicate Bug Reports

We thank Winson Ye for experimenting with different CNN models, the study participants for their time and useful feedback, and the companies who responded to our initial internal survey. This research was supported in part by the NSF CCF-1955853 and CCF-1815186 grants. Any opinions, findings, and conclusions expressed herein are the authors' and do not necessarily reflect those of the sponsors.