key: cord-0048554-w8370wyl
authors: Williams, Grant; Tushev, Miroslav; Ebrahimi, Fahimeh; Mahmoud, Anas
title: Modeling user concerns in Sharing Economy: the case of food delivery apps
date: 2020-08-09
journal: Autom Softw Eng
DOI: 10.1007/s10515-020-00274-7
sha: 0c1bb308e8ed342b064b22680cb320c005390606
doc_id: 48554
cord_uid: w8370wyl

Sharing Economy apps, such as Uber, Airbnb, and TaskRabbit, have generated a substantial consumer interest over the past decade. The unique form of peer-to-peer business exchange these apps have enabled has been linked to significant levels of economic growth, helping people in resource-constrained communities to build social capital and move up the economic ladder. However, due to the multidimensional nature of their operational environments, and the lack of effective methods for capturing and describing their end-users’ concerns, Sharing Economy apps often struggle to survive. To address these challenges, in this paper, we examine crowd feedback in ecosystems of Sharing Economy apps. Specifically, we present a case study targeting the ecosystem of food delivery apps. Using qualitative analysis methods, we synthesize important user concerns present in the Twitter feeds and app store reviews of these apps. We further propose and intrinsically evaluate an automated procedure for generating a succinct model of these concerns. Our work provides a first step toward building a full understanding of user needs in ecosystems of Sharing Economy apps. Our objective is to provide Sharing Economy app developers with systematic guidelines to help them maximize their market fitness and mitigate their end-users’ concerns and optimize their experience.

objective is to understand and classify the main pressing user concerns in the ecosystem of these apps. • We propose, formally describe, and evaluate a fully automated procedure for modeling user concerns in the ecosystem of food delivery apps along with their main attributes and triggers. The generated model is intended to provide SE app developers with a framework for assessing the fitness of their mobile apps and understanding the complex realities of their ecosystem.

The remainder of this paper is organized as follows. Section 2 provides a brief background of existing related research, motivates our work in this paper, and presents our research questions. Section 3 describes our qualitative analysis. Section 4 proposes an automated procedure for extracting and modeling crowd concerns in the ecosystem of food delivery apps. Section 5 discusses our key findings and their impact. Section 6 describes the main limitations of our study. Finally, Sect. 7 concludes the paper and describes our future work.

In this section, we provide a brief summary of seminal related research, motivate our work, and present our research questions.

The research on SE has become a prominent subject of research across multiple disciplines (Dillahunt et al. 2017; Hossain 2020) . This can be explained based on the interdisciplinary nature of the problems often raised in this domain. In general, the research on SE can be categorized into five main categories:

• Economic: Recent research revealed that adapting solutions of SE can foster economic growth in big cities and local communities (Cheng 2016; Zhu et al. 2017) . Specifically, SE can help to counter excessive spending and purchase habits (Hüttel et al. 2018) while generating new sources of revenue (Matzler et al. 2015) . However, major concerns are frequently raised about the impact of this new business model on traditional long-established markets, affecting the revenue and business practices of these markets and threatening to put millions (e.g., taxi drivers and employees in the hotel industry) out of work by making their jobs obsolete (Aznar et al. 2017; Zervas et al. 2017 ). • Social: Existing research often describe SE as a vehicle for building social capital and establishing social relationships within local communities (Benkler 2017; Tussyadiah and Pesonen 2016) . However, on the negative side, SE has paved the way for a new form of social challenges, including problems such as digital discrimination, which refers to scenarios where a business transaction is influenced by race, gender, age, or other aspects of appearance of service provider or receiver (Edelman and Luca 2014) . For instance, a recent report by the National Bureau of Economic Research found that black riders using Uber waited 30% longer to be picked up (Ge et al. 2017) . Another study reported that non-black Airbnb hosts were able to charge 12% more than black hosts (Edelman and Luca 2014 ). • Environmental: Several studies suggest that SE promotes environmental awareness by enabling more sustainable consumption practices in modern-day societies (Ala-Mantilaa et al. 2016; Bonciu and Balgar 2016) . Other studies argue that this impact is not as substantial, suggesting that environmental factors are not as important for consumers as economic factors (Acquier et al. 2017) . In fact, some other studies went even further to suggest that SE can lead to more environmental pressure and resource exploitation due to the more affordable alternatives it provides (Tussyadiah and Pesonen 2016) . • Legal: This category of studies investigate existing regulations and suggest new regulatory infrastructures for protecting users of SE platforms from unwanted business practices. The main objective is to propose legislation to regulate the relationship between the app (e.g., Uber or Airbnb), service providers (e.g., drivers or apartment owners), and service receivers (e.g., riders or renters) (Bond 2014; Murillo et al. 2017) , especially when the terms-of-service are somehow violated, such as in cases of drunk drivers, under-insured cars, unsafe apartments, and fraud (Cannon and Summers 2014) . • Computing: In computing, studies of SE often tackle the problem from an algorithmic and humancomputer interaction (HCI) perspectives (Dillahunt et al. 2017) . Algorithmic papers are mainly concerned with proposing new and more efficient algorithms for P2P matching, path planning in ride-sharing (Chow and Yuan Yu 2015; He et al. 2012) , platform fairness (Bistaffa et al. 2015; Thebault-Spieker et al. 2015 , and pricing (Bistaffa et al. 2015) . HCI related study, on the other hand, propose design solutions to optimize user experience (Dillahunt and Malone 2015), including protecting their privacy (Goel et al. 2016; Xu et al. 2017 ) and safety (Bellotti et al. 2015) and understanding their usage patterns and motivations to participate in SE ).

The research on mining mobile app user feedback has noticeably advanced in the past few years. The objective of this line of research is to help software developers infer their end-users' needs, detect bugs in their code, and plan for future releases of their apps. In general, two main channels of feedback are often considered: app store reviews and Twitter.

• App store reviews: A systematic survey of studies related to app store review analysis is provided in Martin et al. (2017) . In general, this line of research proposes new tools [e.g., AR-Miner (Chen et al. 2014) , MARA (Iacob and Harrison 2013) , MARC (Jha and Mahmoud 2018) , and CLAP (Villarroel et al. 2016) ], methods, and procedures for analyzing user reviews available on Google Play and the Apple App Store. The main objective is to capture any actionable main-tenance requests in these reviews, such as bug reports and feature requests as well as non-functional requirements concerns, such as usability, reliability, security, and privacy (Groen et al. 2017; Jha and Mahmoud 2018) .

To automatically identify informative user reviews, reviews are typically classified using standard text classification techniques, including Naive Bayes (NB), Support Vector Machines (SVM), Random Forests (RF), and Decision Trees (DT) (Jha and Mahmoud 2018; Panichella et al. 2015) as well as clustering algorithms such as DBSCAN (Villarroel et al. 2016) . Simpler techniques, which rely on linguistic pattern and term matching have also been proposed in the literature (Guzman and Maalej 2014; Iacob and Harrison 2013; Panichella et al. 2015) . In terms of modeling, techniques such as Latent Direchlet Allocation (LDA), are commonly used to infer meaningful high-level topics from reviews (Chen et al. 2014; Guzman and Maalej 2014) . Text processing techniques, such as sentiment analysis, lemmatization, and part of speech tagging, are also commonly used to improve the accuracy of review classification and modeling techniques (Carreńo and Winbladh 2013; Maalej and Nabil 2015; Mcllroy et al. 2016; . In addition, meta-data attributes of user reviews, such as their star rating and author information, are used to improve the predictive capabilities of review classifiers (Khalid et al. 2015; Maalej and Nabil 2015) . • Twitter: Twitter enables large populations of end-users of software to publicly share their experiences and concerns about their apps in the form of microblogs. Analysis of large datasets of tweets collected from the Twitter feeds of software systems revealed that around 50% of collected tweets contained actionable maintenance information . Such information was found to be useful for different groups of technical and non-technical stakeholders (Guzman et al. 2017) , providing complementary information to support mobile app developers during release planning tasks. The results also showed that text classifiers, such as SVM and NB, summarization methods, such as Hybrid TF.IDF and SumBasic, and modeling methods, such as LDA, can be effectively used to categorize, summarize, and cluster software-related tweets into semantically related groups of technical feedback (Williams and Mahmoud 2017).

Our review shows that systematically analyzing and synthesizing user feedback at a domain level can help app developers to critically evaluate the current landscape of competition and to understand their end-users' expectations, preferences, and needs (Coulton and Bamford 2011; Finkelstein et al. 2014; Harman et al. 2012; Palomba et al. 2018; Svedic 2015) . Understanding the domain of competition is critical for the survival of SE apps. Specifically, the clusters of functionally-related SE apps form distinct micro-ecosystems within the app store ecosystem. A software ecosystem can be defined as a set of actors functioning as a unit and interacting with a shared market for software and services, together with the relationships among them (Jansen et al. 2009 ).

However, the majority of existing research on mining crowd feedback in the mobile app market is focused on individual apps, with little attention paid to how such information can be utilized and integrated to facilitate software analysis at an ecosystem, or application domain, level (Martin et al. 2017; Panichella et al. 2015) . Extracting concerns at a domain level can be a more challenging problem than focusing on single apps, which typically receive only a limited number of reviews or tweets per day (Mcilroy et al. 2017) . Furthermore, existing crowd feedback mining techniques are calibrated to extract technical user concerns, such as bug reports and feature requests, often ignoring other non-technical types of concerns that originate from the operational characteristics of the app (Jha and Mahmoud 2019; Martin et al. 2017) . These observations emphasize the need for new methods that can integrate multiple heterogeneous sources of user feedback to reflect a more accurate picture of the ecosystem. To bridge the gap in existing research in this paper, we present a case study on modeling crowd feedback in ecosystems of SE apps. Our case study targets the ecosystem of food delivery apps. Emerging evidence has shown that, unlike other SE apps, the demand for food delivery services has significantly increased after the COVID-19 shelter-in-place order (Chen et al. 2020) . In fact, according to The New York Times, while use of Uber's ride-sharing service went down by 80% in April of 2020, UberEats has experienced 89% increase in demand (Conger and Griffith 2020) . This makes food delivery a particularly interesting SE domain to be targeted by our analysis.

The first major food courier service to emerge was Seamless, in 1999. A product of the internet boom, seamless allowed users to order from participating restaurants using an online menu, a unique innovation that granted the service considerable popularity. Following seamless, Grubhub was also met with success when it began offering web-based food delivery for the Chicago market in 2004. As smart phones became more popular, a number of new food couriers took advantage of the new demand for a more convenient mobile app-based delivery services. Of these competitors, UberEATS rose to the top, leveraging their experience with ride-sharing to adapt to food delivery. By the end of 2017, UberEATS became the most downloaded food-related app on the Apple App Store.

The set of food delivery apps along with their consumer (e.g., restaurant patrons and drivers) and business (e.g., restaurants) components represent a uniquely complex and dynamic multi-agent ecosystem. This complexity imposes several challenges on the operation of these apps. These challenges, which can also be often found in other SE ecosystems, can be described as follows:

• Fierce competition: users often have multiple services to choose from within a given metropolitan area. Switching from one app to another is trivial, and users are highly impatient with late or incorrect orders. For instance, food delivery services have less than one hour for delivery. This forces developers to constantly innovate to provide faster delivery than their rivals.

• Decentralized fulfillment: the drivers are generally independent contractors who choose whom to work for and when to work. This creates challenges, not only for job assignment, but also for predicting when and where human resources will become available. • Multi-lateral communication: in order to fulfill an order, the delivery app must communicate with users, drivers, and restaurants to ensure that the food order is ready when the driver arrives, and that the user knows when to expect delivery. Each channel of communication presents an opportunity for failure.

The main objective of our analysis is to demonstrate the feasibility of automatically generating an abstract conceptual model of user concerns in such a dynamic and complex ecosystem. Such model is intended to provide systematic technical and business insights for app developers as well as newcomers trying to break into the SE market. To guide our analysis, we formulate the two following research questions:

• RQ 1 : What types of concerns are raised by users of food delivery apps?

Mobile app users are highly vocal in sharing suggestions and criticism. Understanding this feedback is critical for evaluating and prioritizing potential changes to software. However, not all concerns, especially in businessoriented apps, are technical in nature. Therefore, developers must also be aware of business discussions, such as talk of competitors, poor service, or issues with other actors in their ecosystems. Therefore, the first phase of our analysis is focused on systematically externalizing and classifying crowd feedback available in the Twitter feeds and app store reviews of food delivery apps. • RQ 2 : How can user concerns in the ecosystem of food delivery apps be automatically and effectively modeled? The second phase of our analysis is focused on automatically externalizing and modeling user concerns in the ecosystem of food delivery apps. Modeling such information can provide valuable information for SE app developers, enabling them to discover the most important user concerns in their ecosystem, along with their defining attributes and triggers. To answer this question, we propose an automated procedure for generating a new form of user feedback analysis models and we compare its performance to LDA, a commonly used technique for generating topics of app user concerns from online user feedback.

To answer our first research question ( ), in this section, we qualitatively analyze a large dataset of app store reviews and tweets, sampled from the crowd feedback of four popular food delivery apps. In what follows, we describe our data collection process as well as the main findings of our analysis.

In order to determine which apps to include in our case study, we used the top charts feature of the Apple App Store and Google Play. These charts keep the public aware of the top grossing and downloaded apps in the app store. As of September of 2018, UberEats is the most popular food delivery app on the App Store. Among the top ten apps in the Food category, there are three additional competing delivery apps: Doordash, GrubHub, and PostMates. If we broaden our focus to the top twenty-five apps, only one additional food delivery app is found, Eat24. Eat24 was recently acquired by GrubHub, and have redirected users to their parent app, allowing us to exclude it from the analysis. 1 The Google Play Store shows the top 25 most popular apps in an arbitrary order. However, we find that UberEats and its three main competing apps are also present within the top 25. Therefore, the apps UberEats, Doordash, GrubHub, and PostMates covers the most popular food delivery services available on both platforms.

It is important to point out that there are several other food delivery apps in the app market. These apps often operate in very limited geographical areas or have smaller user base. In our analysis, we are interested in apps with the biggest market share (as quantified by their app store download numbers), thus we narrowed down our ecosystem to its fittest elements from a user perspective. Popular apps receive significantly more crowd feedback on app stores and social media in comparison to smaller apps (Mcilroy et al. 2017) . Furthermore, selecting mature apps gives smaller and newcomer apps a chance to learn from the mistakes of the big players in the market (Pagano and Maalej 2013) .

After the list of apps is determined, the second step in our analysis is to identify and classify the main user concerns in the ecosystem. Prior research has revealed that software-relevant feedback can be found in tweets and app store reviews (Maalej and Nabil 2015; Panichella et al. 2015; Sorbo et al. 2016; . To extract reviews, we used the free third-party service AppAnnie. 2 This service allows reviews up to 90 days old to be retrieved from Google Play and the Apple App Store.

To collect tweets, we limited our search to tweets directed to the Twitter account of our apps. For example, to retrieve tweets associated with UberEats, we searched for to:ubereats. Our previous analysis has revealed that this query form yields a large rate (roughly 50%) of meaningful technical feedback among the resulting tweets . In our analysis, we collected tweets in the period from September 4th to December 4th of 2018. In total, 1833 tweets, 13,557 App Store reviews, and 29,674 Google Play reviews were extracted. Table 1 summarizes our dataset. Collecting data from multiple sources of feedback (multiple app stores and twitter) and over a long period of time is necessary to minimize any sampling bias that may impact the validity of the analysis (Martin et al. 2015) .

To conduct our qualitative analysis, we sampled 900 posts (300 tweets, 300 iOS reviews, and 300 Android reviews) from the data collected for each app in our domain. Sampling 900 posts from the population of posts for each app ensures a confidence level of 99%. To perform the sampling, we developed a Ruby program to first execute a shuffle() method on the lists of tweets and reviews to randomize the order, taking the time of the post into consideration to avoid selecting posts from the same time period (e.g., tweets from one week only). The first 300 posts from each source of user feedback were then selected.

To manually classify our data, we followed a systematic and iterative coding process. Specifically, three judges participated in the data classification process. The judges have an average of three years of industrial software engineering experience. For each post (tweet and review), each judge had to answer three main questions: (a) does the post raise any concerns (informative vs. uninformative)?, (b) what is the broad issue raised in the post?, and (c) what is the specific concern raised in the post? The manual classification process was carried over four sessions, each session lasted around 6 h, divided into two periods of three hours each to avoid any fatigue issues and to ensure the integrity of the data (Wohlin et al. 2012) . A final meeting was then held to generate the main categories of concerns as they appeared in the individually classified data. Conflicts were detected in less than 5% of the cases, mainly on the granularity level of the classification. For example, concerns about refunds and promo codes were considered two separate categories by one judge, while another judge classified them under the same concern category (money issues). Such conflicts were resolved after further discussion and eventually using majority voting. In what follows, we describe the results of our qualitative analysis in greater detail.

A post during our manual classification task was considered informative if it raised any form of user concerns. The rest of the reviews were considered miscellaneous. Posts containing spam (e.g.,"#UberEats Always late!! Check bit.ly/1xTaYs") or context-free praise or insults (e.g., "I hate this app!" and "This app is great!") were also considered irrelevant. In general, the following general categories and sub categories of concerns were identified in the set of informative posts: • Business concerns: This category includes any concerns that are related directly to the business aspects of food delivery. In general, these concerns can be subdivided into two main subcategories:

• Human: these concerns are related to interactions with employees of the apps. Users often complained about orders running late, cancellations, restaurant workers being rude, and drivers getting lost on the way to delivery. Human related reviews were on average the longest (30 words), often narrating multiparagraph sequences of human (mainly driver) failures that led to undesirable outcomes. • Market: the apps in our dataset generally make money either through flatrate delivery charges or surcharges added to the price of individual menu items. Users are highly sensitive to the differences between what they would pay at the restaurant versus at their doorstep. Posts complimenting low fees and markups were rare. Price complaints were not the only form of marketrelated feedback. Other posts included generic discussions of market-related concerns such as business policy (such as refunds), discussion of competitors, promotions, and posts about participating restaurants and delivery zones. Requests for service in remote areas were fairly common too.

• Technical concerns: This set of concerns includes any technical issues that are related to the user experience when using the app itself. As have been shown before (Maalej and Nabil 2015) , technical concerns often revolve around two subcategories:

• Bug reports: Posts classified under this category contain descriptions of software errors, or differences between the described and the observed behaviors of the app. Bug reports commonly consist of a simple narration of an app failure. In our dataset, we observed that the most common bugs were related to payments (174 out of 533) while crashes and service outages counted for 53 posts. • Feature requests: These posts contain requests for specific functionality to be added to the app, or discussions of success/failure of distinct features. For example, some users of DoorDash complained about being forced to tip before the order was delivered. Users of Eat24 lament a recent update which removed the ability to reorder the last meal requested through the app. Under this category, we also include non-functional requirements (NFRs), or aspects of software which are related to overall utility of the app rather than its functional behavior (e.g., usability, reliability, security, and accessibility) (Cleland-Huang et al. 2005; Glinz 2007 ). Ease-of-use was the most common NFR cited by users, followed by user experience (UX).

In terms of specific concerns, nine different concerns were identified: drivers, customer service, refund, service outage, promo code, communication, security, routing, and order. Thorough descriptions as well as examples of these concerns are shown in Table 2 . In Table 3 , we show the number of posts classified under each category of concerns in the sampled dataset. In general, our qualitative analysis revealed that, based on the total number of relevant posts, Table 2 A fine-grained classification of user concerns in the ecosystem of food delivery apps Concern Description Example post

The single most common problem was with drivers. Specifically, drivers were dispatched inefficiently, or combined orders, causing long wait times. Users were especially upset when drivers went the wrong direction -"In addition, the address that I gave to #UberEats took the driver to a completely different parking lot" and "@DoorDash The driver did not follow the order instructions, was belligerent, and shouted at me"

Customer service Users commonly expressed dissatisfaction with the friendliness of service members and how long it took to receive answers "Do not ever use this service! The contact number is nowhere to be found; I had to ask Google to find it"

Refund Users were often frustrated to discover that services generally only offered refunds for the delivery charge, excluding the order, even if the food was rendered inedible due to long delivery time "They were unable to get me a refund for food that arrived cold and rubbery when I live 3 min away from the restaurant"

Service outage Whenever a service was down, users immediately turned to social media to complain "The servers are down!" and "Great timing for an outage"

Promo code A common bug report was promotion codes not being applied to orders correctly "The promo code was rejected, inaccurately saying that I was not eligible" Communication Bugs commonly originated from failed communication between the delivery service and the restaurant, especially regarding menu items and hours-of-operation "@Postmates so I ordered baby blues spent 52$ for my postmate to send me a picture of the place closed so I had to cancel my order and now I cant get food tonight" Security Security errors were surprisingly common. Several users reported unexplained charges to their accounts "@Postmates my account was hacked. I reset my password and people all over the country are still ordering on my account"

Occasionally the GPS systems in the drivers' apps failed, causing drivers to ask users for help. Many users were upset when this happened "Driver got lost had to ask me for BASIC directions, then drove in the complete opposite direction. The food came so late it was inedible"

Order Sometimes, services failed to route a driver to an order, and rather than alert the customer, they gradually pushed the delivery window back "I had to contact #grubhub, not the other way around, about a delivery that was an hour beyond the delivery window and the estimated time kept pushing further back" Android reviews were the least informative in comparison to other sources of feedback. One potential explanation for this phenomenon is that Google Play does not pose any restriction on the number of times an app can request users to leave a review for the app, while the Apple App Store limits app in this respect. As a result, many Android reviews were terse, with statements such as "I'm only posting this because the app keeps nagging me" being common.

Finally, the results also show that the distribution of concerns over the apps was almost the same. As Fig. 1 shows, concern types spread almost equally among apps, highlighting the similarity between the apps in their core features and user base. It is important to point out that our identified categories were considered orthogonal: each post could be any combination of human, market, bug, and feature issues. Therefore, there was considerable overlap between categories. This overlap is shown in Fig. 2 . 

Bug Feature Fig. 1 The distribution of concern categories for each app. Y-axes is the number of posts (reviews and tweets)

In the first phase of our analysis, we qualitatively analyzed a large dataset of crowd feedback, sampled from the set of app store reviews and tweets directed to the apps in our ecosystem. Our results showed that user concerns tend to overlap and extend over a broad range of technical and business issues. Furthermore, these concerns tend to spread over multiple feedback channels and apps in the domain, which makes it practically infeasible to collect and synthesize such feedback manually. This emphasizes the need for automated tools that developers can use to make sense of such data. To address these challenges, our second research question in this paper ( ) aims at proposing automated methods for generating representative models of the data. To answer this question, we first investigate the performance of LDA as one of the most commonly used topic modeling techniques in app user feedback analysis (Chen et al. 2014; Gomez et al. 2015; Guzman and Maalej 2014; Iacob and Harrison 2013) . We then propose a novel frequency-based approach for generating more expressive models of the data. The performance of both techniques is evaluated based on their ability to capture the main concerns of food delivery app users as well as their main attributes and triggers ( 

Introduced by Blei et al. (2003) , LDA is an unsupervised probabilistic approach for estimating a topic distribution over a text corpus. A topic consists of a group of words that collectively represents a potential thematic concept (Blei et al. 2003; Hofmann 1999) . Formally, LDA assumes that words within documents are the observed data. The known parameters of the model include the number of topics k, and the Dirichlet priors on the topic-word and document-topic distributions and . Each topic t i in the latent topic space (t i ∈ T) is modeled as a multidimensional probability distribution, sampled from a Dirichlet distribution , over the set of unique words ( w i ∈ W ) in the corpus D, such that, w|t ∼ Dirichlet( ) . Similarly, each document from the collection ( d i ∈ D ) is modeled as a probability distribution, sampled from a Dirichlet distribution over the set of topics, such that, t|d ∼ Dirichlet( ) . t|d and w|t are inferred using approximate inference techniques such as Gibbs Sampling (Griffiths and Steyvers 2004) . Gibbs sampling creates an initial, naturally weak, full assignment of words and documents to topics. The sampling process then iterates through each word in each document until word and topic assignments converge to an acceptable (stable) estimation (Blei et al. 2003) .

We use Gensim 3 to extract topics from our dataset of user posts (reviews and tweets) (Rehurek and Sojka 2010) . Gensim is a Python-based open-source toolkit for vector space modeling and topic modeling. We apply lemmatization and stopword removal on the posts to enhance the quality of generated topics. For lemmatization we use the spaCy library for Python 4 and to remove stop-words we use Gensim's built-in stop-word removal function. LDA's hyper-parameters and are optimized by Gensim, where is automatically learned from the corpus and is set to be 1/(number of topics). To determine the number of topics, we rely on Gensim's coherence score. Topic coherence provides a convenient measure to judge how good a given topic model is. Our analysis shows that at around 8-10 topics, our data will generate the most cohesive topics (Fig. 3a) .

The list of generated topics are shown in Table 4 . In general, the topics are of poor quality, in other words, they do not seem to capture any of the major concerns identified either by our qualitative analysis. For example, while the second topic in Table 4 includes words such as delivery, food, and fee, it fails to represent a coherent concern due to the mixture of words from more than one concern category. Other topics in Table 4 also contain almost no words collectively representative of any of the concern categories identified during our qualitative analysis phase. These poor results can be explained based on the limited length of user reviews and tweets. Recent research has shown that LDA does not perform well when the input documents are short in length (Bing et al. 2011; Hong and Davison 2010; Yan et al. 2013) . Specifically, LDA is a data-intensive technique that requires large quantities of text to generate meaningful topic distributions. However, due to the sparsity attribute of short-text, applying standard LDA to short-text data (e.g., user reviews or tweets) often produces incoherent topics (Hong and Davison 2010; Zhao et al. 2011) . To overcome this problem, researchers use supplemental strategies to effectively train LDA in short-text environments. Such strategies, often known as pooling, are based on merging (aggregating) related texts together and presenting them as single pseudo-documents to LDA, thus, increasing the amount of text per document to work with. In our analysis, we aggregate posts from each source (App Store reviews, Google Play reviews, and Twitter) for each app in a single document, thus producing 3 × 4 documents. We then generate topics for our aggregated data. Using this data, the coherence score hits a local maxima at six topics (Fig. 3b) . The generated topics are shown in Table 5 . In general, aggregating user posts resulted in producing very similar topics. Generated topics are more redundant, providing only incomplete representations of the user concern in our data. The poor generalization ability of LDA can be attributed to two main reasons. First, due to the overlapping nature of the different concern categories, the classes are not separable by LDA. As a result, we see a mixture of words from different concern categories in the same topic. Second, LDA is a data-intensive technique that requires large quantities of text to generate meaningful topic distributions (Blei et al. 2003) . However, our dataset is relatively small, consisting of only 3600 user posts, and even much less documents when these posts are aggregated.

In summary, our attempt to automatically generate our list of concerns using LDA was relatively unsuccessful. In order to generate meaningful topics, LDA requires a balance between the number and length of text artifacts being modeled (Tang et al. 2014) . While we had a relativity large number of artifacts, their length was limited. Our attempt to generate larger artifacts using Assisted LDA resulted in only few lengthy artifacts (12). This has negatively impacted LDA's ability to converge, or generate meaningful latent topic structures. Our expectation is that, applying more fine-grained text aggregation strategies that can produce sufficiently long, but not too long, documents (e.g., aggregating tweets based on hashtags) would help to improve the quality of generated topics (Hong and Davison 2010; Mimno et al. 2011 ).

The first part of our modeling analysis showed that LDA comes with several inherent limitations related to its computational complexity and the nature of our data. These limitations prevent LDA from producing meaningful representations of crowd feedback. To overcome these limitations, in this section, we propose a fully automated procedure for generating succinct representations of crowd feedback in the ecosystem of food delivery apps. In general, our automated model generation procedure can be divided into four main steps: In what follows, we describe these steps in greater detail.

The first step in our procedure is to separate informative user feedback from uninformative feedback. A large body of research exists on classifying mobile app user feedback into different categories of software maintenance tasks, such as feature requests and bug reports (Maalej and Nabil 2015; Panichella et al. 2015; . Our classification configurations can be described as follows:

• Classification algorithms: To represent our data, we experiment with three different classification algorithms: Support Vector Machines (SVM), Naive Bayes (NB), and Random Forests (RF). These algorithms have been extensively used to classify crowd feedback in the app market (Maalej and Nabil 2015; Panichella et al. 2015) . Their success can be attributed to their ability to deal effectively with short text (e.g., tweets, user reviews, YouTube comments, etc.) (Wang and Manning 2012). • Training settings: to train our classifiers, we used 10-fold cross validation. This method creates 10 partitions of the dataset such that each partition has 90% of the instances as a training set and 10% as an evaluation set. The benefit of this technique is that it uses all the data for building the model, and the results often exhibit significantly less variance than those of simpler techniques such as the holdout method (e.g., 70% training set and 30% testing set). • Text pre-processing: English stop-words were removed and stemming was applied to reduce words to their morphological roots. We used Weka's built-in stemmer and stop-word list to pre-process the posts in our dataset (Lovins 1968 ). It is important to point out that lemmatization is sometimes used instead of stemming in app review classification tasks (Panichella et al. 2015) . The results often show a marginal impact of these techniques on the precision of classification.

In our analysis, we use stemming for its lower overhead. Specifically, lemmatization techniques are often exponential to the text length, while stemming is known for its linear time complexity (Bird et al. 2009 ). • Sentiment Analysis: sentiment analysis is often used in app user feedback classification tasks as a classification feature of the input data (Williams and Mahmoud 2018; Jha and Mahmoud 2019). The underlying hypothesis is that user concerns are often expressed using negative sentiment (Lin et al. 2018) . To calculate the sentiment of our data, we used SentiStrength (Thelwall et al. 2010 ). SentiStrength assigns positive (p) and negative (n) sentiment scores to input text, using a scale of − 5 to + 5, based on the emotional polarity of individual words. To convert SentiStrength's numeric scores into these categories, we adapted the approach proposed by Jongeling et al. (2017) and Thelwall et al. (2012) . Specifically, a post is considered positive if p + n > 0 , negative if p + n < 0 , and neutral if p + n = 0 . It is worth mentioning that other sentiment analysis techniques, such as VADER and the Stanford CoreNLP are also used in related studies. However, the difference in performance between these tools is often marginal (Panichella et al. 2015; Jongeling et al. 2015 ; Williams and Mahmoud 2017). • Text representation: to classify our data, we experimented with simple bag-ofwords with lowercase tokens. The bag-of-words representation encodes each post as a vector. Each attribute of the vector corresponds to one word in the vocabulary of the dataset. A word is included in the vocabulary if it is present in at least two posts. Words that appear in a single post are highly unlikely to carry any predictive value to the classifier. An attribute of one in a post's vector indicates that the corresponding word is present, while a zero indicates absence. This representation can be extended to treat common sequences of adjacent words, called n-grams, a gram is a single word; n is the number of adjacent words, so two adja-cent words are a bi-gram. For example, the phrase "this app is good" contains four words, and three b-grams ("this app", "app is", "is good"). Figure 4 illustrates how bag-of-words and n-gram representations work; "updated", "app", and "crashes" are the key words that occur in the tweet "I updated the app, but now it crashes". "Now it crashes" is a tri-gram that is also included. Each '1' in the vector representation at the bottom corresponds to one of the highlighted n-grams, while each '0' corresponds to a vocabulary word that is not found in the tweet.

To generate this representation, we utilized the n-gram tokenizer in Weka, which allowed uni-gram, bi-gram, and tri-gram tokens to be included in a single dataset.

We trained two set of classifiers to categorize our data. One classifier for detecting business posts and one classifier for detecting technical posts. The standard measures of Precision (P), Recall (R), and F-Score ( F ) are used to evaluate the performance of our classification algorithms. Assuming t p is the set of true positives, f p is the set of false positives, and f n is the set of false negatives; precision is calculated as: t p ∕(t p + f p ) and recall is calculated as: t p ∕(t p + f n ) . The F-measure is the weighted harmonic mean of P and R, calculated as: F = ((1 + 2 )PR)∕( 2 P + R) . In our analysis, we use = 2 to emphasize recall over precision (Berry 2017) .

All tweets and reviews in our original dataset were stored in ARFF format, a common text-based file format often used for representing machine learning datasets, and then fed to Weka. 5 Table 6 shows the performance of NB, SVM, and RF in terms of P, R, and F 2 . SVM provided the best average classification performance in separating the different types of concerns, in comparison to NB and RF respectively. The best SVM results were obtained using the Pearson VII function-based universal kernel (Puk) with kernel parameters = 8 and = 1 (Üstün et al. 2006) . Universal Kernels are known to be effective for a large class of classification problems, especially for noisy data (Steinwart 2001) . RF was evaluated with 100 iterations. Raising iterations above this number did not improve the performance. We also notice that almost all classifiers achieved better performance when classifying the reviews and tweets into generic categories of Business and Technical. The performance deteriorated when the data was classified at a subcategory level (Human, Market, Bug, and Feature) due to the fact that the classifier had to deal with a larger set of classes (labels). Separating concerns at this level can be challenging, especially when the data is relatively unbalanced. In general, business-related posts were easier to classify than technicallyrelated posts. This phenomenon is driven by the quantity of each class. Table 3 shows that technical posts were rare. The prior-probability of any given post being technical is less than 25%, negatively impacting the performance of all three classifiers. This problem was exacerbated for the individual technical categories, with feature requests only occurring in 6.5% of posts. The relative sparsity of technical posts in comparison to other application domains can be explained based on the fact that the domain food delivery is a business domain in nature, thus, users had so many more business-related issues to discuss. For instance, Food courier services would often fail behind the scene, causing drivers to be dispatched to incorrect locations, or customer support to fail to call. These failures often caused customers to discuss competition and pricing. As a result, business concerns crowded out technical concerns. In other domains, failures are more immediate and visible to consumers, meaning that user concerns are more likely to take the form of bug reports.

We further experimented with the bag-of-words representation of text, and then allowing bi-and tri-grams to be included alongside individual words. Neither approach improved the performance. Table 7 shows a comparison between the uni-gram encoding (i.e., bag-of-words), and the encoding which included biand tri-grams. The lack of improvement partly stems from the fact that the additional composite tokens often had the same class implications as their constituent words. For example, the term account was found to have a negative implication on the business class, meaning that posts containing the word account were unlikely to be business-related. Most of the related N-grams, including account got hacked and account was hacked had the same implication, except with a substantially smaller weight. Therefore, they were essentially irrelevant to classification. In some other cases, bi-and tri-grams did not have the same implication as their constituent words. For example, promo was positively implicated to business, but promo code had a negative implication. However, the single word in this case, and in many others, had a higher weight than the bi-and tri-grams, and occurred in substantially more posts. Often times, the bi-grams had the same weight and occurrence as the tri-grams, making the tri-grams superfluous.

Our results also show that the sentiment polarity of posts had almost no impact on the classification accuracy. Specifically, the results show that miscellaneous posts (posts not business or technically-relevant) were detected as having more positive sentiment than any other category. These result were expected; non-miscellaneous posts often described problems users were having. Otherwise, as Fig. 5 shows, the categories had substantially similar sentiment scores overall. For future work, we suspect that enhancing SentiStrength's dictionary with emotion-provoking softwarerelated words (crash, uninstall, etc.), or using customized sentiment analysis classifiers (e.g., Williams and Mahmoud 2017) would help to better estimate the emotional polarity of posts.

In order to specify the main entities (nodes) of our model, we look for important words in the set of reviews and tweets classified as informative in the previous step.

Our assumption is that such words capture the essence of user concerns in the ecosystem. In Object Oriented software design, when generating conceptual models from requirements text or any textual data, nouns are considered candidate classes (objects), verbs are considered as candidate operations (functions), while adjectives commonly represent attributes (Abbott 1983; Elbendak et al. 2011 ). Based on these assumptions, we only consider important nouns, verbs, and adjectives in our analysis.

To extract these parts of speech (POS), we utilize the Natural Language Toolkit (NTLK) (Bird et al. 2009 ) POS tagging library. We further apply lemmatization to reduce the morphological variants of words in our dataset down to their base forms. For example, drink, drinks, drinking, drank, and drunk, are all transformed to simply drink. By applying lemmatization, we avoid the problem of morphological variants being treated as entirely different words by our model. After lemmatization, we merge words together under each part of speech category. For example drive and drives are merged to simply drive when used as verbs. However, the word drive can also be a noun (e.g., "that was a long drive"). Therefore, we only merge words where TF(w i ) is the term frequency of the word w i in the entire collection, |R| is the total number of posts in the collection, and |r j ∶ w i ∈ r j ∧ r j ∈ R| is the number of posts in R that contain the word w i . The purpose of TF.IDF is to score the overall importance of a word to a particular document or dataset. In general, TF.IDF balances general frequency and appearance in number of posts. High frequent words appearing in few documents have higher TF.IDF. After defining TF.IDF, we extract important POS from the set of informative business and technical posts. The top ten nouns, verbs, and adjectives in our dataset are shown in Table 8 .

Our model generation procedure depends on the co-occurrence statistics of words in the data to capture their relations. For example, in our dataset, the words customer and refund appear in a very large number of user reviews and tweets. Therefore, the procedure assumes there is a relation connecting these two entities. To count for such information, we use pointwise mutual information (PMI). PMI is an information-theoretic measure of information overlap, or statistical dependence, between two words (Church and Hanks 1990) . PMI was introduced by Church and Hanks (1990) , and later used by Turney (2001) to identify synonym pairs using Web search results. Formally, PMI between two words w 1 and w 2 can be measured as the probability of them occurring in the same text versus their probabilities of occurring separately. Assuming the corpus contains N documents, PMI between two words w 1 and w 2 can be calculated as: where C(w 1 , w 2 ) is the number of documents in the collection containing both w 1 and w 2 , and C(w 1 ), C(w 2 ) are the numbers of documents containing w 1 and w 2 respectively. Mutual information compares the probability of observing w 1 and w 2 together against the probabilities of observing w 1 and w 2 independently. Formally, mutual information is a measure of how much the actual probability of a co-occurrence of an event P(w 1 , w 2 ) differs from the expectation based on the assumption of independence of P(w 1 ) and P(w 2 ) (Bouma 2009 ). If the words w 1 and w 2 are frequently associated, the probability of observing w 1 and w 2 together will be much larger than the probability of observing them independently. This results in a PMI > 1. On the other hand, if there is absolutely no relation between w 1 and w 2 , then the probability of observing w 1 and w 2 together will be much less than the probability of observing them independently (i.e., PMI < 1). PMI is intuitive, scalable, and computationally efficient (Mihalcea et al. 2006; Newman et al. 2010 ). These attributes have made it an appealing similarity method to be used to process massive corpora of textual data in tasks such as short-text retrieval (Mihalcea et al. 2006) , Semantic Web (Sousa et al. 2010; Turney 2001) , source code retrieval (Khatiwada et al. 2017) .

To generate the relations in our model, we computed PMI between every pair of words to determine their relatedness. One potential pitfall of relying on PMI as a measure of relatedness is that PMIs hits a maximum with words occurring only once. This happens often with misspellings and irrelevant words. In order to prevent this phenomenon, we restrict our analysis to only words that occur at least ten times. Ten was chosen due to being the point at which sensitivity to additional increases became less noticeable (i.e., changing 10-11 would not substantially alter the results).

To generate our model, we extract the top 10 nouns ranked by TF.IDF and then use PMI to extract the three most related verbs and adjectives with each noun. An example of a node, or an atomic entity in our model, is shown in Fig. 6 . This node consists of three main parts:

• Concern: the middle part of the node represents the concern's name (food), which is basically one of the important nouns (based on TF.IDF) in our dataset.

(2) PMI = log 2

Arrive Deliver Prepare Fig. 6 The key elements of the entity-action-property relations represented by our model • Properties: directly attached to the entity's name from the right is the top three adjectives associated with the entity (based on PMI). In our example, food could be cold, hot, or late. • Triggers: on the left side of the node, we attach the list the top three verbs frequently associated (based on PMI) with the noun (concern's name). Verbs often represent triggers, or leading causes of concerns. In our example, the verbs arrive, deliver, and prepare are commonly associated with the word food.

Formally, our model generation process can be described as follows, given a set of Words, containing all words in the dataset occurring at least ten times, we define the parts of speech of a word, or pos(word), Adjs, Verbs, and Nouns as follows:

We define three helper sets to help us express our graph mathematically. SelNouns is the list of the top 10 selected nouns when ranked by Hybrid TF.IDF. Verbs w and Adjs w are the sets of three most closely related (by PMI) verbs and adjectives for a given word w. These sets are defined, using the function top (n, pred) to retrieve the top n words after words are sorted based on the predecessor function pred(word). We use two functions to sort words: TF.IDF for nouns and PMI for verbs and adjectives. We express this using notation for defining anonymous functions, such that, x.TFIDF(x) means define a function that takes an x and returns its TF.IDF. This results in the following expressions:

We define a graph, (V, E), expressed as a tuple of vertices and edges, as follows:

The set of vertices (V) is constructed by creating a smaller set containing each selected noun and its related adjectives and verbs, and then taking the union of these smaller sets to form the entire set of relevant entities, properties, and actions. The set of edges (E) is simply the union of associations of nouns to adjectives and nouns to verbs. Applying this process to our informative posts in the domain of food delivery apps results in the model in Fig. 7 . Verbs w = top (3, Verbs, v.PMI(v, w) ) Adjs w = top(3, Adjs, a.PMI(a, w))

Due to the lack of a priori ground-truth, evaluating domain models can be a challenging task. In general, a domain model is an abstraction that describes a specific body of knowledge. Therefore, the quality of the model can be assessed based on its completeness, or its ability to encompass the main concepts present in the knowledge it models (Mohagheghi and Dehlen 2019; Rubén 1990) . These concepts are often determined manually by domain experts. To evaluate our model generation procedure, we examine the main concepts captured in the model. Specifically, we assess the extent to which the noun-verb-adjective associations presented in our model reflect the main concerns identified by our qualitative analysis:

• Customer Service: Concerns about customer service frequently appeared when an order was not delivered on time, when the order was inaccurate, or when refunds were denied. The model identified both customer and service as important nouns along with the relations <customer, refund> and <customer, incor-rect>. Furthermore, both customer and service were associated with the adjectives poor and terrible in the model. • Orders: Orders were commonly associated with delays. Users complained about receiving cold food as a result. Users were disappointed whenever food was left waiting at the restaurant to be picked up. The model identified <order, refuse> whenever restaurants refused to cancel orders or the app refused to take action when things went wrong. In addition, <order, place> was a common occurrence as these two words often appeared together (e.g., "place order"). The relation <order, second> originated from posts of users complaining about having to re- Fig. 7 A suggested model diagram depicting the relationships between important nouns (entities of the ecosystem), adjectives (attributes), and verbs (concern triggers) order for the second time and the relation <order, full> originated from people asking for full refunds. • Food: Food was directly related to arrival. This was captured in the relations <food, arrive>, <food, deliver>, and <food, prepare>. Food was also associated with temperature, mainly due to the number of complaints about receiving cold or hot food (e.g., <food, cold> and <food, hot>). Complaints about orders being late were common, resulting in the relation <food, late>. • Delivery: Delivery was associated with a number of complaints about incorrect estimated times, explaining the relation <delivery, estimate>. The relation <delivery, prepare> occurred due to issues with orders being stuck in the preparation stage and never being dispatched for delivery. The relation <delivery, choose> primarily occurred in the context of users stating that they would "choose a different delivery service". • Time: Time was primarily present in complaints about delivery delays. The relations <time, estimate> and <time, prepare> appeared for the same reasons they appeared with delivery. A common occurrence was <time, waste> due to unexpected delays and order cancellations. The relation <time, long> occurred in similar contexts, as in "it took longer than the estimated time". • App: App appeared alongside comments about ease-of-use, resulting in the relation <app, easy>. The relation <app, ridiculous> was a general complaint about poor policies or bad usability. The relation <app, delete> appeared when users discussed deleting an app after a poor experience. A common association was <app, look>, appearing due to phrases such as "look into this" and "looks like". The relation <app, end> appeared from posts were users complained that they "ended up" eating cold or incorrect food, or not eating at all. • Money: Money issues were captured by the relation <money, waste>. This relation stems from incidents were users ordered food that ended up being inedible and being unable to obtain a refund, which also yielded the model relation <money, refund>. The verb take was associated with money in posts such as "you take my money but did not deliver", resulting the relation <money, take>. • Drivers: Drivers are a critical component of the ecosystem. All services struggled with their drivers' timing, directions, and friendliness. Users frequently complained about drivers combining orders. The model successfully identified the relation <driver, find> from posts discussing a driver's inability to find their destination. Lack of friendliness is captured in the relation <driver, awful>. • Restaurants: Users often asked services to add new restaurants as well as discussed problems that occurred between the app, restaurant, and driver. The relation <restaurant, show> appeared in the model partly due to users stating that the restaurant they wanted did not "show up" in the app. However, this phrase was more often associated with the driver not appearing at the restaurant. Communication problems between restaurants and consumers were captured through the <restaurant, call> relation.

In summary, to answer , in terms of completeness (the omission of domain concepts and relationships), our model was able to recover a large number of concepts in the data. Missed concerns were rare (e.g., inability to find a customer service number). In terms of clarity, some of the captured relations, such as <food, cold> or <money, waste> were more obvious than others, for example <restaurant, show>. Incorrect, or hard to explain, relations were also present in the model. For example, the relations <driver, big> and <money, long> did not seem to reflect any issues that were identified by our qualitative analysis of the data, rather they originated from posts such as "not a big fan of the driver" or "no longer interested". While these relations were relatively rare, they can be eliminated by compiling a list of such common English adjectives to filter them out before they make their way to the model. Another observation is that technical concerns, despite not being accurately classified, have also found their way into the model. For instance, hacking was a popular technical concerns. The verb hack appeared in association with the nouns customer and service.

A summary of the main steps of the proposed approach is depicted in Fig 8. The first phase of our analysis has revealed that user concerns in SE extend beyond the technical issues of mobile apps to cover other business and service oriented matters. These results emphasize the importance of studying user feedback in the app market at an ecosystem-level. Specifically, apps should be analyzed in bundles, or clusters, of functionally related apps rather than studied individually. In fact, such clusters can be automatically generated using app classification techniques (AlSubaihin et al. 2016) .

Once these fine-grained categories of semantically-similar apps are identified, automated data clustering, classification, and modeling techniques should be employed to consolidate and analyze user feedback and identify the main pressing user concerns in these clusters. Our analysis has also provided an additional evidence on the value of considering multiple sources of user feedback to get the full picture of user concerns. For instance, in the domain of food delivery apps, users preferred to use Twitter as low latency method to get instant reactions from app developers or operators. These complaints were very common whenever any of the services in our ecosystem went down for some reason. Understanding how users utilize different sources of feedback can help developers to focus their attention on the right channels of feedback while planning for their next release. In the second phase of our analysis, we proposed an automated procedure for generating conceptual models of user concerns in the ecosystem of food delivery apps. According to Yu (2009) , "conceptual modeling frameworks aim to offer succinct representations of certain aspects of complex realities through a small number of modeling constructs, with the intent that they can support some kinds of analysis". Our procedure adapted assertions from Object-Oriented programming and text processing to extract the main entities of our ecosystem. An underlying tenet is that the vocabulary of a domain provides an easily accessible supply of concepts. An information theoretic approach, which utilizes term co-occurrence statistics, was then used to establish a structuring mechanism for assembling and organizing extracted concepts. Our evaluation showed that relying on these techniques can generate a high quality model which captures most of the latent concepts in the domain knowledge. By changing the TF.IDF and PMI thresholds and the number of nouns, verbs, and adjectives in Eq. (4), domain entities and relations can be included or excluded, thus, giving app developers the flexibility to generate domain models at different levels of granularity. The simplicity and configurability of our procedure gives it an advantage over other more computationally expensive methods, such as LDA (Blei et al. 2003) , which requires large amounts of data and a calibration of several hyperparameters in order to produce meaningful topics (Chen et al. 2014; Guzman and Maalej 2014) .

In terms of impact, our generated model can provide valuable ecosystem-wide information to SE app developers, acting as a vehicle to facilitate a quick transition from domain knowledge to requirements specifications. For instance, startups, or newcomers, trying to break into the food delivery app market, can use our procedure to quickly generate a model for their micro-ecosystem of operation. Through the model entities and relations, they can get insights into the complex realities of their operational environments. Such information can help them to redirect their effort toward innovations that can help to avoid these issues in their apps. For example, developers can work on more accurate driver dispatching procedures to avoid delays, add new features for payments and refund to reduce amount of money and time wasted, add more security measures to prevent hacking, and implement smarter rating systems of drivers, customers, and restaurants, to control for the quality of service provided through the app. After release, developers can further use our model to automatically track users' reactions to their newly-released features.

Our analysis takes the form of a case study. Case studies often suffer from external validity threats since they target specific phenomena in their specific contexts (Wohlin et al. 2012) . For instance, our case study only included four apps. These apps might not represent the entire domain of food delivery. However, as mentioned earlier, our analysis was focused only on the fittest actors in the ecosystem. These popular apps often receive significantly more feedback than smaller apps (Mcilroy et al. 2017) . Furthermore, to minimize any sampling bias, our data collection process included multiple sources of user feedback and has extended over a long period of time to capture as much information about the apps in our ecosystem as possible. In terms of generalizability, we anticipate that our proposed approach could be applied to other application domains beyond SE, especially for apps operating in complex multi-agent ecosystems. However, independent case studies need to be conducted before we can make such a claim.

Internal validity threats may stem from the fact that we only relied on the textual content of user posts and their sentiment as classification features. In the literature, meta-data attributes, such as the star-rating of the review or the number of retweets, have also been considered as classification features (Guzman and Maalej 2014) . The decision to exclude such attributes was motivated by our goal of maintaining simplicity. Specifically, practitioners trying to use our procedure do not have to worry about collecting and normalizing such data, especially that the impact of such attributes on the quality of classification was found to be limited (Guzman and Maalej 2014) .

Threats might also stem from our model evaluation procedure. Specifically, our generated LDA topics and models was only evaluated intrinsically, based on how well the generated model correlated with the results of the qualitative analysis. While such evaluation can be sufficient for model generation and calibration tasks, it does not capture the practical significance of the model. Therefore, a main direction of future work will be dedicated to the extrinsic evaluation of our model. Extrinsic evaluation is concerned with criteria relating to the system's function, or role, in relation to its purpose (e.g., validation through experience). To conduct such analysis, our model will be provided to selected groups of app developers to be used as an integral part of their app development activities. Evaluation data will be collected through surveys that will measure the level of adaptation as well as the impact of such models on idea formulation and the success or failure of mobile app products.

SE has come with a set of unconventional challenges for software engineers. Understanding these challenges begins with understanding end-users' needs, and then using such knowledge to develop a better understanding of the internal dynamics of such a complex and dynamic software ecosystem. To achieve this goal, in this paper, we proposed an automated approach for modeling crowed feedback in ecosystems of SE apps. The proposed approach is evaluated through a case study targeting the ecosystem of the food delivery apps. Our results showed that users tend to express a variety of concerns in their feedback. Such concerns often extend over a broad range of technical and business issues. The results also showed that, in our ecosystem of interest, business concerns were more prevalent than technical concerns. In the second phase of our analysis, we proposed an approach for automatically generating an abstract conceptual model of the main user concerns in the ecosystem of food delivery apps. The results showed that a descriptive model can be generated by relying on the specificity, frequency, and co-occurrence statistics of nouns, verbs, and adjectives in textual user feedback. The results also showed that, despite being relatively rare and hard to classify, dominant technical concerns were reflected in the model.

We further compared our generated model's entities with topics generated using the topic modeling technique LDA. The results showed that, due to the short nature and lack of structure in user feedback text, LDA failed to generate any cohesive topics that were representative of valid user concerns.

In addition to extrinsically evaluating our generated model, our future work in this domain will include conducting more case studies, targeting SE apps operating in dynamic and multi-agent ecosystems, such as ridesharing or freelancing. These models will be enriched with more information such as the priority of user concerns, or the magnitude/direction of the relation between two ecosystem entities. Such information will enable us to understand the SE app market at a micro level and provide more succinct representations of its complex realities.

Program design by informal English descriptions

Promises and paradoxes of the sharing economy: an organizing framework

To each their own? The greenhouse gas impacts of intra-household sharing in different urban zones

Clustering mobile apps based on mined textual features

The irruption of Airbnb and its effects on hotels' profitability: an analysis of Barcelona's hotel sector

A muddle of models of motivation for using peer-to-peer economy systems

Peer production, the commons, and the future of the firm

Evaluation of tools for hairy requirements and software engineering tasks

Using query log and social tagging to refine queries based on latent topics

Natural Language Processing with Python

Payments for large-scale social ridesharing

Sharing economy as a contributor to sustainable growth. An EU perspective

An app for that: local governments and the rise of the sharing economy

Normalized (pointwise) mutual information in collocation extraction. German Society for Computation Linguistics and Language Technology

How Uber and the sharing economy can win over regulators

Analysis of user comments: an approach for software requirements evolution

Covid-19 pandemic exposes the vulnerability of the sharing economy

AR-Miner: Mining informative reviews for developers from mobile app marketplace

Current sharing economy media discourse in tourism

Real-time bidding based vehicle sharing

Word association norms, mutual information, and lexicography

Goal-centric traceability for managing non-functional requirements

The results are in for the sharing economy. They are ugly

Experimenting through mobile apps and app stores

The promise of the sharing economy among disadvantaged communities

The sharing economy in computing: a systematic literature review

Adding evidence to the debate: quantifying Airbnb's disruptive impact on ten key hotel markets

Digital discrimination: The case of Airbnb

Parsed use case descriptions as a basis for object-oriented class model generation

App store analysis: mining app stores for relationships between customer, business and technical characteristics

Racial and gender discrimination in transportation network companies

Method and instruments for modeling integrated knowledge

On non-functional requirements

Privacy-aware dynamic ride sharing

When app stores listen to the crowd to fight bugs in the wild

Finding scientific topics

Users: The hidden software product quality experts?: A study on how app users report quality aspects in online reviews

A little bird told me: mining tweets for requirements and software evolution

How do users like this feature? A fine grained sentiment analysis of app reviews

App store mining and analysis: MSR for app stores

Mining regular routes from GPS data for ridesharing recommendations

Probabilistic latent semantic indexing

Empirical study of topic modeling in Twitter

Sharing economy: a comprehensive literature review

To purchase or not? Why consumers make economically (non-) sustainable consumption choices

Retrieving and analyzing mobile apps feature requests from online reviews

Comparing Twitter summarization algorithms for multiple post summaries

A sense of community: a research agenda for software ecosystems

Using frame semantics for classifying and summarizing application store reviews

Mining non-functional requirements from app store reviews

Choosing your weapons: On sentiment analysis tools for software engineering research

On negative results when using sentiment analysis tools for software engineering research

What do mobile app users complain about?

Just enough semantics: an information theoretic approach for IR-based software bug localization

Sentiment analysis for software engineering: how far can we go

Development of a stemming algorithm

Bug report, feature request, or simply praise? On automatically classifying app reviews

The app sampling problem for app store mining

The sharing economy: a pathway to sustainability or a nightmarish form of neoliberal capitalism?

A survey of app store analysis for software engineering

Adapting to the sharing economy

User reviews of top mobile apps in apple and google app stores

Analyzing and automatically labelling the types of user issues that are raised in mobile app reviews

Corpus-based and knowledge-based measures of text semantic similarity

Optimizing semantic coherence in topic models

Existing model metrics and relations to model quality

When the sharing economy becomes neoliberalism on steroids: unravelling the controversies

Evaluating topic models for digital libraries

A dynamic theory of organizational knowledge creation

User feedback in the appstore: an empirical study

Crowdsourcing user reviews to support the evolution of mobile apps

How can I improve my app? Classifying user reviews for software maintenance and evolution

PwC: The sharing economy: consumer intelligence series

Software framework for topic modelling with large corpora

What would users change in my app? Summarizing app reviews for recommending software changes

Characterization of the twitter replies network: Are user ties social or topical?

On the influence of the kernel on the consistency of support vector machines

The effect of informational signals on mobile apps sales ranks across the globe

Understanding the limiting factors of topic modeling via posterior contraction analysis

Avoiding the south side and the suburbs: the geography of mobile crowdsourcing markets

Toward a geographic understanding of the sharing economy: systemic biases in uberx and taskrabbit

Sentiment strength detection in short informal text

Sentiment strength detection for the social web

Mining the web for synonyms: PMI-IR versus LSA on TOEFL

Impacts of peer-to-peer accommodation use on travel patterns

Facilitating the application of support vector regression by using a universal pearson vii function based kernel

Release planning of mobile apps based on user reviews

Baselines and bigrams: simple, good sentiment and topic classification

Analyzing, classifying, and interpreting emotions in software users' tweets

Mining Twitter feeds for software user requirements

Modeling user concerns in the app store: a case study on the rise and fall of Yik Yak

Experimentation in Software Engineering

Economy: Privacy respecting contract based on public blockchain

A biterm topic model for short texts

Conceptual modeling: Foundations and applications. chap. Social Modeling and i*

The rise of the sharing economy: estimating the impact of Airbnb on the hotel industry

Comparing Twitter and traditional media using topic models

Inside the sharing economy: understanding consumer motivations behind the adoption of mobile applications

Acknowledgements This work was supported in part by the U.S. National Science Foundation (Award CNS 1951411) and LSU Economic Development Assistantships awards.