key: cord-0043359-7whl28a0
authors: Gear, Aditya Srinivas; Prakash, Kritika; Singh, Nonidh; Paruchuri, Praveen
title: PredictRV: A Prediction Based Strategy for Negotiations with Dynamically Changing Reservation Value
date: 2020-04-25
journal: Group Decision and Negotiation: A Multidisciplinary Perspective
DOI: 10.1007/978-3-030-48641-9_10
sha: e4816d5b2424a39f95ce14dfdb373d8a2cef3fd2
doc_id: 43359
cord_uid: 7whl28a0

Negotiation is an important component of the interaction process among humans. With increasing automation, autonomous agents are expected to take over a lot of this interaction process. Much of automated negotiation literature focuses on agents having a static and known reservation value. In situations involving dynamic environments e.g., an agent negotiating on behalf of a human regarding a meeting, agents can have a reservation value (RV) that is a function of time. This leads to a different set of challenges that may need additional reasoning about the concession behavior. In this paper, we build upon Negotiation algorithms such as ONAC (Optimal Non-Adaptive Concession) and Time-Dependent Techniques such as Boulware which work on settings where the reservation value of the agent is fixed and known. Although these algorithms can encode dynamic RV, their concession behavior and hence the properties they were expected to display would be different from when the RV is static, even though the underlying negotiation algorithm remains the same. We, therefore, propose to use one of Counter, Bayesian Learning with Regression Analysis or LSTM model on top of each algorithm to develop the PredictRV strategy and show that PredictRV indeed performs better on two different metrics tested on two different domains on a variety of parameter settings.

Negotiation is an important component of interaction process among humans [18, 19, 22] . A lot of negotiation literature assumes that we have a good amount of information about our own choices [10, 15] and reservation value (RV), while not knowing our opponents preferences [4, 5, 12] . Note that RV refers to the utility of a bid in the negotiation, below which we would not be willing to accept any bid. Reasons for not accepting a bid whose utility is below RV can be due to a better BATNA -Best Alternative to Negotiated Agreement [6] (so RV may be set to BATNA) or that the agent receives a utility that is not good enough for the agent to accept. In settings where the environment is dynamic, there can be situations where our RV can change with time (while the preference profile is static) [16] . We may not know how the changes would pan out e.g., an agent acting on behalf of a meeting attendee may have varying estimates on when the human may arrive for the meeting [3, 23] . Dynamicity of RV can, therefore, throw additional challenges when we are unaware of the nature of changes (which is different from RV changing because of a discount factor where the change is computable). Bids that simply react to the dynamicity may not be sufficient since they can change in a random fashion and result in lower utility. For example, it can be hard to agree on a meeting time if an agent acting on behalf of a human declares that the human would arrive in 30 min and then re-declares in a short period that the human would arrive in 10 min and then quickly change to say 20 min even though the agent may simply be acting based on its belief of when the human would arrive.

Making concessions to reach an agreement is an important part of the negotiation process [8, 14, 20] . There are a variety of ways in which negotiating agents can concede. One such category of techniques is Time-Dependent Tactics (TDT's) [7, 9] e.g., Boulware and Conceder agents. [1] presents an Optimal Non-Adaptive Concession (ONAC) algorithm with incomplete information where time pressure (amount of time to deadline) is a primary criterion to influence the concession behavior. Negotiation algorithms such as ONAC and Boulware [1] work on settings where RV of the agent is fixed and known. Although these algorithms can work with (or be modeled as a function of) a dynamic RV, their concession behaviors can have a lot more randomness or fluctuations compared to when they have a static RV. For purposes of a more stable bidding behavior, the agent should, therefore, make choices based on predicted (RV) values. While the quality of agreement is a default metric used in negotiations, popular negotiation frameworks such as the Genius platform [17] do not support the modeling of dynamic RV. We, therefore, had to develop a simple negotiation simulator that can encode dynamic RV. In addition to the quality of agreement, we use Prediction as an additional metric to evaluate the concession behavior.

We propose to use the following models on top of negotiation algorithms, to handle the effects of a dynamic RV: (a) Counter model [24] , (b) Bayesian learning with Regression Analysis [25, 26] and (c) LSTM model. All three models are present in literature and we adapt them here to work suitably with the different negotiation algorithms. While the paper builds on top of ONAC and Boulware algorithms, the procedure, in general, would be suitable to apply to algorithms that are sensitive to the dynamicity of RV (which results in fluctuations in bidding). Given that the models help to predict the RV to reduce the effect of dynamicity, we refer to the new strategy as PredictRV.

Rest of the paper is organized as follows: Sect. 2 presents an overview of the negotiation model and two negotiation algorithms namely ONAC and Boulware with static RV. Section 3 presents a dynamic RV version of the negotiation model and the ONAC and Boulware algorithms. In addition, it introduces the Predic-tRV strategy and presents three methods used to make predictions over the dynamic RV namely Counter, Bayesian Learning with Regression Analysis and LSTM based prediction. Section 4 showcases the working of the three prediction methods via an example when faced with dynamic RV. In Sect. 5, we present a variety of experiments on two different domains to evaluate the performance of the PredictRV strategy. Section 6 presents the conclusions of the paper.

The negotiation model we use follows the alternating offers protocol [21] for a bilateral negotiation: Consider two agents A and B with utility functions U A (z) and U B (z) ∈ [0, 1] where z belongs to the set of all possible negotiation outcomes for a domain D. The RV's for the agents are rv A and rv B ∈ [0, 1]. The agents will propose offers with utility higher than their own RVs.

The ONAC algorithm [1] aims to construct optimal concession strategies against specific classes of acceptance strategies [2] . It applies sequential decision techniques to find analytical solutions that optimize the bidders expected utility, given certain strategy sets of the opponent. The ONAC solution was found to significantly outperform state of the art approaches in terms of obtained utility. As shown in [1] , the utility of the ONAC bid is computed by taking into account the probability of acceptance of the bid (x, bid of agent A) by the opponents where the agents have opposing preferences.

where U (x) ∈ [0, 1] and U j+1 is the utility of the bid proposed by the agent at round (j + 1), U j is the utility at round j and x is a valid bid with utility greater than RV. This is a recurrence formula that gives the utility of the bid at each round, where rv A is the RV for agent A and N is the deadline:

The Boulware algorithm is a TDT [7, 9] , which concedes considerably more as the negotiation deadline approaches. TDTs consist of a family of functions that represent an infinite number of possible. The formula for tactics, one for each

value of β this family of functions is as follows where j is the jth round and β should be in the range (0, 1) for Boulware.

The negotiation model remains the same as for the static RV case with the following difference: Since agent A's RV is dynamic, it is represented as rv A (t) (and rv B (t) for B for generality).

To model dynamic RV we assume that the value of RV is drawn from an unknown probability distribution, and in each round, agent A receives a signal rv A (t) drawn from that distribution. PredictRV attempts to predict this probability distribution (p.d.) and incorporate it into a negotiation algorithm (ONAC here). We assume that there is no noise in the signal rv A (t), hence it corresponds to the actual RV at time step t. The PredictRV recurrence formula would be:

For a dynamic RV, the value to bid will no longer be determined using Eq. (1). Instead, we first need to assign the new RV to U N and then re-compute for U j as shown in Eq. (3).

The Boulware algorithm present in negotiation literature assumes a static RV. For Boulware that works with dynamic RV, utilities can be generated using the following function:

Consider a toy example, where the RV can be either 0.1 or 0.9 and it changes randomly every 2 rounds for a total of 100 rounds. Figure 1 shows the concession curves obtained by using the ONAC and Boulware algorithms. The x-axis of each figure shows the number of rounds from 0 to 100 while the y-axis shows the utility values ranging from 0 to 1. The utilities of the bids at each round are computed using Eq. (3). The figures show that the concession curves are not monotonic due to the dynamic nature of the RV, which results in to and fro concessions being made, where peaks correspond to RV of 0.9 and troughs correspond to 0.1. 

Given a negotiation algorithm (like ONAC or Boulware): 

To generate hypotheses (first step), we divide the range between which the RV can vary, into n number of intervals I i for i ∈ {1, 2, 3, ...n}. A suitable point x i is selected as a representative value for each interval I i . If the RV falls within an interval, it is classified as having the utility of the point that represents the interval. We then compute negotiation algorithm utilities,

. At the start, all hypotheses are equally likely, hence each hypothesis is initialized with a probability 1 n i.e., uniform distribution over hypotheses. As the negotiation progresses we may have a better prediction over the hypotheses based on the past RVs, hence the probability distribution would change. The second step of the PredictRV strategy is to update the weights of the hypotheses as the new round starts. How the weights are updated depends on the actual procedure we use namely Counter, Bayesian Learning or LSTM models presented below.

In the Counter based learning procedure, the count for each hypothesis is initialized as c xi = 0, where i ∈ {1, 2, 3, ...n}. At a new round j, we obtain a new RV. As step 2 of PredictRV, using the new RV we update the counter for the hypothesis that corresponds to the new RV. We re-compute the probability for each interval as follows:

As step 3 of PredictRV, using the probabilities computed on different intervals we compute the utility U j to be bid by PredictRV as:

In the BLRA procedure presented in [25] , the learning agent i has a belief about the p.d. of its opponent's negotiation parameters (i.e., the deadline and RV). As shown in step 1 of PredictRV, we have a belief over the hypothesis of our own (dynamic) RV. By keeping track of the history of values obtained for RV so far and comparing it with fitted estimates derived from a regression analysis, the agent can revise its belief over the hypothesis by using a Bayesian updating rule and can correspondingly adapt its concession strategy.

As the negotiation proceeds [25] , utility u t for a TDT decreases according to the following decision function:

where T is the deadline and β is the concession parameter. We adopt this terminology to express in terms of agent A's own dynamic RV. We assume RV to be 0 at the start of the negotiation and vary according to Eq. (7).

where u T is the RV at the deadline and u 0 is the RV at the start. For every round, we receive an RV for that round. We compute the regression line (fitted utilities)RV t b = {û 0 ,û 1 ,û 2 , ...,û t b } based on the historical RVs,

Step 1: Generate the hypotheses and initialize its probabilities as mentioned in Sect. 3.5 (Steps of strategy) with x i representing the utility of each hypothesis.

Step 2: Based on Eq. (8), we use the following power regression function to calculate the regression curve:

where N is the deadline. Next, β is calculated using Eq. (10) (as proposed in [25] ):

Step 3: Based on the calculated regression curve given by Eqs. (9) and (10), the fitted RVs,RV t b would be = {û 0 ,û 1 ,û 2 , ...,û t b } at each round (whereû 0 = u 0 ).

Step 4: We now calculate the non-linear correlation between RV t b and the fitted RVsRV t b . The coefficient of non-linear correlation γ is given by Eq. (11), where u andû are the average of all the historical and fitted RVs respectively:

Step 5: Parameter γ (−1 ≤ γ ≤ 1) is used for evaluating resemblance between chosen (x i ) and real RVs (u t ). To use γ as a probability to perform belief update in Bayesian Learning, we normalize it to [0,1] (γ new in Eq. (11)).

Step 1: Bayesian Learning can be used if we have a hypothesis about the prediction. Belief about p.d. of these hypotheses can be revised through a posterior probability by observing the RV. Each hypothesis H i represents that it would be the possible RV at the end of negotiation. The prior p.d., denoted by P(H i ), i ∈ (1, 2, 3, ..., n) signifies the agent's belief about the hypothesis i.e., how likely the hypothesis matches the RV at the end of the negotiation.

Step 2: The agent can initialize the p.d. over hypotheses based on some prior information if available, otherwise a uniform distribution P(H i ) = 1 n is assigned. During each round of negotiation t b the probability of each hypothesis would be computed using the Bayesian updating rule in Eq. (12):

Step 3: The observed outcome here is historical RVs RV t b = {u 0 , u 1 , u 2 , ..., u t b }. As presented in [25] , the agent will update the prior probability P(H i ) using the posterior probability P(H i |RV t b ), thus a more precise estimate is achieved using Eq. (12).

Step 4: As presented in [25] , conditional probability P(RV t b |H i ) is obtained by comparing the fitted pointsRV t b on the regression line based on each selected RV x i , with the historical RVs RV t b . The more correlated fitted RVs are with historical RVs, the higher P(RV t b |H i ) will be.

Step 5: Difference between the regression curve and the real RV sequence can be indicated by the non-linear correlation coefficient γ new . Thus, we can use the value of γ new as the conditional probability P(RV |H i ) in Eq. (12) . The learning approach will increase the probability of a hypothesis when the RV selected (x i ) is most correlated with the RV at the end of the negotiation. As mentioned in step 4 of PredictRV, using the probabilities on different intervals, we compute the utility at that round as:

LSTM (Long-Short Term Memory) [13] is a popular recurrent neural network architecture to perform deep learning tasks and is useful in time-series prediction. The negotiation problem introduced here can be modeled as a time series prediction task wherein the agent learns more information as the negotiation progresses. We, therefore, propose to use an LSTM based approach to predict the RV at the last time step n of the negotiation, using time-series forecasting. As shown in Fig. 2 , the input at each time step t for LSTM is RV (t) (i.e., RV provided by the environment at t). Note that there exists a single LSTM cell A to which input is fed repeatedly (one value at every time step) along with the output of the previous time step. Output at t is the predicted value for RV at the last time step n denoted byRV t (n). The LSTM is trained using a mean squared error loss function and learns to predict better as the number of epochs increases. There are n hypotheses in our problem whose probability is updated every time step based on the predicted RV for the last time stepRV . This is similar to Counter model where we identify the interval theRV falls into and increase the count of that hypothesis by 1 (Eq. (5)). Using the probabilities for different hypotheses we compute the utility to be bid by PredictRV (Eq. (6)).

The rest of the example is explained using the ONAC-D algorithm (ONAC-D is ONAC strategy without any changes applied to Dynamic RV). Figure 4 shows the utility values generated by Counter, BLRA and LSTM models computed using Eqs. (6) and (13) respectively. The x-axes shows the number of rounds from 0 to 100 while the y-axes shows the utility values ranging from 0 to 1. Figure 3 shows the belief plots for the three models. A belief plot shows how the belief in a particular hypothesis changes as the rounds progress. The figure shows two plots corresponding to the two hypotheses that the RV is 0.1 (hypothesis 0.1) and 0.9 (i.e., hypothesis 0.9). The x-axes for both the figures show the number of rounds from 0 to 100 and the y-axes show the probability of belief in the hypothesis that the figure represents e.g., a y-axis value of 0.3 in figure on left implies that an algorithm believes that the RV is 0.1 with a probability 0.3 which implies that other hypotheses are true with rest of the probability (in this case only other hypothesis is hypothesis 0.9). The belief plots show that:

(a) For hypothesis 0.1, while Counter stays close to middle (probability of 0.5), BLRA and LSTM are more clear in their belief for this hypothesis (former converges to close to 0 while the latter converges to close to 1 probability and stay with these probabilities once converged) showing the inherent differences between the models. (b) Counter converges quickly to a belief of 0.5 since RV alternates between the hypotheses every 2 steps, hence the count is more or less balanced. (c) For BLRA, belief in hypothesis 0.1 converges close to 0 since it is not just the count but the time when the RV changes come into play here. (e) For LSTM, belief in hypothesis 0.1 converges to close to 1 faster than other models, however to the opposite belief of BLRA for this example. The outcome utility for ONAC-D, Counter is 0.5, 0.5 , ONAC-D, Bayesian is 0.25, 0.75 and ONAC-D, LSTM is 0.6, 0.4 .

We have a number of hypotheses, number of rounds of negotiation N and update rate (frequency of change in RV) as the parameters of our algorithm. N is fixed to 100 for all experiments. Experiments were performed on the Fire Disaster Response and Meeting Scheduling domains. In both these domains, the agent is faced with a dynamic RV. For purposes of experimentation, we model the dynamic RV using a Markov chain model [we omit the specifics of our modeling due to space constraints]. The number of hypotheses vary across the domains. Update rate of RV is varied among the values {2, 5, 10, 20, 50}. We run each experiment for 100 iterations keeping the parameters constant. X-axis shows the (hypothesis, update rate) while y-axis shows the respective metric in each plot.

where A uses one of ONAC-D or Boulware-D and B is PredictRV strategy (Counter, BLRA or LSTM). We average the outcome over 100 iterations and compute the outcome utility for each UpdateRate and hypothesis (averaged utility represented as OD for ONAC-D, C for Counter, B for BLRA and L for LSTM). We then compute the utility of PredictRV w.r.t ONAC-D using Eq. (14) (represented in graphs as Average Percentage Utility):

2) Prediction Metric: We allow each model to train until the end of the negotiation (N rounds). At the last round N , we have an RV predicted by each of the models i.eRV for round N + 1 (which is not part of the negotiation).

For each of the models, we then compute the difference betweenRV and the actual RV at round N + 1 which is used to capture the quality of prediction. This value is averaged over 100 iterations where a lower difference in average value implies a better prediction.

Consider a forest fire where the fire can spread quickly in any of the 4 directions i.e., North, South, East or West. Assume that the forest is modeled as a grid of size n 1 *n 1 [11] . Fire fighting units (local units) are dispatched to many locations to fight the fire. The commander in charge has a global picture of the fire and wants to reduce the resources given to each local unit. The local unit leader (modeled as agent) would like to negotiate with the commander to obtain higher (than minimal needed to just put off the fire) number of resources to stop the fire quickly at the local point. Given that the direction of fire changes in different time steps, the RV is dynamic i.e. changes with time.

We operationalize the experimental parameters as follows: A negotiation is being carried out with N (=100) as the deadline. The parameters here are number of hypotheses, the update rate of the RV and grid size. Figure 5 shows two plots corresponding to ONAC and Boulware with a random start point for fire with grid size 100. Both plots show that the values for outcome utility metric (Sect. 5.2) for PredictRV are higher than for ONAC-D or Boulware-D respectively e.g., plot (b) of Fig. 5 shows that the Average Percentage Utility for BLRA varies from 50% at the lowest to 95% at the highest. The plot also shows that the overall Average Percentage Utility across all the intervals and update rates for BLRA is 71% while it is 61% for Counter and −4.4% for LSTM. If the calculated P-value is less than 0.05, it means that statistically the mean difference (in outcome utility as shown in Fig. 5 ) between the paired observations is different. Our testing showed that mean values of outcome utility for Bayesian vs. Counter does not have significant difference statistically (both for ONAC and Boulware i.e., 2 tests). All the other (10) tests, showed that the differences in (the averaged) outcome utility are statistically different.

For brevity purposes, we present the gist of the domain here: We operationalized parameters for this domain from the E-Elves [23] application. The parameters for the algorithm are the update rate of the RV, delay intervals and number of hypotheses. There are 9 possible delay intervals we consider here i.e {5, 10, 15, 20, 25, 30, 35 , 40, 45} min. A delay interval of 10 min means that a meeting supposed to start at 10 am is now rescheduled to start at 10:10 am. For this domain each hypothesis corresponds to a delay interval, hence 9 hypotheses correspond to 9 delay intervals. The overall value of the meet is computed as below:

Delaycost = (delay alpha ) * 2, V alue of the meet = 200

Overall value = V alue of meet − Delaycost (15) where delay is delay w.r.t. the scheduled starting time and alpha ∈ {1.0, 1.2, 1.4, 1.6}. Utility of the hypothesis is calculated by normalizing the reward obtained using Eq. (15) . The prediction measurement for the meeting domain is shown in Fig. 6 . Experiments for measuring prediction for the meeting domain were performed with the following summary (we skip graphs due to 

In Fig. 7 , 1 signifies the best performing model and 2 signifies the second-best performing model for a given metric and domain, x%: how much better the best model is relative to the second-best model. Formulation: a = metric value of best model, b = metric value of 2nd best model. For the outcome utility relative performance = 100 * (a−b) a , (a > b). For the prediction metric, relative performance = 100 * (b−a) b , (b > a). Explanation: For each of the metrics, we measure the relative value of the best performing model w.r.t the second-best performing model for each domain. For example, in the Fire Random domain for the ONAC algorithm, BLRA is 5% better than Counter on outcome utility metric and LSTM is 33.91% better than Counter on the prediction metric.

We introduced the PredictRV strategy which uses one of Counter, BLRA or LSTM learning models that predict over the dynamic RV to perform a better negotiation. Our results show that: a) For Outcome Utility: the BLRA model performs slightly better than Counter although the difference is not statistically significant. b) For Prediction metric: LSTM is the best performing model while Counter performs next best. c) Outcome utility is the standard metric that is used to evaluate negotiations. Given that both BLRA and Counter methods perform well on this metric, they can be tested for the specific use case needed and one of them picked based on the insights obtained. In summary, the key novelty of our work is that we enhance the ability of current negotiation algorithms to handle dynamic RV. The problem can be more general where only an indicator function for the RV is available rather than the actual value at each update step as assumed here. Popular negotiation platform such as Genius allows us to encode static RV currently -we believe this work takes a significant step towards dealing with challenges in handling dynamic RV.

Optimal non-adaptive concession strategies with incomplete information

Effective acceptance conditions in real-time automated negotiation

Electric elves: agent technology for supporting human organizations

A genetic agent-based negotiation system

Learning on opponent's preferences to make effective multi-issue negotiation trade-offs

Precedents in negotiated decisions: Korea-Australia free trade agreement negotiations

Negotiation decision functions for autonomous agents

Principles of Automated Negotiation

Optimal negotiation of multiple issues in incomplete information settings

An agenda-based framework for multi-issue negotiation

Courier Corporation

Opponent modelling in automated multi-issue negotiation using bayesian learning

Long short-term memory

Automated negotiation: prospects, methods and challenges

Strategic Negotiation in Multiagent Environments

Automated negotiation in open and distributed environments

Genius: an integrated environment for supporting the design of generic automated negotiators

The Art and Science of Negotiation

Predicting human decision-making: from prediction to action

Rules of Encounter: Designing Conventions for Automated Negotiation Among Computers

Perfect equilibrium in a bargaining model

Issues in automated negotiation and electronic commerce: extending the contract net framework

Towards adjustable autonomy for the real world

Introduction to Probability and its Applications. Cengage Learning

An adaptive bilateral negotiation model based on bayesian learning

Bayesian learning in negotiation