key: cord-0044135-1a8gbka2 authors: Shaabanzadeh, Seyedeh Soheila; Sánchez-González, Juan title: On the Prediction of Future User Connections Based on Historical Records in Wireless Networks date: 2020-05-04 journal: Artificial Intelligence Applications and Innovations DOI: 10.1007/978-3-030-49190-1_8 sha: 5e277c0634d18973ff41b1772d49e5fd340f3570 doc_id: 44135 cord_uid: 1a8gbka2 Recent developments of data monitoring and analytics technologies in the context of wireless networks will boost the capacity to extract knowledge about the network and the users. On the one hand, the obtained knowledge can be useful for running more efficient network management tasks related to network reconfiguration and optimization. On the other hand, the extraction of knowledge related to user needs, user mobility patterns and user habits and interests can also be useful to provide a more personalized service to the clients. Focusing on user mobility, this paper presents a methodology that predicts the future Access Point (AP) that the user will be connected to in a Wi-Fi Network. The prediction is based on the historical data related to the previous APs which the user connected to. Different approaches are proposed, according to the data that is used for prediction, in order to capture weekly, daily and hourly user activity-based behaviours. Two prediction algorithms are compared, based on Neural Networks (NN) and Random Forest (RF). The methodology has been evaluated in a large Wi-Fi network deployed in a University Campus. The demand of new multimedia services (i.e. online multimedia applications, high quality video, augmented/virtual reality, etc.) has dramatically increased in the last years. A solution to cope with the high bandwidth and strict Quality of Service (QoS) requirements associated to these new services consists on network densification through the deployment of Small Cells (SC) operating cellular technologies (e.g. 4G/5G), complemented with Wi-Fi hotspots using the unlicensed spectrum. In fact, Wi-Fi technology is a competitive option for serving multimedia demands due to its popularity among mobile users. In the last years, a dramatic increase in the amount of IEEE 802.11 (i.e. Wi-Fi) traffic has been observed. It is estimated that by 2021, 63% of the global cellular data traffic will be offloaded to Wi-Fi or small cell networks [1] . Globally, there will be nearly 549 million public Wi-Fi hotspots by 2022, up from 124 million in 2017, a fourfold increase. example, a reliable prediction of future user location and length of stay connected to the different SC/AP enables the use of Location Based Advertising (LBA) mechanisms. This presents the possibility for advertisers to personalize their messages to people based on their location and interests [11] . Within this context, this paper proposes a methodology that predicts the future user connections to the different APs of a Wi-Fi network according to historical user records. The main contribution of the paper is the proposal of a prediction methodology that is able to extract user periodical patterns at different time-levels in order to capture weekly, daily and hourly user activity-based behaviours. Two prediction algorithms based on a supervised learning process are compared, one using a Neural Network (NN) and the other one based on Random Forest (RF). The proposed methodologies are evaluated for a large Wi-Fi network deployed in a University Campus. The remaining of the paper is organized as follows. Section 2 presents the proposed AP prediction methodology, while Sect. 3 describes the considered prediction tools. The results are presented in Sect. 4, while Sect. 5 summarizes the conclusions. The proposed prediction methodology is shown in Fig. 1 and assumes a Wi-Fi Network with monitoring capabilities for the collection of measurements reported by the users when connected to the different APs. The Collection of Network Measurements process collects a list of metrics for each u-th user (u = 1, … , U) when connected to each AP (e.g. the instants of time when the user begins and ends a connection to each AP, the average SNR -Signal to Noise Ratio-, the average RSSI -Received Signal Strength Indicator-, the amount of bytes transmitted/received during the connection of the user to each AP, etc.). All this information is stored in a database. Then, for each user, a pre-processing of the collected data is done so that the measurements collected during each d-th day (with d = 1, … , D) are grouped in M time periods with equal duration T. In particular, the pre-processing step generates a matrix A for each user, so that each term a d,m (with m = 1, … , M and d = 1, … D) represents the AP identifier to which this user was connected during the m-th time period of the d-th day. In case that the user connects to more than one AP at the same m-th time period, it is assumed that the term a d,m will correspond to the AP with the highest connection duration. For the prediction of the AP to which the user will be connected in a specific m * -th time period of a specific d * -th day in the future (a d*,m* ), the proposed methodology makes use of some historical information of the AP to which the user was connected in the past and a prediction function f(Á) that is obtained by means of a supervised learning. For that purpose, the Selection of historical data process selects some specific terms in matrix A. Different approaches are presented below: Prediction Based on Time-Period Patterns (PBTP): In this case, the prediction of a d*,m* is based on the APs to which the user was connected in the last It is worth noting that all these terms a d,m in matrix A correspond to categorical values (e.g. an AP identifier). For both the training and prediction, these terms are converted into numerical attributes by means of the so-called dummy coding process [11] . A dummy variable is a binary variable coded as 0 or 1 to represent the absence or presence of some categorical attribute. Therefore, each of the N elements used for prediction a k (k = 1, … , N) are converted into a set of G dummy variables c k = (c k,1 , … , c k,g , … , c k,G ), where G is the number of different APs in the set of N measurements, so that the term c k,g = 1 if a k corresponds to the g-th AP and c k,g = 0 otherwise. Then, the resulting number of dummy variables D = N Á G are used for the prediction of a d*,m* according to (1) . Before the training, the same dummy coding is also done for all the training tuples of the training set. In this paper, a supervised learning algorithm based on Neural Networks and Random Forest are compared. A brief description of them is provided below. Neural Networks (NN): In this case, the prediction is done by means of a feedforward Neural Network that consists on an input layer, one or more hidden layers and an output layer [12] . Each layer is made up of processing units called neurons. The inputs are fed simultaneously into the units of the input layer. Then, these inputs are weighted and are fed simultaneously to the first hidden layer. The outputs of the hidden layer units are input to the next hidden layer, and so on. A supervised learning technique called backpropagation is used for training. Back propagation iteratively learns the weights of the Neural Network by comparing the inputs and outputs of the training set. Random Forest (RF): Ensemble methodologies are used to increase the overall accuracy by learning and combining a series of individual classifier models. Random Forests is a popular ensemble method. RF is based on building multiple decision trees, generated during the training phase, and merge them together in order to obtain a more accurate and stable prediction [3] . Different from the single decision tree methodology, where each node of the tree is split by searching for the most important feature, in random forest, additional random components are included (i.e. for each node of each tree the algorithm searches for the best feature among a random subset of features). Once all the trees are built, the result of the prediction corresponds to the most occurred prediction from all the trees of the forest. The considered scenario consists on a large Wi-Fi network with 429 APs deployed in a University Campus with 33 buildings with four floors per building. The reported user measurements are collected by the Cisco Prime Infrastructure tool [13] . The users' measurements were collected during D = 84 consecutive days (i.e. W = 12 weeks). The prediction methodology was run for U = 967 users. According to the methodology described in Sect. 2, the matrix A is built for each user by determining the AP to which the user is connected in each of the M = 96 periods of T = 15 min for each of the D = 84 days. According to this data, the proposed prediction methodology is run in order to predict the AP which each user will be connected to in all the M = 96 time periods of T = 15 min in the subsequent week. The obtained predictions are compared to the real APs which the user connected to. The prediction accuracy is calculated as the percentage of time periods that have been predicted correctly in the range between 6:00 h and 22:00 h for all the weekdays (from Monday to Friday) for all the users that connected to the Wi-Fi network at least one time every day. The prediction methodology has been implemented by means of Rapidminer Studio [14] . The parameters of each supervised learning algorithm have been tuned to obtain the maximum prediction accuracy. In particular, the Neural Network is configured with learning rate 0.05, momentum 0.9, 100 training cycles and 1 hidden layer of size 20 and the Random Forest has been configured with 100 trees, gain_ratio criterion and maximal depth 10. In order to illustrate the performance of the proposed prediction methodology, let first focus on the AP prediction for a specific user for all the time periods on a Wednesday. Assuming here the PBWP approach, the AP prediction at the m-th time period is based on the APs to which the user connected to in the previous N = 6 Wednesdays at the same m-th time period of the day. The training set is built by using the last F = 12 weeks. For validation purposes, the predictions are compared to the real AP where the user connected to during this Wednesday. According to this, Fig. 2 presents this comparison when using the Neural Network algorithm. As shown, for this particular user, the methodology is able to correctly predict the AP in 60 out of 64 periods of 15 min (i.e. a 93.75% of prediction accuracy). In general, most of the transitions between AP are correctly predicted. In fact, only an error of one period of 15 min is observed in predicting the user time of arrival while a slightly higher error is also observed for the prediction of departure. During lunch time, the connection to AP XSFD4P202 is not correctly-predicted, but the prediction was AP XSFD4P102 that is located just in the lower floor of the same building. This indicates that, although the AP is not well predicted in this case, the methodology predicted correctly the region where the user was located. In order to gain insight in the performance of the proposed methodology, the prediction process has been run for all the set of users in all the time periods of 15 min during a whole week. Then, the predictions were compared to the AP to which each user connected at each time period. Table 1 presents the percentage of users in which each approach provides the best prediction accuracy for PBWP, PBDP and PBTP. As shown for both NN and RF predictors, PBTP approach provides better prediction accuracy than PBWP or PBDP for most of the users. This indicates that the AP to which a user was connected in the most recent time periods is the most useful information for prediction. However, it is worth noting that, for a relatively high percentage of users, the best approach is obtained with PBWP or PBDP (e.g. around 30% and 18% for NN and RF, respectively). This result indicates that the daily or weekly periodical behavior of some users can be captured better by PBDP or PBWP approaches, respectively. Figure 3 presents the Cumulative Distribution Function of the prediction accuracy for the different prediction approaches with Neural Network. For comparison purposes, in all the approaches, the prediction is based on the 6 previous observations. Therefore, in PBWP, PBDP and PBTP, the sliding window is set to N = 6. In JBP approach, the sliding window is set to N = 2, i.e. the prediction is based according to the APs to which the user connected to in the last N = 2 time periods (i.e. a d*,m*−2 , a d*,m*−1 ), the APs at the same m-th time period for the last N = 2 days (i.e. a d*−2,m* , a d*−1,m* ) and the APs at the same day of the week and time period of the day for the last N = 2 weeks (a d*−14,m* , a d*−7,m* ). As shown in Fig. 3 , the JBP approach is able to provide a better prediction accuracy than the rest of the approaches separately. The reason is that the JBP is able to jointly capture the hourly, daily and weekly user behavior. In Table 2 , the average prediction accuracy and the average computation time, required for running the methodology for each user, are compared for the different approaches for both NN and RF. The methodology was executed in a computer with a Core i5-3330 processor at 3.00 GHz and RAM memory of 8 GB running Microsoft Windows 10. It has been observed that the computation time is mainly due to the process of training while the time for the prediction step is negligible. As shown in Table 2 , the PBTP and PBDP approaches exhibit higher computation time per user since they make use of larger number of training tuples, leading to longer training times. As shown in Table 2 , the JBP approach provides the best prediction accuracy with a relatively low computation time. It is worth noting that the computation time required for the training may impose some restrictions in the maximum number of users that can included in the AP prediction or the frequency in which the training is updated. The values of the average Table 2 may be excessively high in a Wi-Fi network that may have several thousands of simultaneous user connections. As a consequence, running the proposed methodology for such a high amount of users may require the parallelisation of the proposed methodology using multiple processors, each of them to run the prediction of a group of users. This section presents the impact of the amount of historical data used for building the training set. In particular, Fig. 4a presents a comparison of the prediction accuracy for JBP with N = 2 for Neural Network and Random Forest as a function of the number of days with measurements considered for generating the training set. On the other hand, Fig. 4b shows the average computation time required per user. As shown in Fig. 4a , the Neural Network predictor provides higher average prediction accuracy than Random Forest. Figure 4a also shows that the use of larger amount of data for generating the training set provides higher prediction accuracy. However, processing larger amount of data requires higher computation time, especially for the Neural Network, as shown in Fig. 4b . This section evaluates the impact of the size of the sliding window for the JBP approach when D = 84 days of measurements are considered for generating the training set. In particular, Table 3 presents the average prediction accuracy and the computation time per user for different values of the sliding window N. As shown in Table 3 , a too low value of the sliding window reduces the capability to detect weekly, daily and hourly user periodical behaviour, which leads to a lower prediction accuracy. However, when setting a too high sliding window, the number of the tuples for generating the training set will become lower and, as a consequence, a worse training process is done, which reduces the prediction accuracy. This paper has proposed a methodology for the prediction of future APs to which users will be connected in a Wi-Fi network. The proposed methodology is based on a supervised learning that makes use of historical user connectivity to build a prediction model. Different approaches have been defined depending the historical data that is used. In general, the PBTP approach, in which the prediction is based according to the most recent APs to which the user connected, provides the best prediction accuracies. However, PBDP or PBWP perform better for users that follow some daily or weekly periodical behavior. As shown, a joint approach (JBP) is able to provide better prediction accuracy than the rest of the approaches separately with a relatively low computation time per user. The impact of the training set size has been illustrated for the JBP approach in terms of prediction accuracy and computation time. As shown, higher amount of days with measurements considered for generating the training set provides higher prediction accuracy at expenses of higher computation time, especially for Neural Networks. The impact of the size of the sliding window has been also evaluated. A too low value of the sliding window results in a worse capability to detect weekly, daily and hourly user periodical behavior while a too large value leads to a too low number of tuples for training. The results indicate that the prediction based on the Neural Network provides a higher prediction accuracy than the prediction based on Random Forest at expenses of an increase in the computation time. Cisco Visual Networking Index: Forecast and Trends On big data analytics for greener and softer RAN Handbook of Statistical Analysis and Data Mining Applications Activity-based human mobility patterns inferred from mobile phone data: a case study of Singapore Identifying users' profiles from mobile calls habits On extracting user-centric knowledge for personalised quality of service in 5G Networks 791: Technical Specification Group Services and System Aspects. Study of Enablers for Network Automation for 5G A secure seamless handover authentication technique for wireless LAN Optimised access point selection with mobility prediction using hidden Markov model for wireless networks Predicting length of stay at WiFi hotspots CELoF: WiFi dwell time estimation in free environment Data Mining: Concepts and Techniques. Morgan Kaufman (MK) Cisco Prime Infrastructure 3.5 Administrator Guide. www.cisco.com 14 Acknowledgements. This work has been supported by the Spanish Research Council and FEDER funds under SONAR 5G grant (ref. TEC2017-82651-R).