Document downloaded from: This paper must be cited as: The final publication is available at Copyright http://dx.doi.org/10.1016/j.eswa.2012.01.140 http://hdl.handle.net/10251/34812 Elsevier Domenech, J.; De La Ossa Perez, BA.; Sahuquillo Borrás, J.; Gil Salinas, JA.; Pont Sanjuan, A. (2012). A taxonomy of web prediction algorithms. Expert Systems with Applications. 39(9):8496-8502. doi:10.1016/j.eswa.2012.01.140. A taxonomy of web prediction algorithms Josep Domenech, Bernardo de la Ossa, Julio Sahuquillo, Jose A. Gil and Ana Pont Universitat Politecnica de Valencia. Camı́ de Vera, s/n. 46022 Valencia (Spain) jdomenech@upvnet.upv.es; berospe@doctor.upv.es; {jsahuqui,jagil,apont}@disca.upv.es November 10, 2011 Abstract Web prefetching techniques are an attractive solution to reduce the user-perceived latency. These techniques are driven by a prediction en- gine or algorithm that guesses following actions of web users. A large amount of prediction algorithms has been proposed since the first pre- fetching approach was published, although it is only over the last two or three years when they have begun to be successfully implemented in com- mercial products. These algorithms can be implemented in any element of the web architecture and can use a wide variety of information as input. This affects their structure, data system, computational resources and ac- curacy. The knowledge of the input information and the understanding of how it can be handled to make predictions can help to improve the design of current prediction engines, and consequently prefetching techniques. This paper analyzes fifty of the most relevant algorithms proposed along 15 years of prefetching research and proposes a taxonomy where the algorithms are classified according to the input data they use. For each group, the main advantages and shortcomings are highlighted. 1 Introduction Web prediction algorithms pursue to discover patterns that permit them to foresee subsequent web user demands. Applications of these predictions include recommendation systems [11], e-commerce [1], content personalization [34], and reduction of user-perceived latencies by means of web prefetching [36] and cache prevalidation [17]. Since the goals of these applications are completely different, prediction algorithms have different requirements when used for each of them. For instance, prediction algorithms for web prefetching might work at object- level granularity because prefetching one object might reduce the user-perceived latency. However, recommendation systems focus on page-level granularity since users expect that the system recommends a page, not a single object composing a page. This paper focuses on prediction algorithms applied to web prefetching. Web prefetching is a technique aimed at reducing user-perceived latencies by process- ing (usually downloading) user requests before they actually demand them. For 1 example, if a user is likely to visit page B after visiting page A, the browser can download and cache page B just after requesting page A, i.e., while the user is reading page A. In this case, when the user actually requests page B, this is displayed without network latency since it is already in cache. Accurate predictions are required because, on misprediction, prefetched objects waste client, server and/or network resources, and this could degrade the overall sys- tem performance. For this reason, the prediction algorithm is the core of web prefetching. A wide range of prediction algorithms for web prefetching have been pro- posed in the literature over the last 15 years. Although early proposals focused on exploiting the object popularity to determine the objects that are more likely to be requested in the near future, the most explored option is to predict subsequent requests from past sequences of requests. However, over the last years, other data sources beyond the sequence of accesses, such as the site link structure, the HTTP headers or the page contents, are being included in the prediction algorithms. Understanding and identifying advantages and shortcomings of prediction algorithms is a hard task even for an experienced researcher in this field because of the large amount of proposals that work according to different rules. This means that selecting the best approach or developing a new one for a given environment represents an arduous task. Few attempts classifying prediction algorithms have been published and they mainly focus on how sequences of accesses are analyzed. Rabinovich and Spatscheck [41] classify different approaches depending on how predictions are made. They differentiate three categories of Markov models depending on how many previous accesses are used. Other category includes algorithms based on the popularity that web objects exhibit, and another one is dedicated to those algorithms exploiting the structure of the web. The last classification published, presented by Davison [15], divides the prediction algorithms into five categories according to how the sequence of accesses is analyzed. Although it is mentioned that more objects can be predicted if extra information sources are used (e.g., web page content), this is not analyzed in depth. Unfortunately, the complexity of prediction algorithms published in the liter- ature from Davison’s classification has noticeably increased. As a consequence, many of these proposals do not fit well in any of the previously identified cate- gories. Therefore, a new effort is required to order and clarify them to provide a sound understanding of their advantages, progresses and real possibilities of use. In this paper, we propose a prediction algorithm taxonomy in which the algorithms are classified by the type of information they use to make predictions. The algorithms are organized in four main categories depending on whether they take as input the object characteristics, the sequence of requests, the HTTP headers and/or the content of web pages. This classification is proposed because the input of the predictors determine what kind of accesses can be predicted and how complex the analysis is. For instance, a recently added link can be predicted by an algorithm using as input the sequence of requests only if at least one user has already followed that link. The main contribution of this paper is to present a comprehensive review of web prediction algorithms grouped by a novel taxonomy, thus easing the understanding of how predictors work and the implications of selecting a given 2 prediction algorithm when designing a new prefetching system. In other words, this paper is aimed at examining the pros and cons of each group of algorithms and identifying current and future trends in web prediction engines. The remainder of the paper is organized as follows. Section 2 describes the elements that compose the web prefetching architecture, highlighting the func- tion of the prediction algorithm. Section 3 classifies web prediction algorithms according to the proposed taxonomy. Finally, Section 4 draws some concluding remarks. 2 Web prefetching architecture Web prefetching systems are implemented by extending the generic web archi- tecture (with clients, servers and proxies) with two extra elements: the predic- tion and the prefetching engines. The function of the prediction engine is to guess future client actions, e.g., requests. Its implementation corresponds to a prediction algorithm whose output is a hint list. In most proposals, this hint list includes references to the predicted objects, although some few approaches suggest other type of actions [12]. These predictions (or hints) are usually made basing on previous experience about user accesses and preferences. These hints are provided to a prefetching engine, which is aimed at preprocessing (e.g., downloading) those objects predicted by the prediction engine. By preprocess- ing the requests in advance, the user-perceived latency is reduced when these objects are requested by users. The amount and scope of the information that can be used by the prediction engine depends on the element of the Web architecture (the client, the proxy or the server) at which the prediction engine is located. When the prediction engine is located at the client, the algorithm can only use access patterns, con- tents seen and preferences coming from one single user. Consequently, those predictions are only useful for this user. If the prediction engine is located at a proxy server, it can take advantage of the multi-user and multi-server informa- tion gathered at this element to perform the predictions. When the predictor is located at the server side (origin server or replica in the context of Content Dis- tribution Networks (CDN)), it makes predictions based on multi-user accesses and preferences to the same web site. This option has been the most explored in the research literature because of the accuracy of the predictions made and its potential use in real scenarios. Finally, in more complex proposals, the pre- dictions can be performed by several elements in collaboration. This benefits the accuracy and coverage of the predictions. The prefetching engine is independent from the predictor and can be located in any element of the web architecture that receives the hint list. However, the common trend is to locate it at the client side as it is currently implemented in Mozilla-based browsers. The prefetched objects are stored in a cache until they are demanded or evicted. Therefore, only those objects that can be stored at the web client must be predicted or prefetched. This is not a drawback because, although a web page can be the result of a dynamic response, many of the objects composing that page are usually static and consequently cacheable. Furthermore, a dynamically generated response can be cached if its headers properly mark the response. As all the cacheable objects can be prefetched (precached), prefetching techniques 3 Figure 1: Web prediction algorithm taxonomy cover, among other technologies, dynamic web server and application program- ming, and browser application programming like AJAX, Java, Flash, and other Rich Internet Applications. The performance of prediction algorithms is usually measured by means of metrics that quantify both the efficiency and the efficacy of the proposal. Although some proposals use some specific indexes, the general trend when evaluating prediction algorithms is to trade-off Recall and Precision, as main performance metrics. Recall is defined as the ratio of requested objects that were previously predicted. As the recall quantifies the weight of the predicted objects over the amount of objects requested by the user, it can give an idea about the usefulness or coverage of the predictions. Precision is defined as the ratio of good predictions to the number of predictions, so it is related to the quantity of wasted resources. 3 Prediction engine taxonomy This section presents the proposed taxonomy, which breaks down the prediction algorithms into four categories according to the information used to make the predictions. Earlier proposals simply consider isolated object popularity, while some oth- ers analyze the sequence of user requests to discover user navigation patterns. To discover the structure of web pages and the relation between requests, more re- cent algorithms take into account the HTTP headers or examine page contents. Content-based algorithms may involve the analysis of the list of hyperlinks, the text surrounding hyperlinks or the labels with meta-information included by the web designer. This taxonomy, represented in Figure 1, summarizes the prediction algo- rithms ranging from the earliest works in 1995 to the most recent ones found in 2010. The algorithms were designed considering the characteristics of their contemporary web sites and internet technologies. In this sense, this survey represents a review that covers more than 15 years of research following the evolution of the Web, from the genesis when static pages had few embedded ob- jects and hyperlinks to the current web, with dynamic and personalized pages. 4 3.1 Object characteristics Most of the algorithms falling in this category were proposed early in the Web. In those years, more than one decade ago, the Web structure was much simpler than nowadays. As a consequence, and in the absence of other approaches, the most intuitive and simplest algorithms were proposed. These proposals focus on analyzing object characteristics, like temporal lo- cality and frequency of accesses (i.e., popularity), which were the first aspects studied in the Web. Algorithms falling in this category usually follow the long- term approach, where clients are subscribed to proxies storing specific objects, which are in charge of keeping updated copies of the objects. The opposite is referred to as the short-term approach, which aims to prefetch those objects that are likely to be referenced in the near future, given user’s recent accesses. The work by Bestavros et al. [5] provides an overview of the performance benefits of propagating information from servers to proxies that are closer to their clients. No algorithm is described but only some ideas on which the al- gorithm should be based. The suggested algorithm takes as input the rate of transferred bytes among clients and proxies and the cumulative distribution of accesses to the objects as well. This idea was deeply worked in other work by the same author [6], which explores the performance gains of propagating information from servers to proxies. That is, the information is replicated at proxies which are closer to the final users. The level of propagation is based on the popularity of the objects and on the expected reduction in traffic. This approach exploits the properties of temporal and geographical locality of ref- erences, which, unlike the previous proposals, require a large community (e.g., thousands) of users. The Top-10 approach proposed by Markatos and Chronaki [33], also pub- lished in the last century, is one of the first attempts to exploit the web doc- uments’ popularity for prefetching purposes. Additional information obtained from the client side is also used to adapt prefetching to different client profiles. In this approach, the web server publishes a list of the most popular objects and this information is complemented with client access preferences to predict future user accesses. This approach has been included in this category since the main idea of the proposal is the use of the object popularity to predict future accesses, although client profile information is also used to control and adapt prefetching. In [26], Jian et al. propose an algorithm based on prefetching objects exhibit- ing a given characteristic. In this work, three main characteristics are explored and quantified: object lifetime, probability of being accessed, and user request arrival rate. The algorithm prefetches those objects where the estimated val- ues for a given characteristic exceed an specific threshold. The proposal works among clients and, either proxies or servers. The server collects information characterizing the objects and makes it available to the prefetching engine as well. Wu et al. [47] propose an algorithm aimed at improving a given performance metric, e.g., hit ratio, bandwidth consumption, or a ratio of both (H/B). De- pending on the target metric, the algorithm prefetches objects showing some particular characteristic (e.g., popularity, longer update intervals or a mix of both). The best performance is achieved by the algorithm which maximizes the H/B metric. The approach assumes that web servers push fresh copies into web 5 caches (assuming unlimited cache size) or proxy servers whenever such copies are updated. In [46], Venkataramani et al. propose a long-term approach where proxy caches are subscribed to those popular objects that are updated with low fre- quency. This approach is aimed at being used in the last level caches of a CDN hierarchical structure while the typical prefetching (i.e., short-term approach) should remain in the lower level caches. In [30], Lau and Ng propose a prefet- ching algorithm that analyzes the browser’s cache according to the long-term approach. The algorithm predicts and prefetches Web documents and their linked documents as well. Two main characteristics are analyzed for each ob- ject: its access rate and the last time it was accessed. As in [26], those objects whose characteristic exceeds a given threshold are prefetched. The algorithm uses the detection theory to determine the threshold value. In addition, users can take an active part in the prefetching engine by using a graphic user interface where they can modify any computed prefetching schedule. The work by Rangarajan et al. [42] can be characterized as a short-term approach. This paper presents a technique that dynamically groups users based on their frequency of access to each object. In this way, the server structure can be reorganized according to the needs of each group of users. To avoid overloading the network, the proposal predicts in a cluster-based fashion instead of predicting for each single user; that is, whenever a host connects to the server or a proxy, the prefetching strategy returns the URIs predicted for the cluster to which the host belongs. Unlike the previous proposals, in the work by Duchamp [21], servers manage not only statistics about the objects they store but also about objects in other servers. All this information is then summarized and a probability of being used is estimated. The prediction algorithm takes information both from the server and client sides. Once the prediction is received, the client side can interact to decide whether or not to prefetch the hinted hyperlinks according to some local parameters. Algorithms in this category made sense when they were proposed (i.e., early in the Web), since the Web was simple and static regarding the object’s point of view as well as its lifetime. Nowadays, this approach fits better to CDNs, which control both the origin and replica contents. 3.2 Sequence of requests Most of the web prediction algorithms proposed in the literature lie in the anal- ysis of the paths followed by each client, that is, in the sequence of requests. By comparing the user’s recent accesses to the paths previously recorded, algo- rithms are able to predict future accesses. Algorithms falling into this category require to distinguish which client is issuing the request in order to keep track of the right sequence of accesses. Al- though solving this issue is trivial when the prediction engine is implemented in the web browser, it becomes more complex when it is implemented in the proxy (keeping track of the clients IP could be enough) or in the server (frequently requiring the use of cookies). The element of the web architecture (i.e., client, proxy or server) in which the prediction engine is implemented is also a performance limiting factor for prediction engines of this type [19, 29]. This is because predictors falling in this 6 category cannot predict accesses to those resources that have not been requested before to the element where the predictor is placed (first-seen resources). As expected, the ratio of first-seen resources is low when predictors are located at the server but high when located at the client [19]. However, this limitation can be overcome if additional sources of information are provided to the prediction engine. Therefore, algorithms in this category have been divided into pure and hybrid, depending on whether the predictor considers only the sequence of requests or also takes into account additional information to refine predictions. 3.2.1 Pure predictors There is a large number of proposals specifically addressed to predict subse- quent accesses from the sequence of requests that a web server receives. Some variations of Markov models are often used to predict and hint resources that other element of the web architecture (i.e., the proxy or the clients) will prefetch [36, 13, 4, 38]. The dependency graph (DG) proposed by Padmanabhan and Mogul [36] is usually taken as baseline for performance comparisons. The graph has a node for every object that has ever been accessed. There is an arc from node A to B if and only if at some point in time a client accessed B within w accesses after A, where w is the lookahead window size. The weight of the arc is the ratio of the number of accesses to B within a window after A to the number of accesses to A itself. Some other algorithms [43, 31] use the predictions to preprocess resources at the server side. Schechter et al. [43] use session paths to make predictions for generating some dynamic content in advance. In the context of a distributed web server cluster, the proposal of Lee et al. [31] combines a prediction algorithm with a workload distribution algorithm to prefetch objects from disk in each web server. Kim et al. [28] propose a predictor that takes into account the sequence of requests from the client perspective. This sequence is used to predict the most likely link in the current page that the user will follow. Analyzing the sequence of requests at the client side has as main advantage that no accesses are hidden due to the effects of caching. However, since it considers single user information, it can only predict previously followed links, but not accesses to new content. Focusing on the proxy perspective, Fan et al. [22] employ the well-known prediction by partial matching (PPM) algorithm in such a way that it aggregates all users data to make predictions for each single user in a client-proxy context. The joint benefits of caching and prefetching are also analyzed by Teng et al. [45] in a proxy-server context. Its prediction algorithm is based on rules, each one with a confidence level, in which the next requested object depends on the sequence of objects recently requested. In contrast, in the proposal of Huang et al. [25], the prediction engine tries to make a prediction from the same user’s previous accesses. Only if it is not possible, the algorithm uses the aggregate information of other users. Since the range of accesses that are recorded is high, an important issue when the prediction engine is located at the proxy is to reduce the resource consump- tion. To deal with this concern, Yang et al. [49] propose a data structure to record web sessions in which it is possible to apply any data mining algorithm. The proposed algorithm works in two steps: first, user sessions are clustered; second, transition probabilities between pages in the same cluster are computed. 7 In a similar way, the algorithm proposed by Pallis et al. [37] identifies clusters of pages whose accesses are correlated as its first step to make predictions. Other approaches take into account the sequence of requests from several points of view to complement the sequences perceived in one element. In this context, requests that hit in a proxy cache are hidden for the origin server. This fact distorts the sequence of requests at the server and prevents the server from making a prediction for that request. Chen et al. [9] deal with this problem by proposing an algorithm that coordinates servers and proxies to make predictions. Another coordinated approach is taken by Pons [39] in the context of a web application, but considering coordination between proxies and clients. In this proposal, the proxy builds a generic prediction model that each client receives and adapts according to its own patterns. Some other prediction algorithms use the sequence of accesses as the only input, but they are not proposed to work in a specific element of the architecture. To improve PPM prediction accuracy, Ban et al. [3] propose to include extra data in the model to describe a node prediction capability. Chen and Zhang [8] propose a memory-efficient version of a PPM algorithm whose prediction tree branches have different height depending on the root object popularity. A similar algorithm, using Markov models of different orders, is proposed by Dongshan and Junyi [20] to deal with the scalability of the prediction engine. The model presented by Gunduz and Ozsu [24] compare session similarities to make predictions. Sessions are modeled taking into account the sequence of user requests and the time spent in each page. The work by Bonino et al. [7] describes an initial set of predictors (repre- sented as finite-state machines) that evolve according to genetic operators. One of these predictors is selected every time a request is received in order to make a prediction. The graph representing user sessions is different from those usually employed in Markov models, since requests are represented by edges, and nodes represent predictions instead of objects. 3.2.2 Hybrid predictors Predictors falling into this subcategory are those that complement the sequence of requests with another source of information, e.g., the HTTP headers or page contents. The main reasons to take into account extra sources are usually either to increase the prediction accuracy or to control the computational complexity of the algorithms. Three of the prediction algorithms analyzed by Zukerman et al. [50] are based on Markov models that employ the sequence of requests at the server as the main source. One of them, the Linked Space-Time Markov model, com- bines the sequence of accesses with the referrer information found in the HTTP headers. In a later work [2], they combine all the models to build a single predictor. The Double Dependency Graph (DDG) prediction algorithm (see Section 3.3) proposed by Domenech et al. [18] employs the sequence of user accesses to build a request dependency graph similar to the DG algorithm proposed in [36]. This predictor also requires to analyze response headers to distinguish container from embedded objects. The prediction engine proposed by Georgakis and Li in [23] (described in Section 3.4) also takes as input the sequence of requests received by the server. 8 However, it is more interested in the sequence of content read than in the path followed by the user. Nanopoulos et al. [35] propose an algorithm that builds association rules from the sequence of accesses, allowing only those page transitions found in the link structure of the web site. In a similar way, the proposal by Mabroukeh and Ezeife [32] builds a Markov model to make the predictions. However, in this case, the prediction tree is pruned by using the concept of semantic distance. To compute this distance, a domain ontology, which is provided during the design of the web site, is required. 3.3 HTTP headers This category includes those prediction algorithms that use the HTTP headers either as the only input or in combination with other inputs. Web client requests are issued by sending HTTP requests, which contain several headers providing information about request details, besides the URI. One of the most used request headers is the “Referer”. Although this request header is optional in the standard protocol, the most popular web browsers always provide it in their requests. A prediction algorithm can discover first- order dependences by only observing (Requested Object URI, Referrer URI) pairs. Together with the requested object, the web server returns some HTTP response headers. By looking at the response headers, a prediction algorithm can take some features of the requested objects as input. Response headers used in prediction algorithms include “Content-Type” and “Content-Length” headers. Zukerman et al. [50] compare several Markov-based prediction models that take into account different factors: the sequence of documents requests, the web site structure inferred by using the “Referer” request header, and a combina- tion of both. Their results show that the model that combines all these factors in the Markov model yields the best predictive power, in terms of time and space consumption. The same authors propose, in a later work [2], the max- Hybrid prediction algorithm, which uses as input the sequence of accesses with timestamp, the “Referer” request header, and the “Content-Length” response header. The DDG algorithm [18], classified and introduced in the previous section, employs the “Content-Type” response header to distinguish between container and embedded objects. Container objects are those that the user demand ex- plicitly, while secondary objects are those embedded in the page and requested automatically by the web browser. This allows DDG to differentiate the depen- dences of objects of the same page and objects of different pages. Thus, DDG represents the structure of the web site more faithfully than DG. The Referrer Graph (RG) prediction algorithm proposed by Ossa et al. [16] is based on DDG. Both algorithms use the “Content-Type” response header to classify objects as primary (container) and secondary (embedded) objects. However, unlike DDG, which uses the temporal sequence of accesses, RG uses the URI and the “Referrer” request header provided in each individual object request to build the arcs in the graph. In this way, RG builds arcs only between objects that are really linked with hyper references in the web site, and not 9 between objects that have been requested sequentially by a web browser. This allows RG to represent the structure of a web site more faithfully than DDG. HTTP headers can be considered accurate since they are provided explicitly by the web client and server. In other words, there is no derived guessing that might be erroneously interpreted as happens in the sequence of requets category. As it is unnecessary to access the full object content, these algorithms allow shorter processing time and are less intrusive than those in the content category. The headers are provided continually, and this allows the algorithms to actively capture user accesses. 3.4 Content Prediction algorithms that use the information extracted from the content of the web pages as main input are included in this category. These prediction engines use the current web page to make predictions of future accesses, usually without requiring history information. The algorithms in this category are specially designed to deal with objects that are newly created or never visited before, and with dynamic sites where a URL could have time-dependent contents. Despite the wide variety of proposals falling into this category, we have identified three trends when using page content as main input information. 3.4.1 Using hyperlinks The first approach proposed only used the references to objects found when parsing the current web page, without any further analysis. Chinen and Ya- maguchi [10] designed and implemented a prefetching proxy server called the WWW Collector (Wcol) by using this idea. This proxy prefetches and stores referenced pages, and also shares prefetched objects with its other clients. Cohen and Kaplan describe in [12] three simple prefetching techniques (pre- resolving host-names, pre-connecting and pre-warming) that can be used with non-prefetchable URLs. Predictions are performed by extracting URLs from anchor links of previously visited pages. The study considers two classes of requests. The first class includes pages linked in the results page of a search engine or web portal, whereas the second class contains all requests that were not preceded by a request to the server in the last 60 seconds. More recently, in the context of CDNs, Sidiropoulos et al. [44] identify clus- ters of “correlated” web pages in a site (web site communities), and make these communities the basic outsourcing unit of content to be prefetched. The ap- proach is based only on the hyperlink information, since they assume that two pages are “similar” or “correlated” if there exists a link between them. The algorithm by Nanopoulos et al. [35], described in Section 3.2, also falls into this category because it complements the prediction model obtained with the sequence of requests with the hyperlink graph of the web site. This graph is applied to the prediction model to allow only those page transitions that are included in it. 3.4.2 Semantic approaches Algorithms in this subcategory tend to capture the user surfing interest from semantic characteristics of visited pages. 10 A keyword-based semantic prefetching approach is proposed by Xu and Ibrahim [48]. This algorithm lies in the fact that client surfing is often guided by some keywords in the text that surrounds hyperlink definitions (hrefs) in web pages. Therefore, this approach predicts future requests based on semantic preferences of past retrieved objects. To do so, the algorithm ranks the links of the last visited page according to the similarity of the keywords found in their “anchor text” to those in the previously accessed links. More recently, Georgakis and Li [23] propose an algorithm that also considers the anchored text around each of the outbound hyperlinks. To assign a weight to each of these links, the algorithm keeps counters of the frequencies of appearance of bigrams for visited and non-visited links. Davison [14] proposes an algorithm that predicts the user’s next web page request by conducting an analysis of the text of his/her recently requested web pages. The algorithm compares the text in and around the hypertext anchors of the current page to the text contained in previously visited pages. 3.4.3 Labeled approaches This subcategory includes all those proposals where the web designer is in- volved in the prediction process by including extra information that will be later used to predict future accesses. The common advantage of these specific and designer-dependent proposals is that they exhibit interesting results for the environments where they are tested, that is, those for which the proposals have been developed. This is also their main disadvantage, because they are not generic solutions that could be easily adapted to general and real scenarios. Khan and Tao identify in [27] four dominant patterns in hyperlink structure (Chain, Tree, Complete Graph, and Tree with Complete Core). Their pro- posal consists on labeling links with this kind of information to develop smarter prefetching techniques when the structure of webspace is also brought into con- sideration. Pons [40] proposes to conduct web prefetching by using semantic links. That is, semantic information explicitly embedded during the web design and associ- ated with each web page link. A semantic link is different from a hyperlink, since the semantic link represents a pointer with a type or meaning directed from the predecessor web page to the successor web page. Examples of semantic link types are sequential, similar-to, cause-effective, implication, etc. The proposal of Mabroukeh [32] has been previously classified into the group of algorithms using the sequence of request as input because this is the main information used for prediction. Nevertheless, it is actually a hybrid algorithm that also employs semantic information to prune the Markov model according to a computed semantic distance. This proposal assumes that the semantic information is available in the form of domain ontology, provided during the design of the web site. 4 Conclusions Since the inception of web prefetching, the proposals of prediction algorithms have adapted to the increasing complexity of the web content. Thus, the com- plexity of the predictors has increased, too. In this paper, we have summarized 11 Table 1: Surveyed papers by category and year CATEGORY 1995-2000 2001-2005 2006-2010 Obj. characteristics [6, 5, 33, 21] [26, 46, 30, 42] [47] Seq. requests pure [4, 36, 43, 13, 22, 38] [8, 20, 7, 24, 28, 49, 9, 25, 39, 45] [31, 3, 37] hybrid [2, 50] [35] [23, 32, 18] HTTP headers [2, 50] [16, 18] Content [10] [12, 14, 27, 35, 48] [23, 40, 44, 32] and categorized more than 15 years of research in prediction algorithms ap- plied to web prefetching techniques. To this end, a new taxonomy based on the input information used by the prediction algorithms has been proposed and successfully applied. Through the proposed taxonomy, the pros and cons of each category of algorithms have been discussed. Table 1 summarizes the classification of the surveyed papers according to the taxonomy proposed in this work and the year of publication. This table permits us to identify trends in algorithm design. As one can observe, dur- ing the first period of five years, most of the proposed algorithms dealt only with object characteristics or sequence of requests to make predictions. In the next five-year period, the predominant choice was to use only the sequence of accesses as input of the prediction algorithms, although the number of content- based proposals also increased. Finally, during the last five years, researchers have moved towards more sophisticated inputs. Algorithms considering several input types as well as predictors that analyze web page contents or require the collaboration of the web page designer are gaining relative importance in the current trend. Furthermore, it has also been observed the increasing concern for the computational costs of the algorithms. Despite of this, there are several gaps in the design space of prediction algo- rithms that remain unexplored. Few papers, i.e., [21, 9, 39], analyze several web architecture points of view simultaneously, and none of them combine different input categories. Besides, the use of HTTP headers has not been intensively explored and no algorithm combining header and content analysis has been proposed. As for future work, we plan to classify other aspects of web prefetching systems: proposals of prefetching engines according to its location in the web architecture; and proposals of web prefetching architectures according to how prediction and prefetching engines are coordinated, thus providing a complete study about web prefetching techniques from its beginnings to current days. Acknowledgments This work has been partially supported by Spanish Ministry of Science and Innovation under grant TIN2009-08201, Generalitat Valenciana under grant GV/2011/002 and Universitat Politecnica de Valencia under grant PAID-06- 12 10/2424. References [1] Silvana Vanesa Aciar, Christian Serarols-Tarres, Marcelo Royo-Vela, and Josep Lluis de la Rosa i Esteva. Increasing effectiveness in e-commerce: rec- ommendations applying intelligent agents. International Journal of Busi- ness and Systems Research, 1(1):81–97, 2007. [2] David W. Albrecht, Ingrid Zukerman, and Ann E. Nicholson. Pre-sending documents on the WWW: A comparative study. In Proceedings of the Six- teenth International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 1999. [3] Zhijie Ban, Zhimin Gu, and Yu Jin. A ppm prediction model based on stochastic gradient descent for web prefetching. In Proceedings of the IEEE 22nd International Conference on Advanced Information Networking and Applications (AINA), pages 166–173, Okinawa, Japan, 2008. [4] Azer Bestavros. Using speculation to reduce server load and service time on the WWW. In Proceedings of the 4th ACM International Conference on Information and Knowledge Management, Baltimore, USA, 1995. [5] Azer Bestavros. Speculative data dissemination and service to reduce server load, network traffic and service time in distributed information systems. In Proceedings of the Twelfth International Conference on Data Engineering, pages 180–187, New Orleans, USA, 1996. [6] Azer Bestavros and C. Cunha. Server-initiated document dissemination for the WWW. IEEE Data Engineering Bulletin, 1996. [7] Dario Bonino, Fulvio Corno, and Giovanni Squillero. A real-time evolu- tionary algorithm for web prediction. In Proceedings of the IEEE/WIC International Conference on Web Intelligence, Halifax, Canada, 2003. [8] Xin Chen and Xiaodong Zhang. Popularity-based PPM: An effective web prefetching technique for high accuracy and low storage. In Proceedings of the International Conference on Parallel Processing, Vancouver, Canada, 2002. [9] Xin Chen and Xiaodong Zhang. Coordinated data prefetching for web contents. Computer Communications, 28:1947–1958, 2005. [10] K. Chinen and S. Yamaguchi. An interactive prefetching proxy server for improvement of www latency. Proceedings of Seventh Annual Conference of the Internet Society (INET’97), 06/1997 1997. [11] Wei Chu and Seung-Taek Park. Personalized recommendation on dynamic content using predictive bilinear models. In Proceedings of the 18th Inter- national Conference on World Wide Web, pages 691–700, Madrid, Spain, 2009. 13 [12] Edith Cohen and Haim Kaplan. Prefetching the means for document transfer: a new approach for reducing web latency. Computer Networks, 39(4):437–455, 2002. [13] Edith Cohen, Balachander Krishnamurthy, and Jennifer Rexford. Efficient algorithms for predicting requests to web servers. In Proceedings of the IEEE INFOCOM ’99 Conference, New York, USA, 1999. [14] Brian D. Davison. Predicting web actions from HTML content. In Proceed- ings of the 13th ACM Conference on Hypertext and Hypermedia, College Park, USA, 2002. [15] Brian D. Davison. Learning web request patterns. In Web Dynamics - Adapting to Change in Content, Size, Topology and Use, pages 435–460. Springer, 2004. [16] Bernardo de la Ossa, Ana Pont, Julio Sahuquillo, and J. A. Gil. Referrer graph: a low-cost web prediction algorithm. ACM, 2010. [17] Josep Domenech, Jose A. Gil, Julio Sahuquillo, and Ana Pont. Speculative validation of web objects for further reducing the user-perceived latency. In Proceedings of the 9th International IFIP TC 6 Networking Conference, Chennai, India, 2010. [18] Josep Domenech, Jose A. Gil, Julio Sahuquillo, and Ana Pont. Using current web page structure to improve prefetching performance. Computer Networks, 54(9):1404 – 1417, 2010. [19] Josep Domènech, Julio Sahuquillo, José A. Gil, and Ana Pont. The impact of the web prefetching architecture on the limits of reducing user’s perceived latency. In Proceedings of the 2006 IEEE / WIC / ACM International Conference on Web Intelligence, Hong Kong, China, 2006. [20] Xing Dongshan and Shen Junyi. A new Markov model for web access prediction. Computing in Science and Engineering, 4(6):34–39, 2002. [21] Dan Duchamp. Prefetching hyperlinks. In Proceedings of the 2nd USENIX Symposium on Internet Technologies and Systems, Boulder, USA, 1999. [22] Li Fan, Pei Cao, Wei Lin, and Quinn Jacobson. Web prefetching between low-bandwidth clients and proxies: Potential and performance. In Proceed- ings of the ACM SIGMETRICS Conference on Measurement and Modeling Of Computer Systems, pages 178–187, Atlanta, USA, 1999. [23] A Georgakis and H Li. User behavior modeling and content based specu- lative web page prefetching. Data & Knowledge Engineering, 59:770 – 788, 12/2006 2006. [24] Sule Gunduz and M. Tamer Zsu. A web page prediction model based on click-stream tree representation of user behavior. In KDD ’03: Proceed- ings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–540, New York, USA, 2003. ACM Press. 14 [25] Yin-Fu Huang and Jhao-Min Hsu. Mining web logs to improve hit ratios of prefetching and caching. In Proceedings of the 2005 IEEE / WIC / ACM International Conference on Web Intelligence, Compiegne, France, 2005. [26] Yingyin Jiang, Min-You Wu, and Wei Shu. Web prefetching: Costs, bene- fits and performance. In Proceedings of the 7th International Workshop on Web Content Caching and Content Distribution, Boulder, USA, 2002. [27] Javed I. Khan and Qingping Tao. Exploiting webspace organization for ac- celerating web prefetching. In Proceedings of the IEEE/WIC International Conference on Web Intelligence, Halifax, Canada, 2003. [28] Yuna Kim and Jong Kim. Web prefetching using display-based predic- tion. In Proceedings of the IEEE/WIC International Conference on Web Intelligence, Halifax, Canada, 2003. [29] Thomas M. Kroeger, Darrell D.E. Long, and Jeffrey C. Mogul. Exploring the bounds of web latency reduction from caching and prefetching. In Proceedings of the 1st USENIX Symposium on Internet Technologies and Systems, Monterey, USA, 1997. [30] Kelvin Lau and Yiu-Kai Ng. A client-based web prefetching manage- ment system based on detection theory. In Proceedings of the Web Con- tent Caching and Distribution: 9th International Workshop (WCW 2004), pages 129–143, Beijing, China, 2004. [31] Heung Ki Lee, Gopinath Vageesan, Ki Hwan Yum, and Eun Jung Kim. A proactive request distribution (prord) using web log mining in a cluster- based web server. In Proceedings of the International Conference on Parallel Processing (ICPP’06), Columbus, USA, 2006. [32] Nizar R. Mabroukeh and Christie I. Ezeife. Semantic-rich markov models for web prefetching. In Proceedings of the IEEE International Conference on Data Mining Workshops, pages 465 – 470, Miami, FL, USA, 2009. IEEE, IEEE. [33] Evangelos Markatos and Catherine Chronaki. A top-10 approach to pre- fetching on the web. In Proceedings of the INET’ 98, Geneva, Switzerland, 1998. [34] Bamshad Mobasher. The Adaptive Web, volume LNCS 4321, chapter Data Mining for Web Personalization, pages 90–135. Springer, 2007. [35] Alexandros Nanopoulos, Dimitrios Katsaros, and Yannis Manolopoulos. A data mining algorithm for generalized web prefetching. IEEE Trans. Knowl. Data Eng., 15(5):1155–1169, 2003. [36] Venkata N. Padmanabhan and Jeffrey C. Mogul. Using predictive pre- fetching to improve World Wide Web latency. Computer Communication Review, 26(3):22–36, 1996. [37] G Pallis, A Vakali, and J Pokorny. A clustering-based prefetching scheme on a web cache environment. Computers & Electrical Engineering, 34:309 – 323, 07/2008 2008. 15 [38] Themistoklis Palpanas and Alberto Mendelzon. Web prefetching using par- tial match prediction. In Proceedings of the 4th International Web Caching Workshop, San Diego, USA, 1999. [39] A Pons. Improving the performance of client web object retrieval. Journal of Systems and Software, 74:303 – 311, 02/2005 2005. [40] A Pons. Semantic prefetching objects of slower web site pages. Journal of Systems and Software, 79:1715 – 1724, 12/2006 2006. [41] Michael Rabinovich and Oliver Spatscheck. Web Caching and Replication. Addison-Wesley, 2002. [42] S.K. Rangarajan, V.V. Phoha, K.S. Balagani, R.R. Selmic, and S.S. Iyen- gar. Adaptive neural network clustering of web users. Computer, 37:34 – 40, 04/2004 2004. [43] Stuart Schechter, Murali Krishnan, and Michael D. Smith. Using path profiles to predict http requests. In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1998. [44] Antonis Sidiropoulos, George Pallis, Dimitrios Katsaros, Konstantinos Sta- mos, Athena Vakali, and Yannis Manolopoulos. Prefetching in content distribution networks via web communities identification and outsourcing. World Wide Web, 11:39 – 70, 3/2008 2008. [45] Wei-Guang Teng, Cheng-Yue Chang, and Ming-Syan Chen. Integrating web caching and web prefetching in client-side proxies. IEEE Transactions on Parallel and Distributed Systems, 16(5):444–455, 2005. [46] Arun Venkataramani, Praveen Yalagandula, Ravindranath Kokku, Sadia Sharif, and Mike Dahlin. The potential costs and benefits of long-term prefetching for content distribution. Computer Communications, 25:367– 375, 2002. [47] Bin Wu and Ajay D. Kshemkalyani. Objective-optimal algorithms for long- term web prefetching. IEEE Transactions on Computers, 55(1):2–17, 2006. [48] Cheng-Zhong Xu and T.I. Ibrahim. A keyword-based semantic prefetching approach in internet news services. IEEE Transactions on Knowledge and Data Engineering, 16:601 – 611, 05/2004 2004. [49] Qiang Yang, Joshua Zhexue Huang, and Michael Ng. A data cube model for prediction-based web prefetching. Journal of Intelligent Information Systems, 20(1):11–30, 2003. [50] Ingrid Zukerman, David W. Albrecht, and Ann E. Nicholson. Predict- ing users’ requests on the WWW. In Proceedings of the 7th International Conference on User Modeling, Banff, Canada, 1999. 16