🔮 Libraries & tools for enabling Machine Learning driven user-experiences on the web
Libraries and tools for enabling data-driven user-experiences on the web.
For Webpack users:
Install and configure GuessPlugin - the Guess.js webpack plugin which automates as much of the setup process for you as possible.
Should you wish to try out the modules we offer individually, the
packagesdirectory contains three packages:
ga- a module for fetching structured data from the Google Analytics API to learn about user navigation patterns.
webpack- a webpack plugin for setting up predictive fetching in your application. It consumes the
parsermodules and offers a large number of options for configuring how predictive fetching should work in your application.
For non-Webpack users:
Our predictive-fetching for sites workflow provides a set of steps you can follow to integrate predictive fetching using the Google Analytics API to your site.
This repo uses Google Analytics data to determine which page a user is mostly likely to visit next from a given page. A client-side script (which you'll add to your application) sends a request to the server to get the URL of the page it should fetch, then prefetches this resource.
Guess.js provides libraries & tools to simplify predictive data-analytics driven approaches to improving user-experiences on the web. This data can be driven from any number of sources, including analytics or machine learning models. Guess.js aims to lower the friction of consuming and applying this thinking to all modern sites and apps, including building libraries & tools for popular workflows.
Applying predictive data-analytics thinking to sites could be applied in a number of contexts:
By collaborating across different touch-points in the ecosystem where data-driven approaches could be easily applied, we hope to generalize common pieces of infrastructure to maximize their applicability in different tech stacks.
The first large priority for Guess.js will be improving web performance through predictive prefetching of content.
By building a model of pages a user is likely to visit, given an arbitrary entry-page, a solution could calculate the likelihood a user will visit a given next page or set of pages and prefetch resources for them while the user is still viewing their current page. This has the possibility of improving page-load performance for subsequent page visits as there's a strong chance a page will already be in the user's cache.
In order to predict the next page a user is likely to visit, solutions could use the Google Analytics API. Google Analytics session data can be used to create a model to predict the most likely page a user is going to visit next on a site. The benefit of this session data is that it can evolve over time, so that if particular navigation paths change, the predictions can stay up to date too.
With the availability of this data, an engine could inserttags to speed up the load time for the next page request. In some tests, such as Mark Edmondson's Supercharging Page-Loads with R, this led to a 30% improvement in page load times. The approach Mark used in his research involved using GTM tags and machine-learning to train a model for page predictions. This is an idea Mark continued in Machine Learning meets the Cloud - Intelligent Prefetching.
While this approach is sound, the methodology used could be deemed a little complex. Another approach that could be taken (which is simpler) is attempting to get accurate prediction data from the Google Analytics API. If you ran a report for the Page and Previous Page Path dimension combined with the Pageviews and Exits metrics this should provide enough data to wire up prefetches for most popular pages.
ML could help improve the overall accuracy of a solution's predictions, but is not a necessity for an initial implementation. Predictive fetching could be accomplished by training a model on the pages users are likely to visit and improving on this model over time.
Deep neural networks are particularly good at teasing out the complexities that may lead to a user choosing one page over another, in particular, if we wanted to attempt a version of the solution that was catered to the pages an individual user might visit vs. the pages a "general/median" user might visit next. Fixed page sequences (prev, current, next) might be the easiest to begin dealing with initially. This means building a model that is unique to your set of documents.
Model updates tend to be done periodically, so one might setup a nightly/weekly job to refresh based on new user behaviour. This could be done in real-time, but is likely complex, so doing it periodically might be sufficient. One could imagine a generic model representing behavioural patterns for users on a site that can either be driven by a trained status set, Google Analytics, or a custom description you plugin using a new layer into a router giving the site the ability to predictively fetch future pages, improving page load performance.
Speculative prefetch can prefetch pages likely be navigated to on page load. This assumes the existence of knowledge about the probability a page will need a certain next page or set of pages, or a training model that can provide a data-driven approach to determining such probabilities.
Prefetching on page load can be accomplished in a number of ways, from deferring to the UA to decide when to prefetch resources (e.g at low priority with), during page idle time (via requestIdleCallback()()) or at some other interval. No further interaction is required by the user.
A page could speculatively begin prefetching content when links in the page are visible in the viewport, signifying that the user may have a higher chance of wanting to click on them.
As with any mechanism for prefetching content ahead of time, this needs to be approached very carefully. A user on a restricted data-plan may not appreciate or benefit as much from pages being fetched ahead of time, in particular if they start to eat up their data. There are mechanisms a site/solution could take to be mindful of this concern, such as respecting the Save-Data header.
Prefetching links to "logout" pages is likely undesirable. The same could be said of any pages that trigger an action on page-load (e.g one-click purchase). Solutions may wish to include a blacklist of URLs which are never prefetched to increase the likelihood of a prefetched page being useful.
Some of the attempts to accomplish similar proposals in the past have relied on. The Chrome team is currently exploring deprecating rel=prerender in favor of NoStatePrefetch - a lighter version of this mechanism that only prefetches to the HTTP cache but uses no other state of the web platform. A solution should factor in whether it will be relying on the replacement to rel=prerender or using prefetch/preload/other approaches.
There are two key differences between NoStatePrefetch and Prefetch:
nostate-prefetch is a mechanism, andis an API. The nostate-prefetch can be requested by other entry points: omnibox prediction, custom tabs, .
The implementation is different:prefetches one resource, but nostate-prefetch on top of that runs the preload scanner on the resource (in a fresh new renderer), discovers subresources and prefetches them as well (without recursing into preload scanner).
There are three primary types of data analytics worth being aware of in this problem space: descriptive, predictive and prescriptive. Each type is related and help teams leverage different kinds of insight.
Descriptive analytics summarizes raw data and turns it into something interpretable by humans. It can look at past events, regardless of when the events have occurred. Descriptive analytics allow teams to learn from past behaviors and this can help them influence future outcomes. Descriptive analytics could determine what pages on a site users have previously viewed and what navigation paths they have taken given any given entry page.
Predictive analytics “predicts” what can happen next. Predictive analytics helps us understand the future and gives teams actionable insights using data. It provides estimates of the likelihood of a future outcome being useful. It’s important to keep in mind, few algorithms can predict future events with complete accuracy, but we can use as many signals that are available to us as possible to help improve baseline accuracy. The foundation of predictive analytics is based on probabilities we determine from data. Predictive analytics could predict the next page or set of pages a user is likely to visit given an arbitrary entry page.
Prescriptive analytics enables prescribing different possible actions to guide towards a solution. Prescriptive analytics provides advice, attempting to quantify the impact future decisions may have to advise on possible outcomes before these decisions are made. Prescriptive analytics aims to not just predict what is going to happen but goes further; informing why it will happen and providing recommendations about actions that can take advantage of such predictions. Prescriptive analytics could predict the next page a user will visit, but also suggest actions such as informing you of ways you can customize their experience to take advantage of this knowledge.
The key objective of a prediction model in the prefetching problem space is to identify what the subsequent requests a user may need, given a specific page request. This allows a server or client to pre-fetch the next set of pages and attempt to ensure they are in a user’s cache before they directly navigate to the page. The idea is to reduce overall loading time. When this is implemented with care, this technique can reduce page access times and latency, improving the overall user experience.
Markov models have been widely used for researching and understanding stochastic (random probability distribution) process [Ref, Ref] . They have been demonstrated to be well-suited for modeling and predicting a user’s browsing behavior. The input for these problems tends to be the sequence of web pages accessed by a user or set of users (site-wide) with the goal of building Markov models we can use to model and predict the pages a user will most likely access next. A Markov process has states representing accessed pages and edges representing transition probabilities between states which are computed from a given sequence in an analytics log. A trained Markov model can be used to predict the next state given a set of k previous states.
In some applications, first-order Markov models aren’t as accurate in predicting user browsing behaviors as these do not always look into the past to make a distinction between different patterns that have been observed. This is one reason higher-order models are often used. These higher-order models have limitations with state-space complexity, less broad coverage and sometimes reduced prediction accuracy.
One way [Ref] to overcome this problem is to train varying order Markov models, which we then use during the prediction phase. This was attempted in the All-Kth-Order Markov model proposed in this Ref. This can make state-space complexity worse, however. Another approach is to identify frequent access patterns (longest repeating subsequences) and use this set of sequences for predictions. Although this approach can have an order of magnitude reduction on state-space complexity, it can reduce prediction accuracy.
Selective Markov models (SMM) which only store some states within the model have also been proposed as a solution to state-space complexity tradeoffs. They begin with a All-Kth-Order Markov Model - a post-pruning approach is then used to prune states that are not expected to be accurate predictors. The result of this is a model which has the same prediction power of All-Kth-Order models with less space complexity and higher prediction accuracy. In Deshpane and Karpis, different criteria to prune states in the model before prediction (frequency, confidence, error) are looked at.
In Mabroukeh and Ezeife, the performance of semantic-rich 1st and 2nd order Markov models was studied and compared with that of higher-order SMM and semantic-pruned SMM. They discovered that semantic-pruned SMM have a 16% smaller size than frequency-pruned SMM and provide nearly an equal accuracy.
Observing navigation patterns can allow us to analyze user behavior. This approach requires access to user-session identification, clustering sessions into similar clusters and developing a model for prediction using current and earlier access patterns. Much of the previous work in this field has relied on clustering schemes like the K-means clustering technique with Euclidean distance for improving confidence of predictions. One of the drawbacks to using K-means is difficulty deciding on the number of clusters, selecting the initial random center and the order of page visits is not always considered. Kumar et al investigated this, proposing a hierarchical clustering technique with a modified Levenshtein distance, pagerank using access time length, frequency and higher order Markov models for prediction.
Many of the papers referenced in the following section are centered around the Markov model, association rules and clustering. Papers highlighting relevant work related to pattern discovery for evolving page prediction accuracy are our focus.
Uses first-order Markov models to model the sequence of web-pages requested by a user for predicting the next page they are likely to access. Markov chains allow the system to dynamically model URL access patterns observed in navigation logs based on previous state. A “personalized” Markov model is trained for each user and used to predict a user’s future sessions. In practice, it’s overly expensive to construct a unique model for each user and the cost of scaling this becomes more challenging when a site has a large user-base.
First paper to investigate Hidden Markov Models (HMM). Author collected web server logs, pruned the data and patched the paths users passed by. Based on HMM, author constructed a specific model for web browsing that predicts whether the users have the intention to purchase in real-time. Related measures, like speeding up the operation and their impact when in a purchasing mode are investigated.
Proposes a framework to predict ranking positions of a page based on their previous rankings. Assuming a set of successive Top-K rankings, the author identifies predictors based on different methodologies. Prediction quality is quantified as the similarity between predicted and actual rankings. Exhaustive experiments were performed on a real-world large scale dataset for both global and query-based top-K rankings. A variety of existing similarity measures for comparing Top-K ranked lists including a novel one captured in the paper.
Proposes using N-hop Markov models to predict the next web page users are likely to access. Pattern matches the user’s current access sequence with the user’s historical web access sequences to improve the prediction accuracy for prefetches.
Proposes dynamic clustering-based methods to increase Markov model accuracy in representing a collection of web navigation sessions. Uses a state cloning concept to duplicate states in a way separating in-links whose corresponding second-order probabilities diverge. The method proposed includes a clustering technique determining a way to assign in-links with similar second-order probabilities to the same clone.
Extends the use of a page-rank algorithm with numerous navigational attributes: size of the page, duration time of the page, duration of transition (two page visits sequentially), frequency of page and transition. Defines a Duration Based Rank (DPR) and Popularity Based Page Rank (PPR). Author looked at the popularity of transitions and pages using duration information, using it with page size and visit frequency. Using the popularity value of pages, this paper attempts to improve conventional page rank algorithms and model a next page prediction under a given Top-N value.