Web Analytics: Overview, Options, and Technology Enablers

I am doing alot of web / clickstream analytics work for various companies with a wide range of sophistication and latency requirements,  so I’m thinking a lay of the land might be useful…

Access to timely clickstream data from a company’s website provides insight into online visitor behavior and patterns, which in turn enables companies to be more effective in myriad ways: with improved pricing, more effective campaigns and offers, better visitor segmentation and targeting, optimized website layout and workflow, and more.

Many companies use off-the-shelf web analytics products, which often make it relatively simple to monitor and analyze website activity, eliminating the need for significant IT efforts. There is no shortage of these web analytics products from which companies can choose.

As companies become more sophisticated in their web analytics requirements, though, they often need to augment the capabilities of these packaged web analytics products. Building and maintaining an internal clickstream data warehouse (CDW) enables these companies to manage, segment and report on the data in ways that the packaged products do not.

Until recently, building and maintaining a CDW has been prohibitive for most companies because of the volume and complexity of the warehoused data, as well as the volume and complexity of the raw source data files. However, recent advances in data warehousing technologies, such as columnar databases and data warehousing appliances that are designed to deliver very high levels of performance with massive and complex data sets, have made CDWs a practical option for many companies.

In addition to requirements for high performance OLAP database technologies, the raw clickstream source data, which often exists as huge, complex text files, must be parsed, structured, cleansed and loaded on a regular – sometimes daily – basis into the CDW before the CDW can provide value. This process often introduces additional cost, complexity and delay into the process. So the ability to process the raw clickstream data files in a rapid, cost-effective and scalable manner is also a critical component for any CDW initiative to be successful.

Traditional Web Analytics Products

Off-the-shelf software as a service (SaaS) web analytics products have been available for years. One major vendor reports that it has over 5,000 customers, and some of the major search engine companies offer popular web analytics products for free.

These products use a variety of underlying technologies (including page tagging, packet sniffing, and others) to collect a company’s website visitor data on the analytics vendors’ servers, and then provide each customer with capabilities to report on their specific website data. The primary benefit of using a SaaS web analytics product is that it requires much less effort than taking the data in-house and building a clickstream data warehouse. For many companies, these products provide sufficient levels of detail and flexibility in their reporting.

However, problems and limitations exist with these products, including:

• Lack of user-centric segmentation. Although useful for tracking activity in a page-centric manner, for many companies the specific information and segmentation that is available with off-the-shelf web analytics products does not satisfy their requirements for user-centric information. So, while customers are able to track the number of visitors, page views, and conversions on their website, they are unable to segment the data by user session to understand what a user does in a particular session, and to track a user’s activity across multiple sessions, for example.
• Historical analysis. The pages to be tracked and the tracking criteria must be defined in advance – it is impossible to report on new criteria from previous (historical) website activity. The new criteria must first be defined, and only subsequent activity can be tracked.
• Visitor tracking limitations. A limitation with technologies like page tagging is that not all user visits are tracked. For example, for visitors that have deleted cookies, or that don’t have Javascript enabled (such as on mobile devices), the visits are not recorded.
• Object tracking limitations. Activities with certain object types, such as PDF views and file downloads, are not tracked.
• Server tracking limitations. Since most of these products rely on code that is executed in the client, they cannot report on server responses, such as failed requests, response times, etc
• Confidentiality. Web analytics vendors store all of their customers’ web traffic data on their own servers. For some companies (and government organizations), the risk of their proprietary website analytics information being used without their knowledge or approval is not acceptable.

These issues, and others, are sufficient for some companies to build and maintain their own CDW from their clickstream data, and perform analytics on the data with sophisticated Business Intelligence (BI) software.

Clickstream Data Warehouse

A clickstream data warehouse is used to store all of the historical website activity in a structured format – typically on the company’s own servers – so that sophisticated queries and reports can be run on the data with BI software. Because of the large volume of clickstream data generated on a daily basis, and the large number of fields in the data, the prospect of implementing a CDW can be daunting. Even so, the business advantages of augmenting – or supplanting – packaged SaaS web analytics products with a CDW often provide sufficient justification for companies to undergo the initiative.

Benefits of implementing a CDW include:

• Flexibility. Since the company has all of the data, it can process, segment, and report on the data in whatever ways it chooses. For example, the ability to segment the data into unique user sessions and to combine multiple visits of a particular user over time provides rich insight into customer value.
• Combining multiple touchpoints. Combining a customer’s clickstream activity with data from customer support, procurement, POS, and other operational systems provides companies with a more complete view of the customer, and allows for more precise customer scoring.
• Historical analysis. With a CDW, queries do not need to be pre-defined. Days, months, or years after the activity, an organization can ask new questions of the data that it did not initially think to ask.
• More powerful BI. Companies often use sophisticated BI software with their CDW, providing analytics capabilities far beyond what off-the-shelf web analytics products provide.
• Superior tracking. Web server logs have far fewer limitations tracking objects and visitors than other technologies such as page tagging. Web server logs capture activity regardless of the characteristics of the client, and include user activity with PDF files, file downloads, server response times, etc.
• Confidentiality. CDWs built from web server logs eliminates any risk that an analytics vendor will share the data, since all of the data remains inside the company (on the web servers and in the companies CDW).

Clickstream Source Data

Just as different SaaS web analytics vendors use different underlying technologies – such as page tagging or packet sniffing – to track clickstream data, a CDW can leverage various sources of clickstream data as input.

For example, companies that are already using a web analytics product often use their analytics vendors’ source data as input to their CDW. All of the major analytics vendors offer their customers access to their full source data via batch delivery services or APIs. This enables them to continue to leverage their prior investments in customizing the product, and augments many of the limitations of the SaaS offering.

Alternatively, a CDW can be built by processing the raw log files written by the web servers. Since the web server log files contain every transaction, these files provide more complete data to be mined, eliminating many of the limitations of client-side tracking technologies, such as the inability to track visitors running clients without javascript, for example.

Some companies choose to build CDWs by combining multiple data sources that use different types of clickstream data. Doing so enables companies to leverage the benefits of multiple underlying technologies – for example, by enriching the batch data files from their web analytics vendor with the web server log data.

Enabling Technologies

Until recently, the massive amount of computing resources required to effectively work with the clickstream data made CDW initiatives prohibitive for most companies. However, recent advances in data warehousing technologies, including massively parallel processing (MPP) architectures and columnar databases require less investment in hardware resources and deliver significantly more attractive price performance ratios than ever before. As a result, many of the newer, successful CDW initiatives rely on these high performance data warehousing technologies.

In addition to high performance data warehousing technology, another critical component of a successful CDW implementation is high performance data transformation technology that can parse, structure, and cleanse the raw clickstream source files to initially populate the CDW, and refresh the CDW on an ongoing basis. Since these source files are typically very large, complex, and require significant processing to extract the desired information from the files, look for highest performance transformation technologies.

Finally, some of the newer analytic database technologies provide extremely fast import capability, so that the data can be available for reporting and analytics minutes or even seconds after the actual event is logged, in some cases enabling behavioral targeting for in-session, programmatic control of workflow, content, offers, and advertising.

Clickstream data is a rich source of information for companies. While many web analytics products are available, limitations associated with these products often drive companies to undertake their own internal clickstream data warehouse initiatives. With a clickstream data warehouse in place, companies can segment individual user sessions, combine customers’ online activity with data from other operational systems, and overcome many other limitations of the SaaS web analytics products.

New, high performance data warehousing technologies are often used for these initiatives, due to the massive data volumes and the complexity and cardinality of the data. New data warehousing technologies are available to maximize the success of CDW initiatives. High performance data transformation technology that can parse, structure, and cleanse the raw clickstream source files to initially populate the CDW, and to refresh the CDW on a regular basis is also available.

These technologies make it simpler and more cost effective for companies to manage their clickstream data in-house than ever before, effectively raising the bar on the insight and benefits that can be obtained via clickstream analysis.

Some relevant links:
http://hadoop.apache.org/
http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html
http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html

This entry was posted in technology and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>