Because log analysis figures are misleading, ThinkMetrics' CEO advises switching to page-based tracking, and helps you through the process.
There are two methods of analyzing traffic in websites; log analysis and page-based tracking. It is fairly common for people to switch their analysis methodology from log analysis to page-based tracking as their understanding of web analytics grows. Making this switch can create problems in continuity because the numbers from each will be different. This article addresses these issues and outlines some solutions.
Differences between methods
The first thing to understand is that the numbers from page-based tracking and log analysis will never match. The reason for this is that they are not measuring the same things. This difference is further obscured by the fact that both types use the same terms for different measurements. In order to understand what is going on here we need to understand how the two systems of web analytics work.
Log analysis
Web servers keep files called "access logs," which are logs of pages requested, who they went to, when, and so forth. Log analysis involves reading and analyzing these access logs.
Log analysis is easy and cheap. Page-based tracking is much harder, and processor intensive, so commercial-grade page-based tracking is almost non-existent. As a result, all Open Source web analytics software uses log analysis. For this reason, log analysis is much more common than page-based tracking. If you're not sure what you want, or why, and you don't want to pay for it, the web activity reports you're going to get are going to be derived from log analysis.
Problems with log analysis
The most important cause of discontinuity between log analysis and page-based tracking is that log analysis doesn't distinguish between pages requested by search engines indexing the site and humans reading it. It would be easy to separate the two, but I've never come across any system that does.
If your reports are bundling search engines and people together, it distorts your numbers. Search engines read pages much faster than people, perhaps one or two pages per second. In addition, they may read many more pages in a single visit than the average human. Counting search engines in the same way as people skews the key performance indicators. It increases total page views, and reduces the average visit duration. It may also increase the average number of pages viewed per visit.
How much counting search engines will distort your numbers depends on how many people are visiting the site and the number and depth of search engine visits. If you have a site that has a relatively low number of human visits, but a high number of search engine visits, the skewing could be extreme, presenting a completely false picture of what is happening. In my experience, even an average commercial site will see a halving of the average visit duration.
Since log analysis systems aren't separating search engines from humans when they calculate these numbers, it's impossible for you to manually adjust these numbers even if you understand what's happening. In reality, few people understand that this is going on, so they assume that counts for visits, unique visitors, page views and average durations are all talking about humans. This is why most web designers think most people spend half the amount of time on websites that they really do. Most web designers will tell you the typical visit to a website is three to four minutes. In actuality, it's more like eight to 10 minutes, and much longer if you've caught someone's attention.
The other cause of discontinuity between page-based tracking and log analysis is caching. Most web browsers will hold a copy of a web page for a period after the visit. If you return to the page, your browser checks to see if the page has changed since the last visit. If not, your browser simply shows you the copy it already has. This is a hold-over from the days of slow internet connections, and was designed to save you those agonizing five-minute waits for each page.
Corporate proxy servers may also store local copies of commonly requested pages to save bandwidth charges. In both cases this is called caching. The problem is that caching cannot be detected by the web server, so views of cached pages are not recorded in the log files. This means that log analysis is under-counting views of pages by humans. Whereas log analysis could remove search engines from the numbers, there is no way it can add for caching. Caching can account for up to 30 percent of a site's total page views.
All this makes it extremely difficult to understand exactly what the people are doing on your website with log analysis. Caching will be pulling the numbers for page views down, while search engines will be pulling them up. How far out the result is, and which way, depends on how much search engine activity there is in the site and the type of visitors and their connections. There's no practical way to correct for all this.
Next: About page-based tracking, and how to switch processes.