Slides 1 used in workshop session A9 on "Lies, Damn Lies, and Web Statistics" at the IWMW 2005 event held at the University of Manchester on 6-8 July 2005.
See http://www.ukoln.ac.uk/web-focus/events/workshops/webmaster-2005/sessions/lowndes/
IWMW 2005: Lies, Damn Lies, and Web Statistics (1)
1. Dr. Mike Lowndes,
Interactive Media Manager,
Natural History Museum, London
– Houses 350-permanent scientific staff, plus postgraduate
students; one of the largest UK research institutes in the
natural sciences.
(Right-click or click-hold (Mac) and press k or select Speaker Notes)
IWMW 2005: Who’s web is it anyway?
Lies, Damn lies and Web Statistics
2. Contents
• Why bother?
• Issues with web logs
• Issues with analytic tools
• Browser tracking
• Comparison between approaches
• Known issues with browser tracking
• Nedstat input and findings from Newcastle
University
3. Why bother?
• Web log analysis is currently the main method used to
quantify web site usage for reporting.
• Results are used by the government as performance
indicators for institutional websites.
• Not accurate or meaningful most of the time
– no good for absolute measurement of usage.
Can be used for:
• Trend analysis
• Content preferences
• ROI estimation
• Checking and fixing your site
• Understanding users behaviour
• Testing assumed pathways
4. Issues with server logs
• Dynamic IP
– Many users using the same IP number over time.
– Same user assigned many IP numbers over time.
• Proxies
– Several or many users behind 1 IP number
• Caches (can be ‘in’ Proxies)
– Commonly requested files cached closer to the users.
– Can form the top 20-50 hosts accessing sites.
• Robots and spiders
– Few visits but lots of hits.
– Analytic packages cannot keep up to date with all of them for exclusion.
• Syndication
– RSS feeds generate huge logs, but are not ‘read’ by humans initially.
– Click-through configuration.
• Reporting by analysis tools
– Often weekly or monthly reports: realtime is very labour/server intensive
– Reports often complex and techy.
5. Issues with log analysis tools
• Webtrends vs Summary.net
• 1. Natural History Museum
– Summary SP (summary.net) Version 4.2.1, unregistered demo, default configuration
• 2. UKOLN (Bath)
– WebTrends (www.webtrends.com) Version 5, default configuration
• Both tools were applied to the same log file
• Default configurations – not removing robots
– Note: WebTrends documentation not clear on this point
6. Measurement discrepancies
Summary SP Webtrends 7
Connections (hits) - +0.67% hits
Page views (page hits) - +5.00%
Visits (user sessions) - +0.07%
Failed hits - +0.30%
Average visit duration - -30.0% (+250%)
Browsers
IE 75% 86%
Netscape compatible 2% 4%
Referrers
Top Level Domains US US
UK UK
AUS CAN
NETHER NETHER
CAN AUS
JAP JAP
7. Comparison between tools
• Not a single measurement was identical.
• Most measurements were within 5%
• Visit duration measurement widely different, and
can depend on configuration. Possible bug in
WebTrends version 5.
• Page view measurements were quite different.
Results broadly similar but direct comparisons,
especially of Page Views, are not really justified.
8. Browser tracking
• Do they have fewer inaccuracies and distortions?
• Is it easier on the web team?
• Is it affordable?
• Does it give us more information / better
information?
9. Browser tracking
• Requires code to be added to pages
• Uses an image, sourced from the tracking website.
Also uses javascript and cookies for gathering
extended and repeat-visit information
• Usually hosted services
• Provide near real-time tracking
• Few of the issues distorting logs affect these
measurements (according to the blurb)
• Main players: Nedstat, Nielson/Netratings,
WebSideStory
10. Comparison between tools
• Summary SP VS Nielson/Netratings
• Run on one section of a site over a month.
• ‘Visiting’ section of the Natural History Museum site
– small but popular and easily tagged.
11. Results 1 – visits and visitors
Visits / User sessions 27,663 40,402 -32% 35,395
Visits per day (ave) 922 1,347 1,180
Visits per visitor per month (ave) 1.1 1.7 1.5
Unique visitors (browsers) 25,127 23,585 23,084
Pages per visit (ave) 3.31 3 2.1
Visit duration (ave) 02:09 07:13 04:08
Page impressions 91,506 117,447 71,895
12. Results 2 – pages viewed
value Browser track Log analysis
Top 10
index.html, Visiting home. 31,117 28591
where are we? page 17,897 26566
planning your visit page 6,835 16773
events calendar page 9,221 9369
howtogethere -local map page 4,700 5005
access guide introduction page 1,978 4653
travel details page 3,550 3668
facilities page 2,767 3497
activities page 3,293 3375
multilingual info. 828 1901
top ten totals 82,186 103,398
13. Results 3 – country
Browser tr. GeoIP (Sum.)
Countries uk 75% uk 62%
us 5% us 8%
spain spain
italy netherlands
netherlands germany
france italy
germany france
belgium canada
poland poland
• Depends on the quality of the geographical IP database, not
the mode of tracking?
14. Conclusions regarding traditional Log
analysis
Assuming browser tracking is more accurate…
• We have fewer visit sessions than we thought, but
more visitors
– Fewer visits (sessions), possibly due to robot exclusion
– More visitors (unique users), possibly due to the masking
effect of proxies/caches and browser caches
• Visit duration is much shorter than thought
– possibly due to robots/spiders and cache updating.
• Country information is roughly accurate so long as a
geographical lookup is used.
• Activity of popular pages, which are often cached,
will be underestimated
15. Browser tracking advantages
• Almost real-time analysis, incremental data.
• Better repeat user tracking and individual pathway
analysis.
• Configurable, graphical reports for non-techies
– Techie still needs to configure those reports however, as
an understanding of web analytics is required
• Cut our monthly staff time down from 1.5 days to 1
hour
• Appear to be more accurate in describing the
activity of real people, but we would like to see
some independent research.
16. Issues with browser tracking
• Setup is not trivial: You need to add code to every page.
– Multiple server / ownership issues.
• Does not always work (or get full user details) if Javascript is turned
off or cookies disallowed.
• Does not work with text-only browsers.
• Unknown compatibility with PDAs, mobiles etc.
Questions:
• Would we get different results with different hosted services?
– ABCE: industry standards for measurement
• Cookies often deleted unless user is confident in the source?
– This would affect the measurement of repeat visitors and behaviour
Political issues:
• Issues with external hosting of institutional data
• Security of personal data issues with external hosting
17. Next steps
• Many private sector and public sector sites have
already moved to browser tracking.
• About 6 National Museums are currently discussing
hosted browser tracking.
• 5 Universities currently involved in a trial of
NedStat.