Thursday, April 9, 2009

When more is definitely less

First, some background: We are creating a user behavior table that captures our web user's actions. Imagine a table made of columns (session_key, event_key, date_time_key). Events are of the type "user logs in", "user does a search", "user make a form error", "user clicks on a result", and many others.

And I told my boss that we want to keep the granularity of user behavior (the various events) roughly consistent. To give a hypothetical and silly example- if I build a table that tracks people as they go through their day: "get up", "drive to work", "waste time on web", "lunch", "drive back", "pick child from daycare": it seems like a bad idea to have events that are much more granular - say, "take deep breath".

But why is it bad? The argument from a disk space perspective is easy, but there is something else.

The basic query against this table is “give me sessions that did event X, followed by Y , immediately followed by Z. The problem is in the immediately part. A query system needs to handle exceptions (event Z is not deemed immediately after Y if there are intermediate events a,b,c; Z is deemed immediately after Y even if there are intermediate events e, f,g, h. Further, such exception lists can vary for different pairs of events.

The greater the number of trivial events that are recorded, the more invoved the management of these exception lists, more the mistakes in analysis.

...adding, that current BI/OLAP tools cannot do any kind of chronology, much less distinguish immediate chronology.