2

I am looking for a model that determines the likelihood of a customer making a purchase after visiting my website multiple times.

What I anticipate is there's two types of visitors... (1) those that are just checking prices, and (2) those that visit a few times before making a purchase.

All the customers, if they don't make a purchase, are effectively censored because I have no idea if they'll come back or not.

Let's say I have the following longitudinal sample data. The "time" is the time from the first appearance to time of the current record. Purchase is 0 if a purchase was not made and 1 if a purchase was made.

UserID  Time    Purchase
1       0       0
1       1       0
1       3       1
2       0       0
2       1       0
2       2       0
2       3       0
2       4       0
2       5       0
2       6       0
2       7       0
2       8       0
2       9       0
2       10      0
3       0       0
3       2       1
4       0       0
4       4       0
4       6       0
5       0       0
5       1       0
5       2       0
5       3       1

I can see a 60% customer purchase rate. But can I do better?

I can look at a chart of purchase likelihood as the customer "age"...

enter image description here

But on a customer by customer level, I think I can do even better...

For example, customer 4 looks to be sort of poking around... there's a lot of time between revisits... so how do I incorporate the timing factor here?

And suppose I had some covariates, like gender? How would I be able to account for all of these?

1 Answers1

1

Sounds like you are doing supervised classification, where the classes are "will buy" and "won't buy", and the features include number of visits, time between visits, and other information you have about customers.

To keep things simple, I would start with older data and assume that after some period of time, if someone has not bought, they won't. Later you can come back to dealing with censored data (using methods from survival analysis).

I would start with logistic regression, then possibly try KNN or a random forest classifier.

One other consideration is how to many models to build. You could have one model that you train with data from the first visit, and use to make a prediction after the first visit. Then a second model for the second visit, etc.

Or you could make a single model with number of visits as a factor. I am leaning toward the first option, but I might try both.

Hope that helps!

  • How would I codify the data so I can capture the number of times between visits? Edit: What I mean is... do I need a variable for the time between each visit, then some interaction variable for whether the n-th visit actually occurred? – MODELMAN Jun 21 '16 at 18:00
  • Let's say you're making a prediction after n visits, so you've seen n-1 inter-visit times. You could include all n-1 of them as factors, but maybe mean and standard deviation would be enough (or something more robust, like median and IQR). – Allen Downey Jun 21 '16 at 18:22