1

I have dataset of posts from blog and for each post I have the number of views. I want to extract the topics (or phrases) that made the posts with more views.

I am planning divide all posts in two sets based on number of views (one set with low number of views and the other with higher numbers), then extract topics using LDA from each set and compare how they differ.

I am wondering if this is right approach and if there are other approaches that can be better or similar?

2 Answers2

1

Seems right. However, establishing causality will not be as simple as extracting keywords and noticing the differences. And I would suggest not to divide the posts, instead club them together run LDA, extract the keywords, then analyse the differences. By separating you are introducing quite a huge bias into your model.

Himanshu Rai
  • 1,838
  • 12
  • 10
  • Thanks for your feedback, for getting stronger causality I think I would then add also how topic is popular in general based on Google Trend data and also the number of incoming links to each post. – user3550351 Dec 26 '16 at 15:13
0

Instead of diving into LDA directly, I would be rather start with simpler ones like TF-IDF and see whether it can extract keywords from each class/blog. Recently I got into this kind of problem where I need to extract topics out of tweets and I got fruitful results with TF-IDF being a part of my method.

I would treat each blog as individual data point rather than merging them, so that documents can be clubbed based on similarity of words obtained and then extracting the topic out of them. At the end you can use views to see the average views each topic has got.

Well you got tools like LSA which constructs a matrix based on word counts. This matrix is reduced by SVD which can be computationally time taking on matrices of huge sizes.

So before trying any of the bigger methods, do try simpler ones and if the results are unsatisfactory, approach other methods.

Hope it helps.

Kiritee Gak
  • 1,799
  • 1
  • 11
  • 25