0

If we have a set of data of how long one watches youtube, these data points only include the raw number of minutes watched. If it is known that some of those data points include situations where you watched youtube for a while then left the room and let the videos play out. How would we statistically set out a period of time that is too long and therefore the videos had been playing without you present? Then additionally, how many minutes should be taken off the observation to obtain a more true time of viewing?

My Idea is to use Outliers i.e 1.5 x IQR and anything past is deemed as being viewed without anyone there. Additionally, perhaps it would make sense to use a hypothesis test with a given % of confidence that the found value is significantly different from the average.

For the reduction of time would the box plot imply using the upper bound as the value to replace all outliers?

NMA
  • 19
  • 1
    Have you read any of the threads on the site about outliers? – gung - Reinstate Monica Aug 22 '22 at 20:28
  • What about people who legitimately watch the whole video? – Dave Aug 22 '22 at 20:46
  • @gung-ReinstateMonica I am wondering how this relates to the question? – NMA Aug 22 '22 at 20:47
  • @dave I dont think i was clear enough. It is not just for a view of one video. Instead an entire youtube streaming session, i.e 10 videos of 3 minutes each etc... – NMA Aug 22 '22 at 20:48
  • Then what about the people who legitimately watch for a long time? – Dave Aug 22 '22 at 20:49
  • @Dave We are looking at a case of specific data points that have been marked as incredibly long for specific users – NMA Aug 22 '22 at 20:54
  • Did you read the question & the answers? It addresses the idea of using 1.5 x IQR, eg. – gung - Reinstate Monica Aug 22 '22 at 21:13
  • @gung-ReinstateMonica Yes Thank you! I am now trying to find out how to deal with the outliers. If we are unable to simply remove them, how do we calculate a correct estimated value for them? – NMA Aug 22 '22 at 21:25
  • No purely statistical method will distinguish the causes of the long times. Do you have any data related to actual viewing times vs. nominal elapsed times? If not, then any approach you take will have a significant element of arbitrariness to it. – whuber Aug 22 '22 at 22:38
  • @whuber Yes there is data that shows what ordinary view times should be – NMA Aug 23 '22 at 11:29
  • Please describe them, then! – whuber Aug 23 '22 at 13:39

0 Answers0