0

This sounds like a question that should have come up before but I couldn't find it on CV.

I am trying to use a column called limit_price as in the limit price of an order for a machine learning project (you can assume I am trying to predict the average trading price of an order). This column is an upper limit for a buy order, or a lower limit for a sell order. Let's say we have an AAPL buy order with a limit price of \$142, that means we are not willing to buy this stock for higher than \$142.

This column either has a limit price or it has the value "Market", which means the limit price is missing since the upper / lower limit for this order was not specified.

I am wondering what model and preprocessing technique I should use to properly model this column. I am also more broadly curious about how to model dependencies in my training set that spans multiple columns.

  • This is still a float, there are no categories. The issue is that the variable is bounded. In a way no different than having a percentage variable, which is bounded by 0 and 100. There are some posts on this site about this. – user2974951 Dec 09 '22 at 08:00
  • I added more explanation now. I should have specified that when the value is missing the value is "Market", which is a category – Kaan Yolsever Dec 09 '22 at 08:02
  • 1
    A common way to re-code such a variable is to create two variables out of it: one binary variable to indicate "Market" or not; and the other numeric variable to indicate price when "Market" = 0. See this CV thread for an in-depth discussion: How do you deal with "nested" variables in a regression model?. – dipetkov Dec 10 '22 at 19:49
  • I didn't know this type of variable is called a nested variable. I am using ML techniques other than linear regression so, my question was more directed towards other models. I appreciate this though @dipetkov – Kaan Yolsever Dec 12 '22 at 07:47

0 Answers0