1

In some data analysis challenge (so I don't control the data given), I have a dataset made with the price of a commodity in two location (let say Japan and Korea) and at different day to relate to other data, let's say the price of oil and iron on international market.

So a typical row is like

DAY_ID | COUNTRY | PRICE | OIL_PRICE | IRON_PRICE | ...
  1    |   JPN   |  4.35 |    7.22   |    6.55    | 
  1    |   KOR   |  5.32 |    7.22   |    6.55    |
  2    |   JPN   |  3.51 |    6.38   |    4.27    |

As you notice the price of iron is the same for the two first lines since they are in the same day. Also the data are incomplete, some row are missing meaning I can have the row for one day in Japan but not in Korea.

My problem

The DAY_IDis just an identification and does not reflect any chronological order. At the moment I don't know what to do with it so I just drop this column and then train my regression model.

However I feel like that I am erasing some information that I could use since the price in Japan and Korea at the same day are correlated.

How to use the DAY_ID column?

  • 1
    What would be the purpose of the regression? Why are some values missing? – whuber Jul 28 '23 at 19:22
  • 1
    I add some information to make it clearer. – EtienneBfx Jul 29 '23 at 07:28
  • 1
    It sounds like you are looking for a multivariate regression model in which the response is the vector of (JPN,KOR) values and the model allows for the vector error terms on any given day to be correlated. The [tag:duplicate-records] tag looks inappropriate to me. – whuber Aug 01 '23 at 14:35
  • Thank you, as a beginner in this field I don't know the right world to phrase it. – EtienneBfx Aug 03 '23 at 07:30
  • 1
    "then train my regression model." What is your model training? The price of the commodity as function of the price of iron and the price of oil? What is the use case? Estimation of models in order to figure out which location is having a cheaper commodity? – Sextus Empiricus Aug 03 '23 at 08:25
  • 1
    "The DAY_ID is just an identification and does not reflect any chronological order" How can you have time series data where the time is unknown? – Sextus Empiricus Aug 03 '23 at 08:29
  • Are the oil price, iron price and other columns, always the same in the different places? – Sextus Empiricus Aug 03 '23 at 08:31
  • To answer all your question : The use case is a data analysis challenge when you want to estimate PRICE as a function of the price of everything else depending on the date and location. – EtienneBfx Aug 03 '23 at 16:16
  • This is problem is not made to be a problem about time series, that's why the creator randomize the DAY_ID – EtienneBfx Aug 03 '23 at 16:17
  • If two rows have the same DAY_ID then every data but COUNTRY are the same. They represent the price on a international market. – EtienneBfx Aug 03 '23 at 16:18
  • "as a function of the price of everything else depending on the date and location" this assumption is problematic because it discards the nature of a date variable. It is unclear why this is the case and desirable to be mentioned why this is the case. – Sextus Empiricus Aug 03 '23 at 18:21
  • I am not sure to understand your comment. The price vary from day to day, but as the organizers of the challenge doesn't want it to be about time series, they randomize the id of the days. – EtienneBfx Aug 05 '23 at 20:10
  • 1
    Ah, I see, so the reason that this is not a time series is some artificial reason, namely that it is some sort of exercise. What is the goal of the challenge? How do you score points in that challenge? – Sextus Empiricus Aug 05 '23 at 22:33
  • I got a first set of data (1500 rows) with the price for the training and a second set where price is blank (600 rows) for the submission. The metric is Spearman Correlation. – EtienneBfx Aug 06 '23 at 07:23
  • So your goal is to have a function that predicts the price based on a set of predictors and you determine performance by the the sum of squared residuals. – Sextus Empiricus Aug 06 '23 at 09:01
  • It would still be interesting to try to consider this as time series data, or use other outside information about the nature of the data. E.g. it might be that several data points have very similar errors and should be weighted less strongly in fitting the model. I remember another question about some time series where the sampling rate is not constant and the question was about how to deal with that. A solution is to treat several data points that are closer together (and are correlated) with less weight. A similar approach could work here. https://stats.stackexchange.com/questions/272958/ – Sextus Empiricus Aug 06 '23 at 09:09
  • It could be interesting to create some ordering of the data points by creating a path from point to point that minimizes the distance in between the points (using an algorithm for the travellings salesman problem), then fit the model considering a potential correlation between errors with an algorithm for solving the generalized least squares problem. Or skip the ordering and use some matrix for the distances and a function for the correlation as function of distance. Some plotting and searching may help to consider the best function. – Sextus Empiricus Aug 06 '23 at 09:18
  • Could you write your last comments as an answer so you could pretend for the bounty – EtienneBfx Aug 06 '23 at 21:57
  • What is the PRICE compared to the oil price and iron price? – Dave Aug 07 '23 at 13:45
  • PRICE is the target, the price of the final commodity. We want to understand it with respect to 32 features like for example the price of the oil or iron on international market. – EtienneBfx Aug 07 '23 at 21:19
  • So then how is this different from any other regression problem? It seems like you have some features that might be predictive of an outcome you want to predict. What concerns do you have about predicting that outcome? – Dave Aug 08 '23 at 10:34
  • @Dave the prediction is of a vector of multiple prices as function of the features. The error might be correlated and that can influence the regression. – Sextus Empiricus Aug 08 '23 at 11:30
  • @SextusEmpiricus It seems like each prediction will be a single PRICE, not a vector of multiple prices (such as those of oil and iron). Or do you have a different interpretation of the earlier comment about what the PRICE column is? – Dave Aug 08 '23 at 14:31
  • @Dave there are multiple predictions of PRICE for different countries, or at least there is a repetition of the DAY_ID variable, which allows a regression that is different from standard OLS (I assume that a mixed effects regression could deal with it). – Sextus Empiricus Aug 08 '23 at 15:12
  • @SextusEmpiricus You wouldn’t just encode the countries as $0$ and $1?$ – Dave Aug 08 '23 at 15:31
  • @Dave, yes, but the issue is whether the errors for different countries, yet the same other features, are correlated. – Sextus Empiricus Aug 08 '23 at 15:36
  • @Dave in a very concrete way, my issue is that I am dropping the DAY_ID column because I don't know what to do with it. So I am surely loosing some information. – EtienneBfx Aug 08 '23 at 20:07
  • @SextusEmpiricus I still encourage you to copy paste your comment as a solution. – EtienneBfx Aug 09 '23 at 10:31
  • @EtienneBfx I have been thinking about converting the comments in an answer, but I have no time before the bounty ends. (Maybe still tomorrow, during the grace period) – Sextus Empiricus Aug 09 '23 at 22:24

1 Answers1

-1

This is a case of insufficient features to explain the target variable. I assume that your target variable in the regression is gonna be price of the commodity. In that case,

  • you can create new features explaining the difference between two countries like amount of natural resources or value of currency etc.
  • try to collect data on what else can influence the price of the commodity, ex: labor cost, transportation cost etc.

Regression results are as good as your features, hope this helps!

Update: Binary encoding or one-hot encoding will help in including these variables in the regression. Reference: https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/

  • Could you please explain what is "insufficient" about these features? For instance, if the data hypothetically were that PRICE in JPN is the average of the OIL_PRICE and IRON_PRICE and that the corresponding value in KOR is 1.00 greater on the same day, that would be easy to identify with data like these and would be a perfect fit. – whuber Jul 28 '23 at 19:25
  • Just to be clear I make the situation more simple, in my case I have 30 other features. – EtienneBfx Jul 28 '23 at 20:02
  • That was pretty evident from the ellipsis in your table. But it doesn't change anything: include COUNTRY as an explanatory factor in your model, and possibly its interactions with the other variables, too. You could even just run separate models for each country, depending on why you are doing this regression. You still need to explain that. – whuber Jul 28 '23 at 20:17
  • If its just the two locations JPN and KOR, we can easily include them in the regression using One-Hot-Encoding or Binary Encoding, which essentially is to create another feature {0,1} to indicate if the row belongs to JPN or KOR. Reference: https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/ – mugndhn Jul 31 '23 at 17:21
  • Hello, I think you read to quickly the issue I have. – EtienneBfx Aug 03 '23 at 07:33