1

Specifically I have to determine to what extend the prices of some houses in one year can be predicted from the prices of some houses in a previous year. The data set is split up into houses of various types. I have made numerical summaries of each set of data in each year and have made normal quantile plots for each type of house in each year but I don't really know where to go from here. Any help or ideas about where to start would be very helpful.

Edit: Would one way to approach this to be to show how similar or different the distributions of the earlier and later year prices are? Similar meaning easier to predict and different meaning harder to predict?

Flose
  • 135
  • 4

1 Answers1

1

Sebastian Ruder has some work on measuring the distance of datasets from your target dataset, using Bayesian Optimization:

http://ruder.io/learning-select-data/

Paper: https://arxiv.org/abs/1707.05246 "Learning to select data for transfer learning with Bayesian Optimization"

First couple of paragraphs from the blog post above:

"In Machine Learning, the traditional assumption is that the data our model is applied to is the same as the data we used for training. This assumption is proven false as soon as we move into the real world: many of the data sources we encounter will be very different than our original training data (same meaning here that it comes from the same distribution). In practice, this causes the performance of our model to deteriorate significantly.

"Domain adaptation is a prominent approach to transfer learning that can help to bridge this discrepancy between the training and test data. Domain adaptation methods typically seek to identify features that are shared between the domains or learn representations that are general enough to be useful for both domains. In this blog post, I will discuss the motivation for, and the findings of the recent paper that I published with Barbara Planck. In it, we outline a complementary approach to domain adaptation – rather than learning a model that can adapt between the domains, we will learn to select data that is useful for training our model."

Hugh Perkins
  • 4,697