Find formula for a known dataset

Question

I am trying to figure out how the computer in this game works.

In short: it's like normal monopoly, except instead of owning a property comletely everybody can buy shares for the property (max 9 available). If a player lands on this property and you own some shares you also profit from the paid rent. Or, for example, if you own some shares and you land on this property you pay less rent to the owner. A small computer holds information about each players amount of shares, displays the value of each share (slightly goes up and down now and again), calculates rent etc.

I already figured out the formula to calculate the price when you want to buy a share. The price goes up if there are less shares available, when the owner owns the whole street or when houses are being build on the property for example. Now I want to know the formula for calculating the rent. There are a few parameters that "could" influence the result (I do not know for sure if every variable is in the formula of course):

the starting value of the property
the amount of shares in play (or free shares? since the max is 9)
the amount of shares the current player holds when he/she lands on the property
the amount of houses that were build on the property
the fact that the owner owns the complete "street" (has a monopoly for this "color")

With a fixed set of parameters, the computer gives a fixed result. So there MUST be a formula somehow. For example: a few values:

Property value	Shares in play (1-9)	Player owned shares (0-4)	Monopoly(yes/no)	Rent to pay
60	1	0	no	4
60	2	0	no	4
60	2	1	no	1
60	5	0	no	5
160	7	0	no	21

Property value	Shares in play (1-9)	Player owned shares (0-4)	Monopoly(yes/no)	Rent to pay
320	1	0	no	56
320	2	0	no	56
320	2	1	no	24
320	3	0	no	54
320	3	1	no	32
320	4	0	no	56
320	4	1	no	36
320	4	2	no	20
320	5	0	no	55
320	5	1	no	36
320	5	2	no	24
320	6	0	no	54
320	6	1	no	40
320	6	2	no	28
320	6	3	no	15
320	7	0	no	56
320	7	1	no	42
320	7	2	no	30
320	7	3	no	20
320	8	0	no	56
320	8	1	no	42
320	8	2	no	30
320	8	3	no	20
320	8	4	no	12
320	9	0	no	54
320	9	1	no	40
320	9	2	no	28
320	9	3	no	18
320	9	4	no	15
320	1	0	yes	112
320	2	0	yes	112
320	2	1	yes	49
320	3	0	yes	111
320	3	1	yes	64
320	4	0	yes	112
320	4	1	yes	72
320	4	2	yes	42
320	5	0	yes	110
320	5	1	yes	76
320	5	2	yes	48
320	6	0	yes	108
320	6	1	yes	80
320	6	2	yes	56
320	6	3	yes	33
320	7	0	yes	112
320	7	1	yes	84
320	7	2	yes	60
320	7	3	yes	40
320	8	0	yes	112
320	8	1	yes	84
320	8	2	yes	60
320	8	3	yes	40
320	8	4	yes	28
320	9	0	yes	108
320	9	1	yes	80
320	9	2	yes	63
320	9	3	yes	42
320	9	4	yes	30

I have a lot more data than this of course (I can even get ALL possible data, but that is beside the point). How can I find the formula for such a set? I tried entering all these values in Excel, plot graphics to find a "line" etc. but to no avail. What would be the best way?

The first, second, fourth, sixth and ninth lines lines are not encouraging for a simple formula. — Henry, Nov 18 '23 at 00:30
Hi @Henry, thanks for your reply! I think the formula applies rounding somewhere, which results in these "strange" numbers. I could expand the table with more values, but then what? I think my question is a more broad one: how do I approach this problem (I don't have a degree in math)? I have a feeling it would be easiers to start with the "bigger" numbers (to avoid the big roundings problems with smaller numbers). — , Nov 18 '23 at 10:32
My issue is that it is not clear from those lines whether increases in shares in play tend to increase or reduce rent (in each case with no player owned shares, no houses and no monopoly) — Henry, Nov 18 '23 at 10:39
I expanded the table with some more data for a single property in an attempt to answer your question. A rule of the game is that I only need to pay rent if I am NOT the owner of the property. Ownership only changes if you buy extra shares and afterwards own more shares than the current owner (so it is possible to have the same amount of shares, but you still have to pay rent since you are not yet the owner). — , Nov 18 '23 at 10:54
Maybe try symbolic regression: https://www.r-bloggers.com/2019/04/symbolic-regression-genetic-programming-or-if-kepler-had-r/ — jblood94, Nov 20 '23 at 13:41
Thanks for the suggestion! Symbolic regression looks to be the term I am looking for but I have a hard time finding some software to "predict the formula based on this dataset". I looked at (and tried) PySR and feyn but I failed because of a lack of knownledge. — Wietse, Nov 24 '23 at 21:40
This question was posed many years ago at https://stats.stackexchange.com/questions/10363/. — whuber, Nov 29 '23 at 22:32

R Carnell · Answer 1 · 2023-12-01T22:18:01.667

A statistical process to follow would be to use linear regression. You can do this in Excel, R, or python.

Typical steps:

Try a simple linear model
- Examine the residuals and determine if there are necessary variable transforms
Try to engineer variables
- difference in shares between own and outstanding
- percent of outstanding shares owned
- percent of total possible shares outstanding
Try a variable transform on the dependent variable (log)
Try a non-linear model

A good, but not perfect, model is:

$$\mathbf{Rent} = e^{0.96 + 0.01\mathbf{Value}-2.42\frac{\mathbf{owned}}{\mathbf{shares}}-0.244\frac{shares}{9}+0.71\mathbf{Monopoly}}$$

To get a better model this way, you need more examples of different values and some examples with houses.

Update

I also tried the symbolic regression procedure that was mentioned in the comments and in other answers. The results you get are very dependent on the allowed functions and operators. I could not easily beat the linear regression approach with the data presented.

Here is R code to illustrate:

dat <- structure(list(value = c(60, 60, 60, 60, 160, 320, 320, 320, 
  320, 320, 320, 320, 320, 320, 320, 320, 320, 320, 320, 320, 320, 
  320, 320, 320, 320, 320, 320, 320, 320, 320, 320, 320, 320, 320, 
  320, 320, 320, 320, 320, 320, 320, 320, 320, 320, 320, 320, 320, 
  320, 320, 320, 320, 320, 320, 320, 320, 320, 320, 320, 320, 320, 
  320, 320, 320), shares = c(1, 2, 2, 5, 7, 1, 2, 2, 3, 3, 4, 4, 
  4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 9, 
  9, 1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 
  8, 8, 8, 8, 9, 9, 9, 9, 9), owned = c(0, 0, 1, 0, 0, 0, 0, 1, 
  0, 1, 0, 1, 2, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 4, 
  0, 1, 2, 3, 4, 0, 0, 1, 0, 1, 0, 1, 2, 0, 1, 2, 0, 1, 2, 3, 0, 
  1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4), houses = c(0, 0, 0, 0, 
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), monopoly = c(0, 
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 
  rent = c(4, 4, 1, 5, 21, 56, 56, 24, 54, 32, 56, 36, 20, 
  55, 36, 24, 54, 40, 28, 15, 56, 42, 30, 20, 56, 42, 30, 20, 
  12, 54, 40, 28, 18, 15, 112, 112, 49, 111, 64, 112, 72, 42, 
  110, 76, 48, 108, 80, 56, 33, 112, 84, 60, 40, 112, 84, 60, 
  40, 28, 108, 80, 63, 42, 30)), class = "data.frame", row.names = c(NA, 
  -63L))
no examples provided with houses
lm1 <- lm(rent ~ value + shares + owned + monopoly, data = dat)
summary(lm1)
#> 
#> Call:
#> lm(formula = rent ~ value + shares + owned + monopoly, data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -25.488  -5.911  -2.093   4.843  21.256 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)

#> (Intercept)  -8.52642    6.10570  -1.396   0.1679

#> value         0.18937    0.02116   8.950  1.6e-12 ***
#> shares        1.39035    0.60704   2.290   0.0257 *

#> owned       -17.71002    1.16736 -15.171  < 2e-16 ***
#> monopoly     37.34621    2.63746  14.160  < 2e-16 ***
#> ---
#> Signif. codes:  0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 10.05 on 58 degrees of freedom
#> Multiple R-squared:  0.9056, Adjusted R-squared:  0.8991 
#> F-statistic: 139.1 on 4 and 58 DF,  p-value: < 2.2e-16
plot(lm1, which = 1)


# try to engineer some variables
dat$sharediff <- dat$shares - dat$owned
dat$sharepct <- dat$shares / 9
dat$sharediffpct <- dat$sharediff / 9
dat$pctowned <- dat$owned / dat$shares
try a multiplicative model
lm4 <- lm(log(rent) ~ value + pctowned + sharepct + monopoly, data = dat)
summary(lm4)
#> 
#> Call:
#> lm(formula = log(rent) ~ value + pctowned + sharepct + monopoly, 
#>     data = dat)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.30260 -0.10133  0.01289  0.07176  0.65647 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)

#> (Intercept)  0.9601256  0.0904284  10.618 3.21e-15 ***
#> value        0.0101118  0.0003152  32.078  < 2e-16 ***
#> pctowned    -2.4199149  0.1020827 -23.705  < 2e-16 ***
#> sharepct    -0.2442338  0.0736702  -3.315  0.00158 ** 
#> monopoly     0.7076224  0.0392421  18.032  < 2e-16 ***
#> ---
#> Signif. codes:  0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.1496 on 58 degrees of freedom
#> Multiple R-squared:  0.9739, Adjusted R-squared:  0.9721 
#> F-statistic: 540.3 on 4 and 58 DF,  p-value: < 2.2e-16
plot(lm4, which = 1, col = dat$shares, pch = 19)


################################################################################
require(gramEvol)
#> Loading required package: gramEvol
#> Warning: package 'gramEvol' was built under R version 4.3.2
ruleDef <- list(expr = gramEvol::grule(op(expr, expr), func(expr), var),
                func = gramEvol::grule(exp, sqrt),
                op = gramEvol::grule('+', '-', '*', '/'),
                var = gramEvol::grule(dat$value, dat$shares, dat$owned, dat$monopoly))
grammarDef <- gramEvol::CreateGrammar(ruleDef)
grammarDef
#> <expr> ::= <op>(<expr>, <expr>) | <func>(<expr>) | <var>
#> <func> ::= exp | sqrt
#> <op>   ::= "+" | "-" | "*" | "/"
#> <var>  ::= dat$value | dat$shares | dat$owned | dat$monopoly
set.seed(123)
gramEvol::GrammarRandomExpression(grammarDef, 6)
#> [[1]]
#> expression(sqrt(exp(dat$value + dat$owned)))
#> 
#> [[2]]
#> expression(exp(exp(dat$value/dat$monopoly)) + dat$owned)
#> 
#> [[3]]
#> expression((dat$value - sqrt(dat$owned))/dat$monopoly)
#> 
#> [[4]]
#> expression(dat$monopoly)
#> 
#> [[5]]
#> expression(exp(dat$shares))
#> 
#> [[6]]
#> expression(dat$shares + dat$value)
SymRegFitFunc <- function(expr) {
  suppressWarnings(result <- eval(expr))
  if (any(is.nan(result)))
    return(Inf)
  return(mean((dat$rent - result)^2))
}
SymRegFitFunc(expression(exp(dat$shares)))
#> [1] 11838780
ge <- gramEvol::GrammaticalEvolution(grammarDef, SymRegFitFunc, 
                                     terminationCost = 0.1, 
                                     iterations = 2500, 
                                     max.depth = 5)
ge
#> Grammatical Evolution Search Results:
#>   No. Generations:  2500 
#>   Best Expression:  sqrt(dat$value) * (exp(dat$monopoly) + dat$monopoly) + dat$owned 
#>   Best Cost:        693.648325144468
yhat <- eval(ge$best$expressions)
resid <- dat$rent - yhat
plot(yhat, resid, xlab = "Predicted Rent", ylab = "Residuals")


# error of linear model
var(lm4$residuals)
#> [1] 0.02093519
mean(lm4$residuals^2)
#> [1] 0.02060289
error of symbolic regression
var(resid)
#> [1] 600.0785
mean(resid^2)
#> [1] 693.6483

^{Created on 2023-12-01 with reprex v2.0.2}

Thank you for your input, I will try this somewhere in the upcoming days and get back to you! — Wietse, Nov 28 '23 at 13:39

score 1 · Answer 2 · answered Nov 29 '23 at 21:55

Why don't you go and make use of symbolic regression, e.g. with pySR a symbolic regression package. If you can generate enough observations and denote all impact factors (not contain noise), then such an approach to find formulas would be my go-to:

Here's a simple example of how you can use PySR to fit a model:

import numpy as np
from pysr import pysr
Generate some example data
np.random.seed(42)
X = np.random.rand(100, 1)
y = 3 * X.squeeze() + np.random.normal(0, 0.1, 100)
Define the function signature to search for
equation = pysr(X, y, niterations=100)
Print the discovered equation
print("Discovered Equation:", equation)

This should recover the Y=3X formula which we started out with.

Property value	Shares in play (1-9)	Player owned shares (0-4)	Monopoly(yes/no)	Rent to pay
320	1	0	no	56
320	2	0	no	56
320	2	1	no	24
320	3	0	no	54
320	3	1	no	32
320	4	0	no	56
320	4	1	no	36
320	4	2	no	20
320	5	0	no	55
320	5	1	no	36
320	5	2	no	24
320	6	0	no	54
320	6	1	no	40
320	6	2	no	28
320	6	3	no	15
320	7	0	no	56
320	7	1	no	42
320	7	2	no	30
320	7	3	no	20
320	8	0	no	56
320	8	1	no	42
320	8	2	no	30
320	8	3	no	20
320	8	4	no	12
320	9	0	no	54
320	9	1	no	40
320	9	2	no	28
320	9	3	no	18
320	9	4	no	15
320	1	0	yes	112
320	2	0	yes	112
320	2	1	yes	49
320	3	0	yes	111
320	3	1	yes	64
320	4	0	yes	112
320	4	1	yes	72
320	4	2	yes	42
320	5	0	yes	110
320	5	1	yes	76
320	5	2	yes	48
320	6	0	yes	108
320	6	1	yes	80
320	6	2	yes	56
320	6	3	yes	33
320	7	0	yes	112
320	7	1	yes	84
320	7	2	yes	60
320	7	3	yes	40
320	8	0	yes	112
320	8	1	yes	84
320	8	2	yes	60
320	8	3	yes	40
320	8	4	yes	28
320	9	0	yes	108
320	9	1	yes	80
320	9	2	yes	63
320	9	3	yes	42
320	9	4	yes	30

Property value	Shares in play (1-9)	Player owned shares (0-4)	Monopoly(yes/no)	Rent to pay
320	1	0	no	56
320	2	0	no	56
320	2	1	no	24
320	3	0	no	54
320	3	1	no	32
320	4	0	no	56
320	4	1	no	36
320	4	2	no	20
320	5	0	no	55
320	5	1	no	36
320	5	2	no	24
320	6	0	no	54
320	6	1	no	40
320	6	2	no	28
320	6	3	no	15
320	7	0	no	56
320	7	1	no	42
320	7	2	no	30
320	7	3	no	20
320	8	0	no	56
320	8	1	no	42
320	8	2	no	30
320	8	3	no	20
320	8	4	no	12
320	9	0	no	54
320	9	1	no	40
320	9	2	no	28
320	9	3	no	18
320	9	4	no	15
320	1	0	yes	112
320	2	0	yes	112
320	2	1	yes	49
320	3	0	yes	111
320	3	1	yes	64
320	4	0	yes	112
320	4	1	yes	72
320	4	2	yes	42
320	5	0	yes	110
320	5	1	yes	76
320	5	2	yes	48
320	6	0	yes	108
320	6	1	yes	80
320	6	2	yes	56
320	6	3	yes	33
320	7	0	yes	112
320	7	1	yes	84
320	7	2	yes	60
320	7	3	yes	40
320	8	0	yes	112
320	8	1	yes	84
320	8	2	yes	60
320	8	3	yes	40
320	8	4	yes	28
320	9	0	yes	108
320	9	1	yes	80
320	9	2	yes	63
320	9	3	yes	42
320	9	4	yes	30

Find formula for a known dataset

2 Answers2

Update

no examples provided with houses

try a multiplicative model

error of symbolic regression

Generate some example data

Define the function signature to search for

Print the discovered equation

Property value	Shares in play (1-9)	Player owned shares (0-4)	Monopoly(yes/no)	Rent to pay
320	1	0	no	56
320	2	0	no	56
320	2	1	no	24
320	3	0	no	54
320	3	1	no	32
320	4	0	no	56
320	4	1	no	36
320	4	2	no	20
320	5	0	no	55
320	5	1	no	36
320	5	2	no	24
320	6	0	no	54
320	6	1	no	40
320	6	2	no	28
320	6	3	no	15
320	7	0	no	56
320	7	1	no	42
320	7	2	no	30
320	7	3	no	20
320	8	0	no	56
320	8	1	no	42
320	8	2	no	30
320	8	3	no	20
320	8	4	no	12
320	9	0	no	54
320	9	1	no	40
320	9	2	no	28
320	9	3	no	18
320	9	4	no	15
320	1	0	yes	112
320	2	0	yes	112
320	2	1	yes	49
320	3	0	yes	111
320	3	1	yes	64
320	4	0	yes	112
320	4	1	yes	72
320	4	2	yes	42
320	5	0	yes	110
320	5	1	yes	76
320	5	2	yes	48
320	6	0	yes	108
320	6	1	yes	80
320	6	2	yes	56
320	6	3	yes	33
320	7	0	yes	112
320	7	1	yes	84
320	7	2	yes	60
320	7	3	yes	40
320	8	0	yes	112
320	8	1	yes	84
320	8	2	yes	60
320	8	3	yes	40
320	8	4	yes	28
320	9	0	yes	108
320	9	1	yes	80
320	9	2	yes	63
320	9	3	yes	42
320	9	4	yes	30