1

I am a CS student, who is currently working on something related to the Set Covering Problem. To quickly describe it: There is a universe set $U = \{1,2, \dots , n\}$ and a set of subsets of $U$ called $S$, with each subset in $S$ having some cost $c$. The goal is to find a selection of subsets whose Union is equal to $U$ with minimal cost.

Im currently trying to augment test instances by assigning new costs to subsets. And ideally i want the sampling of these costs to fullfil two conditions

  • the sum of all costs should equal 1
  • i want to be able to predetermine the standard deviation of the costs
  • all values should lie between 0 and 1

Is there a particular algorithm/prob. distribution that allows me to achieve this?

  • The Dirichlet family of distributions immediately comes to mind -- but there are many other approaches one could take. – whuber Oct 18 '23 at 11:08
  • Hmm I mainly struggle with coming up with a procedure that fulfills all three conditions at the end – elson1608 Oct 18 '23 at 12:53

1 Answers1

0

With the first and third constraints, the (sample) standard deviation will be bounded by:

$$0\leq\sigma\leq1/\sqrt{n}$$

For a sample that meets all three conditions, first calculate:

$$\alpha=\frac{-\ln(\sigma\sqrt{n})}{\sqrt{n}}$$

(I determined this through trial and error, but it seems to work well for a large range of $n$ and over valid values of $\sigma$.)

Next, get $X$ as $n$ samples from $\text{Gamma}(\alpha,\alpha)$:

$$x_i\sim\text{Gamma}(\alpha,\alpha)$$

Then shift and scale $X$ to get $Y$ such that the sum of $Y$ is 1 and the standard deviation is $\sigma$.

Finally, sample $X$ and calculate $Y$ in this manner until $Y\in[0,1]$.

Demonstrating with R:

Define a function that implements the above procedure:

f <- function(s, n) {
  a <- -log(s*sqrt(n))/sqrt(n)
  repeat({
    x <- as.numeric(1/n + s*scale(rgamma(n, a, a)))
    if (max(abs(range(x) - 0.5)) < 0.5) return(x)
  })
}

set.seed(1454902277) n <- 20

Test 10 random values for $\sigma$ (s):

(s <- runif(10, 0, 1/sqrt(n)))
#>  [1] 0.08615341 0.08472108 0.14550804 0.22037651 0.06881280 0.17247496 0.18822976 0.01613209 0.03530513 0.18090801

Get a sample for each s:

(x <- mapply(f, s, n))
#>              [,1]        [,2]        [,3]         [,4]       [,5]        [,6]        [,7]       [,8]       [,9]       [,10]
#>  [1,] 0.007907081 0.336322174 0.012343162 0.9862760107 0.01741199 0.008957655 0.005227555 0.04005786 0.03090365 0.007536068
#>  [2,] 0.008356469 0.005858934 0.009945213 0.0007192069 0.01646632 0.008956147 0.005291920 0.03952673 0.11196374 0.007464323
#>  [3,] 0.133484604 0.010552570 0.009947305 0.0007192069 0.01832309 0.012025164 0.005227555 0.04555531 0.03508344 0.007464323
#>  [4,] 0.371463293 0.006317710 0.009944963 0.0007192069 0.23854714 0.008956148 0.005237011 0.03728021 0.05216805 0.817698791
#>  [5,] 0.019413860 0.005858742 0.010399976 0.0007192069 0.02320560 0.782342445 0.015729396 0.04073632 0.14619246 0.007464323
#>  [6,] 0.009732502 0.006762692 0.663209568 0.0007192069 0.04776607 0.008956147 0.005227775 0.05461365 0.03085467 0.008121243
#>  [7,] 0.007998791 0.021090367 0.009944958 0.0007192069 0.01739560 0.012547467 0.012556406 0.04314252 0.03299831 0.007464323
#>  [8,] 0.039310453 0.016720323 0.009946820 0.0007192069 0.01660457 0.033639421 0.019469028 0.08705585 0.03094422 0.008041811
#>  [9,] 0.007965665 0.049350931 0.009948243 0.0007192069 0.03387258 0.008957000 0.023901285 0.05187458 0.03554667 0.007464323
#> [10,] 0.007912343 0.008278865 0.009945472 0.0007192069 0.11984135 0.008956147 0.005227556 0.03901123 0.03162774 0.007479019
#> [11,] 0.008229556 0.153381236 0.009977766 0.0007192069 0.03651738 0.008956275 0.005227555 0.08323189 0.03226310 0.007464547
#> [12,] 0.007928077 0.017472852 0.009945005 0.0007192069 0.01646084 0.009611041 0.005578243 0.07968621 0.03134422 0.007464325
#> [13,] 0.086368696 0.005863440 0.037301528 0.0007192069 0.04227043 0.020285243 0.005239877 0.06607766 0.03135961 0.007464390
#> [14,] 0.008513847 0.052577711 0.009944965 0.0007192069 0.02730623 0.010126587 0.005228076 0.03785879 0.03123822 0.007470939
#> [15,] 0.008603736 0.005878002 0.011055799 0.0007192069 0.01703140 0.011867449 0.005340251 0.03937878 0.12884401 0.007481898
#> [16,] 0.129920006 0.026579643 0.074321211 0.0007192069 0.01658535 0.008956149 0.005227555 0.03760609 0.04971147 0.007464323
#> [17,] 0.081442291 0.198799552 0.009944958 0.0007192069 0.02084913 0.008977050 0.005227555 0.03808770 0.05535147 0.007464323
#> [18,] 0.008343874 0.051532733 0.062029416 0.0007782642 0.23988052 0.008956149 0.005250849 0.05144209 0.03131703 0.046598064
#> [19,] 0.008459360 0.008866054 0.009955088 0.0007192069 0.01646219 0.008956147 0.849356811 0.04752063 0.03935074 0.007464323
#> [20,] 0.038645497 0.011935469 0.009948584 0.0007192070 0.01720219 0.009014169 0.005227741 0.04025591 0.03093716 0.007464323

Check that the conditions are met:

all.equal(colSums(x), rep(1, 10))
#> [1] TRUE
all.equal(sapply(1:10, \(i) sd(x[,i])), s)
#> [1] TRUE
jblood94
  • 1,459
  • Thank you very much for your answer. Is there any particular reson why you chose the Gamma Distribution? Also for me it is often stuck for a long time in the repeat loop – elson1608 Oct 19 '23 at 14:25
  • 1
    It's a limited version of a Dirichlet distribution. For a more general and flexible construction see https://stats.stackexchange.com/questions/36093. The price you pay for that greater generality is that you need to relate the expected variance of the components to the Dirichlet parameters so you can choose those parameters to match the intended variance. – whuber Oct 20 '23 at 13:26
  • 1
    @elson1608, this answer has a lot of room for improvement, but first, can you clarify what is meant by "i want to be able to predetermine the standard deviation of the costs"? Specifically, do you want to specify the standard deviation of a sample, or do you want to specify the standard deviation of the distribution(s) from which the $c_{i=1..n}$ are drawn? Also, do you want the $c_{i=1..n}$ to be identically distributed? – jblood94 Oct 23 '23 at 10:27
  • At the end i want that my sample has a specific standard deviation. – elson1608 Oct 23 '23 at 11:00
  • @elson1608: Please add clarifications as an edit to the post and not only as a comment! – kjetil b halvorsen Oct 26 '23 at 18:57