3

I am trying to make a python program that will calculate a result based on a formula, given factors and input data table. I am having troubles making it and I couldn't find a solution on my own nor a similar problem on SO. I hope someone can help!

Below I am showing my sample data in order to understand if I am doing something wrong.

My input data looks like this. I have a number of cars (N_cars) on a given length of the road (l) and their average speed (v).

input_columns = ['l', 'N_cars', 'v']
input_data = [[3.5, 1000, 100], [5.7, 500, 110], 
              [10, 367, 110], [11.1, 1800, 95],
              [2.8, 960, 105], [4.7, 800, 120], 
              [10.4, 103, 111], [20.1, 1950, 115]]
        
input_df = pd.DataFrame(input_data, columns=input_columns)
input_df

      l  N_cars    v
0   3.5    1000  100
1   5.7     500  110
2  10.0     367  110
3  11.1    1800   95
4   2.8     960  105
5   4.7     800  120
6  10.4     103  111
7  20.1    1950  115

I also know the factors needed for the formula for each category of cars, and I know the % of each category of total category. I also have different options for each category (3 options that I have here are just an example, there are many more options).

factors_columns = ['category', 'category %', 'option', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
factors_data = [['A', 58, 'opt_1', 0.000011, 0.23521, 0.93847, 0.39458, 0.00817, 0.24566, 0.0010, 0],
                ['A', 58, 'opt_2', 0.000011, 0.23521, 0.93145, 0.39458, 0.00467, 0.24566, 0.0010, 0],
                ['A', 58, 'opt_3', 0.000011, 0.23521, 0.93145, 0.39458, 0.00467, 0.24566, 0.0010, 0],
                ['B', 22, 'opt_1', 0.002452, 0.48327, 0.83773, 0.92852, 0.00871, 0.29568, 0.0009, 0.02],
                ['B', 22, 'opt_2', 0.002899, 0.49327, 0.83773, 0.92852, 0.00871, 0.30468, 0.0009, 0.02],
                ['B', 22, 'opt_3', 0.002452, 0.48327, 0.83773, 0.92852, 0.00771, 0.29568, 0.0119, 0.01],
                ['C', 17, 'opt_1', 0.082583, 0.39493, 0.02462, 0.82714, 0.00918, 0.28572, 0.0012, 0],
                ['C', 17, 'opt_2', 0.072587, 0.35493, 0.02852, 0.82723, 0.00912, 0.29572, 0.0018, 0], 
                ['C', 17, 'opt_3', 0.082583, 0.39493, 0.02852, 0.82714, 0.00962, 0.28572, 0.0012, 0.01], 
                ['D', 3, 'opt_1', 0.018327, 0.32342, 0.82529, 0.92752, 0.00988, 0.21958, 0.0016, 0],
                ['D', 3, 'opt_2', 0.014427, 0.32342, 0.82729, 0.92752, 0.00968, 0.22558, 0.0026, 0],
                ['D', 3, 'opt_3', 0.018327, 0.32342, 0.82729, 0.94452, 0.00988, 0.21258, 0.0016, 0]]
        
factors_df = pd.DataFrame(factors_data, columns=factors_columns)
factors_df

   category  category % option         a        b        c        d        e        f       g     h
0         A          58  opt_1  0.000011  0.23521  0.93847  0.39458  0.00817  0.24566  0.0010  0.00
1         A          58  opt_2  0.000011  0.23521  0.93145  0.39458  0.00467  0.24566  0.0010  0.00
2         A          58  opt_3  0.000011  0.23521  0.93145  0.39458  0.00467  0.24566  0.0010  0.00
3         B          22  opt_1  0.002452  0.48327  0.83773  0.92852  0.00871  0.29568  0.0009  0.02
4         B          22  opt_2  0.002899  0.49327  0.83773  0.92852  0.00871  0.30468  0.0009  0.02
5         B          22  opt_3  0.002452  0.48327  0.83773  0.92852  0.00771  0.29568  0.0119  0.01
6         C          17  opt_1  0.082583  0.39493  0.02462  0.82714  0.00918  0.28572  0.0012  0.00
7         C          17  opt_2  0.072587  0.35493  0.02852  0.82723  0.00912  0.29572  0.0018  0.00
8         C          17  opt_3  0.082583  0.39493  0.02852  0.82714  0.00962  0.28572  0.0012  0.01
9         D           3  opt_1  0.018327  0.32342  0.82529  0.92752  0.00988  0.21958  0.0016  0.00
10        D           3  opt_2  0.014427  0.32342  0.82729  0.92752  0.00968  0.22558  0.0026  0.00
11        D           3  opt_3  0.018327  0.32342  0.82729  0.94452  0.00988  0.21258  0.0016  0.00

GOAL: For each option (opt_1, opt_2, opt_3), I have to calculate the result based on this formula (factors are taken from the factors table, but v is coming from the input table):

formula = ( (a*v*v) + (b*v) + c + (d/v) )  /  ( (e*v*v) + (f*v) + g) * (1 - h)
result = l * N_cars * formula

BUT, I have to take into account the % of each category of cars. For each row of the input_df I have to make calculations 3 times, for each of the 3 options. For example, for the index 0 of input_df, I have 1000 cars, v=100 and l=3.5, the output should be something like this:

for opt_1:
result = 3.5 * 1000 * ( (58% of category A {formula for index 0 of factors_df}) + 
(22% of category B {formula for index 3 of factors_df) + 
(17% of category C {formula for index 6 of factors_df}) +
(3% of category D {formula for index 9 of factors_df}) )

for opt_2:
result = 3.5 * 1000 * ( (58% of category A {formula for index 1 of factors_df}) + 
(22% of category B {formula for index 4 of factors_df) + 
(17% of category C {formula for index 7 of factors_df}) +
(3% of category D {formula for index 10 of factors_df}) )

for opt_3:
result = 3.5 * 1000 * ( (58% of category A {formula for index 2 of factors_df}) + 
(22% of category B {formula for index 5 of factors_df) + 
(17% of category C {formula for index 8 of factors_df}) +
(3% of category D {formula for index 11 of factors_df}) )

So, as an output, for each of the rows in input_df, I should have 3 results, each for one of the 3 options.

I can do the calculation manually for each step, but what I am having troubles with is to make a loop that does it automatically for each input row and all 3 options and then passes to the next input row and so on until the last input row.

If anyone has an idea how to solve this easily, please do help!

ctrl_z
  • 254
  • 1
  • 14
  • It's not clear what you mean by "I have to take into account the % of each categories". Percentage of *what*? The total number of rows in that category? – ddejohn Feb 06 '22 at 18:54
  • that percantages show how the cars are divided. So If I have 100 as N_cars, I know that 58 of them are of category A, 22 of category B, 17 of category C and 3 of category D – ctrl_z Feb 06 '22 at 18:56
  • I find the question very clear and well written. Don't understand the downvotes. Please comment reasons before downvoting. – Krishnakanth Allika Feb 06 '22 at 19:21
  • 1
    Does https://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas help? – Karl Knechtel Feb 07 '22 at 01:08
  • 1
    While I think OP could benefit from learning how to use `map`, `apply` etc, I think this is a complicated enough use-case to warrant keeping this question open, as it involves using `apply` across two different dataframes, one of which is grouped. – ddejohn Feb 07 '22 at 01:44
  • @ddejohn, I did it before and re-did it again :) – ctrl_z Feb 24 '22 at 09:45

3 Answers3

8

Solution

Not sure what your expected results are, but I believe this does what you're asking for:

def formula(g, *, l, N_cars, v):
    x = (1 - g.h) * (g.a * v*v + g.b*v + g.c + g.d/v) / (g.e * v*v + g.f*v + g.g)
    return N_cars * l * (x * g.pct / 100).sum()


groups = factors_df.rename(columns={"category %": "pct"}).groupby("option")
result = input_df.apply(lambda r: groups.apply(lambda g: formula(g, **r)), axis=1)

Output:

In [5]: input_df.join(result)
Out[5]:
      l  N_cars    v         opt_1         opt_2         opt_3
0   3.5    1000  100   5411.685077   5115.048256   5500.985916
1   5.7     500  110   4425.339734   4169.893681   4483.595803
2  10.0     367  110   5698.595376   5369.652565   5773.612841
3  11.1    1800   95  30820.717985  29180.106606  31384.785443
4   2.8     960  105   4165.270216   3930.726187   4226.877893
5   4.7     800  120   5860.057879   5506.509637   5919.496692
6  10.4     103  111   1663.960420   1567.455541   1685.339848
7  20.1    1950  115  60976.735053  57375.300546  61685.075902

Explanation

The first step is to group factors_df by option. Just to show what that looks like:

In [6]: groups.apply(print)
  category  pct option         a        b  ...        d        e        f       g     h
0        A   58  opt_1  0.000011  0.23521  ...  0.39458  0.00817  0.24566  0.0010  0.00
3        B   22  opt_1  0.002452  0.48327  ...  0.92852  0.00871  0.29568  0.0009  0.02
6        C   17  opt_1  0.082583  0.39493  ...  0.82714  0.00918  0.28572  0.0012  0.00
9        D    3  opt_1  0.018327  0.32342  ...  0.92752  0.00988  0.21958  0.0016  0.00

[4 rows x 11 columns]
   category  pct option         a        b  ...        d        e        f       g     h
1         A   58  opt_2  0.000011  0.23521  ...  0.39458  0.00467  0.24566  0.0010  0.00
4         B   22  opt_2  0.002899  0.49327  ...  0.92852  0.00871  0.30468  0.0009  0.02
7         C   17  opt_2  0.072587  0.35493  ...  0.82723  0.00912  0.29572  0.0018  0.00
10        D    3  opt_2  0.014427  0.32342  ...  0.92752  0.00968  0.22558  0.0026  0.00

[4 rows x 11 columns]
   category  pct option         a        b  ...        d        e        f       g     h
2         A   58  opt_3  0.000011  0.23521  ...  0.39458  0.00467  0.24566  0.0010  0.00
5         B   22  opt_3  0.002452  0.48327  ...  0.92852  0.00771  0.29568  0.0119  0.01
8         C   17  opt_3  0.082583  0.39493  ...  0.82714  0.00962  0.28572  0.0012  0.01
11        D    3  opt_3  0.018327  0.32342  ...  0.94452  0.00988  0.21258  0.0016  0.00

Note that I renamed the category % to pct. This isn't necessary, but made accessing that column in the formula() function a bit cleaner (g.pct vs g["category %"]).

The next step was to implement formula() in such a way as to accept a group from factors_df as an argument:

def formula(g, *, l, N_cars, v):
    x = (1 - g.h) * (g.a * v*v + g.b*v + g.c + g.d/v) / (g.e * v*v + g.f*v + g.g)
    return N_cars * l * (x * g.pct / 100).sum()

In the function signature, g is a group from factors_df, then the keyword-only arguments l, N_cars, and v, which will come from a single row of input_df at a time.

Each of the three groups shown above will be entered into the formula() function one at a time, in their entirety. For example, during one call to formula(), the g argument will hold all of this data:

  category  pct option         a        b  ...        d        e        f       g     h
0        A   58  opt_1  0.000011  0.23521  ...  0.39458  0.00817  0.24566  0.0010  0.00
3        B   22  opt_1  0.002452  0.48327  ...  0.92852  0.00871  0.29568  0.0009  0.02
6        C   17  opt_1  0.082583  0.39493  ...  0.82714  0.00918  0.28572  0.0012  0.00
9        D    3  opt_1  0.018327  0.32342  ...  0.92752  0.00988  0.21958  0.0016  0.00

When the formula uses something like g.e, it's accessing the entire e column, and is taking advantage of vectorization to perform the arithmetic calculations on the entire column at the same time. When the dust settles, x will be a Series where each item in the series will be the result of the formula for each of the four categories of car. Here's an example:

0    0.231242
3    0.619018
6    7.188941
9    1.792376

Notice the indices? Those correspond to category A, B, C, and D from factors_df, respectively.

From there, we need to call formula() on each row of input_df, using the axis argument of pd.DataFrame.apply():

input_df.apply(lambda r: groups.apply(lambda g: formula(g, **r)), axis=1)

The lambda r is an anonymous function object being passed to apply, being applied over axis 1, meaning that r will be a single row from input_df at a time, for example:

In [13]: input_df.apply(print, axis=1)
l            3.5
N_cars    1000.0
v          100.0
Name: 0, dtype: float64
.
.
.

Now, on each row-wise apply, we're also applying the formula() function on the groups groupby object with lambda g: formula(g, **r). The **r unpacks the row from input_df as keyword arguments, which helps to ensure that the values for v, l, and N_cars aren't misused in the formula (no need to worry about which order they're passed into the formula() function).

ddejohn
  • 7,491
  • 3
  • 16
  • 28
  • 1
    ddejohn, I already accepted this answer before and Icommented with a thank you note, but I don't see that comment anymore :( I re-did it now, and I wanted to thank you a lot for the detailed explanation, it really helped a lot to understand the process :) – ctrl_z Feb 24 '22 at 09:43
  • Cheers, glad I could help. – ddejohn Feb 24 '22 at 16:05
2

Here is the code I wrote. It's somewhat long but it works. May be you (or someone) can modify and make it shorter.

# Transforming factors_df
df = factors_df.pivot(columns=["category", "option"])
df.reset_index(inplace=True)
# Renaming column names for each combination of option and category
df.columns = [s3 + s2 + s1 for (s1, s2, s3) in df.columns.to_list()]
df.drop(columns=["index"], inplace=True)
# Flattening to a single row to be able to apply formula
df = pd.DataFrame(df.max()).T
# Merging input with transformed factors data
input_df["tmp"] = 1
df["tmp"] = 1
df = pd.merge(input_df, df, on="tmp", how="left")
df.drop("tmp", axis=1, inplace=True)

# Calculating values for opt_1 using the formula
df["opt_1_value"] = (
    df["l"]
    * df["N_cars"]
    * (
        (
            df["opt_1Acategory %"]
            / 100
            * (
                df["opt_1Aa"] * df["v"] * df["v"]
                + df["opt_1Ab"] * df["v"]
                + df["opt_1Ac"]
                + df["opt_1Ad"] / df["v"]
            )
            / (
                (
                    df["opt_1Ae"] * df["v"] * df["v"]
                    + df["opt_1Af"] * df["v"]
                    + df["opt_1Ag"]
                )
                * (1 - df["opt_1Ah"])
            )
        )
        + (
            df["opt_1Bcategory %"]
            / 100
            * (
                df["opt_1Ba"] * df["v"] * df["v"]
                + df["opt_1Bb"] * df["v"]
                + df["opt_1Bc"]
                + df["opt_1Bd"] / df["v"]
            )
            / (
                (
                    df["opt_1Be"] * df["v"] * df["v"]
                    + df["opt_1Bf"] * df["v"]
                    + df["opt_1Bg"]
                )
                * (1 - df["opt_1Bh"])
            )
        )
        + (
            df["opt_1Ccategory %"]
            / 100
            * (
                df["opt_1Ca"] * df["v"] * df["v"]
                + df["opt_1Cb"] * df["v"]
                + df["opt_1Cc"]
                + df["opt_1Cd"] / df["v"]
            )
            / (
                (
                    df["opt_1Ce"] * df["v"] * df["v"]
                    + df["opt_1Cf"] * df["v"]
                    + df["opt_1Cg"]
                )
                * (1 - df["opt_1Ch"])
            )
        )
    )
)

# Calculating values for opt_2 using the formula
df["opt_2_value"] = (
    df["l"]
    * df["N_cars"]
    * (
        (
            df["opt_2Acategory %"]
            / 100
            * (
                df["opt_2Aa"] * df["v"] * df["v"]
                + df["opt_2Ab"] * df["v"]
                + df["opt_2Ac"]
                + df["opt_2Ad"] / df["v"]
            )
            / (
                (
                    df["opt_2Ae"] * df["v"] * df["v"]
                    + df["opt_2Af"] * df["v"]
                    + df["opt_2Ag"]
                )
                * (1 - df["opt_2Ah"])
            )
        )
        + (
            df["opt_2Bcategory %"]
            / 100
            * (
                df["opt_2Ba"] * df["v"] * df["v"]
                + df["opt_2Bb"] * df["v"]
                + df["opt_2Bc"]
                + df["opt_2Bd"] / df["v"]
            )
            / (
                (
                    df["opt_2Be"] * df["v"] * df["v"]
                    + df["opt_2Bf"] * df["v"]
                    + df["opt_2Bg"]
                )
                * (1 - df["opt_2Bh"])
            )
        )
        + (
            df["opt_2Ccategory %"]
            / 100
            * (
                df["opt_2Ca"] * df["v"] * df["v"]
                + df["opt_2Cb"] * df["v"]
                + df["opt_2Cc"]
                + df["opt_2Cd"] / df["v"]
            )
            / (
                (
                    df["opt_2Ce"] * df["v"] * df["v"]
                    + df["opt_2Cf"] * df["v"]
                    + df["opt_2Cg"]
                )
                * (1 - df["opt_2Ch"])
            )
        )
    )
)

# Calculating values for opt_3 using the formula
df["opt_3_value"] = (
    df["l"]
    * df["N_cars"]
    * (
        (
            df["opt_3Acategory %"]
            / 100
            * (
                df["opt_3Aa"] * df["v"] * df["v"]
                + df["opt_3Ab"] * df["v"]
                + df["opt_3Ac"]
                + df["opt_3Ad"] / df["v"]
            )
            / (
                (
                    df["opt_3Ae"] * df["v"] * df["v"]
                    + df["opt_3Af"] * df["v"]
                    + df["opt_3Ag"]
                )
                * (1 - df["opt_3Ah"])
            )
        )
        + (
            df["opt_3Bcategory %"]
            / 100
            * (
                df["opt_3Ba"] * df["v"] * df["v"]
                + df["opt_3Bb"] * df["v"]
                + df["opt_3Bc"]
                + df["opt_3Bd"] / df["v"]
            )
            / (
                (
                    df["opt_3Be"] * df["v"] * df["v"]
                    + df["opt_3Bf"] * df["v"]
                    + df["opt_3Bg"]
                )
                * (1 - df["opt_3Bh"])
            )
        )
        + (
            df["opt_3Ccategory %"]
            / 100
            * (
                df["opt_3Ca"] * df["v"] * df["v"]
                + df["opt_3Cb"] * df["v"]
                + df["opt_3Cc"]
                + df["opt_3Cd"] / df["v"]
            )
            / (
                (
                    df["opt_3Ce"] * df["v"] * df["v"]
                    + df["opt_3Cf"] * df["v"]
                    + df["opt_3Cg"]
                )
                * (1 - df["opt_3Ch"])
            )
        )
    )
)

# Removing unnecessary columns
df_final = df[["l", "N_cars", "v", "opt_1_value", "opt_2_value", "opt_3_value"]]
print(df_final)

Output:

      l  N_cars    v  opt_1_value  opt_2_value  opt_3_value
0   3.5    1000  100  1496.002370  1420.656629  1534.748740
1   5.7     500  110   750.997279   710.944885   767.411691
2  10.0     367  110   551.157686   521.754019   562.906668
3  11.1    1800   95  2685.551348  2554.477141  2756.164589
4   2.8     960  105  1439.467965  1364.815604  1475.082027
5   4.7     800  120  1206.116125  1138.614075  1229.225287
6  10.4     103  111   154.744048   146.445615   157.990346
7  20.1    1950  115  2933.825622  2773.297776  2990.828374
ddejohn
  • 7,491
  • 3
  • 16
  • 28
  • 1
    thank you for the proposed solution, as I am looking for something automatic, short and simple this seems overly manual / step by step calculation. I hope you understand :) also `1 - h` is indeed the correct subtraction term, but that's easy to change – ctrl_z Feb 07 '22 at 11:43
  • @ctrl_z, I fully understand :-) I realized my code needs to be shortened but I didn't know how. – Krishnakanth Allika Feb 07 '22 at 13:43
  • 2
    @ctrl_z could you at least confirm if this is the expected result? Two solutions posted so far gave different results. – steven Feb 19 '22 at 15:00
  • 1
    @KrishnakanthAllika you get a different result because you used wrong brackets in the formula, you calculate `formula = ( (a*v*v) + (b*v) + c + (d/v) ) / (( (e*v*v) + (f*v) + g) * (1 - h))` instead of `formula = ( (a*v*v) + (b*v) + c + (d/v) ) / ( (e*v*v) + (f*v) + g) * (1 - h)` – ctrl_z Feb 24 '22 at 09:41
1

Another way to do it, not nearly as elegand as @ddejhon 's solution, tho:

def formula(input_index, factors_index):
    formula = ((factors_df.loc[factors_index,'a']*input_df['v'][input_index]**2)+
               (factors_df.loc[factors_index,'b']*input_df['v'][input_index])+
               (factors_df.loc[factors_index,'c'])+
               (factors_df.loc[factors_index,'d']/input_df['v'][input_index])
            )/(
               (factors_df.loc[factors_index,'e']*input_df['v'][input_index]**2)+
               (factors_df.loc[factors_index,'f']*input_df['v'][input_index])+
               (factors_df.loc[factors_index,'g'])
           )*(1-factors_df.loc[factors_index,'h'])
    return formula
index_list = [factors_df[factors_df['option'] == opt].index.tolist() for opt in factors_df['option'].unique().tolist()]

Edit1: got rid of that ugly nested for structure and replaced it with list comprehension

output_df            = pd.DataFrame(np.repeat(input_df.values, len(factors_df['option'].unique()), axis=0))
output_df.columns    = input_df.columns
output_df['option']  = factors_df['option'].unique().tolist()*len(input_df.index)
output_df['formula'] = [n for sub_list in [[sum(factors_df['category %'].unique()[k]/100 * formula(i,j[k]) 
                          for k in range(len(factors_df['category'].unique()))) 
                          for j in index_list] for i in input_df.index] for n in sub_list]
output_df['result']  = output_df['l'] * output_df['N_cars'] * output_df['formula']

Output:

output_df

       l    N_cars      v  option    formula          result
 0   3.5    1000.0  100.0   opt_1   1.546196     5411.685077
 1   3.5    1000.0  100.0   opt_2   1.461442     5115.048256
 2   3.5    1000.0  100.0   opt_3   1.571710     5500.985916
 3   5.7     500.0  110.0   opt_1   1.552751     4425.339734
 4   5.7     500.0  110.0   opt_2   1.463121     4169.893681
 5   5.7     500.0  110.0   opt_3   1.573192     4483.595803
 6  10.0     367.0  110.0   opt_1   1.552751     5698.595376
 7  10.0     367.0  110.0   opt_2   1.463121     5369.652565
 8  10.0     367.0  110.0   opt_3   1.573192     5773.612841
 9  11.1    1800.0   95.0   opt_1   1.542578    30820.717985
10  11.1    1800.0   95.0   opt_2   1.460466    29180.106606
11  11.1    1800.0   95.0   opt_3   1.570810    31384.785443
12   2.8     960.0  105.0   opt_1   1.549580     4165.270216
13   2.8     960.0  105.0   opt_2   1.462324     3930.726187
14   2.8     960.0  105.0   opt_3   1.572499     4226.877893
15   4.7     800.0  120.0   opt_1   1.558526     5860.057879
16   4.7     800.0  120.0   opt_2   1.464497     5506.509637
17   4.7     800.0  120.0   opt_3   1.574334     5919.496692
18  10.4     103.0  111.0   opt_1   1.553361     1663.960420
19  10.4     103.0  111.0   opt_2   1.463271     1567.455541
20  10.4     103.0  111.0   opt_3   1.573319     1685.339848
21  20.1    1950.0  115.0   opt_1   1.555727    60976.735053
22  20.1    1950.0  115.0   opt_2   1.463842    57375.300546
23  20.1    1950.0  115.0   opt_3   1.573800    61685.075902
P. Pinho
  • 416
  • 1
  • 6