Compare two dataframes with different shapes and with condition in python

Question

I have two dataframes in python

First dataframe : tf_words : of shape (1 row,2235 columns) : looks like-

     0   1    2     3      4     5      6    ......  2234
0   aa, aaa, aaaa, aaan, aaanu, aada, aadhyam,.....zindabad]

Second dataframe : tf1_bigram: of shape (4000, 34319) : contains bigram with their occurrences in dataset, dataframe looks like-

(a, en) (a, ha) (a, padam) (aa, aala) (aa, accountinte) (aa,adhamanaya)...
  1        0         0         1            0                 0        ...
  0        1         0         0            1                 0        ...
  0        0         1         0            0                 1        ...

I have to compare tf_words dataframe with tf1_bigram dataframe and the comparison should be as follows

E.g. As seen in tf_words dataframe, though the word 'aa' is matching with only one word in columns: (aa, aala) (aa, accountinte) & (aa,adhamanaya) in tf1_bigram datagram, those matching columns values will be multiply by 0.5.

then to check for 'aaa', and if found multiply found column by 0.5;

then to check for 'aaaa', if found multiply found column by 0.5;

then for 'aaan', if found multiply the found column by 0.5

and so on upto last word 'zindabad'(having coulmn no. 2234)

Thus the output tf1_bigram will look like as below:

(a, en) (a, ha) (a, padam) (aa, aala) (aa, accountinte) (aa,adhamanaya)...
  1        0         0         0.5          0                 0        ...
  0        1         0         0            0.5               0        ...
  0        0         1         0            0                 0.5      ...

I have tried : tf1_bigram.apply(lambda x: np.multiply(x * 0.5) if x.name in tf_words else x) but output output is not what I have expected.

Plz help...!!!!!!!!

Hi Prasad, please follow these guidelines on how to write a minimum reproducible example, this will make it easier for people to understand and answer your question :) [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) — noob100, May 31 '22 at 11:54
[How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — mozway, May 31 '22 at 11:54
Please provide enough code so others can better understand or reproduce the problem. — Community, May 31 '22 at 12:34

Rafael M R de Rezende · Answer 1 · 2022-06-04T13:46:28.313

try this

import pandas as pd
table = {
    'a, en':[1,0,0],
    'a, ha':[0,1,0],
    'a, padam':[0,0,1],
    'aa, aala' :[1,0,0],
    'aaa, accountinte':[0,1,0],
    'aaaa,adhamanaya':[0,0,1]
              }
tf1_bigram = pd.DataFrame(table)

table = {0:['aa'], 1:['aaa'], 2:['aaaa'], 3:['aaan'], 4:['aaanu'], 5:['aada'], 6:['aadhyam']}
tf_words  = pd.DataFrame(table)

list_tf_words = tf_words.values.tolist()

print(tf1_bigram)

print(f'\n\n-------------BREAK-----------\n\n')


def func(x):
    for y in list_tf_words[0]:
        if x.name.find(y) != -1:
            return x*0.5
        else:
            pass
    return x

tf1_bigram = tf1_bigram.apply(func, axis = 0) 

print(tf1_bigram)

OUTUPUT

   a, en  a, ha  a, padam  aa, aala  aaa, accountinte  aaaa,adhamanaya
0      1      0         0         1                 0                0
1      0      1         0         0                 1                0
2      0      0         1         0                 0                1


-------------BREAK-----------


   a, en  a, ha  a, padam  aa, aala  aaa, accountinte  aaaa,adhamanaya
0      1      0         0       0.5               0.0              0.0
1      0      1         0       0.0               0.5              0.0
2      0      0         1       0.0               0.0              0.5

If you want to multiply by 0.5 more than once, use this code below

import pandas as pd
table = {
    'a, en':[1,0,0],
    'a, ha':[0,1,0],
    'a, padam':[0,0,1],
    'aa, aala' :[1,0,0],
    'aaa, aaanu, accountinte':[0,1,0],
    'aaaa,adhamanaya':[0,0,1]
              }
tf1_bigram = pd.DataFrame(table)

table = {0:['aa'], 1:['aaa'], 2:['aaaa'], 3:['aaan'], 4:['aaanu'], 5:['aada'], 6:['aadhyam']}
tf_words  = pd.DataFrame(table)

list_tf_words = tf_words.values.tolist()

print(tf1_bigram)

print(f'\n\n-------------BREAK-----------\n\n')


def func(x):
    for y in list_tf_words[0]:
        if x.name.find(y) != -1:
            x = x*0.5
        else:
            pass
    return x

tf1_bigram = tf1_bigram.apply(func, axis = 0) 

print(tf1_bigram)

OUTUPUT

   a, en  a, ha  a, padam  aa, aala  aaa, aaanu, accountinte  aaaa,adhamanaya
0      1      0         0         1                        0                0
1      0      1         0         0                        1                0
2      0      0         1         0                        0                1


-------------BREAK-----------


   a, en  a, ha  a, padam  aa, aala  aaa, aaanu, accountinte  aaaa,adhamanaya
0      1      0         0       0.5                   0.0000            0.000
1      0      1         0       0.0                   0.0625            0.000
2      0      0         1       0.0                   0.0000            0.125

Thanks for your reply. But this is not what I expect because as mentioned multiplication by 0.5 to those columns, which are in tf_words and also in tf1_bigram — Prasad Joshi, May 31 '22 at 16:39
I updated the code, now it takes the column name and evaluates if it contains 'aa' inside the name, I hope it is the desired solution, and if it is, be sure to mark it as solved — Rafael M R de Rezende, Jun 01 '22 at 11:22
Thanks for your update..! But after implementing your code the digonal values of aa, aala, aa, accountinte, aa,adhamanaya, are showing zero. — Prasad Joshi, Jun 03 '22 at 06:07
Rafael M R de Rezende One more thing that is missing in code that I want to iterate the dataframe : tf_words and check each value i.e. It chek for aa , if found multiply the column by 0.5, then aaa, , if found multiply the column by 0.5 then aaaa , if found multiply the column by 0.5; then aaan, , if found multiply the column by 0.5 and so on. Eagerly waiting for your reply thanks once again... — Prasad Joshi, Jun 03 '22 at 06:16
From what I understand, just having "aa" in the string it already multiplies, that is, the solution I made corresponds to what you just said, I updated the code and example, for you to visualize, in case I got it wrong, try to explain it in another way if possible, so that I can better assist — Rafael M R de Rezende, Jun 03 '22 at 12:29
I made the code cleaner too, now it does everything in one line — Rafael M R de Rezende, Jun 03 '22 at 12:48
Thanks its working, but I do not want only "aa". I have dataframe named tf_words of shape (1,) and the contents are [aa, aaa, aaaa, aaan, aaanu, aada, aadhyam,.....]. I have to check the every value and if found in df then multiply by 0.5. I hope you understand. Eagerly waiting for your reply thanks once again — Prasad Joshi, Jun 04 '22 at 10:14
I edited the code, see if this is what you need. OBS.: I put two solutions, in case you want to multiply more than once — Rafael M R de Rezende, Jun 04 '22 at 13:48

Compare two dataframes with different shapes and with condition in python

1 Answers1