I've got the following task (code is unfortunately quite long, so I try an example via a barebone):
Get one Dataframe (df1) from a MySQL DB
Get another Dataframe (df2) from the code / some other place
# Check if df2 has updated values:
if df1 and df2 contain rows with the same value in two columns:
update several columns of the df1.row from df2.row
# Check if df2 has new values:
if df2 has rows that df1 does not contain
append those rows to df1
When I run my code, I receive the error:
ValueError: cannot reindex from a duplicate axis
The error is raised in the line:
df1.at[index, columns] = row_new[columns]
I've built a barebone of my code, but surprisingly that works:
import pandas as pd
s1 = pd.Series([1, 'test1', 'name1', 'a'])
s2 = pd.Series([2, 'test2', 'name2', 'b'])
s3 = pd.Series([3, 'test3', 'name3', 'c'])
s4 = pd.Series([4, 'test4', 'name4', 'd'])
s5 = pd.Series([5 ,'test5', 'name5', 'e'])
df1 = pd.DataFrame([list(s1), list(s2),list(s3),list(s4),list(s5)], columns = ["id", "A", "B", "C"])
# print(df1)
s1 = pd.Series(['test1', 'name1new', 'a'])
s2 = pd.Series(['test2', 'name2new', 'b'])
s3 = pd.Series(['test6', 'name6', 'f'])
s4 = pd.Series(['test7', 'name7', 'g'])
s5 = pd.Series(['test8', 'name8', 'h'])
s6 = pd.Series(['test9', 'name9', 'i'])
df2 = pd.DataFrame([list(s1), list(s2),list(s3),list(s4),list(s5), list(s6)], columns = ["A", "B", "C"])
print(df1.index)
print(df1.columns)
print(df2.index)
print(df2.columns)
columns = ['A', 'B']
for index, row in df1.iterrows():
for index_new, row_new in df2.iterrows():
if(row['C'] == row_new['C']
and row['A'] == row_new['A']):
print("-- SAME --")
print(row)
print(index)
print(index_new)
df1.at[index, columns] = row_new[columns]
print(df1)
The index of the barebone and the real code are the same:
RangeIndex(start=0, stop=10, step=1)
(just with different stop values)
The columns are different from df1 and df2 (in the barebone, I've added id. In the real code, I've got id from the MySQL and around 5 columns more in df1 than in df2)
I've read other topics like: What does `ValueError: cannot reindex from a duplicate axis` mean? on the same issue which suggest, that there is an issue with the duplicate column names. But it seems like, in my barebone, duplicate column names work fine.
In my real code, I've tried as well to drop the index before iterrows:
df2_copy = df2.reset_index(drop=True)
df1_copy = df1.reset_index(drop=True)
but still get the same error.
My question now (unfortunately a bit theoretical since the barebone works): How can I update the df1 row with some df2 values?