left_join function creating duplicates when merging dataframes

Question

A bit of a lengthy explanation here, so please bear with me with.

I am working with two dataframes:

qb.metrics.df
roster.df

In the end, I would like to left_join them together.

Here is a look at qb.metrics.df

 Rank PlayerPlayerId PlayerShortName PlayerAgeExact PassAttemptsPer~ PassingYards CompletionPerce~ PassingTouchdow~
   <dbl>          <dbl> <chr>                    <dbl> <chr>            <chr>        <chr>            <chr>           
 1     1          20889 K. Murray                 23.3 34.6             2375         68.2             17              
 2     2          19801 J. Allen                  24.5 36.4             2873         68.4             21              
 3     3          14536 R. Wilson                 32   37.1             2789         69.8             28              
 4     4          18890 P. Mahomes                25.2 36.6             2687         66.9             25              
 5     5           2593 A. Rodgers                37   34.9             2578         67.8             26              
 6     6           4314 T. Brady                  43.3 38.5             2739         66               23              
 7     7          18857 D. Watson                 25.2 33.4             2539         68.1             18              
 8     8          21681 J. Herbert                22.7 37.6             2333         66.8             19              
 9     9          19781 L. Jackson                23.9 27.4             1762         64               14              
10    10          16497 T. Bridgewater            28   33.1             2544         71.9             13

... add here is a look at roster.df

season team  position depth_chart_pos~ jersey_number status full_name first_name last_name birth_date height weight
    <dbl> <chr> <chr>    <chr>                    <int> <chr>  <chr>     <chr>      <chr>     <date>     <chr>  <chr> 
 1   2020 ARI   C        C                           52 Active Mason Co~ Mason      Cole      1996-03-28 6-5    292   
 2   2020 ARI   C        C                           53 Active Lamont G~ Lamont     Gaillard  1996-02-08 6-3    305   
 3   2020 ARI   CB       NB                          33 Active Byron Mu~ Byron      Murphy    1998-01-18 5-11   190   
 4   2020 ARI   CB       NA                          20 Active Prince A~ Prince     Amukamara 1989-06-06 6-0    204   
 5   2020 ARI   CB       NA                          39 Active Jace Whi~ Jace       Whittaker 1995-07-16 5-11   185   
 6   2020 ARI   CB       NA                          27 Injur~ Kevin Pe~ Kevin      Peterson  1994-03-22 5-11   185   
 7   2020 ARI   CB       LCB                         25 Active Johnatha~ Johnathan  Joseph    1984-04-16 5-11   185   
 8   2020 ARI   CB       NA                          23 Injur~ Robert A~ Robert     Alford    1988-10-31 5-10   186   
 9   2020 ARI   CB       LCB                         21 Active Patrick ~ Patrick    Peterson  1990-07-11 6-1    203   
10   2020 ARI   CB       RCB                         20 Active Dre Kirk~ Dre        Kirkpatr~ 1989-10-26 6-2    190

In order to user left_join, I thought it best to match up 'PlayerShortName' from qb.metrics.df and 'full_name' from roster.df.

To do so, I dropped everything before the first letter of the first name and added a period (to match p with 'PlayerShortName' in qb.metrics.df):

roster.df$full_name <- sub('(.)\\w+ (\\w+)', '\\1. \\2', roster.df$full_name)

It works fine, as you can see in the updated version of roster.df:

 season team  position depth_chart_pos~ jersey_number status full_name first_name last_name birth_date height weight
    <dbl> <chr> <chr>    <chr>                    <int> <chr>  <chr>     <chr>      <chr>     <date>     <chr>  <chr> 
 1   2020 ARI   C        C                           52 Active M. Cole   Mason      Cole      1996-03-28 6-5    292   
 2   2020 ARI   C        C                           53 Active L. Gaill~ Lamont     Gaillard  1996-02-08 6-3    305   
 3   2020 ARI   CB       NB                          33 Active B. Murphy Byron      Murphy    1998-01-18 5-11   190   
 4   2020 ARI   CB       NA                          20 Active P. Amuka~ Prince     Amukamara 1989-06-06 6-0    204   
 5   2020 ARI   CB       NA                          39 Active J. Whitt~ Jace       Whittaker 1995-07-16 5-11   185   
 6   2020 ARI   CB       NA                          27 Injur~ K. Peter~ Kevin      Peterson  1994-03-22 5-11   185   
 7   2020 ARI   CB       LCB                         25 Active J. Joseph Johnathan  Joseph    1984-04-16 5-11   185   
 8   2020 ARI   CB       NA                          23 Injur~ R. Alford Robert     Alford    1988-10-31 5-10   186   
 9   2020 ARI   CB       LCB                         21 Active P. Peter~ Patrick    Peterson  1990-07-11 6-1    203   
10   2020 ARI   CB       RCB                         20 Active D. Kirkp~ Dre        Kirkpatr~ 1989-10-26 6-2    190

However, when I do a left_merge, I get all kinds of duplicates ...

qb.metrics.df <- qb.metrics.df %>%
  left_join(roster.df, by = c("PlayerShortName" = "full_name"))

Doing so results in this:

   Rank PlayerPlayerId PlayerShortName PlayerAgeExact PassAttemptsPer~ PassingYards CompletionPerce~ PassingTouchdow~
   <dbl>          <dbl> <chr>                    <dbl> <chr>            <chr>        <chr>            <chr>           
 1     1          20889 K. Murray                 23.3 34.6             2375         68.2             17              
 2     1          20889 K. Murray                 23.3 34.6             2375         68.2             17              
 3     2          19801 J. Allen                  24.5 36.4             2873         68.4             21              
 4     2          19801 J. Allen                  24.5 36.4             2873         68.4             21              
 5     2          19801 J. Allen                  24.5 36.4             2873         68.4             21              
 6     2          19801 J. Allen                  24.5 36.4             2873         68.4             21              
 7     2          19801 J. Allen                  24.5 36.4             2873         68.4             21              
 8     2          19801 J. Allen                  24.5 36.4             2873         68.4             21              
 9     3          14536 R. Wilson                 32   37.1             2789         69.8             28              
10     3          14536 R. Wilson                 32   37.1             2789         69.8             28

As you can see, it creates duplicates of the information and I cannot figure out why.

Any help is greatly appreciated!

EDIT:

Here is the minimal reproducible example for qb.metrics.df:

structure(list(Rank = c(1, 2, 3, 4, 5, 6), PlayerPlayerId = c(20889, 
19801, 14536, 18890, 2593, 4314), PlayerShortName = c("K. Murray", 
"J. Allen", "R. Wilson", "P. Mahomes", "A. Rodgers", "T. Brady"
), PlayerAgeExact = c(23.3, 24.5, 32, 25.2, 37, 43.3), PassAttemptsPerGame = c("34.6", 
"36.4", "37.1", "36.6", "34.9", "38.5"), PassingYards = c("2375", 
"2873", "2789", "2687", "2578", "2739"), CompletionPercentage = c("68.2", 
"68.4", "69.8", "66.9", "67.8", "66"), PassingTouchdowns = c("17", 
"21", "28", "25", "26", "23"), Interceptions = c("8", "7", "10", 
"1", "3", "7"), PlayactionPassCompletionPercentage = c("71.4", 
"66.4", "76.5", "71.3", "68.1", "74.7"), TruePasserRating = c("102.6", 
"99.4", "119", "128.5", "130.2", "105.8"), ReceiverTargetSeparation = c("1.79", 
"1.89", "2.16", "1.96", "2", "1.53"), DroppedPasses = c("9", 
"19", "14", "61", "49", "29"), AccuracyRating = c("7.6", "7.5", 
"7.8", "7.4", "7.6", "7.2"), RedZoneAttempts = c("34", "50", 
"55", "46", "48", "64"), RedZoneCompletionPercentage = c("67.6", 
"66", "72.7", "67.4", "66.7", "59.4"), InterceptablePasses = c("11", 
"18", "11", "8", "8", "10"), InterceptablePassesPerGame = c("1.2", 
"1.8", "1.2", "0.9", "0.9", "1")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

And here it is for roster.df:

structure(list(season = c(2020, 2020, 2020, 2020, 2020, 2020), 
    team = c("ARI", "ARI", "ARI", "ARI", "ARI", "ARI"), position = c("C", 
    "C", "CB", "CB", "CB", "CB"), depth_chart_position = c("C", 
    "C", "NB", NA, NA, NA), jersey_number = c(52L, 53L, 33L, 
    20L, 39L, 27L), status = c("Active", "Active", "Active", 
    "Active", "Active", "Injured Reserve"), full_name = c("Mason Cole", 
    "Lamont Gaillard", "Byron Murphy", "Prince Amukamara", "Jace Whittaker", 
    "Kevin Peterson"), first_name = c("Mason", "Lamont", "Byron", 
    "Prince", "Jace", "Kevin"), last_name = c("Cole", "Gaillard", 
    "Murphy", "Amukamara", "Whittaker", "Peterson"), birth_date = structure(c(9583, 
    9534, 10244, 7096, 9327, 8846), class = "Date"), height = c("6-5", 
    "6-3", "5-11", "6-0", "5-11", "5-11"), weight = c("292", 
    "305", "190", "204", "185", "185"), college = c("Michigan", 
    "Georgia", "Washington", "Nebraska", "Arizona", "Oklahoma State"
    ), high_school = c("East Lake (FL)", "Pine Forest (NC)", 
    "Saguaro (AZ)", "Apollo (AZ)", "Oceanside (CA)", "Wagoner (OK)"
    ), gsis_id = c("00-0034785", "00-0035536", "00-0035236", 
    "00-0027957", "00-0036043", "00-0032848"), espn_id = c(3115972L, 
    3128707L, 4038999L, 13975L, 3821572L, 2977742L), sportradar_id = c("53d25371-e3ce-4030-8d0a-82def5cdc600", 
    "0c8b0581-9ed2-488b-bcaa-ef783261dfd1", "c025b513-9431-4097-bc25-9777bf08f846", 
    "f1879cfa-4c07-4140-9da0-c7ebe9af2dfd", "528bf5c9-1d23-40c3-adda-df21f8f0e2ab", 
    "30e539c3-74dd-4a9a-9ebb-4bdd0f0d39f8"), yahoo_id = c(31067L, 
    32011L, 31865L, 24806L, 33089L, 29924L), rotowire_id = c(12795L, 
    13678L, 13560L, 7509L, 14793L, 11101L), update_dt = structure(c(1605942526.92275, 
    1605942526.92275, 1605942526.92275, 1605942526.92275, 1605942526.92275, 
    1605942526.92275), tzone = "America/New_York", class = c("POSIXct", 
    "POSIXt")), headshot_url = c("https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/3115972.png", 
    "https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/3128707.png", 
    "https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/4038999.png", 
    "https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/13975.png", 
    "https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/3821572.png", 
    "https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/2977742.png"
    )), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))

Using `unique(qb.metrics.df)` should remove the duplicates. Can you provide some data to verify the cause? — Ronak Shah, Nov 22 '20 at 01:27
In order for us to help you, please provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). For example, to produce a minimal data set, you can use `head()`, `subset()`, or the indices. Then use `dput()` to give us something that can be put in R immediately. Also, please make sure you know what to do [when someone answers your question](https://stackoverflow.com/help/someone-answers). More info can be found at Stack Overflow's [help center](https://stackoverflow.com/help). Thank you! — iamericfletcher, Nov 22 '20 at 01:37
compare `n_distinct(qb.roster.df$full_name)` to `nrow(qb.roster.df)` and `n_distinct(qb.metrics.df$PlayerShortName` to `nrow(qb.metrics.df$PlayerShortName)`. Are there duplicated names in either? n_distinct should match nrow in both comparisons. — icj, Nov 22 '20 at 01:56
It would be helpful if your reproducible example also reproduced the issue you are experiencing. Please update your reprex. — iamericfletcher, Nov 22 '20 at 02:07
I tried to reproduce the issue with the data you provided. However, for me, the `left_join` works without an issue (no duplicates). Are you sure, you have the most current package versions and `R` installation? Edit: Also, maybe think about introducing a new variable (e.g. `short_name`) in the `roster.df` instead of modifying `full_name`, as this might help with code clarity. — Dom42, Nov 22 '20 at 13:25
Your reproducible example doesn't work (nor does the example you provide in your question), because the `roster.df` examples you provide do not contain any of the names from the `qb.metrics.df` file. Regardless, just looking at the column headers in the two files in your examples, my immediate reaction is that you have alot of double entries because the `roster.df` file includes the column "season", which means that the players listed in `roster.df` have multiple entries for each season they play, and when you left_join those to `qb_roster.df` there would be multiple entries. — anguyen1210, Nov 22 '20 at 16:52

left_join function creating duplicates when merging dataframes

0 Answers0