0

I have a large dataset and have tried finding an answer to this online but can't find one which answers my problem. I want to exclude a number of individuals in my dataset.

I have two columns of data in my dataset which are made up of a number of dates. I want to remove all individuals where date A is before dateB.

I've tried a variety of things but I can't work it out. When I have successfully removed these individuals all my other columns in the datset have NA in them versus the value they are meant to show.

If my dataset is called cohort, and dateA and dateB are the names of the columns, then this is what I tried to remove the individuals:

cohort2 <- cohort[!(cohort$dateA<cohort$dateB),]

cohort3 <- subset(cohort, cohort$dateA!<cohort$dateB)

Example of what my dataset looks like, I want to remove row 4 of the data in this case as date A occurs before date B. Thereby creating a new cohort which has the first three rows and the last row.

DateA DateB

1/1/16 1/1/15

2/2/16 2/2/15

3/3/16 3/3/15

4/4/15 4/4/16

NA___ 5/5/15

TarJae
  • 43,365
  • 4
  • 14
  • 40
Eams
  • 49
  • 7
  • In order for us to help you, please edit your question to include a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). For example, to produce a minimal data set, you can use `head()`, `subset()`, or the indices. Then use `dput()` to give us something that can be put in R immediately. Also, please make sure you know what to do [when someone answers your question](https://stackoverflow.com/help/someone-answers). More info can be found at Stack Overflow's [help center](https://stackoverflow.com/help). Thank you! – iamericfletcher Mar 11 '21 at 20:26
  • In addition to the useful suggestions above, you should First do a good faith search. When I search on "[r] remove rows condition" I get hundreds of hits:https://stackoverflow.com/search?q=%5Br%5D+condition+remove+rows – IRTFM Mar 11 '21 at 20:34

1 Answers1

0

Using the dplyr package:


cohort <- data.frame(
  DateA = as.Date(c('1/1/2016', '2/2/2016', '3/3/2016', '4/4/2015', NA), format = "%m/%d/%Y"),
  DateB = as.Date(c('1/1/2015', '2/2/2015', '3/3/2015', '4/4/2016', '5/5/2015'), format = "%m/%d/%Y")
)

library(dplyr)


cohort %>% 
  filter(DateA > DateB | is.na(DateA) | is.na(DateB))

#>        DateA      DateB
#> 1 2016-01-01 2015-01-01
#> 2 2016-02-02 2015-02-02
#> 3 2016-03-03 2015-03-03
#> 4       <NA> 2015-05-05

Created on 2021-03-11 by the reprex package (v0.3.0)

Base R approach:

cohort <- data.frame(
  DateA = as.Date(c('1/1/2016', '2/2/2016', '3/3/2016', '4/4/2015', NA), format = "%m/%d/%Y"),
  DateB = as.Date(c('1/1/2015', '2/2/2015', '3/3/2015', '4/4/2016', '5/5/2015'), format = "%m/%d/%Y")
)

cohort[cohort$DateA > cohort$DateB | is.na(cohort$DateA) | is.na(cohort$DateB), ] 

#>        DateA      DateB
#> 1 2016-01-01 2015-01-01
#> 2 2016-02-02 2015-02-02
#> 3 2016-03-03 2015-03-03
#> 5       <NA> 2015-05-05

Created on 2021-03-11 by the reprex package (v0.3.0)

iamericfletcher
  • 2,529
  • 6
  • 17
  • So if my dates were all between the years 2000 and 2019, would I change as.Date to ('2000/01/01)' , as.Date('2020/01/01'). – Eams Mar 11 '21 at 20:40
  • Don't worry about that section of my code. I needed to produce my own sample data because you didn't provide any. Use the code starting from `library(dplyr)` and change `df` to the name of your dataset. If the columns are indeed named `dateA` and `dateB` then the code should work as is. – iamericfletcher Mar 11 '21 at 20:42
  • Either way, you should edit your question to include a reproducible example as noted in my comment. We need to see your data in order to provide an accurate as possible answer for you. Other than that I (we) are taking shots in the dark. – iamericfletcher Mar 11 '21 at 20:43
  • Thank you. So I did that and re-named it cohort2 for this purpose, but this is now a cohort of the excluded individuals vs the ones I want to have, so how do I remove them from the original cohort?? – Eams Mar 11 '21 at 20:52
  • I am not sure I understand what you are asking. Can you please update your question to include a reproducible example and expected output? – iamericfletcher Mar 11 '21 at 20:57
  • You asked to remove all the dates where dateA is before dateB. That is what this code does. – iamericfletcher Mar 11 '21 at 20:59
  • I've edited it now. Hopefully it makes more sense now, I also have some missing pieces of data in each of the columns and I need to keep this individuals as well. – Eams Mar 11 '21 at 21:05
  • I have updated my solution that now uses your data and outputs what you want. – iamericfletcher Mar 11 '21 at 21:06
  • Apologies, I think I was still adding a row in when you edited your comment. I also have some pieces of missing information in the column date A. I want to keep those individuals, but currently they get excluded with this code. – Eams Mar 11 '21 at 21:09
  • Check my latest update. Does that work for you? It is outputting exactly what you are looking for. – iamericfletcher Mar 11 '21 at 21:14
  • Yes! Thank you, I appreciate you helping! – Eams Mar 11 '21 at 21:18
  • Please upvote and mark this solution as accepted by clicking the checkmark. – iamericfletcher Mar 11 '21 at 21:18