161

I am trying to use grep to test whether a vector of strings are present in an another vector or not, and to output the values that are present (the matching patterns).

I have a data frame like this:

FirstName Letter   
Alex      A1
Alex      A6
Alex      A7
Bob       A1
Chris     A9
Chris     A6

I have a vector of strings patterns to be found in the "Letter" columns, for example: c("A1", "A9", "A6").

I would like to check whether the any of the strings in the pattern vector is present in the "Letter" column. If they are, I would like the output of unique values.

The problem is, I don't know how to use grep with multiple patterns. I tried:

matches <- unique (
    grep("A1| A9 | A6", myfile$Letter, value=TRUE, fixed=TRUE)
)

But it gives me 0 matches which is not true, any suggestions?

zx8754
  • 46,390
  • 10
  • 104
  • 180
user971102
  • 2,825
  • 4
  • 26
  • 35
  • 3
    You can't use `fixed=TRUE` cause you pattern is _true_ regular expression. – Marek Oct 05 '11 at 15:27
  • 6
    Using `match` or `%in%` or even `==` is the *only* correct way to compare exact matches. regex is very dangerous for such a task and can lead to unexpected results. – David Arenburg Sep 12 '16 at 05:34

10 Answers10

307

In addition to @Marek's comment about not including fixed==TRUE, you also need to not have the spaces in your regular expression. It should be "A1|A9|A6".

You also mention that there are lots of patterns. Assuming that they are in a vector

toMatch <- c("A1", "A9", "A6")

Then you can create your regular expression directly using paste and collapse = "|".

matches <- unique (grep(paste(toMatch,collapse="|"), 
                        myfile$Letter, value=TRUE))
Henrik
  • 61,039
  • 13
  • 131
  • 152
Brian Diggs
  • 55,682
  • 13
  • 158
  • 183
  • Any way to do this when your list of strings includes regex operators as punctuation? – user124123 Jan 27 '15 at 17:10
  • @user1987097 It should work the same way, with or without any other regex operators. Did you have a specific example this didn't work for? – Brian Diggs Feb 04 '15 at 18:26
  • @user1987097 use 2 backslahes before a dot or bracket. First backslash is an escape character to interpret the second one needed to disable the operator. – mbh86 Mar 11 '16 at 14:48
  • 3
    Using regex for exact matches seem dangerous to me and can have unexpected results. Why not just `toMatch %in% myfile$Letter` ? – David Arenburg Sep 12 '16 at 05:30
  • @user4050 No specific reason. The version in the question had it and I probably just carried it through without thinking about whether it was necessary. – Brian Diggs Jun 01 '17 at 03:16
  • method also works for matching multiple patterns not in a dataframe, but within a character vector. – Momchill Nov 30 '20 at 15:55
43

Good answers, however don't forget about filter() from dplyr:

patterns <- c("A1", "A9", "A6")
>your_df
  FirstName Letter
1      Alex     A1
2      Alex     A6
3      Alex     A7
4       Bob     A1
5     Chris     A9
6     Chris     A6

result <- filter(your_df, grepl(paste(patterns, collapse="|"), Letter))

>result
  FirstName Letter
1      Alex     A1
2      Alex     A6
3       Bob     A1
4     Chris     A9
5     Chris     A6
Adamm
  • 2,072
  • 17
  • 28
  • 3
    I think that `grepl` works with one pattern at the time (we need vector with length 1), we have 3 patterns (vector of length 3), so we can combine them with one using some friendly for grepl separator - `|`, try your luck with other :) – Adamm Feb 23 '18 at 09:16
  • 3
    oh I get it now. So its a compress way to output something like A1 | A2 so if one wanted all conditions then the collapse would be with an & sign, cool thanks. – Ahdee Feb 23 '18 at 15:41
  • 1
    Hi, using `)|(` to separate patterns might make this more robust: `paste0("(", paste(patterns, collapse=")|("),")")`. Unfortunately it becomes also slightly less elegent. This results in pattern `(A1)|(A9)|(A6)`. – fabern Jul 09 '19 at 16:09
36

This should work:

grep(pattern = 'A1|A9|A6', x = myfile$Letter)

Or even more simply:

library(data.table)
myfile$Letter %like% 'A1|A9|A6'
petermeissner
  • 11,537
  • 5
  • 57
  • 60
BOC
  • 369
  • 3
  • 2
  • 13
    `%like%` isn't in base R, so you should mention what package(s) are needed to use it. – Gregor Thomas Nov 01 '18 at 16:39
  • 2
    For others looking at this answer, `%like%` is part of the `data.table` package. Also similar in `data.table` are `like(...)`, `%ilike%`, and `%flike%`. – steveb May 05 '20 at 15:35
8

Based on Brian Digg's post, here are two helpful functions for filtering lists:

#Returns all items in a list that are not contained in toMatch
#toMatch can be a single item or a list of items
exclude <- function (theList, toMatch){
  return(setdiff(theList,include(theList,toMatch)))
}

#Returns all items in a list that ARE contained in toMatch
#toMatch can be a single item or a list of items
include <- function (theList, toMatch){
  matches <- unique (grep(paste(toMatch,collapse="|"), 
                          theList, value=TRUE))
  return(matches)
}
Austin
  • 7,379
  • 2
  • 29
  • 34
6

Have you tried the match() or charmatch() functions?

Example use:

match(c("A1", "A9", "A6"), myfile$Letter)
dwitvliet
  • 6,684
  • 7
  • 34
  • 60
user3877096
  • 69
  • 1
  • 1
  • 3
    One thing to note with `match` is that it is not using patterns, it is expecting an exact match. – steveb May 05 '20 at 15:39
5

To add to Brian Diggs answer.

another way using grepl will return a data frame containing all your values.

toMatch <- myfile$Letter

matches <- myfile[grepl(paste(toMatch, collapse="|"), myfile$Letter), ]

matches

Letter Firstname
1     A1      Alex 
2     A6      Alex 
4     A1       Bob 
5     A9     Chris 
6     A6     Chris

Maybe a bit cleaner... maybe?

DryLabRebel
  • 6,797
  • 3
  • 14
  • 24
4

Not sure whether this answer has already appeared...

For the particular pattern in the question, you can just do it with a single grep() call,

grep("A[169]", myfile$Letter)
BenBarnes
  • 18,616
  • 6
  • 55
  • 72
Assaf
  • 515
  • 5
  • 6
2

Take away the spaces. So do:

matches <- unique(grep("A1|A9|A6", myfile$Letter, value=TRUE, fixed=TRUE))
Saurabh Chauhan
  • 3,003
  • 2
  • 18
  • 42
1

Using the sapply

 patterns <- c("A1", "A9", "A6")
         df <- data.frame(name=c("A","Ale","Al","lex","x"),Letters=c("A1","A2","A9","A1","A9"))



   name Letters
1    A      A1
2  Ale      A2
3   Al      A9
4  lex      A1
5    x      A9


 df[unlist(sapply(patterns, grep, df$Letters, USE.NAMES = F)), ]
  name Letters
1    A      A1
4  lex      A1
3   Al      A9
5    x      A9
dondapati
  • 789
  • 5
  • 17
-1

I suggest writing a little script and doing multiple searches with Grep. I've never found a way to search for multiple patterns, and believe me, I've looked!

Like so, your shell file, with an embedded string:

 #!/bin/bash 
 grep *A6* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";
 grep *A7* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";
 grep *A8* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";

Then run by typing myshell.sh.

If you want to be able to pass in the string on the command line, do it like this, with a shell argument--this is bash notation btw:

 #!/bin/bash 
 $stingtomatch = "${1}";
 grep *A6* "${stingtomatch}";
 grep *A7* "${stingtomatch}";
 grep *A8* "${stingtomatch}";

And so forth.

If there are a lot of patterns to match, you can put it in a for loop.

Jaap
  • 77,147
  • 31
  • 174
  • 185
ChrisBean
  • 139
  • 1
  • 1
  • 3
  • Thank you ChrisBean. The patterns are lots actually, and maybe it would be better to use a file then. I am new to BASH, but maybe something like this should work… #!/bin/bash for i in 'pattern.txt' do echo $i j='grep -c "${i}" myfile.txt' echo $j if [$j -eq o ] then echo $i >> matches.txt fi done – user971102 Sep 29 '11 at 15:44
  • doesn't work…the error message is '[grep: command not found'…I have grep in the /bin folder, and /bin is on my $PATH…Not sure what is happening…Can you please help? – user971102 Sep 29 '11 at 16:33