6

I have a vector of strings string which look like this

ABC_EFG_HIG_ADF_AKF_MNB

Now from each of this element I want to extract the 3rd set of strings(from left) i.e in this case HIG. How can I achieve this in R

zx8754
  • 46,390
  • 10
  • 104
  • 180
Rajarshi Bhadra
  • 1,706
  • 4
  • 22
  • 39

4 Answers4

13

substr extracts a substring by position:

substr('ABC_EFG_HIG_ADF_AKF_MNB', 9, 11)

returns

[1] "HIG"
alistaire
  • 40,464
  • 4
  • 71
  • 108
8

Here's one more possibility:

strsplit(str1,"_")[[1]][3]
#[1] "HIG"

The command strsplit() does what its name suggests: it splits a string. The second parameter is the character on which the string is split, wherever it is found within the string.

Perhaps somewhat surprisingly, strsplit() returns a list. So we can either use unlist() to access the resulting split parts of the original string, or in this case address them with the index of the list [[1]] since the list in this example has only one member, which consists of six character strings (cf. the output of str(strsplit(str1,"_"))). To access the third entry of this list, we can specify [3] at the end of the command.

The string str1 is defined here as in the answer by @akrun.

RHertel
  • 22,694
  • 5
  • 36
  • 60
  • 1
    Was about to post the same, but slightly different: `strsplit(str1,"_")[[c(1,3)]]`, just to show what a vector does inside `[[`. – nicola Mar 02 '16 at 17:30
6

We can use sub. We match one or more characters that are not _ ([^_]+) followed by a _. Keep it in a capture group. As we wants to extract the third set of non _ characters, we repeat the previously enclosed group 2 times ({2}) followed by another capture group of one or more non _ characters, and the rest of the characters indicated by .*. In the replacement, we use the backreference for the second capture group (\\2).

sub("^([^_]+_){2}([^_]+).*", "\\2", str1)
#[1] "HIG"

Or another option is with scan

scan(text=str1, sep="_", what="", quiet=TRUE)[3]
#[1] "HIG"

A similar option as mentioned by @RHertel would be to use read.table/read.csv on the string

 read.table(text=str1,sep = "_", stringsAsFactors=FALSE)[,3]

data

str1 <- "ABC_EFG_HIG_ADF_AKF_MNB"
akrun
  • 789,025
  • 32
  • 460
  • 575
5

If you know the place of the pattern you look for, and you know that it is fixed (here, between the 9 and 11 character), you can simply use str_sub(), from the stringr package.

MyString = 'ABC_EFG_HIG_ADF_AKF_MNB'
str_sub(MyString, 9, 11)
Rtist
  • 3,167
  • 1
  • 25
  • 34