0

Given the string "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", I need to generate a regex filter so that it ignores the last char if it is an "/" .

I tried the following regex "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)" as of regexr.com/4om61, but it doesn´t work when I run in R as:

regex_exp_R   <- "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)"
grep(regex_exp_R, "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", perl = T, value = T)

I need this to work in pure regex and grep function, without using any string R package. Thank you.

Simplified Case: After important contributions of you all, one last issue remains. Because I will use regex as an input in another friunction, the solution must work with pure regex and grep.

The remaining point is a very basic one: given the strings "a1bc/" or "a1bc", the regex must return "a1bc". Building on suggestions I received, I tried

grep(".*[^//]" ,"a1bc/", perl = T, value = T), but still get "a1bc/" instead of "a1bc". Any hints? Thank you.

Jaap
  • 77,147
  • 31
  • 174
  • 185
Fabio Correa
  • 1,013
  • 1
  • 9
  • 16
  • 2
    Use this as regex: `(?:http://)?compras\.dados\.gov\.br.*\?[^/]*` There is no need to use lookbehind here. – anubhava Nov 12 '19 at 16:47
  • `gsub('/$', '', x)` will make a copy of `x` without the `/` at the end (if there is one for the given element of `x`) – IceCreamToucan Nov 12 '19 at 16:53
  • 1
    I am not completely clear on what you are looking for--what do you mean by ignore? Do you want it returned without the last `/` or do you want it to be an optional element of your search pattern. – Andrew Nov 12 '19 at 16:55
  • 1
    Dear Andrew, I want the string returned without the last "/". Thank you – Fabio Correa Nov 12 '19 at 17:01
  • @FabioCorrea: Did you try my suggested regex? – anubhava Nov 12 '19 at 17:04
  • @anubhava, yes, it did not work. Simplifying the problem and building on your proposed solution, when I try grep(".*[^//]" ,"abc/", perl = T, value = T), I get "abc/" instead of "abc". Thank you. – Fabio Correa Nov 12 '19 at 17:14
  • 1
    With `grep()`, even if you correctly match part of the string, it will return the original string regardless. E.g., `grep("a", "abc", value = T)` – Andrew Nov 12 '19 at 17:16
  • 1
    @FabioCorrea: Check this: https://stackoverflow.com/a/23901600/548225 – anubhava Nov 12 '19 at 17:20
  • 2
    Just `grep` the `gsub("/+$", "", x)` – Wiktor Stribiżew Nov 12 '19 at 17:25
  • I edited the original question for a simplified last issue, after all contributions. Thank you. – Fabio Correa Nov 12 '19 at 19:08
  • 1
    @FabioCorrea, it cannot work with only `grep()` because `grep()` is not designed to return partial matches. It is designed to return an index--`grep("c", letters)`--but it can return the value of the original string instead of the index--`grep("c", letters, value = T)`. I would suggest using another base function such as `gsub` (on its own, or with `grep`). Read the value header in `?grep` – Andrew Nov 12 '19 at 19:30

4 Answers4

0

If you want to return the string without the last / you can do this several ways. Below are a couple options using base R:

Using a back-reference in gsub() (sub() would work too here):

gsub("(.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"

# or, adapting your original pattern
gsub("((http:////)?compras\\.dados\\.gov\\.br.*\\?.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"

By position using ifelse() and substr() (this will proabbly be a little bit faster if scaling matters)

ifelse(substr(x, nchar(x), nchar(x)) == "/", substr(x, 1, nchar(x)-1), x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"

Data:

x <- "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/"
Andrew
  • 4,858
  • 2
  • 10
  • 20
0

Use sub to remove a trailing /:

x <- c("a1bc/", "a2bc")
sub("/$", "", x)

This changes nothing on a string that does not end in /.

As others have pointed out, grep does not modify strings. It returns a numeric vector of indices of the matched strings or a vector of the (unmodified) matched items. It's usually used to subset a character vector.

ngwalton
  • 383
  • 3
  • 8
0

You can use a negative look-behind at the end to ensure it doesn't end with the character you don't want (in this case, a /). The regex would then be:

.+(?<!\/)

You can view it here with your three input examples: https://regex101.com/r/XB9f7K/1/. If you only want it to match urls, then you would change the .+ part at the beginning to your url regex.

David542
  • 101,766
  • 154
  • 423
  • 727
-1

How about trying gsub("(.*?)/+$","\\1",s)?

ThomasIsCoding
  • 80,151
  • 7
  • 17
  • 65