2

I got the 2 texts:

First one: My favorite programming language is c++.

Second one: My favorite programming language is c.

and want to seek for c and c++ in those texts separately.

For finding c I can write: \bc\b then: first text is bad! and second one is good. I tried also: \bc^\+\b but doesn't work. For fiding c++ I tried for example: \bc\+\+\b but then first and second doesn't work. Help please.

EDIT:

And what if the text will be I programme in c++ a lot! ?

EDIT:

Here is the unit test which I need to fulfill:

package adhoc;

import java.util.HashSet;
import java.util.Set;

import org.junit.Test;

import junit.framework.TestCase;

public class FinderProgrammingTechnologyInTextTest extends TestCase{

    @Test
    public void testFind() {
        // Given:
        Set<String> setOfProgrammingLanguagesToSeek = new HashSet<>();
        setOfProgrammingLanguagesToSeek.add("java");
        setOfProgrammingLanguagesToSeek.add("perl");
        setOfProgrammingLanguagesToSeek.add("c");
        setOfProgrammingLanguagesToSeek.add("c++");

        // When:
        FinderProgrammingTechnologyInText finder = new FinderProgrammingTechnologyInText(
                setOfProgrammingLanguagesToSeek);
        Set<String> result = finder.find("java , perl! c++ and other staff");

        // Then:
        assertTrue(result.contains("java"));
        assertTrue(result.contains("perl"));
        assertFalse(result.contains("c"));
        assertTrue(result.contains("c++"));
    }

}

by changing ONLY the argument for compile() method:

package adhoc;

import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

public class FinderProgrammingTechnologyInText {

    Set<String> setOfTechnologiesToSearch;

    public FinderProgrammingTechnologyInText(Set<String> x) {
        this.setOfTechnologiesToSearch = x;
    }

    public Set<String> find(String text) {
        Set<String> result = new HashSet<>();
        return setOfTechnologiesToSearch.stream()
                .filter(x -> Pattern
                        .compile(x)  // change only this line
                        .matcher(text).find()
                        ) 
                .collect(Collectors.toSet());       
    }
}
W W
  • 759
  • 1
  • 9
  • 25

2 Answers2

3

Replace .compile(x) line with

.compile("(?<![\\w\\p{S}])" + Pattern.quote(x) + "(?![\\w\\p{S}])")

Here, (?<![\w\p{S}]) is a negative lookbehind that will make sure there is no word or symbol char immediately to the left of the current location, and (?![\w\p{S}]) negative lookahead will make sure there is no word or symbol char immediately to the right of the current location (that is, word and symbol chars are your allowed "word" chars now).

See a sample regex demo for a c++ keyword at regex101.com.

Since the search words are passed as literal char sequences to Pattern, they must be escaped, and that is what Pattern.quote(x) is doing in the code.

Wiktor Stribiżew
  • 561,645
  • 34
  • 376
  • 476
2

You could you just look for the last word in the sentence before the dot.

[\w+]+(?=\.$)

https://regex101.com/r/aPYDTE/1

The problem with your pattern is that the plus sign is not a word and therefore the word boundary \b does not match. If you would use the dot as anchor you would get a match \b(c\+\+)\.

If you are just want to match c/c++ and other languages try \W(c\+\+|css|c|java)\W
I have added a non-word \W as boundary. Adding a look around allows you to use the full match instead of using the capturing group $1.

(?<=\W)(c\+\+|css|c|java)(?=[^\w\+])

https://regex101.com/r/qWnOsB/4

wp78de
  • 17,272
  • 6
  • 36
  • 68