23

I have a string for e.g.

String src = "How are things today /* this is comment *\*/ and is your code  /*\* this is another comment */ working?"

I want to remove /* this is comment *\*/ and /** this is another comment */ substrings from the src string.

I tried to use regex but failed due to less experience.

Alan Moore
  • 71,299
  • 12
  • 93
  • 154
hanumant
  • 1,051
  • 4
  • 14
  • 26
  • 7
    Parsing Java code with regex is not something I'd recommend. – Confluence Oct 22 '12 at 15:49
  • @Confluence, I am not sure what could be the best option to achieve the result? Can you suggest one. – hanumant Oct 22 '12 at 15:52
  • What regex did you try? As you already say that you have tried something, you can as well just paste it here, so we can see your approach. We can go into more/less details about the solutions depending on your experience. – brimborium Oct 22 '12 at 15:52
  • /\\*.*\\/ this is what I used ...And it removed whole string after the first match – hanumant Oct 22 '12 at 15:59
  • from https://www.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/ch07s06.html, you can use either `/\*.*?\*/` or `/\*[\s\S]*?\*/` – psykid Jul 14 '21 at 13:00

8 Answers8

52

The best multiline comment regex is an unrolled version of (?s)/\*.*?\*/ that looks like

String pat = "/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/";

See the regex demo and explanation at regex101.com.

In short,

  • /\* - match the comment start /*
  • [^*]*\*+ - match 0+ characters other than * followed with 1+ literal *
  • (?:[^/*][^*]*\*+)* - 0+ sequences of:
    • [^/*][^*]*\*+ - not a / or * (matched with [^/*]) followed with 0+ non-asterisk characters ([^*]*) followed with 1+ asterisks (\*+)
  • / - closing /

David's regex needs 26 steps to find the match in my example string, and my regex needs just 12 steps. With huge inputs, David's regex is likely to fail with a stack overflow issue or something similar because the .*? lazy dot matching is inefficient due to lazy pattern expansion at each location the regex engine performs, while my pattern matches linear chunks of text in one go.

Wiktor Stribiżew
  • 561,645
  • 34
  • 376
  • 476
  • How did you come up with this? – JiaChen ZENG Oct 07 '17 at 14:15
  • 2
    @AT-Aoi It is basically taken from *Mastering Regular Expressions*, *Removing C Comments* section. – Wiktor Stribiżew Oct 07 '17 at 14:21
  • 1
    This has a bug, it incorrectly extends comments that consists only of asterisks past the closing `*/`. A small, syntactically correct C snippet demonstrates this issue: `/**/ Incorrectly removed /**/`. – jerry Jun 05 '19 at 20:37
  • 1
    @jerry I introduced a change some time ago, trying to accommodate for repeating asterisks at the start. Rolled back to the original version. Now, your issue is [not repro](https://regex101.com/r/dU5fO8/73). – Wiktor Stribiżew Jun 05 '19 at 20:43
  • 2
    An assumption like "because the .*? lazy dot matching is inefficient" cannot be made in general without referring to a specific regex engine and version. Even if it holds true for some engine, it may not hold true for another one and not even for a different version of the same one. It's not defined how a regex engine works; that's comparable to SQL not specifying how a database really works under the hood. – Mecki Mar 22 '20 at 02:17
  • Your ``my regex`` link still links to the old broken version that fails for /***/. – erg Aug 01 '20 at 15:00
  • awesome solution :) – Nav Jan 23 '21 at 18:48
  • Do you know how to fix the case where the comment is actually a part of a string? I posted it here: https://stackoverflow.com/q/66301705 – john c. j. Feb 21 '21 at 12:09
  • How about finding multiple line comments having a certain word like `foo`? – Hasanuzzaman Sattar Dec 14 '21 at 11:53
  • @HasanuzzamanSattar Using POSIX-like pattern here would be hard, [I'd suggest](https://regex101.com/r/dU5fO8/177) `(?s)/\*(?:(?!/\*|\*/).)*?foo(?:(?!/\*|\*/).)*\*/`, but note it is not going to be efficient. The best approach here is to match all of the comments and just filter out those containing some other pattern. – Wiktor Stribiżew Dec 14 '21 at 12:00
  • @WiktorStribiżew It's not working at Dreamweaver search tool. I want match all c/C++ style multiple lines comments where `foo` word is present inside the comment at least once. – Hasanuzzaman Sattar Dec 14 '21 at 12:09
  • 1
    @HasanuzzamanSattar Then it does not use Java / PCRE compliant regex engine. Probably, they use some kind of ECMAScript there, so you need `/\*(?:(?!/\*|\*/)[\w\W])*?foo(?:(?!/\*|\*/)[\w\W])*\*/`. – Wiktor Stribiżew Dec 14 '21 at 12:16
20

Try using this regex (Single line comments only):

String src ="How are things today /* this is comment */ and is your code /* this is another comment */ working?";
String result=src.replaceAll("/\\*.*?\\*/","");//single line comments
System.out.println(result);

REGEX explained:

Match the character "/" literally

Match the character "*" literally

"." Match any single character

"*?" Between zero and unlimited times, as few times as possible, expanding as needed (lazy)

Match the character "*" literally

Match the character "/" literally

Alternatively here is regex for single and multi-line comments by adding (?s):

//note the added \n which wont work with previous regex
String src ="How are things today /* this\n is comment */ and is your code /* this is another comment */ working?";
String result=src.replaceAll("(?s)/\\*.*?\\*/","");
System.out.println(result);

Reference:

ThomasW
  • 16,483
  • 4
  • 76
  • 103
David Kroukamp
  • 35,635
  • 13
  • 75
  • 131
3

Try this one:

(//[^\n]*$|/(?!\\)\*[\s\S]*?\*(?!\\)/)

If you want to exclude the parts enclused in " " then use:

(\"[^\"]*\"(?!\\))|(//[^\n]*$|/(?!\\)\*[\s\S]*?\*(?!\\)/)

the first capturing group identifies all " " parts and second capturing group gives you comments (both single line and multi line)

copy the regular expression to regex101 if you want explanation

Akshay
  • 81
  • 1
  • 11
0

Can't parse C/C++ style comments in Java source directly.
Quoted strings have to be parsed at the same time and within the same regex
because the string may embed /* or //, the start of a comment when it is just part
of the string.

Note there is additional regex consideration needs if raw strings constructs
are possible in the language.

The regex that does this feat is this.
Where group 1 contains the Comment and group 2 contains the Non-Comment.
For example if you were removing comments it would be:

Find
(/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n|$))|("[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|[\S\s][^/"'\\]*)

Replace
$2


Stringed:
"(/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\(?:\\r?\\n)?)*?(?:\\r?\\n|$))|(\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|[\\S\\s][^/\"'\\\\]*)"

0
(?s)(?i)(^|\s+?)(\/\*)((.)(?!\*\/))*?(this)(.*?)(\*\/)

You can find inner comment's words:

richardec
  • 14,202
  • 6
  • 23
  • 49
-1
System.out.println(src.replaceAll("\\/\\*.*?\\*\\/ ?", ""));

You have to use the non-greedy-quantifier ? to get the regex working. I also added a ' ?' at the end of the regex to remove one space.

jens-na
  • 2,174
  • 1
  • 16
  • 21
-1

Try this which worked for me:

System.out.println(src.replaceAll("(\/\*.*?\*\/)+",""));
Digerkam
  • 1,684
  • 4
  • 23
  • 37
-1

This could be the best approach for multi-line comments

System.out.println(text.replaceAll("\\/\\*[\\s\\S]*?\\*\\/", ""));

Mahesh Yadav
  • 379
  • 3
  • 5