0

I am working on a plugin. I will parse HTML files. I have a naming convention like that:

<!--$include="a.html" -->

or

<!--$include="a.html"-->

is similar

According to this pattern(similar to server side includes) I want to search an HTML file. Question is that:

Find that pattern and get value (a.html at my example, it is variable)

It should be like:

while(!notFinishedWholeFile){
    fileName = findPatternFunc(htmlFile)
    replaceFunc(fileName,something)
}

PS: Using regex at Java or implementing it different(as like using .indexOf()) I don't know which one is better. If regex is good at this situation by performence I want to use it.

Any ideas?

kamaci
  • 69,365
  • 67
  • 217
  • 352

3 Answers3

0

You mean like this?

<!--\$include=\"(?<htmlName>[a-z-_]*).html\"\s?-->
Muqito
  • 1,321
  • 3
  • 13
  • 27
0

Read a file into a string then

str = str.replaceAll("(?<=<!--\\$include=\")[^\"]+(?=\" ?-->)", something);

will replace the filenames with the string something, then the string can be written back to the file.
(Note: this replaces any text inside the double quotes, not just valid filenames.)

If you want only want to replace filenames with the html extension, swap the [^\"]+ for [^.]+.html.

Using regex for this task is fine performance wise, but see e.g. How to use regular expressions to parse HTML in Java? and Java Regex performance etc.

Community
  • 1
  • 1
MikeM
  • 12,105
  • 2
  • 30
  • 45
  • quote from your links: "Using regular expressions to pull values from HTML is always a mistake." and: "Hint: Don't use regexes for link extraction or other HTML "parsing" tasks!" :) – linski Dec 31 '12 at 12:15
  • @linski. Yes, I included the links because I wanted kamaci to consider such opinions before _making up his own mind_. – MikeM Dec 31 '12 at 13:58
  • I thought it might be more visible, now that I have red it again it seems more obvious. – linski Dec 31 '12 at 14:03
0

I have used that pattern:

"<!--\\$include=\"(.+)(.)(html|htm)\"-->"
kamaci
  • 69,365
  • 67
  • 217
  • 352