7

I have a text containing just HTML entities such as < and   I need to remove this all and get just the text content:

&nbspHello there<testdata>

So, I need to get Hello there and testdata from this section. Is there any way of using negative lookahead to do this?

I tried the following: /((?!&.+;).)+/ig but this doesnt seem to work very well. So, how can I just extract the required text from there?

Mkl Rjv
  • 6,493
  • 4
  • 24
  • 44

3 Answers3

17

A better syntax to find HTML entities is the following regular expression:

/&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-fA-F]{1,6});/ig

This syntax ignores false entities.

Kevin Doyon
  • 3,274
  • 2
  • 32
  • 37
Mahoor13
  • 4,947
  • 5
  • 22
  • 22
  • This doesn't necessarily matter, but it's worth noting that this is technically not comprehensive. `&amp`, `{`, and `{` are all valid HTML entities that won't be matched by this. – Grant Gryczan Sep 10 '21 at 03:05
  • [a-z0-9]+ matches &amp and similar forms, and #[0-9]{1,6} matches all entities from to 󴈿 . I think other forms are not useful. – Mahoor13 Sep 11 '21 at 12:59
  • It matches `&`, not `&amp`. Your regex requires a semicolon, but `&amp` is a valid HTML entity. And I didn't say anything about whether those forms of entities are useful. I only said this regex is not comprehensive. If someone needed a comprehensive regex for their use case, this would not work. – Grant Gryczan Sep 11 '21 at 17:30
4

Here are 2 suggestions:

1) Match all the entities using /(&.+;)/ig. Then, using whatever programming language you are using, replace those matches with an empty string. For example, in php use preg_replace; in C# use Regex.Replace. See this SO for a similar solution that accounts for more cases: How to remove html special chars?

2) If you really want to do this using the plaintext portions, you could try something like this: /(?:^|;)([^&;]+)(?:&|$)/ig. What its actually trying to do it match the pieces between; and & with special cases for start and end without entities. This is probably not the way to go, you're likely to run into different cases this breaks.

Community
  • 1
  • 1
dtyler
  • 1,388
  • 2
  • 15
  • 21
1

It's language specific but in Python you can use html.unescape (MAN). Like:

import html
print(html.unescape("This string contains & and >"))
#prints: This string contains & and >
gneusch
  • 95
  • 6