0

I know you can remove HTML tags with a command such as this:

REGEXP_REPLACE(overview, '<.+?>')

But, some of the text has actual HTML encoding, where the application actually encoded things, like single quotes to be: &#39 or ’

I'm assuming these are pretty standard. Is there a way to remove them and replace them with the actual character, or am I stuck with REPLACE and listing them?

Many thanks!

Landon Statis
  • 573
  • 1
  • 7
  • 18

2 Answers2

1

Use a proper XML parser:

with t (overview) as (
  SELECT '<div><p>Some entities: &amp; &#39; &lt; &gt; to be handled </p></div>' from dual UNION ALL
  SELECT '<html><head><title>Test</title></head><body><p>&lt;test&gt;</p></body></html>' from dual
)
SELECT x.*
FROM   t
       CROSS JOIN LATERAL (
         SELECT LISTAGG(value) WITHIN GROUP (ORDER BY ROWNUM) AS text
         FROM   XMLTABLE(
                  '//*'
                  PASSING XMLTYPE(t.overview)
                  COLUMNS
                    value CLOB PATH './text()'
                )
      ) x

Which outputs:

TEXT
Some entities: & ' < > to be handled
Test<test>

db<>fiddle here

MT0
  • 113,669
  • 10
  • 50
  • 103
0

You can use utl_i18n.unescape_references():

utl_i18n.unescape_reference(regexp_replace(overview, '<.+?>'))

As a demo:

-- sample data
with t (overview) as (
  select '<div><p>Some entities: &amp; &#39; &lt; &gt; to be handled </p></div>'
  from dual
)
select REGEXP_REPLACE(overview, '<.+?>') as result1,
  utl_i18n.unescape_reference(regexp_replace(overview, '<.+?>')) as result2
from t

gets

RESULT1 RESULT2
Some entities: &amp; &#39; &lt; &gt; to be handled Some entities: & ' < > to be handled

db<>fiddle


I'm not endorsing (or attacking) the notion of using regular expressions; that's handled and refuted and discussed elsewhere. I'm just addressing the part about encoded entities.

Alex Poole
  • 173,063
  • 10
  • 166
  • 288