PL SQL Remove HTML Encoding

Question

I know you can remove HTML tags with a command such as this:

REGEXP_REPLACE(overview, '<.+?>')

But, some of the text has actual HTML encoding, where the application actually encoded things, like single quotes to be: &#39 or ’

I'm assuming these are pretty standard. Is there a way to remove them and replace them with the actual character, or am I stuck with REPLACE and listing them?

Many thanks!

The regular expression fails on the HTML `
`
– MT0 Feb 03 '22 at 20:47 — MT0, Feb 03 '22 at 20:47

score 1 · Answer 1 · answered Feb 03 '22 at 21:14

Use a proper XML parser:

with t (overview) as (
  SELECT '<div><p>Some entities: &amp; &#39; &lt; &gt; to be handled </p></div>' from dual UNION ALL
  SELECT '<html><head><title>Test</title></head><body><p>&lt;test&gt;</p></body></html>' from dual
)
SELECT x.*
FROM   t
       CROSS JOIN LATERAL (
         SELECT LISTAGG(value) WITHIN GROUP (ORDER BY ROWNUM) AS text
         FROM   XMLTABLE(
                  '//*'
                  PASSING XMLTYPE(t.overview)
                  COLUMNS
                    value CLOB PATH './text()'
                )
      ) x

Which outputs:

TEXT

Some entities: & ' < > to be handled

Test<test>

db<>fiddle here

Alex Poole · Accepted Answer · 2022-02-03T18:44:39.447

You can use utl_i18n.unescape_references():

utl_i18n.unescape_reference(regexp_replace(overview, '<.+?>'))

As a demo:

-- sample data
with t (overview) as (
  select '<div><p>Some entities: &amp; &#39; &lt; &gt; to be handled </p></div>'
  from dual
)
select REGEXP_REPLACE(overview, '<.+?>') as result1,
  utl_i18n.unescape_reference(regexp_replace(overview, '<.+?>')) as result2
from t

gets

RESULT1	RESULT2
Some entities: & ' < > to be handled	Some entities: & ' < > to be handled

db<>fiddle

_{I'm not endorsing (or attacking) the notion of using regular expressions; that's handled and refuted and discussed elsewhere. I'm just addressing the part about encoded entities.}

utl_i18n.unescape_reference appears to be working. New feature for me......thanks!! — Landon Statis, Feb 04 '22 at 03:38

PL SQL Remove HTML Encoding

2 Answers2