16

I have the following records in one of my table

CD&M Communications 
auburndale oil & propane inc  
C F La Fountaine #7561  
Laramie County Fire District # 2  
AmeriGas Propane LP #2250  

Is there a way to remove the characters like &, #7561, #2250 etc.

"&" should be replaced with "&" as per C# HTMLDECODE function

Andrew Morton
  • 22,953
  • 9
  • 55
  • 76
VPP
  • 665
  • 1
  • 8
  • 30

4 Answers4

24

There is a much easier solution...

SQL Server supports the XML datatype, and it supports decoding XML/HTML encoded entities. If you just cast the string to the XML datatype, you can use the built in decode function.

That would look like this:

select cast('Q & A' as XML).value('.[1]','nvarchar(max)' );

To turn it into a function for easy use:

create function dbo.xmlDecode (@string nvarchar(max))
returns varchar(max)
begin
    return cast(@string as XML).value('.[1]','nvarchar(max)' )
end;

Keep in mind that in OP's example, the string seems to have been encoded 3 times in a row. & got turned into & then into & and then into &. As a result, to get the "original" string back, you have to use the decode function 3 times.

Wouter
  • 1,688
  • 2
  • 24
  • 34
  • One user pointed out that this conversion fails when the string contains the characters < or >. I have not been able to reproduce this issue. If anyone encounters this, I would be interested in an example case. – Wouter May 04 '18 at 09:00
  • There are 5 characters that break this solution: `, &, ", '` For example: try `select cast('Q&A' as XML).value('.[1]','nvarchar(max)' )` – egerardus May 21 '19 at 00:28
  • @Geronimo: aha, thanks for that clarification. Indeed, when the string being cast to XML contains characters that are *not* XML encoded, an error is thrown. Because of this, this approach will not work for mixed-encoded strings, but only for strings that have been fully XML-encoded one or multiple times. – Wouter May 21 '19 at 09:03
  • This approach will work in most use-cases, yet the currently accepted answer by @sathish remains a good "plan B" approach for the mixed-encoding cases. – Wouter May 21 '19 at 09:05
  • 1
    I liked your use of casting because it covers the HEX entities that the accepted answer does not. I have plagarized it below and added handling for XML entities. – egerardus May 21 '19 at 16:24
  • 2
    Note that is will generally not work, as XML encoding is not the same as HTML encoding which this question is about. E.g. it will fail at `'A B'` with *"XML parsing: line 1, character 7, well formed check: undeclared entity"*. – GSerg Jun 28 '19 at 12:53
12

The following SQL function would work in your case or it would be a good starting point for you to extend it. However, please note the String manipulations in the Database [SQL Server] would be slower compared to the string manipulations in application layer.

GO

IF OBJECT_ID('dbo.MyHTMLDecode') IS NOT NULL BEGIN DROP FUNCTION dbo.MyHTMLDecode END

GO
CREATE FUNCTION dbo.MyHTMLDecode (@vcWhat VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
    DECLARE @vcResult VARCHAR(MAX)
    DECLARE @siPos INT
        ,@vcEncoded VARCHAR(7)
        ,@siChar INT

    SET @vcResult = RTRIM(LTRIM(CAST(REPLACE(@vcWhat COLLATE Latin1_General_BIN, CHAR(0), '') AS VARCHAR(MAX))))

    SELECT @vcResult = REPLACE(REPLACE(@vcResult, '&#160;', ' '), '&nbsp;', ' ')

    IF @vcResult = ''
        RETURN @vcResult

    SELECT @siPos = PATINDEX('%&#[0-9][0-9][0-9];%', @vcResult)

    WHILE @siPos > 0
    BEGIN
        SELECT @vcEncoded = SUBSTRING(@vcResult, @siPos, 6)
            ,@siChar = CAST(SUBSTRING(@vcEncoded, 3, 3) AS INT)
            ,@vcResult = REPLACE(@vcResult, @vcEncoded, NCHAR(@siChar))
            ,@siPos = PATINDEX('%&#[0-9][0-9][0-9];%', @vcResult)
    END

    SELECT @siPos = PATINDEX('%&#[0-9][0-9][0-9][0-9];%', @vcResult)

    WHILE @siPos > 0
    BEGIN
        SELECT @vcEncoded = SUBSTRING(@vcResult, @siPos, 7)
            ,@siChar = CAST(SUBSTRING(@vcEncoded, 3, 4) AS INT)
            ,@vcResult = REPLACE(@vcResult, @vcEncoded, NCHAR(@siChar))
            ,@siPos = PATINDEX('%&#[0-9][0-9][0-9][0-9];%', @vcResult)
    END

    SELECT @siPos = PATINDEX('%#[0-9][0-9][0-9][0-9]%', @vcResult)

    WHILE @siPos > 0
    BEGIN
        SELECT @vcEncoded = SUBSTRING(@vcResult, @siPos, 5)
            ,@vcResult = REPLACE(@vcResult, @vcEncoded, '')
            ,@siPos = PATINDEX('%#[0-9][0-9][0-9][0-9]%', @vcResult)
    END

    SELECT @vcResult = REPLACE(REPLACE(@vcResult, NCHAR(160), ' '), CHAR(160), ' ')

    SELECT @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult, '&amp;', '&'), '&quot;', '"'), '&lt;', '<'), '&gt;', '>'), '&amp;amp;', '&')

    RETURN @vcResult
END

GO

Illustration:

  DECLARE @S VARCHAR(MAX)='CD&amp;amp;amp;M Communications 
    auburndale oil &amp;amp;amp; propane inc  
    C F La Fountaine #7561  
    Laramie County Fire District # 2  
    AmeriGas Propane LP #2250'

    SELECT dbo.MyHTMLDecode (@s)

OUTPUT:

CD&M Communications 
auburndale oil & propane inc  
C F La Fountaine   
Laramie County Fire District # 2  
AmeriGas Propane LP 
Sathish
  • 1,806
  • 4
  • 26
  • 37
  • This script worked great except it skipped some characters. I needed to add `REPLACE(@vcResult, '"', '"')`, `REPLACE(@vcResult, '&', '&')`, `REPLACE(@vcResult, '*', '*')`, `REPLACE(@vcResult, '+', '+')` and `REPLACE(@vcResult, '/', '/')`. – Tot Zam Oct 18 '17 at 16:18
  • It is not my intent to explicitly "advertise" my own answer. But make sure to have a look at it before using the above script. This accepted answer is outdated imho. – Wouter Oct 29 '18 at 08:56
  • 1
    @Wouter: Could you clarify in what sense that accepted answer is outdated? Happy to correct. – Sathish Oct 29 '18 at 17:18
  • @Sathish: Since SQL Server supports the XML decode/encode functionality, implementing this yourself has become redundant. Also, as Tot Zam mentioned, it is not as complete a solution as the SQL Server functionality, since it does not cover all XML/HTML encoded entities. It remains a great contribution, by no means I intent to critique your work, but I no longer see an added value of using this solution, over the native SQL Server functions. – Wouter Oct 30 '18 at 09:08
8

Previous version does not works with Japan, Korean, ... Here fixed version:

GO
IF OBJECT_ID('dbo.fn_HTMLDecode') IS NOT NULL BEGIN DROP FUNCTION dbo.fn_HTMLDecode END
GO
CREATE FUNCTION dbo.fn_HTMLDecode(
    @vcWhat NVARCHAR(MAX)
    ,@toDecodeMainISOSymbols bit = 1
    ,@toDecodeISOChars bit = 1
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
    DECLARE @vcResult NVARCHAR(MAX);
    DECLARE @siPos INT ,@vcEncoded NVARCHAR(9) ,@siChar INT;
    SET @vcResult = RTRIM(LTRIM(CAST(REPLACE(@vcWhat COLLATE Latin1_General_BIN, CHAR(0), '') AS NVARCHAR(MAX))));
    SELECT @vcResult = REPLACE(REPLACE(@vcResult, '&#160;', ' '), '&nbsp;', ' ');
    IF @vcResult = '' RETURN @vcResult;

    declare @s varchar(35);
    declare @n int; set @n = 6;
    declare @i int;

    while @n > 2
    begin
        set @s = '';
        set @i=1;
        while @i<=@n
        begin
            set @s = @s + '[0-9]';
            set @i = @i + 1;
        end
        set @s = '%&#' + @s + '%';
        SELECT @siPos = PATINDEX(@s, @vcResult);
        WHILE @siPos > 0
        BEGIN
            SELECT @vcEncoded = SUBSTRING(@vcResult, @siPos, @n+3)
                ,@siChar = CAST(SUBSTRING(@vcEncoded, 3, @n) AS INT)
                ,@vcResult = REPLACE(@vcResult, @vcEncoded, NCHAR(@siChar))
                ,@siPos = PATINDEX(@s, @vcResult);
        END
        set @n = @n - 1;
    end

    if @toDecodeMainISOSymbols=1
    begin
        select @vcResult = REPLACE(REPLACE(@vcResult, NCHAR(160), ' '), CHAR(160), ' ');
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult, '&amp;', '&'), '&quot;', '"'), '&lt;', '<'), '&gt;', '>'), '&amp;amp;', '&'),'&rdquo;','”'),'&bdquo;','„'),'&ndash;','–'),'&mdash;','—');
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult,'&lsquo;','‘'),'&rsquo;','’'),'&bull;','•'),'&hellip;','…'),'&permil;','‰') COLLATE Latin1_General_BIN,'&prime;','′') COLLATE Latin1_General_BIN,'&Prime;','″'),'&circ;','ˆ'),'&tilde;','˜'),'&nbsp;',' ');
    end

    if @toDecodeISOChars=1
    begin
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(@vcResult COLLATE Latin1_General_BIN,'&Scaron;','Š') COLLATE Latin1_General_BIN,'&scaron;','š') COLLATE Latin1_General_BIN,'&Ccedil;','Ç') COLLATE Latin1_General_BIN,'&ccedil;','ç');
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult,'&Agrave;','À') COLLATE Latin1_General_BIN,'&agrave;','à') COLLATE Latin1_General_BIN,'&Aacute;','Á') COLLATE Latin1_General_BIN,'&aacute;','á') COLLATE Latin1_General_BIN,'&Acirc;','Â') COLLATE Latin1_General_BIN,'&acirc;','â') COLLATE Latin1_General_BIN,'&Atilde;','Ã') COLLATE Latin1_General_BIN,'&atilde;','ã') COLLATE Latin1_General_BIN,'&Auml;','Ä') COLLATE Latin1_General_BIN,'&auml;','ä') COLLATE Latin1_General_BIN,'&Aring;','Å') COLLATE Latin1_General_BIN,'&aring;','å') COLLATE Latin1_General_BIN,'&AElig;','Æ') COLLATE Latin1_General_BIN,'&aelig;','æ');
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult COLLATE Latin1_General_BIN,'&Egrave','È') COLLATE Latin1_General_BIN,'&egrave','è') COLLATE Latin1_General_BIN,'&Eacute;','É') COLLATE Latin1_General_BIN,'&eacute;','é') COLLATE Latin1_General_BIN,'&Ecirc;','Ê') COLLATE Latin1_General_BIN,'&ecirc;','ê') COLLATE Latin1_General_BIN,'&Euml;','Ë') COLLATE Latin1_General_BIN,'&euml;','ë');
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult COLLATE Latin1_General_BIN,'&Igrave;','Ì') COLLATE Latin1_General_BIN,'&igrave;','ì') COLLATE Latin1_General_BIN,'&Iacute;','Í') COLLATE Latin1_General_BIN,'&iacute;','í') COLLATE Latin1_General_BIN,'&Icirc;','Î') COLLATE Latin1_General_BIN,'&icirc;','î') COLLATE Latin1_General_BIN,'&Iuml;','Ï') COLLATE Latin1_General_BIN,'&iuml;','ï');
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult COLLATE Latin1_General_BIN,'&Ograve;','Ò') COLLATE Latin1_General_BIN,'&ograve;','ò') COLLATE Latin1_General_BIN,'&Oacute;','Ó') COLLATE Latin1_General_BIN,'&oacute;','ó') COLLATE Latin1_General_BIN,'&Ocirc;','Ô') COLLATE Latin1_General_BIN,'&ocirc;','ô') COLLATE Latin1_General_BIN,'&Otilde;','Õ') COLLATE Latin1_General_BIN,'&otilde;','õ') COLLATE Latin1_General_BIN,'&Ouml;','Ö') COLLATE Latin1_General_BIN,'&ouml;','ö') COLLATE Latin1_General_BIN,'&Oslash','Ø') COLLATE Latin1_General_BIN,'&oslash','ø');
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult COLLATE Latin1_General_BIN,'&Ugrave;','Ù') COLLATE Latin1_General_BIN,'&ugrave;','ù') COLLATE Latin1_General_BIN,'&Uacute;','Ú') COLLATE Latin1_General_BIN,'&uacute;','ú') COLLATE Latin1_General_BIN,'&Ucirc;','Û') COLLATE Latin1_General_BIN,'&ucirc;','û') COLLATE Latin1_General_BIN,'&Uuml;','Ü') COLLATE Latin1_General_BIN,'&uuml;','ü');
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult COLLATE Latin1_General_BIN,'&ETH;','Ð') COLLATE Latin1_General_BIN,'&eth;','ð') COLLATE Latin1_General_BIN,'&Ntilde;','Ñ') COLLATE Latin1_General_BIN,'&ntilde;','ñ') COLLATE Latin1_General_BIN,'&Yacute;','Ý') COLLATE Latin1_General_BIN,'&yacute;','ý') COLLATE Latin1_General_BIN,'&THORN;','Þ') COLLATE Latin1_General_BIN,'&thorn;','þ') COLLATE Latin1_General_BIN,'&szlig;','ß');
    end
    RETURN @vcResult;
END
-- test:
-- select dbo.fn_HTMLDecode(N'A fine example of man and nature co-existing is Slovenia&#8217;s ecological tourist farms.',1,1)
-- select dbo.fn_HTMLDecode(N'm0 &#50752;&#51064;&#48516;&#50556;&#50640;&#49436; m1 &#44032;&#51109; m2 &#50689;&#54693;&#47141; m3&#51080;&#45716;m10',1,1)
QazyCat
  • 119
  • 1
  • 2
3

Using the casting method from @Wouter handles decimal and hex entities, which the accepted answer does not.

However, a handler is also needed to deal with XML special characters in the input.

XML special characters: <, >, ", ', &

create function [dbo].[XMLdecode] (@input nvarchar(max))
returns nvarchar(max)
begin

    declare @output nvarchar(max) = ''
    declare @next nchar(1)
    declare @endIdx int = 0
    declare @idx int = 0
    while @idx < len(@input)
    begin

        set @idx += 1
        set @next = substring(@input, @idx, 1)
        set @endIdx = charindex(';', @input, @idx) - @idx

        if @next = '&' and (@endIdx > 8 or @endIdx < 1) 
            set @output += '&amp;'
        else if @next = '&' and @endIdx > 1 and @endIdx < 8
        begin
            set @output += lower(substring(@input,@idx,@endIdx+1))
            set @idx += @endIdx
        end
        else
            set @output += @next
    end

    set @output = replace(@output,'<','&lt;')
    set @output = replace(@output,'>','&gt;')
    set @output = replace(@output,'"','&quot;')
    set @output = replace(@output,'''','&apos;')

    return cast(@output as XML).value('.[1]','nvarchar(max)')

end;

Example usage:

select dbo.XMLdecode('this is a t&#xc9;st &#xba;f HEX &amp; DECIMAL &#8364;ntities & <<<< non-entities too! &#9745;')
------------------------------------------------------------------------------
returns: 'this is a tÉst ºf HEX & DECIMAL €ntities & <<<< non-entities too! ☑'

It's not bulletproof, but handled all of my cases.

Something that looks like an HTML entity but which is not actually, can still cause trouble. For example: &#x2; or even &;

It also does not handle non-XML named entities like &check; or &hearts;

egerardus
  • 11,016
  • 12
  • 78
  • 119