1

I built an extension to convert HTML formatted text to something better for a list view. It removes all HTML tags except it replaces <h> and <p>s with <br /> to keep readability on the list view. It also shortens the text for longer posts. I put it on my razor view with HTML.Raw(model.text).

public static string FixHTML(string input, int? strLen)
        {
            string s = input.Trim();
            s = Regex.Replace(s, "</p.*?>", "<br />");
            s = Regex.Replace(s, "</h.*?>", "<br />");
            s = s.Replace("<br />", "*ret$990^&");
            s = Regex.Replace(s, "<.*?>", String.Empty);
            s = Regex.Replace(s, "</.*", String.Empty);
            s = s.Replace("*ret$990^&", "<br />");
            int i = (strLen ?? s.Length);
            s = s.Substring(0,(i > s.Length ? s.Length : i));
            return(s);
        }

PROBLEM: if the last character gets cut off mid <br /> it messes up the displayed text. Example it gets cut off at blah blah blah <br then the display isnt nice. How can I use REGEX (or even string replace) to find only the last occurence of <b.... and only if it doesnt have a closing >.

I was thinking of something like:

s = string.Format(s.Substring(0, s.Length-6) + Regex.Replace(s.Substring(s.Length - 6), "<.*", string.Empty));

That will probably work but my whole converter seems like it is using a to of code to do something that should be relatively simple.

How can I do this?

NetMage
  • 24,279
  • 3
  • 31
  • 50
dave317
  • 744
  • 1
  • 11
  • 27
  • 1
    Using regex to parse HTML is not recommended. –  Jan 18 '18 at 20:29
  • Is there anything that IS recommended to "clean" HTML? What I am doing above works, but I agree its not pretty. – dave317 Jan 18 '18 at 20:40
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Lews Therin Jan 18 '18 at 21:26
  • I would suggest a library such as [HtmlAgilityPack](https://www.nuget.org/packages/HtmlAgilityPack) to parse through and change your HTML – Mike Kuenzi Jan 18 '18 at 22:21

1 Answers1

2

Try this:

s = Regex.Replace(s, "(<|<b|<br|<br/)$", "", RegexOptions.None);
SBFrancies
  • 3,498
  • 2
  • 12
  • 34