How can I get only text from CKEditors RichTextField in Python?

Question

I have CKEditor and I need the pure text from it in server-side. I tried re.sub to strip tags, but it fails when there are tables or images or links. How can I avoid them?

I got some problems when \xa0 's are consecutive and

<a href="#"> abc</a>

where i don't want the inside abc text. And \xa0 is one of them when some other input comes like \xa0 this code will not eliminate it.

Example:

I have this string:

'<p>wtw tebrs g&nbsp; <u>resgs e<s>re</s> sger</u>g erg<a href="http://resgser g"> rgresg</a> rgergre<a id="grerg" name="grerg"></a>g er gerge ge rge<object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="allowFullScreen" value="true" /><param name="quality" value="high" /><param name="movie" value="e" /><embed pluginspage="http://www.macromedia.com/go/getflashplayer" quality="high" src="e" type="application/x-shockwave-flash"></embed></object> gere</p>\r\n\r\n<hr />\r\n<p>greger gerg ger ge ges s5ulp&uuml;ğna&nbsp; fg</p>\r\n\r\n<table border="1" cellpadding="1" cellspacing="1" style="width:500px">\r\n\t<tbody>\r\n\t\t<tr>\r\n\t\t\t<td>&nbsp;</td>\r\n\t\t\t<td>&nbsp;</td>\r\n\t\t</tr>\r\n\t\t<tr>\r\n\t\t\t<td>&nbsp;</td>\r\n\t\t\t<td>&nbsp;</td>\r\n\t\t</tr>\r\n\t\t<tr>\r\n\t\t\t<td>&nbsp;</td>\r\n\t\t\t<td>&nbsp;</td>\r\n\t\t</tr>\r\n\t</tbody>\r\n</table>\r\n\r\n<p>&nbsp;rgesr es ges esg es&uuml;er g sg&uuml; serg&uuml;ser &uuml;ges greklsm r&ouml;e&ccedil; şjgslejr gesi gs relkgjesgrjs pt ıqqoqotjg ler<img alt="" src="terger" /></p>'

I want to get the string output as:

'wtw tebrs g resgs e re sger g erg rgresg rgergre g er gerge ge rge gere greger gerg ger ge ges s5ulpüğna fg rgesr es ges esg esüer g sgü sergüser üges greklsm röeç şjgslejr gesi gs relkgjesgrjs pt ıqqoqotjg ler'

I put my best solution here:

def preprocess_text(text):
    text = unescape_entities(text)
    text = re.sub("<.*?>", " ", text)
    text = re.sub("\n|\r|\t|\xa0", " ",  text)
    text = re.sub(" +", " ", text)
    text = text.strip()
    return text

I doubt there is a 100% reliable way to do it without a full browser engine. However, check this question for some suggestions https://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python — Alexandr Tatarinov, Aug 12 '21 at 17:10

How can I get only text from CKEditors RichTextField in Python?

0 Answers0