I have CKEditor and I need the pure text from it in server-side. I tried re.sub to strip tags, but it fails when there are tables or images or links. How can I avoid them?
I got some problems when \xa0 's are consecutive and
<a href="#"> abc</a>
where i don't want the inside abc text. And \xa0 is one of them when some other input comes like \xa0 this code will not eliminate it.
Example:
I have this string:
'<p>wtw tebrs g <u>resgs e<s>re</s> sger</u>g erg<a href="http://resgser g"> rgresg</a> rgergre<a id="grerg" name="grerg"></a>g er gerge ge rge<object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="allowFullScreen" value="true" /><param name="quality" value="high" /><param name="movie" value="e" /><embed pluginspage="http://www.macromedia.com/go/getflashplayer" quality="high" src="e" type="application/x-shockwave-flash"></embed></object> gere</p>\r\n\r\n<hr />\r\n<p>greger gerg ger ge ges s5ulpüğna fg</p>\r\n\r\n<table border="1" cellpadding="1" cellspacing="1" style="width:500px">\r\n\t<tbody>\r\n\t\t<tr>\r\n\t\t\t<td> </td>\r\n\t\t\t<td> </td>\r\n\t\t</tr>\r\n\t\t<tr>\r\n\t\t\t<td> </td>\r\n\t\t\t<td> </td>\r\n\t\t</tr>\r\n\t\t<tr>\r\n\t\t\t<td> </td>\r\n\t\t\t<td> </td>\r\n\t\t</tr>\r\n\t</tbody>\r\n</table>\r\n\r\n<p> rgesr es ges esg esüer g sgü sergüser üges greklsm röeç şjgslejr gesi gs relkgjesgrjs pt ıqqoqotjg ler<img alt="" src="terger" /></p>'
I want to get the string output as:
'wtw tebrs g resgs e re sger g erg rgresg rgergre g er gerge ge rge gere greger gerg ger ge ges s5ulpüğna fg rgesr es ges esg esüer g sgü sergüser üges greklsm röeç şjgslejr gesi gs relkgjesgrjs pt ıqqoqotjg ler'
I put my best solution here:
def preprocess_text(text):
text = unescape_entities(text)
text = re.sub("<.*?>", " ", text)
text = re.sub("\n|\r|\t|\xa0", " ", text)
text = re.sub(" +", " ", text)
text = text.strip()
return text