118

This code almost does what I need it to..

for line in all_lines:
    s = line.split('>')

Except it removes all the '>' delimiters.

So,

<html><head>

Turns into

['<html','<head']

Is there a way to use the split() method but keep the delimiter, instead of removing it?

With these results..

['<html>','<head>']
some1
  • 2,307
  • 8
  • 25
  • 23
  • 21
    This doesn't really answer your question, but if you're trying to parse HTML in Python, I highly recommend [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/). – Michael Mior Oct 23 '11 at 12:33
  • 2
    See also [In Python, how do I split a string and keep the separators?](http://stackoverflow.com/questions/2136556/in-python-how-do-i-split-a-string-and-keep-the-separators). – outis Oct 23 '11 at 12:44
  • 9
    This question should be reopened. The duplicate one is regex-specific. – orestisf Apr 26 '20 at 05:06
  • 2
    @orestisf Also, the "duplicate" one answers a different problem. `['', '', '']` is different from `['', '']`. I know it's been a few months but I just voted to reopen. If you do too someone else make take it over the finish line? – user1717828 Oct 26 '20 at 00:37
  • 1
    re.split(r"(?<=>(?!$))", '') directly gives the answer. This way it can be handled by playing with regex look-arounds – dhgoratela Dec 31 '20 at 07:46

4 Answers4

70
d = ">"
for line in all_lines:
    s =  [e+d for e in line.split(d) if e]
Wes Modes
  • 1,844
  • 2
  • 20
  • 35
P.Melch
  • 7,836
  • 42
  • 39
37

If you are parsing HTML with splits, you are most likely doing it wrong, except if you are writing a one-shot script aimed at a fixed and secure content file. If it is supposed to work on any HTML input, how will you handle something like <a title='growth > 8%' href='#something'>?

Anyway, the following works for me:

>>> import re
>>> re.split('(<[^>]*>)', '<body><table><tr><td>')[1::2]
['<body>', '<table>', '<tr>', '<td>']
gb.
  • 598
  • 4
  • 11
  • If you are not sure whether the string in question will end with the deliminator in question, looks like you can do: `re.split("(.*\n?)", "my\nstr\ning")[1::2]` – Seth Robertson Oct 11 '18 at 17:34
  • If you want to be parsing html, should go to https://automatetheboringstuff.com/2e/chapter12/ and read this chapter. Has everything you need to know about parsing html and webscraping. If this link ever breaks, look into using the requests, beautifulsoup, and selenium libraries. – zicameau Feb 11 '22 at 23:39
21

How about this:

import re
s = '<html><head>'
re.findall('[^>]+>', s)
Óscar López
  • 225,348
  • 35
  • 301
  • 374
1

Just split it, then for each element in the array/list (apart from the last one) add a trailing ">" to it.

orangething
  • 708
  • 5
  • 16
  • 1
    What about the case of ">>" it would just become ">" – paulm Mar 21 '16 at 09:06
  • @paulm no, because splitting two `>`s like in `">body".split('>')` creates an empty element in the middle `["`s to result in just a single `>` after processing, in which case you could first remove those empty strings. – yyny Sep 28 '18 at 08:25