-4

I don't know much about regular expressions. I need to extract a single word enclosed by left and right parentheses in a string.

For instance, in string

(ROOT (S (NP (NNP Washington) (NNP (CNN)) (NNP Donald) (NNP Trump)) (VP (VBD was) (VP (VBN asked) (PP ( by) (NP (NP (DT a) (NN member)) (PP (IN of) (NP (DT a) (NNP Fox) (NNP News) (NN town) (NN hall) (NN audience))))) (NP-TMP (DT this) (NN week)) (SBAR (WHNP (WP what)) (S (NP (PRP he)) (VP (MD would) (VP (VB do) (S (VP (TO to) (VP (VB reduce) (NP (JJ violent) (NN crime)) (PP (IN in) (NP (DT the) (NNS country's) (JJ inner) (NN cities.)))))))))))))) (NN debate,)) (ADJP (JJ due) (S (VP (TO to) (VP (VB be) (VP (VBN broadcast) (S (NP (JJ live)) (ADJP (RBS most) (RB everywhere.))))))))))))))))

I need to get (CNN) and ( by) substrings.

EDIT

    def fixeup_tree_string(T):
            change_index=[]
            Match = [(m.start(0), m.end(0),'Y') for m in re.finditer(r"\(\s*[\w|#|~|!|@|#|$|%|^|&|*|<|>|.|,|;|:|`|'''|_|-|+|/]+\s*\)", T)]
            if len(Match)==0:
            return T

            if Match[0][0]!=0:
            change_index.append((0,Match[0][0],'N'))
            for i in range(len(Match)-1):
                change_index.append(Match[i])
                change_index.append((Match[i][1],Match[i+1][0],'N'))
                change_index.append(Match[-1])
                if Match[len(Match)-1][1]< len(T):
                  change_index.append((Match[len(Match)-1][1],len(T),'N'))

            new_T = []
            for r in change_index:
            if r[2]=='N':
                for i in range(r[0],r[1]):
                    new_T.append(T[i])
            else:
                str = T[r[0]:r[1]].replace(' ','')
                str = str.split(')')[0].split('(')[-1]
                str = '(NN '+str +')'
                for x in str:
                    new_T.append(x)
            new_T = (''.join(new_T))
            return new_T
Hamid
  • 15
  • 1
  • 6

1 Answers1

0

You could try:

y = re.findall(r"\(\s*[a-zA-Z]+\s*\)", x)

Where x is the input.

Note - You will have to import re by using the statement import re before the program, to use the regex functions

Explanation -

re.findall is a module that returns an array of the matches found by the regex. The arguments are as follows: regex, string

r"\(\s*[a-zA-Z]+\)" specifies that the regex is a raw string given by the r at the start.
\( means an opening bracket '('
\s* means that the opening bracket may be followed by xero-or-more spaces
[a-zA-Z]+matches the letters inside (one-or-more)
\s* indicates that the text may be followed by zero-or-more spaces, and then the \) indicates the closing ')'

Edit 1 -

To answer your second doubt in the comments, you could try this:

z = []

for i in y:
    str = "(NN " + i[1:]
    z.append(str)

print(z)
Community
  • 1
  • 1
Robo Mop
  • 3,346
  • 1
  • 9
  • 23
  • Thank you so much @coffeehouse-coder. Then how can I replace that in the string with something else? More specifically, I want to add word 'NN' before the word found in the parentheses. For instance, "(CNN)" should be replace with "(NN CNN)". – Hamid Apr 02 '18 at 19:41
  • @Hamid Do you want to add `NN` before **all the matched** elements eg: (NN CNN) and (NN by)? – Robo Mop Apr 03 '18 at 04:08
  • I apologize for being so late :( – Robo Mop Apr 03 '18 at 04:08
  • Yes I do want that. I did it yesterday. May not be very efficient though – Hamid Apr 03 '18 at 17:42
  • I added the code at the post above. T is the string like I have posted above. I find all that match using your solution. Then replace them by add NN before them. – Hamid Apr 03 '18 at 17:52
  • @Hamid I can give you my answer to that problem, or is your question fully answered? – Robo Mop Apr 03 '18 at 18:09
  • @Hamid If your question is answered, please consider marking that answer as helpful by clicking on the green tick-mark, for any future reference. – Robo Mop Apr 03 '18 at 18:15
  • Yes that's perfect. I voted up your answer. I don't know how to mark it as helpful :( – Hamid Apr 03 '18 at 19:21
  • @Hamid Does it not allow you to click on the green tick-mark? – Robo Mop Apr 04 '18 at 02:11