grab specific field value from the string using regex

Question

I've a text file, from that I have extracted these two paragraph block. The text example is given below:

Text Example:

NOMEAR ISABELLE FERREIRA ZARONI, ID FUNCIONAL Nº 5100796-7, para exercer, com validade a contar de 16 de novembro de 2020, o cargo em comissão de Assessor, símbolo DAS-7, da Sub- secretaria de Concessões e Parcerias, da Secretaria de Estado de Planejamento e Gestão, anteriormente ocupado por Vinicius dos San- tos Silva, ID Funcional n° 5108029-0. Processo nº SEI- 1 2 0 0 0 1 / 0 1 4 6 11 / 2 0 2 0 .

NOMEAR KARINE MATOS DIAS, ID FUNCIONAL Nº 5092869-4 para exercer, com validade a contar de 16 de novembro de 2020, o cargo em comissão de Assessor, símbolo DAS-7, da Secretaria de Estado de Planejamento e Gestão, anteriormente ocupado por Amauri Ferrei- ra do Carmo, ID Funcional nº 5099579-0. Processo nº SEI- 1 2 0 0 0 1 / 0 1 4 6 11 / 2 0 2 0 .

From the above text block I want to grab the bold values only from each paragraph as a individual row.

What I have tried:

filter_data_nomear = ['NOMEAR ISABELLE FERREIRA ZARONI, ID FUNCIONAL Nº\n5100796-7, para exercer, com validade a contar de 16 de novembro\nde 2020, o cargo em comissão de Assessor, símbolo DAS-7, da Sub-\nsecretaria de Concessões e Parcerias, da Secretaria de Estado de\nPlanejamento e Gestão, anteriormente ocupado por Vinicius dos San-\ntos Silva, ID Funcional n° 5108029-0. Processo nº SEI-\n1 2 0 0 0 1 / 0 1 4 6 11 / 2 0 2 0 .', 'NOMEAR KARINE MATOS DIAS, ID FUNCIONAL Nº 5092869-4 para\nexercer, com validade a contar de 16 de novembro de 2020, o cargo\nem comissão de Assessor, símbolo DAS-7, da Secretaria de Estado\nde Planejamento e Gestão, anteriormente ocupado por Amauri Ferrei-\nra do Carmo, ID Funcional nº 5099579-0. Processo nº SEI-\n1 2 0 0 0 1 / 0 1 4 6 11 / 2 0 2 0 .', 'NOMEAR ROSIONE FERNANDES DE SÁ, ID FUNCIONAL Nº\n4413710-9, para exercer, com validade a contar de 16 de novembro\nde 2020, o cargo em comissão de Assistente II, símbolo DAI-6, da\nSecretaria de Estado de Planejamento e Gestão, anteriormente ocu-\npado por Luis Henrique Ferreira de Aquino, ID Funcional nº 1914315-\n0. Processo nº SEI-120001/014825/2020.', 'NOMEAR FRANCISCO DE ASSIS PINTO CAVALCANTE para exer-\ncer, com validade a contar de 16 de novembro de 2020, o cargo em\ncomissão de Assistente II, símbolo DAI-6, da Secretaria de Estado de\nPlanejamento e Gestão, anteriormente ocupado por Edson Carneiro\nda Silva, ID Funcional nº 570136-8. Processo nº SEI-\n120001/014825/2020.']

for i in filter_data_nomear:
    splited_ini = i.split(',')
    splited_ini = list(filter(lambda x: x != 'para exercer', splited_ini))
    splited = [x.strip() \
        .replace("\n",' ') \
        .replace('anteriormente ocupado por ','') \
        .replace('para exercer','') \
        .replace('anteriormente ocupado por ','') \
        .replace('NOMEAR','') \
        .replace('o cargo em comissão de ','') \
        .replace('ID FUNCIONAL Nº ','') \
        .replace('com validade a contar de ','') \ 
        .replace('ID Funcional ','') \
        .replace('Processo nº SEI-','') \
        .replace('símbolo ','') \
        .strip() \
        .replace(u"nº",'--') \ 
        .replace('para exer- cer','') \ 
        .strip() for x in splited_ini]

My Current Output:

['ISABELLE FERREIRA ZARONI', '5100796-7', '16 de novembro de 2020', 'Assessor', 'DAS-7', 'da Sub- secretaria de Concessões e Parcerias', 'da Secretaria de Estado de Planejamento e Gestão', 'Vinicius dos San- tos Silva', 'n° 5108029-0.  1 2 0 0 0 1 / 0 1 4 6 11 / 2 0 2 0 .']

My current output is almost ok but having issue with multiple replace() and some time this static replace is breaking my code also. So is there other way I can achieve that using regex matching on those bold text?

https://regex101.com/r/wn5moF/3 you can try that regex, not sure if it is good and hence leaving this as a comment, you just have to check for `,` in group 6 when you iterate through the match — python_user, Nov 27 '20 at 09:35
You might use 3 capturing groups to get all the different values `\b(?:(?:NOMEAR|d[ea]|por) ([^,]+?)(?: e Gestão)?,|((?:[A-Z]+|\d+)-\d+)|SEI- ([\d /]+) )` https://regex101.com/r/xN7CYm/1 — The fourth bird, Nov 27 '20 at 12:11

The fourth bird · Accepted Answer · 2020-11-27T13:09:21.610

To get the values in bold, you might use 3 capturing groups with an alternation:

\b(?:(?:NOMEAR|d[ea]|por) ([^,]+?)(?: e Gestão)?,|([A-Z\d]+-\d+)|SEI- ([\d /]+)\b)

In parts

\b A word boundary to prevent the word being part of a longer word
(?: Non capture group
- (?:NOMEAR|d[ea]|por) Match one of NOMEAR de da por
- ([^,]+?) Capture group 1, match any char except , non greedy
- (?: e Gestão)?, Optionally match e Gestão and match a ,
- | Or
- ([A-Z\d]+-\d+) Capture in group 2 matching 1+ times either A-Z or a digit and - and 1+ digits
- | Or
- SEI- ([\d /]+)\b Match SEI- , capture in group 3 one of the listed followed by a word boundary
) Close non capture group

Regex demo | Python demo

For example

import re

regex = r"\b(?:(?:NOMEAR|d[ea]|por) ([^,]+?)(?: e Gestão)?,|([A-Z\d]+-\d+)|SEI- ([\d /]+)\b)"

filter_data_nomear = ['NOMEAR ISABELLE FERREIRA ZARONI, ID FUNCIONAL Nº\n5100796-7, para exercer, com validade a contar de 16 de novembro\nde 2020, o cargo em comissão de Assessor, símbolo DAS-7, da Sub-\nsecretaria de Concessões e Parcerias, da Secretaria de Estado de\nPlanejamento e Gestão, anteriormente ocupado por Vinicius dos San-\ntos Silva, ID Funcional n° 5108029-0. Processo nº SEI-\n1 2 0 0 0 1 / 0 1 4 6 11 / 2 0 2 0 .', 'NOMEAR KARINE MATOS DIAS, ID FUNCIONAL Nº 5092869-4 para\nexercer, com validade a contar de 16 de novembro de 2020, o cargo\nem comissão de Assessor, símbolo DAS-7, da Secretaria de Estado\nde Planejamento e Gestão, anteriormente ocupado por Amauri Ferrei-\nra do Carmo, ID Funcional nº 5099579-0. Processo nº SEI-\n1 2 0 0 0 1 / 0 1 4 6 11 / 2 0 2 0 .', 'NOMEAR ROSIONE FERNANDES DE SÁ, ID FUNCIONAL Nº\n4413710-9, para exercer, com validade a contar de 16 de novembro\nde 2020, o cargo em comissão de Assistente II, símbolo DAI-6, da\nSecretaria de Estado de Planejamento e Gestão, anteriormente ocu-\npado por Luis Henrique Ferreira de Aquino, ID Funcional nº 1914315-\n0. Processo nº SEI-120001/014825/2020.', 'NOMEAR FRANCISCO DE ASSIS PINTO CAVALCANTE para exer-\ncer, com validade a contar de 16 de novembro de 2020, o cargo em\ncomissão de Assistente II, símbolo DAI-6, da Secretaria de Estado de\nPlanejamento e Gestão, anteriormente ocupado por Edson Carneiro\nda Silva, ID Funcional nº 570136-8. Processo nº SEI-\n120001/014825/2020.']

for i in filter_data_nomear:
    result = []
    matches = re.finditer(regex, i, re.MULTILINE)
    for matchNum, match in enumerate(matches, start=1):
        for groupNum in range(0, len(match.groups())):
            groupNum = groupNum + 1
            if match.group(groupNum) is not None:
                result.append(match.group(groupNum))
    print(result)

Output

['ISABELLE FERREIRA ZARONI', '5100796-7', '16 de novembro\nde 2020', 'Assessor', 'DAS-7', 'Sub-\nsecretaria de Concessões e Parcerias', 'Secretaria de Estado de\nPlanejamento', 'Vinicius dos San-\ntos Silva', '5108029-0']
['KARINE MATOS DIAS', '5092869-4', '16 de novembro de 2020', 'Assessor', 'DAS-7', 'Secretaria de Estado\nde Planejamento', 'Amauri Ferrei-\nra do Carmo', '5099579-0']
['ROSIONE FERNANDES DE SÁ', '4413710-9', '16 de novembro\nde 2020', 'Assistente II', 'DAI-6', 'Estado de Planejamento', 'Luis Henrique Ferreira de Aquino', 'SEI-120001']
['FRANCISCO DE ASSIS PINTO CAVALCANTE para exer-\ncer', '16 de novembro de 2020', 'Assistente II', 'DAI-6', 'Secretaria de Estado de\nPlanejamento', 'Edson Carneiro\nda Silva', '570136-8']

It is almost ok but having issue on the last part, please have a look on the 3rd element of the Output. I also need to get the value after SEI- as a new element eg. https://regex101.com/r/E2sYnI/2 — Always Sunny, Nov 27 '20 at 15:50
@AlwaysSunny Perhaps like this? https://regex101.com/r/IVDHMK/1 — The fourth bird, Nov 27 '20 at 15:54
@AlwaysSunny I see, there can be optional whitespace chars in between which you could match using `\s*` https://regex101.com/r/JHTTyT/1 — The fourth bird, Nov 27 '20 at 16:01
@AlwaysSunny You can name all 3 groups https://regex101.com/r/9YA3Gu/1 — The fourth bird, Nov 27 '20 at 16:12
sir I could't find any reason why on the 3rd paragraph block the regex doesn't match the full string "Secretaria de Estado de Planejamento e Gestão" . See: https://regex101.com/r/JHTTyT/1 seems new line is casuing issue but `[^,]` should match any character until `,` then why it doesn't fully match !! — Always Sunny, Nov 28 '20 at 13:20
@AlwaysSunny I see, the space should be a `\s`. Try it like this https://regex101.com/r/xmqGRy/1 — The fourth bird, Nov 28 '20 at 13:25
Sir could you please have a look on this ? https://stackoverflow.com/questions/65094987/grab-required-field-values-from-the-paragraph-block-using-regex-in-python — Always Sunny, Dec 01 '20 at 16:59
@AlwaysSunny For the current question, this also might be an option to get all parts in their own capturing group https://regex101.com/r/GKCJMB/1 — The fourth bird, Dec 01 '20 at 17:43

grab specific field value from the string using regex

1 Answers1