powershell extract text between two strings

Question

I know this question has been asked before but I can't get any of the answers I have looked at to work. I have a JSON file which has thousands of lines and want to simply extract the text between two strings every time they appear (which is a lot).

As a simple example my JSON would look like this:

    "customfield_11300": null,
    "customfield_11301": [
      {
        "self": "xxxxxxxx",
        "value": "xxxxxxxxx",
        "id": "10467"
      }
    ],
    "customfield_10730": null,
    "customfield_11302": null,
    "customfield_10720": 0.0,
    "customfield_11300": null,
    "customfield_11301": [
      {
        "self": "zzzzzzzzzzzzz",
        "value": "zzzzzzzzzzz",
        "id": "10467"
      }
    ],
    "customfield_10730": null,
    "customfield_11302": null,
    "customfield_10720": 0.0,

So I want to output everything between "customfield_11301" and "customfield_10730":

      {
        "self": "xxxxxxxx",
        "value": "xxxxxxxxx",
        "id": "10467"
      }
    ],
      {
        "self": "zzzzzzzzzzzzz",
        "value": "zzzzzzzzzzz",
        "id": "10467"
      }
    ],

I'm trying to keep it as simple as possible - so don't care about brackets being displayed in the output.

This is what I have (which outputs way more than what I want):

$importPath = "todays_changes.txt"
$pattern = "customfield_11301(.*)customfield_10730"

$string = Get-Content $importPath
$result = [regex]::match($string, $pattern).Groups[1].Value
$result

why don't you decode the JSON into an object and address the properties directly? — Gerald Schneider, Apr 20 '16 at 14:05
The quick answer is - change your greedy capture `(.*)` to non greedy - `(.*?)`. That should do it. — SamWhan, Apr 20 '16 at 14:13
Glad it helped. Fell free to mark my answer below as *accepted*. — SamWhan, Apr 20 '16 at 14:27

score 10 · Accepted Answer · answered Apr 20 '16 at 14:26

10

The quick answer is - change your greedy capture (.*) to non greedy - (.*?). That should do it.

customfield_11301(.*?)customfield_10730

Otherwise the capture will eat as much as it can, resulting in it continuing 'til the last customfield_10730.

Regards

answered Apr 20 '16 at 14:26

SamWhan

8,166
1
16
44

With this approach if I have multiple times the same pattern on a single line , it only returns the first occurence . Any idea on how to apply this to multiple occurrences in the same line ? – jcromanu Nov 08 '18 at 16:48

score 9 · Answer 2 · edited Oct 28 '17 at 13:44

Here is a PowerShell function which will find a string between two strings.

function GetStringBetweenTwoStrings($firstString, $secondString, $importPath){

    #Get content from file
    $file = Get-Content $importPath

    #Regex pattern to compare two strings
    $pattern = "$firstString(.*?)$secondString"

    #Perform the opperation
    $result = [regex]::Match($file,$pattern).Groups[1].Value

    #Return result
    return $result

}

You can then run the function like this:

GetStringBetweenTwoStrings -firstString "Lorem" -secondString "is" -importPath "C:\Temp\test.txt"

My test.txt file has the following text within it:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

So my result:

Ipsum

score 4 · Answer 3 · answered Apr 20 '16 at 14:22

You need to make your RegEx Lazy:

customfield_11301(.*?)customfield_10730

Live Demo on Regex101

Your Regex was Greedy. This means it will find customfield_11301, and then carry until it finds the very last customfield_10730.

Here is a simpler example of Greedy vs Lazy Regex:

# Regex (Greedy): [(.*)]
# Input:          [foo]and[bar]
# Output:         foo]and[bar

# Regex (Lazy):   [(.*?)]
# Input:          [foo]and[bar]
# Output:         "foo" and "bar" separately

Your Regex was very similar to the first one, it captured too much, whereas this new one captures the least amount of data possible, and will therefore work as you intended

Thank you kindly for your help, @ClasG answered a few minutes before you so I'll accept his as the answer. But thank you especially for the regex101 demo link, that really helped me understand what was happening. — adjuzy, Apr 20 '16 at 14:41

score 2 · Answer 4 · answered Aug 06 '20 at 08:51

First issue is Get-Content pipe will give you line by line not the entire content at once. You can pipe Get-Content with Out-String to get entire content as a single string and do the Regex on the content.

A working solution for your problem is:

Get-Content .\todays_changes.txt | Out-String | % {[Regex]::Matches($_, "(?<=customfield_11301)((.|\n)*?)(?=customfield_10730)")} | % {$_.Value}

And the output will be:

": [
  {
    "self": "xxxxxxxx",
    "value": "xxxxxxxxx",
    "id": "10467"
  }
],
"

": [
  {
    "self": "zzzzzzzzzzzzz",
    "value": "zzzzzzzzzzz",
    "id": "10467"
  }
],
"

powershell extract text between two strings

4 Answers4

Linked

Related