6

I'm finding myself somewhat stumpped on a simple problem. I'm trying to remove fancy quoting from a bunch of text files. I've the following script, where I'm trying a number of different replacement methods, but w/o results.

Here's an example that downloads the data from github and attempts to convert.

$srcUrl="https://raw.github.com/gist/1129778/d4d899088ce7da19c12d822a711ab24e457c023f/gistfile1.txt"
$wc = New-Object net.WebClient
$wc.DownloadFile($srcUrl,"foo.txt")
$fancySingleQuotes = "[" + [string]::Join("",[char[]](0x2019, 0x2018)) + "]"

$c = Get-Content "foo.txt"
$c | % { `
        $_ = $_.Replace("’","'")
        $_ = $_.Replace("`“","`"")
        $_.Replace("`”","`"")       
    } `
    |  Set-Content "foo2.txt"

What's the trick for this to work?

Scott Weinstein
  • 18,540
  • 13
  • 73
  • 114

4 Answers4

6

UPDATE: Fixed my answer (manojlds comments were correct, the $_ thing was a red herring). Here's a version that works, and I've updated it to incorporate your testing code:

    $srcUrl="https://raw.github.com/gist/1129778/d4d899088ce7da19c12d822a711ab24e457c023f/gistfile1.txt"
    $wc = New-Object net.WebClient
    $wc.DownloadFile($srcUrl,"C:\Users\hartez\SO6968270\foo.txt")

    $fancySingleQuotes = "[\u2019\u2018]" 
    $fancyDoubleQuotes = "[\u201C\u201D]" 

    $c = Get-Content "foo.txt" -Encoding UTF8

    $c | % { `
        $_ = [regex]::Replace($_, $fancySingleQuotes, "'")   
        [regex]::Replace($_, $fancyDoubleQuotes, '"')     
    } `
    |  Set-Content "foo2.txt"

The reason that manojlds version wasn't working for you is that the encoding on the file you're getting from github wasn't compatible with the Unicode characters in the regex. Reading it in as UTF-8 fixes the problem.

E.Z. Hart
  • 5,587
  • 1
  • 30
  • 24
  • the last `$_.Replace("`“","'")` does push the output line – Scott Weinstein Aug 06 '11 at 17:26
  • Your answer doesn't add anything imo. Why do you need to replace and assign it to $_ and then return $_? Just `[regex]::Replace($_,$fancySingleQuotes, "'")` already returns to the pipeline. And the OP is already doing it. – manojlds Aug 06 '11 at 18:22
  • Starting with Powershell version 2, you can now use the -Replace operator, instead of [regex]::Replace(). $line = $line -replace '[\u2019\u2018]', "'" $line = $line -replace '[\u201C\u201D]', '"' – Nathan Hartley Nov 16 '11 at 18:34
2

The following works on the input and output that you had given:

    $c = Get-Content $file 
    $c | % { `

        $_ = $_.Replace("’","'")
        $_ = $_.Replace("`“","`"")
        $_.Replace("`”","`"")
        } `
        |  Set-Content $file
manojlds
  • 275,671
  • 58
  • 453
  • 409
0

Your last replace, pleaces a left fancy quote with and single quote. Is that what you want? it doesn't match your sample output. Try this:

$_.Replace("`“","`"")
$_.Replace("`”","`"")
zdan
  • 27,391
  • 5
  • 56
  • 67
  • that is my answer, but what you have given as code is wrong, `"“”"` - replace will try to replace the entire string, not individual characters. and i believe they have to be escaped as well. – manojlds Aug 06 '11 at 18:57
  • @manojlds: right you are, the console was playing tricks on me. I've got to remember to use ISE for unicode. – zdan Aug 07 '11 at 00:53
-1

This article is so close to what I need. I was looking for something that would check for any UTF8 and found this article: Notepad++, How to remove all non ascii characters with regex? Which seems to work fine in PowerShell as well.

The regex they use that works in PowerShell is:

[^\x00-\x7F]+

Which will find any UTF8 Character, you can hone the regex if you need to be more specific.

My input only had the curly quote(s) as a UTF8 characters so this simple substitution worked:

Replace the UTF8 quote with standard single quote

$cq = $cq -replace "[^\x00-\x7F]+", "'"

Community
  • 1
  • 1