63

I want to remove all the non-ASCII characters from a file in place.

I found one solution with tr, but I guess I need to write back that file after modification.

I need to do it in place with relatively good performance.

Any suggestions?

dda
  • 5,760
  • 2
  • 24
  • 34
Sujit
  • 2,293
  • 4
  • 27
  • 36
  • can you provide a link to the one liner with tr? – Jordan Sitkin Jun 28 '16 at 19:00
  • The OP probably(?) meant non-printable characters (ctrl-c, unicode number U+0002, is an ASCII character). The question should also specify the locale - without that information one could(should?) assume he meant the "C" locale. A naive answer would be to strip any byte greater than 0x7f - that would preserve characters that are not printable in the C locale, but are perfectly legitimate ASCII characters. I'm downvoting the question because of these reasons which make the it too vague. – Juan Mar 07 '18 at 00:58

11 Answers11

81

A perl oneliner would do: perl -i.bak -pe 's/[^[:ascii:]]//g' <your file>

-i says that the file is going to be edited inplace, and the backup is going to be saved with extension .bak.

ssegvic
  • 3,043
  • 1
  • 19
  • 20
48
# -i (inplace)

sed -i 's/[\d128-\d255]//g' FILENAME
JellicleCat
  • 26,352
  • 22
  • 102
  • 152
Ivan
  • 1,481
  • 16
  • 22
  • @Sujit: Note that `sed -i` still creates an intermediate file. It just does it behind the scenes. – Dennis Williamson Jul 26 '10 at 19:57
  • @Dennis - then what would be the better solution? – Sujit Jul 26 '10 at 20:43
  • 4
    @Sujit: There's not a better solution. I just wanted to point out that an intermediate file is still created. Sometimes that matters. I just didn't want you to be under the assumption that it was doing it *literally* in place. – Dennis Williamson Jul 26 '10 at 21:22
  • On MacOSX, `sed: 1: "FILENAME": unterminated substitute pattern` – h3xStream Aug 08 '12 at 15:01
  • `sed -i "s/[\d128-\d255]//g" FILE` works for me on centos w/ GNU sed. You may have to use different quoting strategy (double quotes instead of single) depending on your OS/shell. – Joe Atzberger Aug 09 '13 at 16:27
  • 56
    Prints "Invalid collation character" on GNU sed 4.2.1. – Jason C Jun 18 '14 at 15:16
  • 30
    I can avoid the "invalid collation character" error with `LANG=C sed -i 's/[\d128-\d255]//g' FILE` – Patrick Dec 30 '14 at 21:58
  • 1
    @Patrick then your setup is broken. C locale implies 7-bit characters, and should generate that error with that pattern space. I recommend using a locale that has 8-bit characters, like iso-8859-1. That worked for me. – MarkI Jan 26 '15 at 18:39
  • On cygwin I got the same problem as @JasonC and Patrick's solution didn't fix it for me. I used the Perl solution below. – skiphoppy Nov 10 '16 at 21:15
  • @skiphoppy Try using double backslashes with cygwin. [related discussion](https://superuser.com/questions/552041/why-is-it-true-that-three-backslashes-are-needed-on-windows-for-sed-replace) – C8H10N4O2 Jan 19 '18 at 16:47
  • I fixed the "Invalid collation character" error by prefixing the sed invocation with `LC_ALL=C`. – Diomidis Spinellis Jan 02 '21 at 12:16
28

I tried all the solutions and nothing worked. The following, however, does:

tr -cd '\11\12\15\40-\176'

Which I found here:

https://alvinalexander.com/blog/post/linux-unix/how-remove-non-printable-ascii-characters-file-unix

My problem needed it in a series of piped programs, not directly from a file, so modify as needed.

Katastic Voyage
  • 775
  • 7
  • 16
16
sed -i 's/[^[:print:]]//' FILENAME

Also, this acts like dos2unix

jcalfee314
  • 4,120
  • 7
  • 38
  • 71
  • 11
    Does not work. [:print:] is not the same as ASCII. There are many printable non-ASCII characters. – Jason C Jun 18 '14 at 15:17
  • 1
    Also the g modifier is missing. Only the first non-printable character would be removed. – proski Nov 30 '17 at 00:18
  • 1
    @JasonC There are also many non-printable ASCII characters. It's likely the original question was poorly formed. – Juan Mar 07 '18 at 01:21
15

Try tr instead of sed

tr -cd '[:print:]' < file.txt
Vivek
  • 10,124
  • 19
  • 78
  • 112
  • 4
    The OP specifically mentioned he didn't want to use tr (because he wanted an "in place" conversion which sed -i pretends to be - really writes to a temp file and renames behind the scenes). So this answer doesn't help the OP. BUT... for those who want to use tr, you might want to preserver newlines (the 20180228 version shown here does not). A simple tweak however preserves newlines and carriage returns: `tr -cd '[:print:]\n\r' < file.txt` – Juan Mar 07 '18 at 00:08
  • 1
    `tr -cd '[:print:]' – evandrix Aug 07 '19 at 21:48
7
# -i (inplace)

LANG=C sed -i -E "s|[\d128-\d255]||g" /path/to/file(s)

The LANG=C part's role is to avoid a Invalid collation character error.

Based on Ivan's answer and Patrick's comment.

evandrix
  • 5,889
  • 4
  • 26
  • 35
Nicolas Raoul
  • 57,417
  • 55
  • 212
  • 360
6

This worked for me:

sed -i 's/[^[:print:]]//g'
Jorge Y. C. Rodriguez
  • 3,277
  • 5
  • 34
  • 58
AJn
  • 61
  • 1
  • 3
5

I'm using a very minimal busybox system, in which there is no support for ranges in tr or POSIX character classes, so I have to do it the crappy old-fashioned way. Here's the solution with sed, stripping ALL non-printable non-ASCII characters from the file:

sed -i 's/[^a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILE
ACK_stoverflow
  • 2,896
  • 4
  • 22
  • 31
  • 1
    I don't have your system to test it on, but considering is character 32 (decimal) and tilde "~" is character 126, all of the printable ASCII characters fall between these. If your sed supports [a-z] type ranges, and [^ type "not in" syntax, you should be able to replace that long string of characters with: `sed -i 's/[^ -~]//g' FILE` (that's /[^-~]/) – JohnGH Nov 25 '20 at 15:46
  • 1
    @JohnGH Excellent, this does indeed work! A much better solution, albeit six years down the road :) – ACK_stoverflow Nov 25 '20 at 16:56
  • 1
    Sorry for the laggy response ;-) – JohnGH Dec 03 '20 at 15:49
3
awk '{ sub("[^a-zA-Z0-9\"!@#$%^&*|_\[](){}", ""); print }' MYinputfile.txt > pipe_out_to_CONVERTED_FILE.txt
guestSA
  • 31
  • 2
3

As an alternative to sed or perl you may consider to use ed(1) and POSIX character classes.

Note: ed(1) reads the entire file into memory to edit it in-place, so for really large files you should use sed -i ..., perl -i ...

# see:
# - http://wiki.bash-hackers.org/doku.php?id=howto:edit-ed
# - http://en.wikipedia.org/wiki/Regular_expression#POSIX_character_classes

# test
echo $'aaa \177 bbb \200 \214 ccc \254 ddd\r\n' > testfile
ed -s testfile <<< $',l' 
ed -s testfile <<< $'H\ng/[^[:graph:][:space:][:cntrl:]]/s///g\nwq'
ed -s testfile <<< $',l'
trevor
  • 31
  • 1
0

I appreciate the tips I found on this site.

But, on my Windows 10, I had to use double quotes for this to work ...

sed -i "s/[\d128-\d255]//g" FILENAME

Noticed these things ...

  1. For FILENAME the entire path\name needs to be quoted This didn't work -- %TEMP%\"FILENAME" This did -- %TEMP%\FILENAME"

  2. sed leaves behind temp files in the current directory, named sed*

Renats Stozkovs
  • 2,479
  • 10
  • 21
  • 25
Larry8811
  • 179
  • 1
  • 4
  • Note: this answer works with gnu sed, but is not portable to other versions of sed (e.g., bsd). Given the side effects mentioned in this answer, it seems like a weird windows compiled version that tries to emulate gnu sed. Or the user is muddying the water with unrelated shell issues. – Juan Mar 07 '18 at 01:30