Remove non-ASCII characters from CSV

Question

I want to remove all the non-ASCII characters from a file in place.

I found one solution with tr, but I guess I need to write back that file after modification.

I need to do it in place with relatively good performance.

Any suggestions?

The OP probably(?) meant non-printable characters (ctrl-c, unicode number U+0002, is an ASCII character). The question should also specify the locale - without that information one could(should?) assume he meant the "C" locale. A naive answer would be to strip any byte greater than 0x7f - that would preserve characters that are not printable in the C locale, but are perfectly legitimate ASCII characters. I'm downvoting the question because of these reasons which make the it too vague. — Juan, Mar 07 '18 at 00:58

score 81 · Answer 1 · answered Jul 26 '10 at 18:52

81

A perl oneliner would do: perl -i.bak -pe 's/[^[:ascii:]]//g' <your file>

-i says that the file is going to be edited inplace, and the backup is going to be saved with extension .bak.

answered Jul 26 '10 at 18:52

ssegvic

3,043
1
19
20

1

This one is also usable with `stdin` as input. – h3xStream Aug 08 '12 at 14:59
3

The perl solution is faster than the sed solution. Trying to update a 122 GB file using sed took 3 hours, while perl took about less than 2 hours for me. – user8128167 Sep 15 '14 at 19:01
I couldn't get the `sed` solution to work in my environment (Ubuntu gnu sed 4.2.2) but this worked like a charm. – steve klein Jun 01 '15 at 12:02
1

Tried everything and this was the only one that worked for me. Gotta love the power of Perl. Thanks! – jbrahy Dec 20 '16 at 19:00

score 48 · Accepted Answer · edited Apr 20 '15 at 19:21

48

# -i (inplace)

sed -i 's/[\d128-\d255]//g' FILENAME

edited Apr 20 '15 at 19:21

JellicleCat

26,352
22
102
152

answered Jul 26 '10 at 18:51

Ivan

1,481
16
22

@Sujit: Note that `sed -i` still creates an intermediate file. It just does it behind the scenes. – Dennis Williamson Jul 26 '10 at 19:57
@Dennis - then what would be the better solution? – Sujit Jul 26 '10 at 20:43
4

@Sujit: There's not a better solution. I just wanted to point out that an intermediate file is still created. Sometimes that matters. I just didn't want you to be under the assumption that it was doing it *literally* in place. – Dennis Williamson Jul 26 '10 at 21:22
On MacOSX, `sed: 1: "FILENAME": unterminated substitute pattern` – h3xStream Aug 08 '12 at 15:01
`sed -i "s/[\d128-\d255]//g" FILE` works for me on centos w/ GNU sed. You may have to use different quoting strategy (double quotes instead of single) depending on your OS/shell. – Joe Atzberger Aug 09 '13 at 16:27
56

Prints "Invalid collation character" on GNU sed 4.2.1. – Jason C Jun 18 '14 at 15:16
30

I can avoid the "invalid collation character" error with `LANG=C sed -i 's/[\d128-\d255]//g' FILE` – Patrick Dec 30 '14 at 21:58
1

@Patrick then your setup is broken. C locale implies 7-bit characters, and should generate that error with that pattern space. I recommend using a locale that has 8-bit characters, like iso-8859-1. That worked for me. – MarkI Jan 26 '15 at 18:39
On cygwin I got the same problem as @JasonC and Patrick's solution didn't fix it for me. I used the Perl solution below. – skiphoppy Nov 10 '16 at 21:15
@skiphoppy Try using double backslashes with cygwin. [related discussion](https://superuser.com/questions/552041/why-is-it-true-that-three-backslashes-are-needed-on-windows-for-sed-replace) – C8H10N4O2 Jan 19 '18 at 16:47
I fixed the "Invalid collation character" error by prefixing the sed invocation with `LC_ALL=C`. – Diomidis Spinellis Jan 02 '21 at 12:16

score 28 · Answer 3 · answered Dec 21 '17 at 05:39

I tried all the solutions and nothing worked. The following, however, does:

tr -cd '\11\12\15\40-\176'

Which I found here:

https://alvinalexander.com/blog/post/linux-unix/how-remove-non-printable-ascii-characters-file-unix

My problem needed it in a series of piped programs, not directly from a file, so modify as needed.

score 16 · Answer 4 · answered Jan 17 '12 at 18:59

16

sed -i 's/[^[:print:]]//' FILENAME

Also, this acts like dos2unix

answered Jan 17 '12 at 18:59

jcalfee314

4,120
7
38
71

11

Does not work. [:print:] is not the same as ASCII. There are many printable non-ASCII characters. – Jason C Jun 18 '14 at 15:17
1

Also the g modifier is missing. Only the first non-printable character would be removed. – proski Nov 30 '17 at 00:18
1

@JasonC There are also many non-printable ASCII characters. It's likely the original question was poorly formed. – Juan Mar 07 '18 at 01:21

score 15 · Answer 5 · answered Feb 28 '18 at 10:24

15

Try tr instead of sed

tr -cd '[:print:]' < file.txt

answered Feb 28 '18 at 10:24

Vivek

10,124
19
78
112

4

The OP specifically mentioned he didn't want to use tr (because he wanted an "in place" conversion which sed -i pretends to be - really writes to a temp file and renames behind the scenes). So this answer doesn't help the OP. BUT... for those who want to use tr, you might want to preserver newlines (the 20180228 version shown here does not). A simple tweak however preserves newlines and carriage returns: `tr -cd '[:print:]\n\r' < file.txt` – Juan Mar 07 '18 at 00:08
1

`tr -cd '[:print:]' – evandrix Aug 07 '19 at 21:48

score 7 · Answer 6 · edited Aug 07 '19 at 21:49

7

# -i (inplace)

LANG=C sed -i -E "s|[\d128-\d255]||g" /path/to/file(s)

The LANG=C part's role is to avoid a Invalid collation character error.

Based on Ivan's answer and Patrick's comment.

edited Aug 07 '19 at 21:49

evandrix

5,889
4
26
35

answered May 02 '18 at 03:41

Nicolas Raoul

57,417
55
212
360

score 6 · Answer 7 · edited Jun 09 '17 at 12:18

6

This worked for me:

sed -i 's/[^[:print:]]//g'

edited Jun 09 '17 at 12:18

Jorge Y. C. Rodriguez

3,277
5
34
58

answered May 01 '17 at 20:22

AJn

61
1
3

I'm still getting unicode characters like 007F in my terminal. – Katastic Voyage Dec 21 '17 at 05:35
@KatasticVoyage What is your locale set to (LANG, LC_CTYPE)? – Juan Mar 07 '18 at 00:43

score 5 · Answer 8 · answered Oct 28 '14 at 16:40

5

I'm using a very minimal busybox system, in which there is no support for ranges in tr or POSIX character classes, so I have to do it the crappy old-fashioned way. Here's the solution with sed, stripping ALL non-printable non-ASCII characters from the file:

sed -i 's/[^a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILE

answered Oct 28 '14 at 16:40

ACK_stoverflow

2,896
4
22
31

1

I don't have your system to test it on, but considering is character 32 (decimal) and tilde "~" is character 126, all of the printable ASCII characters fall between these. If your sed supports [a-z] type ranges, and [^ type "not in" syntax, you should be able to replace that long string of characters with: `sed -i 's/[^ -~]//g' FILE` (that's /[^-~]/) – JohnGH Nov 25 '20 at 15:46
1

@JohnGH Excellent, this does indeed work! A much better solution, albeit six years down the road :) – ACK_stoverflow Nov 25 '20 at 16:56
1

Sorry for the laggy response ;-) – JohnGH Dec 03 '20 at 15:49

guestSA · Answer 9 · 2014-08-19T19:08:22.570

3

awk '{ sub("[^a-zA-Z0-9\"!@#$%^&*|_\[](){}", ""); print }' MYinputfile.txt > pipe_out_to_CONVERTED_FILE.txt

edited Aug 19 '14 at 19:08

answered Aug 19 '14 at 16:56

guestSA

31
2

This answer is missing its educational explanation. – mickmackusa May 17 '22 at 00:16

score 3 · Answer 10 · answered Jul 28 '10 at 13:05

As an alternative to sed or perl you may consider to use ed(1) and POSIX character classes.

Note: ed(1) reads the entire file into memory to edit it in-place, so for really large files you should use sed -i ..., perl -i ...

# see:
# - http://wiki.bash-hackers.org/doku.php?id=howto:edit-ed
# - http://en.wikipedia.org/wiki/Regular_expression#POSIX_character_classes

# test
echo $'aaa \177 bbb \200 \214 ccc \254 ddd\r\n' > testfile
ed -s testfile <<< $',l' 
ed -s testfile <<< $'H\ng/[^[:graph:][:space:][:cntrl:]]/s///g\nwq'
ed -s testfile <<< $',l'

score 0 · Answer 11 · edited Mar 08 '17 at 01:57

0

I appreciate the tips I found on this site.

But, on my Windows 10, I had to use double quotes for this to work ...

sed -i "s/[\d128-\d255]//g" FILENAME

Noticed these things ...

For FILENAME the entire path\name needs to be quoted This didn't work -- %TEMP%\"FILENAME" This did -- %TEMP%\FILENAME"
sed leaves behind temp files in the current directory, named sed*

edited Mar 08 '17 at 01:57

Renats Stozkovs

2,479
10
21
25

answered Mar 07 '17 at 22:22

Larry8811

179
1
4

Note: this answer works with gnu sed, but is not portable to other versions of sed (e.g., bsd). Given the side effects mentioned in this answer, it seems like a weird windows compiled version that tries to emulate gnu sed. Or the user is muddying the water with unrelated shell issues. – Juan Mar 07 '18 at 01:30

Remove non-ASCII characters from CSV

11 Answers11

Linked

Related