Identifying and removing null characters in UNIX

Question

I have a text file containing unwanted null characters (ASCII NUL, \0). When I try to view it in vi I see ^@ symbols, interleaved in normal text. How can I:

Identify which lines in the file contain null characters? I have tried grepping for \0 and \x0, but this did not work.
Remove the null characters? Running strings on the file cleaned it up, but I'm just wondering if this is the best way?

In fact, this question is on superuser.com: http://superuser.com/questions/75130/how-to-remove-ths-symbol-with-vim — jrb, Apr 26 '11 at 09:23

score 141 · Accepted Answer · edited Jan 27 '14 at 03:06

141

I’d use tr:

tr < file-with-nulls -d '\000' > file-without-nulls

If you are wondering if input redirection in the middle of the command arguments works, it does. Most shells will recognize and deal with I/O redirection (<, >, …) anywhere in the command line, actually.

edited Jan 27 '14 at 03:06

Palec

11,499
7
57
127

answered Mar 07 '10 at 23:14

Pointy

389,373
58
564
602

and a "diff file-with-nulls file-without-nulls" should show me which lines had null characters? It brings back a lot more than expected. – dogbane Mar 07 '10 at 23:27
11

Actually, I believe it should be `tr -d '\000' < file-with-nulls > file-without-nulls` since ` – Mikael S Mar 07 '10 at 23:50
11

Most shells will recognize & deal with < or > anywhere in the argument string, actually. Surprised me too. – pra Mar 08 '10 at 18:16
1

+1 For usage of input redirection instead of `cat |`. A fine, clean solution and it solved my problem. – Krzysztof Jabłoński Feb 13 '14 at 07:14
1

This is an order of magnitude slower than `sed` for me – diachedelic Oct 30 '17 at 04:06
@diachedelic that's pretty interesting. I wonder what's behind that; buffering? – Pointy Oct 30 '17 at 13:00
@Pointy I have no idea how the internals of either tool work, so I can't hazard a guess – diachedelic Nov 02 '17 at 01:46
@Pointy Is there a reason you use '\000' instead of '\0'? On the surface they seem to have the same effect – Harold Fischer May 31 '18 at 01:38
@HaroldFischer I don't recall why I wrote it that way 8 years ago I'm afraid :) – Pointy May 31 '18 at 02:08
4

@Pointy '\000' is used in lieu of '\0' in the POSIX opengroup specification for tr. That is a good reason to prefer it – Harold Fischer May 31 '18 at 02:45
@HaroldFischer well I'm not sure what you're trying to do; if you want to see if file has any nulls in it you could use `wc` to compare the size of the file pre-filtering and post-filtering. In a Unicode world it's generally a better idea to be prepared for non-ASCII characters than to worry about them. – Pointy May 31 '18 at 17:35
@Pointy I do apologize, that question was meant for someone else – Harold Fischer May 31 '18 at 18:03
I manage to detect nulls using `grep -Poa '\000'` and using `wc`. Seems easier and more direct / less error-prone. – Pysis Apr 26 '19 at 03:07
@Pysis I'm glad that works for you, but I'm not sure what's less "error-prone" about it. – Pointy Apr 26 '19 at 03:25
I can identify nulls more exactly matched to the content rather than aggregate file size. I guess if we are replacing all nulls then maybe that's less of a problem. Grepping just gives a more focused approach that I was trying to mention. – Pysis Apr 26 '19 at 04:17
This is an extra-good answer for the description of the file redirects. That has the potential to make things so much clearer! – bballdave025 Jun 10 '19 at 16:38
How can we automate the process, the moment multiple files arrived, remove its null characters ? – PAA Jan 06 '21 at 05:58

score 74 · Answer 2 · edited Oct 08 '12 at 16:12

74

Use the following sed command for removing the null characters in a file.

sed -i 's/\x0//g' null.txt

this solution edits the file in place, important if the file is still being used. passing -i'ext' creates a backup of the original file with 'ext' suffix added.

edited Oct 08 '12 at 16:12

Community

1
1

answered Mar 08 '10 at 07:13

rekha_sri

2,519
1
23
27

7

Note: In FreeBSD (and I believe also Mac OS X), `sed -i` *requires* an extension in the next argument, but it may be empty. In those systems, add a `''`, as in: `sed -i '' 's/\x0//g "$FILE"`. – Tim Čas Feb 01 '17 at 21:05
2

This is an order of magnitude faster than `tr` for me – diachedelic Oct 30 '17 at 04:06
For me, using Git for Windows and `$ sed --version` -> `sed (GNU sed) 4.7`, I had to use the following invocation to get a backup file called `example.csv.bak`: `sed -i.bak 's/\x0//g' example.csv` – Andrew Keeton Jan 22 '20 at 18:21
1

@TimČas you did it great, just missed one ' so it should be sed -i '' 's/\x0//g' some_file.xml – Dark Apr 29 '20 at 07:29
On mac this only did the first null character and not all of them. `gsed` did work to do all of them. – phyatt Sep 08 '21 at 18:36

score 23 · Answer 3 · answered Mar 07 '10 at 23:16

23

A large number of unwanted NUL characters, say one every other byte, indicates that the file is encoded in UTF-16 and that you should use iconv to convert it to UTF-8.

answered Mar 07 '10 at 23:16

Ignacio Vazquez-Abrams

740,318
145
1,296
1,325

1

I ran out of disk space while my application was logging. This resulting in these characters. – dogbane Mar 07 '10 at 23:21
For example, it works using this command: `iconv -f UTF-16 -t UTF-8 file`. – djule5 Apr 27 '20 at 15:30

score 7 · Answer 4 · answered Mar 08 '10 at 08:08

7

I discovered the following, which prints out which lines, if any, have null characters:

perl -ne '/\000/ and print;' file-with-nulls

Also, an octal dump can tell you if there are nulls:

od file-with-nulls | grep ' 000'

answered Mar 08 '10 at 08:08

dogbane

254,755
72
386
405

score 5 · Answer 5 · answered Nov 24 '15 at 10:41

5

If the lines in the file end with \r\n\000 then what works is to delete the \n\000 then replace the \r with \n.

tr -d '\n\000' <infile | tr '\r' '\n' >outfile

answered Nov 24 '15 at 10:41

wwmbes

181
1
4

PS. If you find yourself in a Windows DOS shell, you can get the GNU/win32 versions of Unix commands from Sourceforge.net. I use them all the time. Check out "od" the octal dump command for analysing what's in a file... – wwmbes Jun 20 '16 at 14:22

kenorb · Answer 6 · 2017-11-30T13:13:36.563

Here is example how to remove NULL characters using ex (in-place):

ex -s +"%s/\%x00//g" -cwq nulls.txt

and for multiple files:

ex -s +'bufdo!%s/\%x00//g' -cxa *.txt

^{For recursivity, you may use globbing option **/*.txt (if it is supported by your shell).}

Useful for scripting since sed and its -i parameter is a non-standard BSD extension.

See also: How to check if the file is a binary file and read all the files which are not?

score 2 · Answer 7 · edited Dec 17 '16 at 13:00

2

I used:

recode UTF-16..UTF-8 <filename>

to get rid of zeroes in file.

edited Dec 17 '16 at 13:00

kenorb

137,499
74
643
694

answered Jun 22 '15 at 10:04

logisec

21
1

score 0 · Answer 8 · edited Sep 04 '18 at 07:00

0

I faced the same error with:

import codecs as cd
f=cd.open(filePath,'r','ISO-8859-1')

I solved the problem by changing the encoding to utf-16

f=cd.open(filePath,'r','utf-16')

edited Sep 04 '18 at 07:00

Inian

71,145
9
121
139

answered Sep 04 '18 at 06:57

Ming Young

1

Identifying and removing null characters in UNIX

8 Answers8

Linked

Related