Batch convert encoding in files

Question

How can I batch-convert files in a directory for their encoding (e.g. ANSI → UTF-8) with a command or tool?

For single files, an editor helps, but how can I do the mass files job?

related: http://stackoverflow.com/questions/724083/unix-newlines-to-windows-newlines-on-windows — , Aug 21 '09 at 09:18

score 45 · Answer 1 · edited Aug 29 '20 at 12:22

45

With PowerShell you can do something like this:

Get-Content IN.txt | Out-File -encoding ENC -filepath OUT.txt

While ENC is something like unicode, ascii, utf8, and utf32. Check out 'help out-file'.

To convert all the *.txt files in a directory to UTF-8, do something like this:

foreach($i in ls -name DIR/*.txt) { \
    Get-Content DIR/$i | \
    Out-File -encoding utf8 -filepath DIR2/$i \
}

which creates a converted version of each .txt file in DIR2.

To replace the files in all subdirectories, use:

foreach($i in ls -recurse -filter "*.java") {
    $temp = Get-Content $i.fullname
    Out-File -filepath $i.fullname -inputobject $temp -encoding utf8 -force
}

edited Aug 29 '20 at 12:22

Peter Mortensen

12,190

answered Feb 26 '10 at 06:31

akira

62,099

Converting from ANSI to UTF via your first proposal does erase the whole content of my textfile... – Orsinus May 09 '15 at 07:24
@Acroneos: then you made a mistake: the in-file is IN.txt, the outfile is OUT.txt ... this way it is impossible to overwrite the original. if you used the same filename for IN.txt and OUT.txt then you overwrite the file you are reading from, obviously. – akira May 10 '15 at 06:06
1

Powershell will convert to UTF with BOM. find and iconv might be much easier. – pparas Aug 23 '17 at 14:25
1

@pparas that's wrong. Commands related to text files like Out-File, Get-Content, Set-Content... all have an -Encoding parameter which allows utf8BOM or utf8NoBOM. iconv is much worse in this regard because it never supports UTF-8 with BOM – phuclv Aug 21 '20 at 23:17
1

\ is not an escape character in powershell so putting it at the end of each line won't work – phuclv Aug 25 '20 at 16:00

score 45 · Accepted Answer · edited Nov 11 '23 at 13:04

Cygwin or GnuWin32 provide Unix tools like iconv and dos2unix (and unix2dos). Under Unix/Linux/Cygwin, you'll want to use "windows-1252" as the encoding instead of ANSI (see below). (Unless you know your system is using a codepage other than 1252 as its default codepage, in which case you'll need to tell iconv the right codepage to translate from.)

Convert from one (-f) to the other (-t) with:

$ iconv -f windows-1252 -t utf-8 infile > outfile

Or in a find-all-and-conquer form:

## this will clobber the original files!
$ find . -name '*.txt' -exec iconv --verbose -f windows-1252 -t utf-8 {} \> {} \;

Alternatively:

## this will clobber the original files!
$ find . -name '*.txt' -exec iconv --verbose -f windows-1252 -t utf-8 -o {} {} \;

This question has been asked many times on this site, so here's some additional information about "ANSI". In an answer to a related question, CesarB mentions:

There are several encodings which are called "ANSI" in Windows. In fact, ANSI is a misnomer. iconv has no way of guessing which you want.

The ANSI encoding is the encoding used by the "A" functions in the Windows API (the "W" functions use UTF-16). Which encoding it corresponds to usually depends on your Windows system language. The most common is CP 1252 (also known as Windows-1252). So, when your editor says ANSI, it is meaning "whatever the API functions use as the default ANSI encoding", which is the default non-Unicode encoding used in your system (and thus usually the one which is used for text files).

The page he links to gives this historical tidbit (quoted from a Microsoft PDF) on the origins of CP 1252 and ISO-8859-1, another oft-used encoding:

[...] this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft, which became ISO Standard 8859-1. However, in adding code points to the range reserved for control codes in the ISO standard, the Windows code page 1252 and subsequent Windows code pages originally based on the ISO 8859-x series deviated from ISO. To this day, it is not uncommon to have the development community, both within and outside of Microsoft, confuse the 8859-1 code page with Windows 1252, as well as see "ANSI" or "A" used to signify Windows code page support.

Don't use the same filename as input and output! iconv seems to truncate files to 32,768 bytes if they exceed this size. As he writes in the file he's trying to read from, he manages to do the job if the file is small enough, else he truncates the file without any warning... — sylbru, Sep 11 '14 at 07:32
FYI This question is tagged with osx and it doesn't look like either of the convert-all commands work on Yosemite or El Cap. The iconv version Apples ships doesn't support --verbose or -o, and the other syntax redirecting stdout doesn't work for some reason and just sends it to regular stdout. — Scott McIntyre, May 09 '16 at 13:02

Serge Stroobandt · Answer 3 · 2020-08-30T19:48:19.117

9

Oneliner using find, with automatic detection

The character encoding of all matching text files gets detected automatically and all matching text files are converted to UTF-8 encoding:

$ find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -bi "$1" |sed -e "s/.*[ ]charset=//") -t utf-8 -o converted "$1" && mv converted "$1"' -- {} \;

To perform these steps, a sub shell sh is used with -exec, running a one-liner with the -c flag, and passing the filename as the positional argument "$1" with -- {}. In between, the UTF-8 output file is temporarily named converted.

The find command is very useful for such file management automation.

Click here for more find galore.

edited Aug 30 '20 at 19:48

answered Aug 28 '16 at 19:53

Serge Stroobandt

2,145

2

This works on Mac: find . -type f -iname "*.txt" -exec sh -c 'iconv -f windows-1252 -t utf-8 "$1" > converted && mv converted "$1"' -- "{}" \;, to convert from ANSI – djjeck Feb 17 '21 at 20:05
2

My iconv command from git bash has no -o option so I use file redirection > : find . -type f -name '*.txt' -exec sh -c 'iconv -f $(file -bi "$1" |sed -e "s/.*[ ]charset=//") -t utf-8 > /tmp/converted "$1" && mv /tmp/converted "$1"' -- {} \;. Any advantage of using this syntax -- as opposed to passing {} directly ? find . -type f -name '*.txt' -exec sh -c 'iconv -f $(file -bi {} |sed -e "s/.*[ ]charset=//") -t utf-8 > /tmp/converted {} && mv /tmp/converted {}' \; – Sybuser Dec 30 '21 at 14:31

score 5 · Answer 4 · answered Aug 21 '09 at 09:21

5

The Wikipedia page on newlines has a section on conversion utilities.

This seems your best bet for a conversion using only tools Windows ships with:

TYPE unix_file | FIND "" /V > dos_file

answered Aug 21 '09 at 09:21

this is new line conversion and has nothing to do with encoding conversion – phuclv Aug 21 '20 at 23:18
@phuclv that code will indeed take certain unicode files and translate them to ascii. I have been doing that for 15+ years. I don't know of any way to natively change line endings with a batch file. I have always used the unix2dos utilities. – Squashman Mar 19 '24 at 19:12

score 3 · Answer 5 · edited Aug 29 '20 at 12:32

3

There is free and open source batch encoding converter named CP Converter.

edited Aug 29 '20 at 12:32

Peter Mortensen

12,190

answered Mar 28 '20 at 19:15

MSS

208
2
5

This looked like exactly what I was hoping for (a GUI) though no drag-and-drop (you have to use the File menu) and no ANSI option (that I could find). Nothing I did showed up later as UTF-8 in Notepad++. If this had had just a little more development this tool would have been nearly perfect. – John Feb 24 '21 at 11:23

score 3 · Answer 6 · edited Dec 07 '11 at 02:16

3

UTFCast is a Unicode converter for Windows which supports batch mode. I'm using the paid version and am quite comfortable with it.

UTFCast is a Unicode converter that lets you batch convert all text files to UTF encodings with just a click of your mouse. You can use it to convert a directory full of text files to UTF encodings including UTF-8, UTF-16 and UTF-32 to an output directory, while maintaining the directory structure of the original files. It doesn't even matter if your text file has a different extension, UTFCast can automatically detect text files and convert them.

edited Dec 07 '11 at 02:16

Gareth

18,809

answered Dec 06 '11 at 18:48

Tiler

31

Seems they cannot convert into the same folder, only into another destination folder. – Uwe Keim Aug 09 '16 at 19:49
The pro version allows in-place conversion. $20/3months. https://www.rotatingscrew.com/utfcast-version-comparison.aspx – SherylHohman Jan 30 '19 at 19:19
Oh, express (free) version is useless - it only "Detects" utf-8 WITH BOM !! (everyone can do that). Only Pro version that Auto-Renews every 3 months at $20 a pop, will auto-detect. Price is steep for a non-enterprise user. AND Beware if you try the basic version, and your file is already utf-8 (without BOM), then this converter will detect it as ASCII, then (re-)"convert" it to utf-8, which could result in gibberish. Be Aware if this before trying the express version! They have a demo version for the pro that produces no output - pointless IMHO cuz can't verify results before buying! – SherylHohman Jan 30 '19 at 19:38

score 2 · Answer 7 · edited Aug 29 '20 at 12:32

2

In my use case, I needed automatic input encoding detection and there there was a lot of files with Windows-1250 encoding, for which command file -bi <FILE> returns charset=unknown-8bit. This is not a valid parameter for iconv.

I have had the best results with enca.

Convert all files with txt extension to UTF-8

find . -type f -iname *.txt -exec sh -c 'echo "$1" && enca "$1" -x utf-8' -- {} \;

edited Aug 29 '20 at 12:32

Peter Mortensen

12,190

answered Sep 16 '18 at 17:40

Bedla

121

Dang... I wish your answer wasn't that deeply buried at the bottom! enca is really useful, and way easier to use... when it works. Then again, other solutions fail, too... – Gwyneth Llewelyn May 13 '20 at 00:36

score 1 · Answer 8 · answered Jul 01 '18 at 10:18

1

Use this Python script: https://github.com/goerz/convert_encoding.py It works on any platform. Requires Python 2.7.

answered Jul 01 '18 at 10:18

kinORnirvana

1,451
1
9
5

What about Python 3? – Peter Mortensen Aug 29 '20 at 12:28

score 1 · Answer 9 · answered Nov 28 '20 at 07:50

1

I made a tool for this finally: https://github.com/gonejack/transcode

Install:

go get -u github.com/gonejack/transcode

Usage:

> transcode source.txt
> transcode -s gbk -t utf8 source.txt

answered Nov 28 '20 at 07:50

igonejack

131

Missing Twins · Answer 10 · 2022-07-21T08:44:29.190

---------------Solution 1-----------------------------
There are two flaws in @akira 's answer.

Your original file would be zeroed if encountered any failure.
If your path contains any non-ASCII character, it will throw this error Set-Content : An object at the specified path ...txt does not exist, or has been filtered by the -Include or -Exclude parameter.

This is an improved version, by adding -LiteralPath and if($?)

foreach($i in ls -name *.txt) {
    $relativePath = Resolve-Path -Relative -LiteralPath "$i"
    $temp = Get-Content -LiteralPath "$relativePath" 
    if($?)
    {
        Out-File -LiteralPath "$i" -inputobject "$temp" -encoding utf8 -force
    }
}

----------------Solution 2 (Better)----------------
PowerShell can covert very limited encodings, such gb2312, Shift-JIS are not one of them. Notepad++ has a python plugin can do a better job than the powershell, and relatively safer, you can review what you are about to convert.

Use Everything find what file you want to convert. Download link is at below
https://www.voidtools.com/
Notepad++ Menu -> Plugins -> Python Script -> New Scripts
Copy the one of two scripts(see bellow) and modify by your needs, save it to the default location.
Drag all files from Everything into notepad++
Run python-script with python-plugin in notepad++ from Menu -> Plugins -> Python Script -> Scripts
Done

There are two scripts, the bottom one can convert and save opened tabs into UTF-8

Script 1
https://gist.github.com/bjverde/88bbc418e79f016a57539c2d5043c445
Script 2

for filename, bufferID, index, view in notepad.getFiles():
    console.write( filename + "\r\n")        
    notepad.activateIndex(view, index)       
    # UTF8 (without BOM)
    notepad.menuCommand(MENUCOMMAND.FORMAT_CONV2_AS_UTF_8)
    notepad.save()    
    notepad.reloadCurrentDocument()

The method 5 is quite good and nice, thanks for the sharing. — ollydbg23, Jul 20 '22 at 10:23

score 1 · Answer 11 · edited Aug 29 '20 at 12:26

1

iconv -f original_charset -t utf-8 originalfile > newfile

Run the above command in a for loop.

edited Aug 29 '20 at 12:26

Peter Mortensen

12,190

answered Jun 06 '14 at 14:47

Aneesh Garg

111

1

I'm guessing original_charset is just a placeholder here, not actually the magical "detect my encoding" feature we all might hope for. – mwfearnley Feb 26 '20 at 09:11
This has the advantage of not requiring the -o option which is not available on some flavours of iconv (namely, macOS, and I suspect FreeBSD as well). On the other hand, the for loop is non-trivial to create if you require it to transverse a deep tree structure of directories... – Gwyneth Llewelyn May 13 '20 at 00:38

score 0 · Answer 12 · answered Aug 19 '20 at 07:54

ConvertZ is another Windows GUI tool for batch conversion

Convert file (plain text) or clipboard content among the following encodings: big5, gbk, hz, shift-jis, jis, euc-jp, unicode big-endian, unicode little-endian, and utf-8.

Batch files conversion

Preview file content and converted result before actual conversion.

Auto-update the charset in <Meta> tag, if specified in html docs.

Auto-fix mis-mapped Big5/GBK characters after conversion.

Change filename's encoding among big5, gbk, shift-jis and unicode.

Convert MP3's ID3 or APE among big5, gbk, shift-jis, unicode and utf-8 encoding.

Convert Ogg tag between Traditional and Simplified Chinese in utf-8.

Alternative download link: https://www.softking.com.tw/download/1763/

score 0 · Answer 13 · edited Aug 29 '20 at 12:18

0

There is dos2unix on Unix. There was another similar tool for Windows (another reference is here).

How do I convert between Unix and Windows text files? has some more tricks.

edited Aug 29 '20 at 12:18

Peter Mortensen

12,190

answered Aug 21 '09 at 09:14

nik

56,306

5

dos2unix is useful to convert line breaks, but the OP is looking for converting character encodings. – Sony Santos Apr 17 '14 at 03:01

score -1 · Answer 14 · answered Jul 15 '21 at 14:25

-1

I have created an online tool for that:

https://encoding-converter.netlify.app

You can upload bunch of files at once to be converted. Use it in this order:

enter the encodings
select/drag&drop your files

Upload will start automatically.

answered Jul 15 '21 at 14:25

Zoldyck

1

Batch convert encoding in files

14 Answers14

Oneliner using find, with automatic detection

Linked

Related