How does the Linux command `file` recognize the encoding of my files?

Question

How does the Linux command file recognize the encoding of my files?

zell@ubuntu:~$ file examples.desktop 
examples.desktop: UTF-8 Unicode text

zell@ubuntu:~$ file /etc/services 
/etc/services: ASCII text

Possible duplicate of [How does Linux recognize a file as a certain file type, and how can I change it?](https://stackoverflow.com/questions/10131631/what-causes-the-computer-to-recognize-a-file-as-a-certain-file-type-and-how-can), [How to find encoding of a file via script on Linux?](https://stackoverflow.com/q/805418/608639), etc. — jww, Oct 15 '19 at 00:53

score 1 · Answer 1 · answered Oct 10 '19 at 18:32

The man page is pretty clear

The filesystem tests are based on examining the return from a stat(2) system call...

The magic tests are used to check for files with data in particular fixed formats. The canonical example of this is a binary executable (compiled program) a.out file, whose format is defined in #include and possibly #include in the standard include directory. These files have a 'magic number' stored in a particular place near the beginning of the file that tells the UNIX operating system that the file is a binary executable, and which of several types thereof. The concept of a 'magic' has been applied by extension to data files. Any file with some invariant identifier at a small fixed offset into the file can usually be described in this way. The information identifying these files is read from the compiled magic file /usr/share/misc/magic.mgc, or the files in the directory /usr/share/misc/magic if the compiled file does not exist. In addition, if $HOME/.magic.mgc or $HOME/.magic exists, it will be used in preference to the system magic files. If /etc/magic exists, it will be used together with other magic files.

If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported.

In short, for regular files, their magic values are tested. If there's no match, then file checks whether it's a text file, making an educated guess about the specific encoding by looking at the actual values of bytes in the file.

Oh, and you can also download the source code and look at the implementation for yourself.

How does the Linux command `file` recognize the encoding of my files?

1 Answers1

Related