UTF-8 Width Display Issue of Chinese Characters

Question

When I use Perl or C to printf some data, I tried their format to control the width of each column, like

printf("%-30s", str);

But when str contains Chinese character, then the column doesn't align as expected. see the attachment picture.

My ubuntu's charset encoding is zh_CN.utf8, as far as I know, utf-8 encoding has 1~4 length of bytes. Chinese character has 3 bytes. In my test, I found printf's format control count a Chinese character as 3, but it actually displays as 2 ascii width.

So the real display width is not a constant as expected but a variable related to the number of Chinese character, i.e.

Sw(x) = 1 * (w - 3x) + 2 * x = w - x

w is the width limit expected, x is the count of Chinese characters, Sw(x) is the real display width.

So the more Chinese character str contains, the shorter it displays.

How can I get what I want? Count the Chinese characters before printf?

As far as I know, all Chinese or even all wide characters I guess, displays as 2 width, then why printf count it as 3? UTF-8's encoding has nothing to do with display length.

In other words, you're looking for a multibyte-aware version of `printf` for Perl and/or C? — deceze, May 25 '12 at 09:32
I've never done utf8 decoding in C but here's a Go code that counts runes in an utf-8 string : http://golang.org/src/pkg/unicode/utf8/utf8.go?s=4824:4876#L202 — Denys Séguret, May 25 '12 at 09:33
@dystroy It isn’t just a matter of counting the code points (i.e., runes). Rather, it is taking into account that different code points represent 0, 1, or 2 print columns per UAX#11, and this is fairly subtle, especially with the `East_Asian_Width=Ambiguous` characters. I don’t know of any Go library that deals with this the way the Perl library described in my answer does, but if there is such a thing for Go, I’d love to learn about it! Thanks. — tchrist, May 26 '12 at 07:40
@tchrist : I learned something. And I just made a test : "go fmt" doesn't format correctly structs with "long" characters. So I guess there are still imperfections is Go's handling of the gigantic beast that is Unicode... — Denys Séguret, May 26 '12 at 07:59
Display width (number of screen positions), number of characters and number of bytes are three different things. `printf` only cares about the number of bytes. If you want to take into account the number of characters, use `wprintf` (remember, it takes a `wchar_t*` format). There's no formatting function in C that takes into account display width. — n. 1.8e9-where's-my-share m., May 26 '12 at 08:00
@n.m. Dealing with physical code units instead of logical code points is always a major fail. — tchrist, May 26 '12 at 14:58
@tchrist: that's what the specification of `printf` says. Perhaps a better one is in order. — n. 1.8e9-where's-my-share m., May 26 '12 at 18:20
@n.m. Depends whose `printf` . Some deal in code points. None deal with print widths. — tchrist, May 26 '12 at 19:19
@tchrist: `printf` is defined by the C standard, any deviation from it is, well, non-standard. — n. 1.8e9-where's-my-share m., May 26 '12 at 19:26
@n.m. Immaterial. Read the fourth word in the original poster’s question. Only the C `printf` is defined by the C standard, **no one else’s.** Certainly Perl’s `printf` wouldn’t be so stupid as to treat characters as bytes, and doesn’t. Neither does Ruby’s `printf`, nor Java’s `printf`, nor Go’s `fmt`, nor Python’s `%` operator. I’m sure there are plenty of other langauges with a modern character processing model that aren’t crippled by bytethink. — tchrist, May 26 '12 at 20:59
@tchrist: sorry I should have made clear that I'm only talking about C, not Perl or anything else. — n. 1.8e9-where's-my-share m., May 26 '12 at 21:25

score 7 · Answer 1 · edited May 23 '17 at 11:45

Yes, this is a problem with all versions of printf that I am aware of. I briefly discuss the matter in this answer and also in this one.

For C, I do not know of a library that will do this for you, but if anyone has it, it would be ICU.

For Perl, you have to use the Unicode::GCString module form CPAN to calculate the number of print columns a Unicode string will take up. This takes into account Unicode Standard Annex #11: East Asian Width.

For example, some code points take up 1 column and others take up 2 columns. There are even some that take up no columns at all, like combining characters and invisible control characters. The class has a columns method that returns how many columns the string takes up.

I have an example of using this for aligning Unicode text vertically here. It will sort a bunch of Unicode strings, including some with combining characters and “wide” Asian ideograms (CJK characters), and allow you to align things vertically.

sample terminal output

Code for the little umenu demo program which prints that nicely aligned output, is included below.

You might also be interested the far more ambitious Unicode::LineBreak module, of which the aforementioned Unicode::GCString class is just a smaller component. This module is much cooler, and takes into account Unicode Standard Annex #14: Unicode Line Breaking Algorithm.

Here’s the code for the little umenu demo, tested on Perl v5.14:

 #!/usr/bin/env perl
 # umenu - demo sorting and printing of Unicode food
 #
 # (obligatory and increasingly long preamble)
 #
 use utf8;
 use v5.14;                       # for locale sorting
 use strict;
 use warnings;
 use warnings  qw(FATAL utf8);    # fatalize encoding faults
 use open      qw(:std :utf8);    # undeclared streams in UTF-8
 use charnames qw(:full :short);  # unneeded in v5.16

 # std modules
 use Unicode::Normalize;          # std perl distro as of v5.8
 use List::Util qw(max);          # std perl distro as of v5.10
 use Unicode::Collate::Locale;    # std perl distro as of v5.14

 # cpan modules
 use Unicode::GCString;           # from CPAN

 # forward defs
 sub pad($$$);
 sub colwidth(_);
 sub entitle(_);

 my %price = (
     "γύρος"             => 6.50, # gyros, Greek
     "pears"             => 2.00, # like um, pears
     "linguiça"          => 7.00, # spicy sausage, Portuguese
     "xoriço"            => 3.00, # chorizo sausage, Catalan
     "hamburger"         => 6.00, # burgermeister meisterburger
     "éclair"            => 1.60, # dessert, French
     "smørbrød"          => 5.75, # sandwiches, Norwegian
     "spätzle"           => 5.50, # Bayerisch noodles, little sparrows
     "包子"              => 7.50, # bao1 zi5, steamed pork buns, Mandarin
     "jamón serrano"     => 4.45, # country ham, Spanish
     "pêches"            => 2.25, # peaches, French
     "シュークリーム"    => 1.85, # cream-filled pastry like éclair, Japanese
     "막걸리"            => 4.00, # makgeolli, Korean rice wine
     "寿司"              => 9.99, # sushi, Japanese
     "おもち"            => 2.65, # omochi, rice cakes, Japanese
     "crème brûlée"      => 2.00, # tasty broiled cream, French
     "fideuà"            => 4.20, # more noodles, Valencian (Catalan=fideuada)
     "pâté"              => 4.15, # gooseliver paste, French
     "お好み焼き"        => 8.00, # okonomiyaki, Japanese
 );

 my $width = 5 + max map { colwidth } keys %price;

 # So the Asian stuff comes out in an order that someone
 # who reads those scripts won't freak out over; the
 # CJK stuff will be in JIS X 0208 order that way.
 my $coll  = new Unicode::Collate::Locale locale => "ja";

 for my $item ($coll->sort(keys %price)) {
     print pad(entitle($item), $width, ".");
     printf " €%.2f\n", $price{$item};
 }

 sub pad($$$) {
     my($str, $width, $padchar) = @_;
     return $str . ($padchar x ($width - colwidth($str)));
 }

 sub colwidth(_) {
     my($str) = @_;
     return Unicode::GCString->new($str)->columns;
 }

 sub entitle(_) {
     my($str) = @_;
     $str =~ s{ (?=\pL)(\S)     (\S*) }
              { ucfirst($1) . lc($2)  }xge;
     return $str;
 }

As you see, the key to making it work in that particular program is this line of code, which just calls other functions defined above, and uses the module I was discussing:

print pad(entitle($item), $width, ".");

That will pad out the item to the given width using dots as the fill character.

Yes, it’s a lot less convenient that printf, but at least it is possible.

UTF-8 Width Display Issue of Chinese Characters

1 Answers1

Linked