Same regex in perl and sed work differently

Question

Okay, maybe something wrong with unicode or etc, but the code tells everything:

$ cat leo
сказывать
ссказываю
сказав
BladeMight@Chandere ~ 23:24:58
$ cat leo | perl -pe 's/^с+каз/Рассказ/g'
Рассказывать
ссказываю
Рассказав
BladeMight@Chandere ~ 23:25:00
$ cat leo | sed -r 's/^с+каз/Рассказ/g'
Рассказывать
Рассказываю
Рассказав

I have file leo, contents in cyrillic, so i wanted to replace wrong places with the regex ^с+каз in perl -pe, but it replaces only the ones that have only 1 с(cyrillic one), e.g. + does nothing in this case(and for non-cyrillic it works fine), although in sed -r it works perfectly. Why could that be?

You'll also want to avoid the [useless `cat`](/questions/11710552/useless-use-of-cat). — tripleee, Nov 29 '19 at 21:48
Tip: No need to involve `cat`. You can use `perl -pe'...' leo` — ikegami, Nov 29 '19 at 21:51

score 4 · Accepted Answer · answered Nov 29 '19 at 21:40

4

Perl needs to be told that your source code is UTF-8 (-Mutf8) and that it should treat stdin and stdout as UTF-8 (-CS).

$ cat leo | perl -Mutf8 -CS -pe 's/^с+каз/Рассказ/g'
Рассказывать
Рассказываю
Рассказав

answered Nov 29 '19 at 21:40

hobbs

206,796
16
199
282

1

NOTE: `use utf8` is necessary only if inside of the code used `utf8` encoding (for example search pattern in this particular case). An options `-CS` is required practically anytime when `utf8` input/output takes place. – Polar Bear Nov 30 '19 at 00:11

Same regex in perl and sed work differently

1 Answers1