What is the different between a null terminated string and a string that is not terminated by null in x86 assembly language

Question

I'm currently learning assembly programming by following Kip Irvine's "assembly language x86 programming" book.

In the book, the author states

The most common type of string ends with a null byte (containing 0). Called a null-terminated string

In the subsequent section of the book, the author had a string example without the null byte

greeting1 \
BYTE "Welcome to the Encryption Demo program "

So I was just wondering, what is the different between a null terminated string and a string that is not terminated by null in x86 assembly language? Are they interchangeable? Or they are not equivalent of each other?

How do you get the length of `greeting2`? With `greeting1` you can look for the first null byte. Example 1 is like C, Example 2 is like Java (Java `String` stores length). — Elliott Frisch, Jun 27 '17 at 00:45
@ElliottFrisch thank you for the comment. Correct me if i'm wrong, but my understanding of string in x86 assembly language is: string can be defined with or without null at the end, we can determine the length of strings with null at the end, but the same can't be done for strings without null. Is my understanding correct? is there any other difference? — Thor, Jun 27 '17 at 00:48
Consider also storing strings that might contain multiple `\0`(s). I can't really think of another one. But maybe you can. Good luck. — Elliott Frisch, Jun 27 '17 at 00:54
@ElliottFrisch thanks for helping out! I will keep googling lol — Thor, Jun 27 '17 at 00:57
Some very relevant background can be found in [this answer](https://stackoverflow.com/questions/44534685/how-to-index-through-a-string-in-assembly/44535805#44535805) (although your question is not a duplicate of that one). — Cody Gray, Jun 27 '17 at 07:56
There's no string in assembly at all. That line in your question defines 39 bytes of memory to particular values (converted from text by assembler, using standard ASCII encoding), i.e. the first byte has value `87`. The difference is, that "null-terminated string" has one more byte defined after last character, which is of value `0` (so it also eats +1 memory). It's values in memory in both cases, so from this perspective they are the same. But if the code is looking for the terminating zero (like C `strlen`), then non-terminated string will confuse it and make it run into memory beyond it. — Ped7g, Jun 27 '17 at 08:21
I.e. it is the code using the memory content, which gives its content some structural meaning. If you have code which want "strings" defined as length byte at offset 0 and then the string content, then you would define `"Hello"` as: `pascal_string_example: DB 5, "Hello"`. Such code would happily work with that one. If you would try to use this "string" with C-like code expecting null-terminated, it would wrongly display the `5` as first character and run until it would hit first zero in the following memory, or crash on invalid access. Meanwhile the memory is just array of bytes... — Ped7g, Jun 27 '17 at 08:28

Peter Cordes · Accepted Answer · 2020-10-10T14:30:38.023

There's nothing specific to asm here; it's the same issue in C. It's all about how you store strings in memory and keep track of where they end.

what is the different between a null terminated string and a string that is not terminated by null?

A null-terminated string has a 0 byte after it, so you can find the end with strlen. (e.g. with a slow repne scasb). This makes is usable as an implicit-length string, like C uses.

NASM Assembly - what is the ", 0" after this variable for? explains the NASM syntax for creating one in static storage with db. db usage in nasm, try to store and print string shows what happens when you forget the 0 terminator.

Are they interchangeable?

If you know the length of a null-terminated string, you can pass pointer+length to a function that wants an explicit-length string. That function will never look at the 0 byte, because you will pass a length that doesn't include the 0 byte. It's not part of the string data proper.

But if you have a string without a terminator, you can't pass it to a function or system-call that wants a null-terminated string. (If the memory is writeable, you could store a 0 after the string to make it into a null-terminated string.)

In Linux, many system calls take strings as C-style implicit-length null-terminated strings. (i.e. just a char* without passing a length).

For example, open(2) takes a string for the path: int open(const char *pathname, int flags); You must pass a null-terminated string to the system call. It's impossible to create a file with a name that includes a '\0' in Linux (same as most other Unix systems), because all the system calls for dealing with files use null-terminated strings.

OTOH, write(2) takes a memory buffer which isn't necessarily a string. It has the signature ssize_t write(int fd, const void *buf, size_t count);. It doesn't care if there's a 0 at buf+count because it only looks at the bytes from buf to buf+count-1.

You can pass a string to write(). It doesn't care. It's basically just a memcpy into the kernel's pagecache (or into a pipe buffer or whatever for non-regular files). But like I said, you can't pass an arbitrary non-terminated buffer as the path arg to open().

Or they are not equivalent of each other?

Implicit-length and explicit-length are the two major ways of keeping track of string data/constants in memory and passing them around. They solve the same problem, but in opposite ways.

Long implicit-length strings are a bad choice if you sometimes need to find their length before walking through them. Looping through a string is a lot slower than just reading an integer. Finding the length of an implicit-length string is O(n), but an explicit-length string is of course O(1) time to find the length. (It's already known!). At least the length in bytes is known, but the length in Unicode characters might not be known, if it's in a variable-length encoding like UTF-8 or UTF-16.

"so you can find the end with `strlen`. (e.g. with a slow `rep scasb`)." -- There's no such thing as `rep scasb`, there's only `repe scasb` and `repne scasb` (you meant the latter). — ecm, Oct 10 '20 at 14:21
@ecm: thanks, fixed. Didn't notice that mistake from the original answer while adding new stuff >. — Peter Cordes, Oct 10 '20 at 14:31

score 2 · Answer 2 · answered Jun 27 '17 at 02:14

How a string is terminated has nothing to do with assembly. Historically, '$', CRLF [10,13] or [0A,0D] and those are sometimes reversed as with GEDIT under Linux. Conventions are determined by how your system is going to interact with itself or other systems. As an example, my applications are strictly oriented around ASCII, therefore, if I would read a file that's UTF-8 or 16 my application would fail miserably. NULLs or any kind of termination could be optional.

Consider this example

Title:  db  'Proto_Sys 1.00.0', 0, 0xc4, 0xdc, 0xdf, 'CPUID', 0, 'A20', 0
        db  'AXCXDXBXSPBPSIDIESDSCSSSFSGS'
Err00:  db  'Retry [Y/N]', 0

I've implemented a routine where if CX=0 then it's assumed a NULL terminated string is to be displayed, otherwise only one character is read and repeated CX times. That is why 0xc4 0xdc 0xdf are not terminated. Similarly, there isn't a terminator before 'Retry [Y/N]' because the way my algo is designed, there doesn't need to be.

The only thing you need concern yourself with is what is the source of your data or does your application need to be compatible with something else. Then you just simply implement whatever you need to make it work.

What is the different between a null terminated string and a string that is not terminated by null in x86 assembly language

2 Answers2

Related