You're right, your code uses 16-bit CX. Much worse, it depends on CL being zero before this snippet executes! An 8-bit loop counter that starts at zero will wrap back to zero 256 decrements (or increments) later.
mov al, 0 ; uint8_t i = 0. xor ax,ax is the same code size but zeros AH
loop: ; do {
dec al
jnz loop ; }while(--i != 0)
Nothing in the question said there needed to be any work inside the loop; this is just an empty delay loop.
efficiency notes: dec ax is smaller than dec al, and loop rel8 is even more compact than dec/jnz. So if you were optimizing for real 8086 or 8088, you'd want to keep the loop body smaller because it runs more times than the code ahead of the loop. Of course if you actually wanted to just delay, this would delay longer since code-fetch would take more memory accesses. Overall code size is the same either way, for mov ax, 256 (3 bytes) vs. xor ax,ax (2 bytes) or mov al, 0 (2 bytes).
This works the same with any 8-bit register; AL isn't special for any of these instructions, so you'd often want to keep it free for stuff that can benefit from its special encodings for stuff like cmp al, imm8 in 2 bytes instead of the usual 3.