`ncplane_putnstr` doesn't fully respect the maximum number of bytes #2754

zhiayang · 2024-02-07T05:58:31Z

Ok, so there's the putNstr family of functions that let you specify the maximum number of bytes, right? The issue is, that limit is not really respected. first, the repro:

#include <unistd.h>
#include <notcurses/notcurses.h>

int main()
{
	auto opts = notcurses_options {
		.loglevel = NCLOGLEVEL_ERROR,
		.flags = NCOPTION_NO_FONT_CHANGES | NCOPTION_SUPPRESS_BANNERS,
	};

	auto nc = notcurses_init(&opts, stdout);
	auto np = notcurses_stdplane(nc);

	// note this clown string with 3 invalid UTF-8 bytes at the end
	const char* asdf = "abcd\xaa\xaa\xaa";

	while(true)
	{
		ncplane_cursor_move_yx(np, 0, 0);

		// but!! I'm only telling it to print 4 bytes -- 
		// so it should never see the 3 garbage bytes!
		ncplane_putnstr(np, 4, asdf);

		notcurses_render(nc);
		usleep(100 * 1000);
	}

	notcurses_stop(nc);
}

The expected behaviour is that I see abcd printed, but what actually happens is that I see abc, putnstr returns -3, and I get an error message about invalid UTF-8.

I managed to trace down the bug:

ncplane_putnstr_yx calls ncplane_putegc_yx -- without telling it how many bytes it can read from the buffer
ncplane_putegc_yx calls utf8_egc_len with the same problem
utf8_egc_len calls mbrtowc, and tells it that it can read up to MB_LEN_MAX bytes from the pointer (which on my system is 6).

Of course, we know that by the time the gcluster pointer is pointing at d, it should really only be allowed to read 1 more byte. This behaviour might actually be platform specific, and it might be macOS's implementation of mbrtowc overzealously reading more bytes even though by right it should stop once it sees a byte <= 0x7F.

The fix should be fairly straightforward, but might touch quite a bit of code, since we'd need
utf8_egc_len to know the maximum number of bytes -- and do something like min(MB_LEN_MAX, n).

tbh if you think that's an OK solution, i'm willing to PR that. For now my fix is to overallocate 1 byte and manually NULL-terminate my buffers so mbrtowc doesn't go looking where it shouldn't. (the garbage bytes in this case were coming from uninitialised memory that i didn't wanna memset because why bother)

edit: this reproduces on linux too, so I guess it's (surprisingly) not apple fucking up.

Please include the following data:

notcurses version 3.0.9
macOS 13.6

COLORTERM truecolor
LANG en_US.UTF-8
TERM xterm-256color
TERMINFO_DIRS :/opt/pacman/usr/share/terminfo
TERM_PROGRAM WezTerm
TERM_PROGRAM_VERSION 20240205-070437-39d2b6ca

The text was updated successfully, but these errors were encountered:

dankamongmen · 2024-02-07T13:11:17Z

So ncplane_putnstr() calls ncplane_putnstr_yx():

static inline int
ncplane_putnstr(struct ncplane* n, size_t s, const char* gclustarr){
  return ncplane_putnstr_yx(n, -1, -1, s, gclustarr);
}

Going to ncplane_putnstr(), we have the following comment:

// Write a series of EGCs to the current location, using the current style.                                                         
// They will be interpreted as a series of columns (according to the definition                                                     
// of ncplane_putc()). Advances the cursor by some positive number of columns                                                       
// (though not beyond the end of the plane); this number is returned on success.                                                    
// On error, a non-positive number is returned, indicating the number of columns                                                    
// which were written before the error. No more than 's' bytes will be written.

See that last line: No more than 's' bytes will be written.

s is a maximum on the output, not the input.

nevermind, i see you already did all of this work, lol. i ought read the report all the way through like less of a dumbass.

ok, i didn't think mbrtowc() would touch anything beyond the next valid utf-8 character assuming it was a valid character. i have a hard time believing that. if that's the case, you essentially always need to know the number of valid bytes in the utf8 string you're handing to mbrtowc(), since it could read past even a well-terminated string. let me go test that out. maybe it guarantees it won't read past a 0 byte, but that makes no sense, since you'd only want to manipulate the end of the buffer if you're (a) doing some kind of parallelism or (b) shaping the memory load.

let me do some testing and lock down exactly what's up. i agree with you: mbrtowc() ought not touch anything once it has seen a char under 0x7f.

btw, mbrtowc() is a core plot point in the novel i just wrote, lol:
midnight-mbrtowc.pdf

dankamongmen · 2024-02-07T13:21:06Z

#include <wchar.h>
#include <stdio.h>
#include <locale.h>
#include <stdlib.h>

int main(void){
  if(!setlocale(LC_ALL, "")){
    fprintf(stderr, "no locale\n");
    return EXIT_FAILURE;
  }
  const char u[] = "abcd\xaa\xaa\xaa";
  size_t r = 0;
  mbstate_t m = {0};
  do{
    wchar_t w;
    size_t s = mbrtowc(&w, u + r, MB_CUR_MAX, &m);
    fprintf(stderr, "R: %zu S: %zu\n", r, s);
    if(s == (ssize_t)-1){
      fprintf(stderr, "nope!\n");
      return EXIT_FAILURE;
    }
    r += s;
  }while(1);
  fprintf(stderr, "yep!\n");
  return EXIT_SUCCESS;
}```

```sh
[schwarzgerat](0) $ ./a 
R: 0 S: 1
R: 1 S: 1
R: 2 S: 1
R: 3 S: 1
R: 4 S: 18446744073709551615
nope!
[schwarzgerat](1) $

so that's as expected--we get four valid results

dankamongmen · 2024-02-07T13:24:22Z

i don't think this is a bug in utf8_egc_len(); it smells more like ncplane_putnstr_yx(). let's see.

dankamongmen · 2024-02-07T13:30:50Z

glust: abcd cols: 1 wcs: 1
glust: bcd cols: 1 wcs: 1
glust: cd cols: 1 wcs: 1
utf8_egc_len:107:invalid UTF8:

dankamongmen · 2024-02-07T13:34:32Z

glust: abcd cols: 1 wcs: 1
post: cols: 1 wcs: 1 offset: 0
glust: bcd cols: 1 wcs: 1
post: cols: 1 wcs: 1 offset: 1
glust: cd cols: 1 wcs: 1
post: cols: 1 wcs: 1 offset: 2
utf8_egc_len:107:invalid UTF8: 
post: cols: -1 wcs: 1 offset: 3

dankamongmen · 2024-02-07T13:39:20Z

hrmmm this looks like a bug in utf8_egc_len(). the 'd' ought be taking us out of the loop, but it does not. we then go around and a '0' would toss us, but crap does not. ok good deal, definite bug.

dankamongmen · 2024-02-07T13:39:58Z

heh this could be uncharitably interpreted as a security bug, technically

dankamongmen · 2024-02-07T13:47:47Z

GOT THAT RET: 1
GOT THAT RET: 1
glust: abcd cols: 1 wcs: 1
post: 97 ret: 0 cols: 1 wcs: 1 offset: 0
GOT THAT RET: 1
GOT THAT RET: 1
glust: bcd cols: 1 wcs: 1
post: 98 ret: 1 cols: 1 wcs: 1 offset: 1
GOT THAT RET: 1
GOT THAT RET: 1
glust: cd cols: 1 wcs: 1
post: 99 ret: 2 cols: 1 wcs: 1 offset: 2
GOT THAT RET: 1
utf8_egc_len:107:invalid UTF8: 
post: 100 ret: 3 cols: -1 wcs: 1 offset: 3

dankamongmen · 2024-02-07T13:48:18Z

yep. we're doing an extra cycle through the utf8_egc_len() main loop on each call.

well, this might be a perf win, too, nice.

dankamongmen · 2024-02-07T14:00:09Z

ok, i see what's going on now.

utf8_egc_len() returns an EGC, not a unicode character. an EGC can be composed of multiple unicodes.

let's say you had X̄ at the end. that's:

u[schwarzgerat](0) $ unicode X̄
U+0058 LATIN CAPITAL LETTER X
UTF-8: 58 UTF-16BE: 0058 Decimal: &#88; Octal: \0130
X (x)
Lowercase: 0078
Category: Lu (Letter, Uppercase); East Asian width: Na (narrow)
Unicode block: 0000..007F; Basic Latin
Bidi: L (Left-to-Right)


U+0304 COMBINING MACRON
UTF-8: cc 84 UTF-16BE: 0304 Decimal: &#772; Octal: \01404
 ̄
Category: Mn (Mark, Non-Spacing); East Asian width: A (ambiguous)
Unicode block: 0300..036F; Combining Diacritical Marks
Bidi: NSM (Non-Spacing Mark)

Combining: 230 (Above)

[schwarzgerat](0) $

here, we absolutely want to read six bytes to output four (for "abcX̄"). so you're calling with invalid input. since s bounds the output size, it doesn't bind the input size, and you're passing in garbage before we have four full EGCs.

yeah, this is not a notcurses bug.

now, whether some function which binds input size is necessary is a valid question...

zhiayang · 2024-02-07T15:25:25Z

ahhh it's another user error on my part, i feel dumb now LOL SORRY ><

in my defence it didn't make sense to want to limit the number of bytes output -- maybe egcs or columns but not bytes, at least to me. I was looking through the putstr functions to see if there was something that took a length, so I probably skimmed through the comment too quickly when i saw something that had a size_t parameter x_x

i do feel like it would be useful to have such a function tho (one that bounds the number of bytes read)... thoughts?

dankamongmen · 2024-02-07T16:01:15Z

yeah i'm not sure this is the most useful bound either. surely we want something that limits the number of columns, assuming we have that?

dankamongmen · 2024-02-07T16:02:40Z

i do feel like it would be useful to have such a function tho (one that bounds the number of bytes read)... thoughts?

hrmm maybe. we start getting into a complex landscape of features, though (this is probably just as valid as my comment above, though)

zhiayang · 2024-02-07T16:05:18Z

i feel like if i'm doing any sort of string manipulation i tend to go for string_view like types/interfaces, which kinda basically need that size field. much better than making null terminated strings all the time imo

neurocyte · 2024-02-08T07:32:53Z

I've run into this issue a couple of times.

I'm working on a text editor that uses hybrid rope/piece-table like structures a lot which may reference small substrings anywhere in a large body of text. Reads with no explicit bounds are obviously problematic in this sort of situation as anything could follow the desired bytes in memory. Adding terminators requires rearranging the entire memory layout and voids most of the advantages of the design.

So, because of this fairly subtle detail, I'm forced to copy pretty much anything used in a notcurses call to a temporary, null terminated, buffer. It's not a big deal, but seems like a rather inflexible API.

dankamongmen · 2024-02-08T19:13:21Z

thanks for the feedback, you two. i think strndup(), strncpy(), and friends have shown the value of limiting the amount of source data. furthermore, i agree that the current API does not provide much use.

i think i'll go ahead and create a bug to add this functionality. i appreciate the comments from real world users!

zhiayang · 2024-02-08T22:35:05Z

i think i'll go ahead and create a bug to add this functionality. i appreciate the comments from real world users!

if you have any guidelines for which parts of the API to change (or which parts not to change), I'd be willing to make a PR to add this!

dankamongmen · 2024-02-09T06:07:00Z

if you have any guidelines for which parts of the API to change (or which parts not to change), I'd be willing to make a PR to add this!

i've assigned #2756 to myself, but if you send in a PR, i'd be delighted to see you do it. run any naming by me, and see my notes there about language wrappers etc.

zhiayang added the bug Something isn't working label Feb 7, 2024

dankamongmen self-assigned this Feb 7, 2024

dankamongmen added this to the 3.1.0 milestone Feb 7, 2024

dankamongmen mentioned this issue Feb 7, 2024

amata_next diagnostics on startup #2755

Open

dankamongmen added the nonfix closed without a successful fix (invalid, wontfix) label Feb 7, 2024

dankamongmen closed this as completed Feb 7, 2024

dankamongmen mentioned this issue Feb 9, 2024

input byte-limited versions of output functions #2756

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ncplane_putnstr` doesn't fully respect the maximum number of bytes #2754

`ncplane_putnstr` doesn't fully respect the maximum number of bytes #2754

zhiayang commented Feb 7, 2024 •

edited

Loading

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

zhiayang commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

zhiayang commented Feb 7, 2024

neurocyte commented Feb 8, 2024

dankamongmen commented Feb 8, 2024

zhiayang commented Feb 8, 2024

dankamongmen commented Feb 9, 2024

ncplane_putnstr doesn't fully respect the maximum number of bytes #2754

ncplane_putnstr doesn't fully respect the maximum number of bytes #2754

Comments

zhiayang commented Feb 7, 2024 • edited Loading

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

zhiayang commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

dankamongmen commented Feb 7, 2024

zhiayang commented Feb 7, 2024

neurocyte commented Feb 8, 2024

dankamongmen commented Feb 8, 2024

zhiayang commented Feb 8, 2024

dankamongmen commented Feb 9, 2024

`ncplane_putnstr` doesn't fully respect the maximum number of bytes #2754

`ncplane_putnstr` doesn't fully respect the maximum number of bytes #2754

zhiayang commented Feb 7, 2024 •

edited

Loading