Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ncplane_putnstr doesn't fully respect the maximum number of bytes #2754

Closed
zhiayang opened this issue Feb 7, 2024 · 18 comments
Closed

ncplane_putnstr doesn't fully respect the maximum number of bytes #2754

zhiayang opened this issue Feb 7, 2024 · 18 comments
Assignees
Labels
bug Something isn't working nonfix closed without a successful fix (invalid, wontfix)
Milestone

Comments

@zhiayang
Copy link

zhiayang commented Feb 7, 2024

Ok, so there's the putNstr family of functions that let you specify the maximum number of bytes, right? The issue is, that limit is not really respected. first, the repro:

#include <unistd.h>
#include <notcurses/notcurses.h>

int main()
{
	auto opts = notcurses_options {
		.loglevel = NCLOGLEVEL_ERROR,
		.flags = NCOPTION_NO_FONT_CHANGES | NCOPTION_SUPPRESS_BANNERS,
	};

	auto nc = notcurses_init(&opts, stdout);
	auto np = notcurses_stdplane(nc);

	// note this clown string with 3 invalid UTF-8 bytes at the end
	const char* asdf = "abcd\xaa\xaa\xaa";

	while(true)
	{
		ncplane_cursor_move_yx(np, 0, 0);

		// but!! I'm only telling it to print 4 bytes -- 
		// so it should never see the 3 garbage bytes!
		ncplane_putnstr(np, 4, asdf);

		notcurses_render(nc);
		usleep(100 * 1000);
	}

	notcurses_stop(nc);
}

The expected behaviour is that I see abcd printed, but what actually happens is that I see abc, putnstr returns -3, and I get an error message about invalid UTF-8.

I managed to trace down the bug:

  1. ncplane_putnstr_yx calls ncplane_putegc_yx -- without telling it how many bytes it can read from the buffer
  2. ncplane_putegc_yx calls utf8_egc_len with the same problem
  3. utf8_egc_len calls mbrtowc, and tells it that it can read up to MB_LEN_MAX bytes from the pointer (which on my system is 6).

Of course, we know that by the time the gcluster pointer is pointing at d, it should really only be allowed to read 1 more byte. This behaviour might actually be platform specific, and it might be macOS's implementation of mbrtowc overzealously reading more bytes even though by right it should stop once it sees a byte <= 0x7F.

The fix should be fairly straightforward, but might touch quite a bit of code, since we'd need
utf8_egc_len to know the maximum number of bytes -- and do something like min(MB_LEN_MAX, n).

tbh if you think that's an OK solution, i'm willing to PR that. For now my fix is to overallocate 1 byte and manually NULL-terminate my buffers so mbrtowc doesn't go looking where it shouldn't. (the garbage bytes in this case were coming from uninitialised memory that i didn't wanna memset because why bother)

edit: this reproduces on linux too, so I guess it's (surprisingly) not apple fucking up.

Please include the following data:

  • notcurses version 3.0.9
  • macOS 13.6
COLORTERM truecolor
LANG en_US.UTF-8
TERM xterm-256color
TERMINFO_DIRS :/opt/pacman/usr/share/terminfo
TERM_PROGRAM WezTerm
TERM_PROGRAM_VERSION 20240205-070437-39d2b6ca
@zhiayang zhiayang added the bug Something isn't working label Feb 7, 2024
@dankamongmen
Copy link
Owner

So ncplane_putnstr() calls ncplane_putnstr_yx():

static inline int
ncplane_putnstr(struct ncplane* n, size_t s, const char* gclustarr){
  return ncplane_putnstr_yx(n, -1, -1, s, gclustarr);
}

Going to ncplane_putnstr(), we have the following comment:

// Write a series of EGCs to the current location, using the current style.                                                         
// They will be interpreted as a series of columns (according to the definition                                                     
// of ncplane_putc()). Advances the cursor by some positive number of columns                                                       
// (though not beyond the end of the plane); this number is returned on success.                                                    
// On error, a non-positive number is returned, indicating the number of columns                                                    
// which were written before the error. No more than 's' bytes will be written.

See that last line: No more than 's' bytes will be written.

s is a maximum on the output, not the input.

nevermind, i see you already did all of this work, lol. i ought read the report all the way through like less of a dumbass.

ok, i didn't think mbrtowc() would touch anything beyond the next valid utf-8 character assuming it was a valid character. i have a hard time believing that. if that's the case, you essentially always need to know the number of valid bytes in the utf8 string you're handing to mbrtowc(), since it could read past even a well-terminated string. let me go test that out. maybe it guarantees it won't read past a 0 byte, but that makes no sense, since you'd only want to manipulate the end of the buffer if you're (a) doing some kind of parallelism or (b) shaping the memory load.

let me do some testing and lock down exactly what's up. i agree with you: mbrtowc() ought not touch anything once it has seen a char under 0x7f.

btw, mbrtowc() is a core plot point in the novel i just wrote, lol:
midnight-mbrtowc.pdf

@dankamongmen dankamongmen self-assigned this Feb 7, 2024
@dankamongmen dankamongmen added this to the 3.1.0 milestone Feb 7, 2024
@dankamongmen
Copy link
Owner

#include <wchar.h>
#include <stdio.h>
#include <locale.h>
#include <stdlib.h>

int main(void){
  if(!setlocale(LC_ALL, "")){
    fprintf(stderr, "no locale\n");
    return EXIT_FAILURE;
  }
  const char u[] = "abcd\xaa\xaa\xaa";
  size_t r = 0;
  mbstate_t m = {0};
  do{
    wchar_t w;
    size_t s = mbrtowc(&w, u + r, MB_CUR_MAX, &m);
    fprintf(stderr, "R: %zu S: %zu\n", r, s);
    if(s == (ssize_t)-1){
      fprintf(stderr, "nope!\n");
      return EXIT_FAILURE;
    }
    r += s;
  }while(1);
  fprintf(stderr, "yep!\n");
  return EXIT_SUCCESS;
}```

```sh
[schwarzgerat](0) $ ./a 
R: 0 S: 1
R: 1 S: 1
R: 2 S: 1
R: 3 S: 1
R: 4 S: 18446744073709551615
nope!
[schwarzgerat](1) $

so that's as expected--we get four valid results

@dankamongmen
Copy link
Owner

i don't think this is a bug in utf8_egc_len(); it smells more like ncplane_putnstr_yx(). let's see.

@dankamongmen
Copy link
Owner

glust: abcd cols: 1 wcs: 1
glust: bcd cols: 1 wcs: 1
glust: cd cols: 1 wcs: 1
utf8_egc_len:107:invalid UTF8: 

@dankamongmen
Copy link
Owner

glust: abcd cols: 1 wcs: 1
post: cols: 1 wcs: 1 offset: 0
glust: bcd cols: 1 wcs: 1
post: cols: 1 wcs: 1 offset: 1
glust: cd cols: 1 wcs: 1
post: cols: 1 wcs: 1 offset: 2
utf8_egc_len:107:invalid UTF8: 
post: cols: -1 wcs: 1 offset: 3

@dankamongmen
Copy link
Owner

hrmmm this looks like a bug in utf8_egc_len(). the 'd' ought be taking us out of the loop, but it does not. we then go around and a '0' would toss us, but crap does not. ok good deal, definite bug.

@dankamongmen
Copy link
Owner

heh this could be uncharitably interpreted as a security bug, technically

@dankamongmen
Copy link
Owner

GOT THAT RET: 1
GOT THAT RET: 1
glust: abcd cols: 1 wcs: 1
post: 97 ret: 0 cols: 1 wcs: 1 offset: 0
GOT THAT RET: 1
GOT THAT RET: 1
glust: bcd cols: 1 wcs: 1
post: 98 ret: 1 cols: 1 wcs: 1 offset: 1
GOT THAT RET: 1
GOT THAT RET: 1
glust: cd cols: 1 wcs: 1
post: 99 ret: 2 cols: 1 wcs: 1 offset: 2
GOT THAT RET: 1
utf8_egc_len:107:invalid UTF8: 
post: 100 ret: 3 cols: -1 wcs: 1 offset: 3

@dankamongmen
Copy link
Owner

yep. we're doing an extra cycle through the utf8_egc_len() main loop on each call.

well, this might be a perf win, too, nice.

@dankamongmen dankamongmen added the nonfix closed without a successful fix (invalid, wontfix) label Feb 7, 2024
@dankamongmen
Copy link
Owner

ok, i see what's going on now.

utf8_egc_len() returns an EGC, not a unicode character. an EGC can be composed of multiple unicodes.

let's say you had at the end. that's:

u[schwarzgerat](0) $ unicode X̄
U+0058 LATIN CAPITAL LETTER X
UTF-8: 58 UTF-16BE: 0058 Decimal: &#88; Octal: \0130
X (x)
Lowercase: 0078
Category: Lu (Letter, Uppercase); East Asian width: Na (narrow)
Unicode block: 0000..007F; Basic Latin
Bidi: L (Left-to-Right)


U+0304 COMBINING MACRON
UTF-8: cc 84 UTF-16BE: 0304 Decimal: &#772; Octal: \01404
 ̄
Category: Mn (Mark, Non-Spacing); East Asian width: A (ambiguous)
Unicode block: 0300..036F; Combining Diacritical Marks
Bidi: NSM (Non-Spacing Mark)

Combining: 230 (Above)

[schwarzgerat](0) $ 

here, we absolutely want to read six bytes to output four (for "abcX̄"). so you're calling with invalid input. since s bounds the output size, it doesn't bind the input size, and you're passing in garbage before we have four full EGCs.

yeah, this is not a notcurses bug.

now, whether some function which binds input size is necessary is a valid question...

@zhiayang
Copy link
Author

zhiayang commented Feb 7, 2024

ahhh it's another user error on my part, i feel dumb now LOL SORRY ><

in my defence it didn't make sense to want to limit the number of bytes output -- maybe egcs or columns but not bytes, at least to me. I was looking through the putstr functions to see if there was something that took a length, so I probably skimmed through the comment too quickly when i saw something that had a size_t parameter x_x

i do feel like it would be useful to have such a function tho (one that bounds the number of bytes read)... thoughts?

@dankamongmen
Copy link
Owner

yeah i'm not sure this is the most useful bound either. surely we want something that limits the number of columns, assuming we have that?

@dankamongmen
Copy link
Owner

i do feel like it would be useful to have such a function tho (one that bounds the number of bytes read)... thoughts?

hrmm maybe. we start getting into a complex landscape of features, though (this is probably just as valid as my comment above, though)

@zhiayang
Copy link
Author

zhiayang commented Feb 7, 2024

i feel like if i'm doing any sort of string manipulation i tend to go for string_view like types/interfaces, which kinda basically need that size field. much better than making null terminated strings all the time imo

@neurocyte
Copy link

I've run into this issue a couple of times.

I'm working on a text editor that uses hybrid rope/piece-table like structures a lot which may reference small substrings anywhere in a large body of text. Reads with no explicit bounds are obviously problematic in this sort of situation as anything could follow the desired bytes in memory. Adding terminators requires rearranging the entire memory layout and voids most of the advantages of the design.

So, because of this fairly subtle detail, I'm forced to copy pretty much anything used in a notcurses call to a temporary, null terminated, buffer. It's not a big deal, but seems like a rather inflexible API.

@dankamongmen
Copy link
Owner

thanks for the feedback, you two. i think strndup(), strncpy(), and friends have shown the value of limiting the amount of source data. furthermore, i agree that the current API does not provide much use.

i think i'll go ahead and create a bug to add this functionality. i appreciate the comments from real world users!

@zhiayang
Copy link
Author

zhiayang commented Feb 8, 2024

i think i'll go ahead and create a bug to add this functionality. i appreciate the comments from real world users!

if you have any guidelines for which parts of the API to change (or which parts not to change), I'd be willing to make a PR to add this!

@dankamongmen
Copy link
Owner

if you have any guidelines for which parts of the API to change (or which parts not to change), I'd be willing to make a PR to add this!

i've assigned #2756 to myself, but if you send in a PR, i'd be delighted to see you do it. run any naming by me, and see my notes there about language wrappers etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working nonfix closed without a successful fix (invalid, wontfix)
Projects
None yet
Development

No branches or pull requests

3 participants