feat(ansi): add Scanner for splitting ANSI encoded strings. #215

pachecot · 2024-10-14T20:53:15Z

Hi,

Thought this might be helpful for manipulating styled strings without getting into the details of the Parser and all the encodings of styled strings. It could be very useful in other modules like lipgloss or when using it.

Here are a few samples of functions rewritten using this..

// Strip removes ANSI escape codes from a string.
func Strip(s string) string {
	var (
		buf bytes.Buffer // buffer for collecting printable characters
	)
	scanner := NewScanner(s)
	for !scanner.EOF() {
		tk, txt := scanner.Scan()
		switch tk {
		case TextToken:
			buf.WriteString(txt)
		}
	}
	return buf.String()
}

// Truncate truncates a string to a given length, adding a tail to the
// end if the string is longer than the given length.
// This function is aware of ANSI escape codes and will not break them, and
// accounts for wide-characters (such as East Asians and emojis).
func Truncate1(s string, length int, tail string) string {
	var (
		ignoring  bool
		sb        strings.Builder
		tailWidth = StringWidth(tail)
		width     = 0
	)
	scanner := NewScanner(s)
	for !scanner.EOF() {
		tk, txt := scanner.Scan()
		switch tk {
		case TextToken:
			if ignoring {
				continue
			}
			if txWidth := scanner.Width(); width+txWidth <= length {
				width += txWidth
				sb.WriteString(txt)
				continue
			}
			// text is too big and needs to be truncated
			// read characters until out of space
			// append with tail if there is room
			g := uniseg.NewGraphemes(txt)
			for g.Next() && width+g.Width() <= length-tailWidth {
				sb.WriteString(g.Str())
				width += g.Width()
			}
			if width+tailWidth <= length {
				sb.WriteString(tail)
			}
			ignoring = true

		default:
			sb.WriteString(txt)
		}
	}
	return sb.String()
}

I was trying to do something like transpose a line to a column.

\x1b[31mHi\x1b[0m

to

\x1b[31mH\x1b[0m
\x1b[31mi\x1b[0m

Here is a sample function to do that..

// Transpose breaks a line into individual lines for each rune preserving ANSI
// escape codes which are distributed to each new line.
//
// todo minimize the ANSI codes. Currently if there are multiple codes they are
// just concatenated, which may lead to redundancy in some cases.
func Transpose(s string) string {
	var (
		prefix  strings.Builder
		lines   = make([]strings.Builder, 0, len(s))
		scanner = NewScanner(s, ScanRunes)
	)

	for !scanner.EOF() {
		tk, txt := scanner.Scan()
		switch tk {
		case ControlToken:
			prefix.WriteString(txt)
			for i := range lines {
				lines[i].WriteString(txt)
			}
		case RuneToken:
			n := len(lines)
			lines = append(lines, strings.Builder{})
			lines[n].WriteString(prefix.String())
			lines[n].WriteString(txt)
		}
	}
	var sb strings.Builder
	for i, l := range lines {
		if i > 0 {
			sb.WriteString("\n")
		}
		sb.WriteString(l.String())
	}
	return sb.String()
}

Let me know if you are interested and if you think it still needs some changes..

Thanks.
Tom

aymanbagabas · 2024-10-15T15:56:10Z

Hi @pachecot, this looks cool ngl. I've thought about a similar idea before but performance was more important and thus that led us to implement both ansi.Parser and ansi.DecodeSequence.

In terms of the PR, I think this adds a lot of code surface to the ansi package that already exists. You can already "scan" an input using either ansi.Parser and/or ansi.DecodeSequence. The thing i would argue is adding a bufio.SplitFunc implementation that wraps ansi.DecodeSequence and sub sequently, uses bufio.Scanner to scan for sequences. The ansi package define a "sequence" as one of the following:

Escape sequences i.e. the ones that start with CSI, OSC, DCS, ESC, etc for example SGR style sequences \x1b[31m
Control codes in the unicode spaces C0 and C1, this includes things like newlines \n, carriage returns \r, and tabs \t
Grapheme clusters and UTF-8 sequences which includes the ASCII character set, UTF-8 encoded characters, and multi-codepoint characters such as wide emojis and CJK wide characters.

The cellbuf package has a good example of using ansi.DecodeSequence that you can check out.

Another thing to keep in mind, is that the purpose behind the ansi package is to provide a low-level performant ANSI parser and helpers, and using a scanner can introduce unnecessary memory allocations and cpu cycles.

Let me know what you think about the bufio.ScanFunc idea and if you're interested in working on something like that 🙂

pachecot · 2024-10-18T22:26:36Z

Hi @aymanbagabas thanks.

Still learning this library, will need to play with it some more.

The buffio.Scanner is for tokens, there is no additional information, how would you distinguish the sequence type (cluster, control, or escape)? Not sure how to make that work nicely.

aymanbagabas · 2024-10-21T03:11:15Z

The buffio.Scanner is for tokens, there is no additional information, how would you distinguish the sequence type (cluster, control, or escape)? Not sure how to make that work nicely.

That's a fair point. We can use the ansi.HasCsiPrefix and others to distinguish the sequence/token types. It would be nice to have a bufio.SplitFunc implementation that wraps ansi.DecodeSequence to be used with bufio.Scanner. I think it makes sense to have our own ansi.Scanner type that conforms to a similar interface.

func NewScanner(r io.Reader) *Scanner

// bufio.Scanner scanner interface to conform to
type Scanner interface {
  Bytes() []byte
  Text() string
  Err() error
  Scan() bool
}

// extra functions we need for ansi.Scanner
func (s *Scanner) Width() int // returns the cell width of the last scanned token
func (s *Scanner) Len() int // returns the bytes length of the last scanned token

// bufio.SplitFunc implementation to be used with bufio.Scanner
func SplitFunc(data []byte, atEOF bool) (advance int, token []byte, err error)

I'm a little hesitant about introducing a new type just to distinguish the sequence type. I want to keep using primitive types everywhere possible. Let me know if you have a good solution for this.

pachecot · 2024-10-27T13:17:17Z

Removed the Token Type and simplified it some to be more inline with buffio.Scanner

type Scanner interface {
  Bytes() []byte              // returns last token
  EOF() bool
  Error() error
  IsEscape() bool             // returns the if the last scanned token is a control/escape 
  Len() int                   // returns the bytes length of the last scanned token
  Scan() bool
  Split(f SplitFunc)          // set the split function
  Text() string               // returns last token as a string 
  Token() ([]byte, int, bool) // returns token, width, escape (same as s.Bytes(), s.Width(), s.IsEscape()) 
  Width() int                 // returns the cell width of the last scanned token
}

type SplitFunc func(data []byte, width int, atEOF bool) (advance int, token []byte, err error)

I was thinking the width in the split function was needed. But, maybe its not needed anymore without the TokenState.

Still just using the parser for this and not ansi.DecodeSequence. Currently this is not inspecting the escape sequences, so it doesn't seem necessary.

Is there something you would want to do with the escape sequences here? I was looking at this as ansi.Split function to safely get at the text and separate the escape codes.

remove width arg from SplitFunc rename (*scanner).Error to (*scanner).Err update comments

aymanbagabas · 2024-11-13T22:00:57Z

Hey @pachecot, I wonder how's the performance of Scanner compared to Parser and DecodeSequence?

pachecot requested a review from aymanbagabas as a code owner October 14, 2024 20:53

pachecot force-pushed the feat-ansi-scanner branch from 6676890 to b7f1dd5 Compare October 15, 2024 00:25

pachecot added 4 commits October 27, 2024 18:49

feat(ansi): add Scanner for splitting ANSI encoded strings.

d7acd4d

refactor(ansi): change Scanner remove TokenType

3c6ca92

refactor(ansi): changes to Scanner

ec4e6b0

remove width arg from SplitFunc rename (*scanner).Error to (*scanner).Err update comments

fix(ansi.Scanner): correction for 0 width control characters.

c0d4f36

pachecot force-pushed the feat-ansi-scanner branch from 4a6a3ce to c0d4f36 Compare October 27, 2024 22:51

aymanbagabas mentioned this pull request Nov 4, 2024

ansi: implement a new ansi.Scanner to replace ansi.Parser #239

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ansi): add Scanner for splitting ANSI encoded strings. #215

feat(ansi): add Scanner for splitting ANSI encoded strings. #215

pachecot commented Oct 14, 2024 •

edited

Loading

aymanbagabas commented Oct 15, 2024

pachecot commented Oct 18, 2024

aymanbagabas commented Oct 21, 2024

pachecot commented Oct 27, 2024

aymanbagabas commented Nov 13, 2024

feat(ansi): add Scanner for splitting ANSI encoded strings. #215

Are you sure you want to change the base?

feat(ansi): add Scanner for splitting ANSI encoded strings. #215

Conversation

pachecot commented Oct 14, 2024 • edited Loading

aymanbagabas commented Oct 15, 2024

pachecot commented Oct 18, 2024

aymanbagabas commented Oct 21, 2024

pachecot commented Oct 27, 2024

aymanbagabas commented Nov 13, 2024

pachecot commented Oct 14, 2024 •

edited

Loading