Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ansi): add Scanner for splitting ANSI encoded strings. #215

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

pachecot
Copy link
Contributor

@pachecot pachecot commented Oct 14, 2024

Hi,

Thought this might be helpful for manipulating styled strings without getting into the details of the Parser and all the encodings of styled strings. It could be very useful in other modules like lipgloss or when using it.

Here are a few samples of functions rewritten using this..

// Strip removes ANSI escape codes from a string.
func Strip(s string) string {
	var (
		buf bytes.Buffer // buffer for collecting printable characters
	)
	scanner := NewScanner(s)
	for !scanner.EOF() {
		tk, txt := scanner.Scan()
		switch tk {
		case TextToken:
			buf.WriteString(txt)
		}
	}
	return buf.String()
}
// Truncate truncates a string to a given length, adding a tail to the
// end if the string is longer than the given length.
// This function is aware of ANSI escape codes and will not break them, and
// accounts for wide-characters (such as East Asians and emojis).
func Truncate1(s string, length int, tail string) string {
	var (
		ignoring  bool
		sb        strings.Builder
		tailWidth = StringWidth(tail)
		width     = 0
	)
	scanner := NewScanner(s)
	for !scanner.EOF() {
		tk, txt := scanner.Scan()
		switch tk {
		case TextToken:
			if ignoring {
				continue
			}
			if txWidth := scanner.Width(); width+txWidth <= length {
				width += txWidth
				sb.WriteString(txt)
				continue
			}
			// text is too big and needs to be truncated
			// read characters until out of space
			// append with tail if there is room
			g := uniseg.NewGraphemes(txt)
			for g.Next() && width+g.Width() <= length-tailWidth {
				sb.WriteString(g.Str())
				width += g.Width()
			}
			if width+tailWidth <= length {
				sb.WriteString(tail)
			}
			ignoring = true

		default:
			sb.WriteString(txt)
		}
	}
	return sb.String()
}

I was trying to do something like transpose a line to a column.

\x1b[31mHi\x1b[0m

to

\x1b[31mH\x1b[0m
\x1b[31mi\x1b[0m

Here is a sample function to do that..

// Transpose breaks a line into individual lines for each rune preserving ANSI
// escape codes which are distributed to each new line.
//
// todo minimize the ANSI codes. Currently if there are multiple codes they are
// just concatenated, which may lead to redundancy in some cases.
func Transpose(s string) string {
	var (
		prefix  strings.Builder
		lines   = make([]strings.Builder, 0, len(s))
		scanner = NewScanner(s, ScanRunes)
	)

	for !scanner.EOF() {
		tk, txt := scanner.Scan()
		switch tk {
		case ControlToken:
			prefix.WriteString(txt)
			for i := range lines {
				lines[i].WriteString(txt)
			}
		case RuneToken:
			n := len(lines)
			lines = append(lines, strings.Builder{})
			lines[n].WriteString(prefix.String())
			lines[n].WriteString(txt)
		}
	}
	var sb strings.Builder
	for i, l := range lines {
		if i > 0 {
			sb.WriteString("\n")
		}
		sb.WriteString(l.String())
	}
	return sb.String()
}

Let me know if you are interested and if you think it still needs some changes..

Thanks.
Tom

@aymanbagabas
Copy link
Member

Hi @pachecot, this looks cool ngl. I've thought about a similar idea before but performance was more important and thus that led us to implement both ansi.Parser and ansi.DecodeSequence.

In terms of the PR, I think this adds a lot of code surface to the ansi package that already exists. You can already "scan" an input using either ansi.Parser and/or ansi.DecodeSequence. The thing i would argue is adding a bufio.SplitFunc implementation that wraps ansi.DecodeSequence and sub sequently, uses bufio.Scanner to scan for sequences. The ansi package define a "sequence" as one of the following:

  1. Escape sequences i.e. the ones that start with CSI, OSC, DCS, ESC, etc for example SGR style sequences \x1b[31m
  2. Control codes in the unicode spaces C0 and C1, this includes things like newlines \n, carriage returns \r, and tabs \t
  3. Grapheme clusters and UTF-8 sequences which includes the ASCII character set, UTF-8 encoded characters, and multi-codepoint characters such as wide emojis and CJK wide characters.

The cellbuf package has a good example of using ansi.DecodeSequence that you can check out.

Another thing to keep in mind, is that the purpose behind the ansi package is to provide a low-level performant ANSI parser and helpers, and using a scanner can introduce unnecessary memory allocations and cpu cycles.

Let me know what you think about the bufio.ScanFunc idea and if you're interested in working on something like that 🙂

@pachecot
Copy link
Contributor Author

Hi @aymanbagabas thanks.

Still learning this library, will need to play with it some more.

The buffio.Scanner is for tokens, there is no additional information, how would you distinguish the sequence type (cluster, control, or escape)? Not sure how to make that work nicely.

@aymanbagabas
Copy link
Member

The buffio.Scanner is for tokens, there is no additional information, how would you distinguish the sequence type (cluster, control, or escape)? Not sure how to make that work nicely.

That's a fair point. We can use the ansi.HasCsiPrefix and others to distinguish the sequence/token types. It would be nice to have a bufio.SplitFunc implementation that wraps ansi.DecodeSequence to be used with bufio.Scanner. I think it makes sense to have our own ansi.Scanner type that conforms to a similar interface.

func NewScanner(r io.Reader) *Scanner

// bufio.Scanner scanner interface to conform to
type Scanner interface {
  Bytes() []byte
  Text() string
  Err() error
  Scan() bool
}

// extra functions we need for ansi.Scanner
func (s *Scanner) Width() int // returns the cell width of the last scanned token
func (s *Scanner) Len() int // returns the bytes length of the last scanned token

// bufio.SplitFunc implementation to be used with bufio.Scanner
func SplitFunc(data []byte, atEOF bool) (advance int, token []byte, err error)

I'm a little hesitant about introducing a new type just to distinguish the sequence type. I want to keep using primitive types everywhere possible. Let me know if you have a good solution for this.

@pachecot
Copy link
Contributor Author

Removed the Token Type and simplified it some to be more inline with buffio.Scanner

type Scanner interface {
  Bytes() []byte              // returns last token
  EOF() bool
  Error() error
  IsEscape() bool             // returns the if the last scanned token is a control/escape 
  Len() int                   // returns the bytes length of the last scanned token
  Scan() bool
  Split(f SplitFunc)          // set the split function
  Text() string               // returns last token as a string 
  Token() ([]byte, int, bool) // returns token, width, escape (same as s.Bytes(), s.Width(), s.IsEscape()) 
  Width() int                 // returns the cell width of the last scanned token
}

type SplitFunc func(data []byte, width int, atEOF bool) (advance int, token []byte, err error)

I was thinking the width in the split function was needed. But, maybe its not needed anymore without the TokenState.

Still just using the parser for this and not ansi.DecodeSequence. Currently this is not inspecting the escape sequences, so it doesn't seem necessary.

Is there something you would want to do with the escape sequences here? I was looking at this as ansi.Split function to safely get at the text and separate the escape codes.

@aymanbagabas
Copy link
Member

Hey @pachecot, I wonder how's the performance of Scanner compared to Parser and DecodeSequence?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants