Skip to content

crop text in middle instead of decimate it #809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

wassname
Copy link

@wassname wassname commented Oct 29, 2024

A few problems:

  • for cropping we were discarding every N sentence, this creates garbage text
  • we were calculating the max tokens wrong (or at least the logic didn't make sense to me)

So I've

  • change the cropText for the whole extension to remove the excess text from the middle and replace it with a warning "this text has had X% removed"
  • use \n instead of , for more natural text
  • change the way we count the token limit to help us use all of it

#800

@wassname wassname changed the title crop transcript in middle instead of decimate it crop text in middle instead of decimate it Oct 29, 2024
} else {
maxLength -= 100 + clamp(userConfig.maxResponseTokenLength, 1, maxLength - 1000)
// if we don't have the models exact content limit use the default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe the entire else block should be removed?

@PeterDaveHello PeterDaveHello requested a review from Copilot May 31, 2025 20:18
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the text cropping functionality to remove excess text from the middle rather than discarding every N sentence, while also adjusting token limit calculations and separator characters for a more natural output. Key changes include:

  • Modifying cropText to split text by an expanded set of delimiters and crop from the middle with an appended disclaimer.
  • Updating the token limit logic and default configurations for extended model contexts.
  • Adjusting the YouTube site adapter to use newline as a separator and refining the summary prompt.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/utils/crop-text.mjs Revised crop text function: new cropping algorithm, delimiter update, and token limit adjustments.
src/content-script/site-adapters/youtube/index.mjs Updated subtitle extraction and summary prompt adjustments.
src/config/index.mjs Increased default maxResponseTokenLength to accommodate longer texts.

croppedText = splits.slice(0, middleIndex).join('\n')
if (croppedTokens > 0) {
croppedText += `\n\n**Important disclaimer**, this text is incomplete! ${croppedTokens} or ${
(croppedTokens / totalTokens).toFixed(2) * 100
Copy link
Preview

Copilot AI May 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider explicitly converting the result of toFixed to a number (e.g., using Number(...)) to avoid any unexpected type coercion issues.

Suggested change
(croppedTokens / totalTokens).toFixed(2) * 100
Number((croppedTokens / totalTokens).toFixed(2)) * 100

Copilot uses AI. Check for mistakes.

let firstHalfTokens = 0
let secondHalfTokens = 0
const halfTargetTokens = Math.floor(cropTargetLength / 2)
let middleIndex = -1
Copy link
Preview

Copilot AI May 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input text is short and the first pass never exceeds the halfTargetTokens, middleIndex will remain -1, which may lead to unexpected slicing. Consider adding a condition to bypass cropping when the text length is within limits.

Copilot uses AI. Check for mistakes.

@PeterDaveHello
Copy link
Member

/review

Copy link

qodo-merge-pro bot commented Jun 2, 2025

Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

800 - Partially compliant

Compliant requirements:

• Fix the "decimation" approach to text cropping which creates garbage text

Non-compliant requirements:

• Fix the issue where the extension uses subtitles instead of transcript on YouTube
• Ensure the extension can handle uncommon words that are missed in subtitles

Requires further human verification:

• Improve the quality of text used for summarization - The PR changes the cropping method but still uses subtitles, so verification is needed to see if the summary quality has improved despite still using subtitles

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Percentage Calculation

The percentage calculation for the removed tokens might be incorrect. The expression (croppedTokens / totalTokens).toFixed(2) * 100 first converts to a string with 2 decimal places, then multiplies by 100, which could lead to unexpected results.

croppedText += `\n\n**Important disclaimer**, this text is incomplete! ${croppedTokens} or ${
  (croppedTokens / totalTokens).toFixed(2) * 100
}% of tokens have been removed from this location in the text due to lack limited model context of ${maxLength}\n\n`
Incomplete Implementation

The PR addresses the cropping method but doesn't solve the core issue from the ticket - it still uses subtitles instead of transcript data. The PR improves how text is cropped but doesn't change the data source.

const splits = text.split(/[,,。??!!;;\n]/).map((s) => s.trim())
const splitsLength = splits.map((s) => (tiktoken ? encode(s).length : s.length))
const cropTargetLength = maxLength - startLength - endLength

let firstHalfTokens = 0
let secondHalfTokens = 0
const halfTargetTokens = Math.floor(cropTargetLength / 2)
let middleIndex = -1
let endStartIndex = splits.length
let totalTokens = splitsLength.reduce((sum, length) => sum + length + 1, 0)
let croppedTokens = 0
let croppedText = ''
let currentLength = 0

// First pass: find the middle
for (let i = 0; i < splits.length; i++) {
  if (firstHalfTokens < halfTargetTokens) {
    firstHalfTokens += splitsLength[i] + 1
  } else {
    middleIndex = i
    break
  }
}

// Second pass: find the start of the end section
for (let i = splits.length - 1; i >= middleIndex; i--) {
  secondHalfTokens += splitsLength[i] + 1
  if (secondHalfTokens >= halfTargetTokens) {
    endStartIndex = i
    break
  }
}

// Calculate cropped tokens
croppedTokens = totalTokens - firstHalfTokens - secondHalfTokens

// Construct the cropped text
croppedText = splits.slice(0, middleIndex).join('\n')
if (croppedTokens > 0) {
  croppedText += `\n\n**Important disclaimer**, this text is incomplete! ${croppedTokens} or ${
    (croppedTokens / totalTokens).toFixed(2) * 100
  }% of tokens have been removed from this location in the text due to lack limited model context of ${maxLength}\n\n`
}
croppedText += splits.slice(endStartIndex).join('\n')

@PeterDaveHello
Copy link
Member

/improve

Copy link

qodo-merge-pro bot commented Jun 2, 2025

Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

PR Code Suggestions ✨

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix percentage calculation and grammar

The disclaimer message has a grammatical error with "due to lack limited model
context". Also, the percentage calculation is incorrect as .toFixed(2) returns a
string, so multiplying by 100 doesn't work as expected.

src/utils/crop-text.mjs [101-103]

 croppedText += `\n\n**Important disclaimer**, this text is incomplete! ${croppedTokens} or ${
-  (croppedTokens / totalTokens).toFixed(2) * 100
-}% of tokens have been removed from this location in the text due to lack limited model context of ${maxLength}\n\n`
+  ((croppedTokens / totalTokens) * 100).toFixed(2)
+}% of tokens have been removed from this location in the text due to limited model context of ${maxLength}\n\n`
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies two real issues: a grammatical error ("lack limited" should be "limited") and a mathematical bug where .toFixed(2) returns a string that shouldn't be multiplied by 100. Both fixes improve code correctness and user experience.

Medium
  • More

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants