Skip to content

fix(operator): correct regex escaping in WordCloud operator#4261

Merged
bobbai00 merged 4 commits intoapache:mainfrom
bobbai00:fix/wordcloud-regex-escaping
Mar 6, 2026
Merged

fix(operator): correct regex escaping in WordCloud operator#4261
bobbai00 merged 4 commits intoapache:mainfrom
bobbai00:fix/wordcloud-regex-escaping

Conversation

@bobbai00
Copy link
Contributor

@bobbai00 bobbai00 commented Mar 5, 2026

What changes were proposed in this PR?

Fixed two issues in WordCloudOpDesc.scala:

  1. Regex escaping bug: The pyb refactor in feat(backend): introduce python code template builder for creating Python based operators #4189 changed manipulateTable() from s"..." to pyb"""...""", but the regex \\w was not adjusted. In s"...", \\w is an escape sequence producing \w. In triple-quoted pyb"""...""", backslashes are literal, so \\w stays as \\w — producing r'\\w' in Python, which matches a literal backslash + w instead of word characters. This caused all rows to be filtered out, resulting in: "text column does not contain words or contains only nulls." Fixed by changing to \w.

  2. Duplicate statement: Removed a duplicate Map(...) line in getOutputSchemas.

Added unit tests to verify the regex pattern is correct.

Any related issues, documentation, discussions?

Regression introduced by #4189.

How was this PR tested?

Added WordCloudOpDescSpec with tests that verify:

  • manipulateTable() uses r'\w' (not r'\\w')
  • Text column name appears in generated code

All tests pass.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.6)

The regex pattern `r'\\w'` in the Scala triple-quoted string produced
`r'\\w'` in Python, which matches a literal backslash + 'w' instead of
word characters. This caused `str.contains` to filter out all rows,
resulting in "text column does not contain words or contains only nulls".

Also removed a duplicate Map statement in getOutputSchemas.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bobbai00 bobbai00 changed the title fix: correct regex escaping in WordCloud operator fix(operator): correct regex escaping in WordCloud operator Mar 5, 2026
@bobbai00 bobbai00 self-assigned this Mar 5, 2026
@bobbai00 bobbai00 requested a review from aglinxinyuan March 5, 2026 16:50
@bobbai00 bobbai00 added the fix label Mar 5, 2026
Verify that manipulateTable() uses r'\w' (word character match)
instead of r'\\w' (literal backslash+w match).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the common label Mar 5, 2026
Copy link
Contributor

@chenlica chenlica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now (after adding the test case).

@bobbai00 bobbai00 merged commit ac909a0 into apache:main Mar 6, 2026
15 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants