Improve support for copy-paste from Microsoft Word#5595
Improve support for copy-paste from Microsoft Word#5595AllanOXDi merged 3 commits intolearningequality:unstablefrom
Conversation
nucleogenesis
left a comment
There was a problem hiding this comment.
I'll spin it up tomorrow but overall the code looks great overall. Left a suggestion and a couple of questions in the meantime.
| } | ||
| } catch (err) { | ||
| editor.value.chain().focus().insertContent(clipboardAccessFailed$()).run(); | ||
| return handlePasteNoFormat(); |
There was a problem hiding this comment.
Is this call here necessary if the one 2 lines down is outside of the if block altogether (or vice-versa?)
There was a problem hiding this comment.
Good catch! The inner return handlePasteNoFormat() was redundant, as the outer fallback already covers both cases. Thanks
| return `$$${latex || ''}$$`; | ||
| }; | ||
|
|
||
| export function sanitizePastedHTML(html) { |
There was a problem hiding this comment.
If this is specific to MS word then maybe putting that in the name is a good idea a la sanitizeMSWordHTML or something?
There was a problem hiding this comment.
Yes, the sanitizer initially targeted MS Word, but we’ve expanded it to handle copy-paste issues from Google Docs and LibreOffice as well (e.g., strike-through bleed, nested list normalization). And since it now applies to all external HTML paste sources, renaming it to sanitizeMSWordHTML would be misleading. Keeping the more general name sanitizePastedHTML better reflects its broader use and sounds good to me.
| const items = list.querySelectorAll(':scope > li'); | ||
| items.forEach(item => { | ||
| const nestedLists = Array.from(item.children).filter( | ||
| child => child.tagName === 'UL' || child.tagName === 'OL', |
There was a problem hiding this comment.
Why are these uppercase? Is that coming from the pasted text?
There was a problem hiding this comment.
The uppercase comes from the DOM API itself. Browsers normalize element.tagName to uppercase for all HTML elements, regardless of how they appear in the pasted HTML. https://developer.mozilla.org/en-US/docs/Web/API/Element/tagName.
https://html.spec.whatwg.org/multipage/dom.html#htmlelement
HTML elements have an uppercase local name.
So I think checking against 'UL' and 'OL' is the correct and standard way to identify list elements in sanitized HTML
|
Looking forward to checking this out on |
There was a problem hiding this comment.
@radinamatic could you give this image a look and share any thoughts you have? Do you think this is worth getting onto unstable? I'm not using Microsoft Word but I am seeing inconsistencies between filetypes and where I open and copy the contents so I'm not sure we could ever really enumerate all of the various formats we might run into.
I opened the various testing files @AllanOXDi shared in the reviewer guidance and made examples of them below in the question & answer boxes of an exercise for demonstration purposes. I put which filetype it was and where I opened it (and then copied it from) as the first line in each box -- then below that is the pasted text from that filetype/program combination.
I did do the sort of "happy paths" where files were opened in places they're most compatible but also mixed and matched things like opened in a LibreOffice file in Google Docs and DOCX in LibreOffice and so forth -- although I'm not sure the acceptance criteria.
nucleogenesis
left a comment
There was a problem hiding this comment.
@AllanOXDi I'm sorry - I saw this and was wondering why it hadn't merged yet but I realize now it's because I failed to come back and approve it after we discussed it on Slack 😓
Summary
This PR improves how the Studio editor handles copy-paste from Microsoft Word so that pasted content more closely matches the formatting that Studio itself supports.
References
#4325
Reviewer guidance
Copy and paste from these sample docs MS Word 2019. LibreOffice Writer v25.8.1.1. Google Doc to exercise editor and verify if: