Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚀 功能建议:翻译原文处理 #86

Closed
3 tasks done
tshu-w opened this issue May 5, 2023 · 14 comments
Closed
3 tasks done

🚀 功能建议:翻译原文处理 #86

tshu-w opened this issue May 5, 2023 · 14 comments
Assignees
Labels
enhancement New feature or request fixed in next release The issue will be closed once next release is available good first issue Good for newcomers

Comments

@tshu-w
Copy link

tshu-w commented May 5, 2023

请先确认以下事项

  • 已仔细阅读了 README
  • issues 页面搜索过(包括已关闭的 issue),未发现类似功能建议
  • Easydict 已升级到最新版本

功能描述

翻译原文处理,选择翻译通常会出现翻译原文不在一行,PDF 中以-换行等情况,建议添加类似 Bob 翻译原文处理选项

Screenshot 2023-05-05 at 18 56 53

使用场景(可选)

No response

实现方案(可选)

No response

@tshu-w tshu-w added the enhancement New feature or request label May 5, 2023
@github-actions
Copy link

github-actions bot commented May 5, 2023

Hello tshu-w, Thank you for your first issue contribution 🎉

@tisfeng
Copy link
Owner

tisfeng commented May 5, 2023

对翻译原文进行预处理,这个功能感觉可以有。

请问这个具体是什么场景,能给几个具体示例吗

选择翻译通常会出现翻译原文不在一行,PDF 中以-换行等情况

@tisfeng tisfeng added the feat label May 5, 2023
@tshu-w
Copy link
Author

tshu-w commented May 5, 2023

对翻译原文进行预处理,这个功能感觉可以有。

请问这个具体是什么场景,能给几个具体示例吗

选择翻译通常会出现翻译原文不在一行,PDF 中以-换行等情况

举例:

  1. 纯文本邮件或别的自动 wrap 的排版(需要将换行转换成空格)

Screenshot 2023-05-05 at 21 12 04

复制后内容:
This paper evaluates the viability of using fixed language models for
training text classification networks on low-end hardware. We combine language
models with a CNN architecture and put together a comprehensive benchmark with
8 datasets covering single-label and multi-label classification of topic,
sentiment, and genre. Our observations are distilled into a list of trade-offs,
concluding that there are scenarios, where not fine-tuning a language model
yields competitive effectiveness at faster training, requiring only a quarter
of the memory compared to fine-tuning.
  1. PDF 文件 (需要将 「- 空格去掉」)

Screenshot 2023-05-05 at 21 14 24

复制后内容:
Methods of machine learning belong to the standard reper- toire of any data analytics endeavour nowadays. However many machine learning algorithms rely on input in the form of dense numerical vectors, which is in stark contrast to the conventional representation of knowledge graphs. To make KGs usable for machine learning tasks Knowledge Graph Embedding approaches are used to encode KG entities (and sometimes relationships) into a lower-dimensional space.
While there are different paradigms of algorithms most embedding approaches score the plausibility of a given tri- ple (h, r, t), i.e. how likely is this statement to be true. The goal of the algorithm is then to compute the embeddings in such a way that positive examples (triples contained in the

@tisfeng tisfeng added the good first issue Good for newcomers label May 5, 2023
@tisfeng
Copy link
Owner

tisfeng commented May 5, 2023

第二个,去除 PDF 中的 【-空格】,这个我理解了。

第一个,你是用 OCR 取词,它没有处理好换行符吗?还是说,直接在邮件中复制的文本,它带了多余的换行符,需要处理?

@tshu-w
Copy link
Author

tshu-w commented May 6, 2023

第一个,你是用 OCR 取词,它没有处理好换行符吗?还是说,直接在邮件中复制的文本,它带了多余的换行符,需要处理?

上面给的例子是直接在邮件中复制文本,另外像 Markdown 或者 LaTeX 中,换行并不代表新的一个段落,需要一个空行才是新起一个段落。还有试了下 OCR 取上面第二章截图也会出现每一行文字都换行的问题。

@tisfeng
Copy link
Owner

tisfeng commented May 6, 2023

我没用过 LaTeX,对这个不太理解,,如果是下面这种情况,你希望如何对它进行处理?将换成符转成空格?

像 Markdown 或者 LaTeX 中,换行并不代表新的一个段落,需要一个空行才是新起一个段落

@tisfeng
Copy link
Owner

tisfeng commented May 6, 2023

1.3.0 版本的 OCR 换行处理有时是不对,这个我会逐步优化算法的。

最新的代码已经能处理它了,稍后会发个新版本。

image

@tshu-w
Copy link
Author

tshu-w commented May 7, 2023

我没用过 LaTeX,对这个不太理解,,如果是下面这种情况,你希望如何对它进行处理?将换成符转成空格?

像 Markdown 或者 LaTeX 中,换行并不代表新的一个段落,需要一个空行才是新起一个段落

目前想法是和 Bob 一样,「换行符转换成空格」(更高级一点是不是可以由用户设置替换,不过感觉这样有点复杂了)

@tisfeng
Copy link
Owner

tisfeng commented May 7, 2023

了解了,后面会考虑的。

@tisfeng
Copy link
Owner

tisfeng commented Mar 31, 2024

昨天碰到一个「换行符」替换为「空格」的使用场景 https://www.mail-archive.com/xz-devel@tukaani.org/msg00566.html

正好目前快捷动作菜单已完成,这个功能可以安排上了。

Progress will not happen until there is new maintainer. XZ for C has sparse 
commit log too. Dennis you are better off waiting until new maintainer happens 
or fork yourself. Submitting patches here has no purpose these days. The 
current maintainer lost interest or doesn't care to maintain anymore. It is sad 
to see for a repo like this.
image

@tisfeng tisfeng self-assigned this Mar 31, 2024
@tisfeng tisfeng added the fixed in next release The issue will be closed once next release is available label Mar 31, 2024
@tisfeng
Copy link
Owner

tisfeng commented May 1, 2024

2.7.0 版本已实现该功能。

@tisfeng tisfeng closed this as completed May 1, 2024
@tshu-w
Copy link
Author

tshu-w commented May 3, 2024

@tisfeng 你好,没有找到设置的位置(上图中的按钮在最新版也没有了),是默认开启么?

@tisfeng
Copy link
Owner

tisfeng commented May 4, 2024

我记得代码是设置默认开启的,你去设置页看看这个选项。

image

@tshu-w
Copy link
Author

tshu-w commented May 4, 2024

感谢,关闭再打开就显示了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request fixed in next release The issue will be closed once next release is available good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants