Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【problem】mac版本一次性上传大批量txt文件不响应 #41

Closed
F-crystal opened this issue Apr 20, 2024 · 2 comments · Fixed by #43
Closed

【problem】mac版本一次性上传大批量txt文件不响应 #41

F-crystal opened this issue Apr 20, 2024 · 2 comments · Fixed by #43

Comments

@F-crystal
Copy link

F-crystal commented Apr 20, 2024

开发者您好,非常感谢您开发并持续更新neosca这一工具,便于英文词法和句法复杂度分析。但在使用这一工具的过程中我遇到了一些小问题:我下载了mac版本的neosca app,但是当我一次性上传多个txt文件时app会长期不响应,我怀疑是因为内存问题导致,因此想问一下能否提供之前的python包版本,便于我自己写代码处理?再次感谢您的辛苦工作!

@F-crystal F-crystal changed the title 多个文件上传不响应 mac版本一次性上传大批量txt文件不响应 Apr 20, 2024
@F-crystal F-crystal changed the title mac版本一次性上传大批量txt文件不响应 【problem】mac版本一次性上传大批量txt文件不响应 Apr 20, 2024
@F-crystal F-crystal reopened this Apr 20, 2024
@tanloong
Copy link
Owner

tanloong commented Apr 20, 2024

您好!感谢反馈,估计是 NeoSCA 在检查文件名的命名冲突时占用内存太高了。如果已经添加了来自 A 文件夹的 a.txt,再添加来自 B 的 a.txt 时,底部文件区的 Name 列需要给改成 "a (2).txt"。这个检查很占资源,添加 1 个文件时,会和已添加的所有文件挨个比较是否冲突,然后放进已添加文件列表参加对后续新增文件的检查,新增文件又会加入列表,雪球越滚越大。猜错了,这个检查没那么耗时,耗时的是文件区表格每添加一个文件都要自动调整列宽,改了下,现在是等所有文件都加进来之后只调整一次,不卡了。这个改动从现在的dev分支源码运行或在下个版本发布时就可以看到。

要自己写代码处理可以下载源码,直接调接口,绕开这个检查。NeoSCA 0.1.0+ 没再更新它的 PyPI,需要用 Git 从 GitHub 下载。

  1. 安装 Git

  2. 安装 Python,NeoSCA 0.1.0+ 要求最低 3.10

  3. 打开终端 (Windows 按 Win+s 搜索打开 powershell;macOS 在启动坞搜索打开 terminal),运行下面的命令来从 GitHub 仓库安装 NeoSCA。

pip3 install git+https://github.com/tanloong/neosca
  1. 下载 Stanza 英语语种的模型,会下载到 NeoSCA 安装路径的 ns_data/stanza_resources 文件夹。
import stanza
from neosca import STANZA_MODEL_DIR

stanza.download("en", model_dir=str(STANZA_MODEL_DIR), resources_url="stanfordnlp")
  1. 在自己的程序中调用 NeoSCA:
import csv
import io

from neosca.ns_io import Ns_IO
from neosca.ns_sca.ns_sca import Ns_SCA
from neosca.ns_lca.ns_lca import Ns_LCA

sca_kwargs = {
    # 所有可选指标:["W", "S", "VP", "C", "T", "DC", "CT", "CP", "CN", "MLS", "MLT", "MLC", "C/S", "VP/T", "C/T", "DC/C", "DC/T", "T/S", "CT/T", "CP/T", "CP/C", "CN/T", "CN/C"]
    # 不传入此参数时会统计所有可选指标
    "selected_measures": ["MLS", "MLC", "MLT", "C/S"],
    # 缓存中间文件,可节省下次在相同文件上的运行时间,缓存路径是 neosca 安装路径的 ns_data/cache
    "is_cache": True,
    # 是否使用历史缓存,当设为 True 且缓存文件非空同时最后修改时间晚于对应输入文件时会使用缓存
    "is_use_cache": True,
}
sca_analyzer = Ns_SCA(**sca_kwargs)

lca_kwargs = {
        # 暂时没有 selected_measures 选项,会统计所有可选指标
        "wordlist": "bnc", # 或 "anc"
        "tagset": "ud", # 或 "ptb"
        "is_cache": True,
        "is_use_cache": True,
    }
lca_analyzer = Ns_LCA(**lca_kwargs)

# get_verified_ifile_list 会获取指定文件夹及其嵌套子文件夹下所有 NeoSCA 支持类型的文件 (txt/docx/odt),该文件夹下属于这些类型的无关文件要移走不然也会被分析。
# 这个函数不会检查文件名冲突。
file_paths = Ns_IO.get_verified_ifile_list(["./files"])

sname_value_map = {}
lname_value_map = {}
with io.StringIO() as sca_output, io.StringIO() as lca_output:
    for file_path in file_paths:
        sca_counter = sca_analyzer.run_on_file_or_subfiles(file_path)
        sname_value_map: dict[str, str] = sca_counter.get_all_values(precision=4)
        sca_values = sname_value_map.values()
        sca_writer = csv.writer(sca_output)
        sca_writer.writerow(sca_values)
        # 保存 matches,会清空 ./sca_matches 原有文件
        sca_counter.dump_matches("./sca_matches")

        lca_counter = lca_analyzer.run_on_file_or_subfiles(file_path)
        lname_value_map: dict[str, str] = lca_counter.get_all_values(precision=4)
        lca_values = lname_value_map.values()
        lca_writer = csv.writer(lca_output)
        lca_writer.writerow(lca_values)
        # 保存 matches,同样会清空 ./lca_matches 原有文件
        lca_counter.dump_matches("./lca_matches")

    with open("./neosca_sca_results.csv", "w") as f:
        sca_writer = csv.writer(f)
        sca_writer.writerow(sname_value_map.keys())  # 列名
        f.write(sca_output.getvalue())

    with open("./neosca_lca_results.csv", "w") as f:
        lca_writer = csv.writer(f)
        lca_writer.writerow(lname_value_map.keys())
        f.write(lca_output.getvalue())

或在终端通过 NeoSCA 的命令行界面分析文件。

python3 -m neosca sca ./files
python3 -m neosca lca ./files
# 使用 --help 查看帮助
# python3 -m neosca --help
# python3 -m neosca sca --help
# python3 -m neosca lca --help

@F-crystal
Copy link
Author

您好,我已经成功安装最新的neosca库并成功运行,非常感谢您的帮助!

tanloong added a commit that referenced this issue Apr 24, 2024
tanloong added a commit that referenced this issue Apr 24, 2024
tanloong added a commit that referenced this issue Jun 25, 2024
tanloong added a commit that referenced this issue Jun 25, 2024
@tanloong tanloong mentioned this issue Jul 23, 2024
Merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants