-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can MetaPAD works on Chinese corpus? #1
Comments
Unfortunately, this MetaPAD package is developed only for English corpus thought the methodology (please look at the paper) works for any kind of corpus. If you look at the sample data input, you will see the English corpus has been processed by a few important tools: entity recognition, entity typing, fine-grained typing. If you use a Chinese language toolbox to process your corpus in the same format, you can run MetaPAD then. |
Thank you very much for your reply. I will continue to track your paper and use your method to try it. |
I am also very interested in your work in this paper. |
@imnujf Sure. You should be able to find tools for entity recognition, typing and fine-grained typing. There have been tons of papers published towards these three tasks. I'd suggest you to find your favorite work among those and either find the open source tool on the link the authors provided or just contact the authors to see if they can share with you. The tools I used were got from the authors on three papers (SIGMOD'15, KDD'16 and WWW'17 as cited in my paper). |
I have found tools of KDD'16 and WWW'17 as cited in your paper, but the SIGMOD'15 tools seems not open source. Could you share me a link, please? As you know, it's too hard for me to realize that paper. |
https://shangjingbo1226.github.io/2018-03-04-autophrase/ Likely you will find a more powerful toolkit. |
Thank you very very very much much much! |
We collected tuples from both our method's output and competitive methods' output. Then we manually label the tuples as true tuples or false tuples. We have five volunteers to label them and then vote for ground truth as true/false. |
Thank you。 |
PATTY的工作你是如何实现的?自己编码?还是有开源的代码?我读他那个论文没有开源的代码啊。 |
发邮件给作者 :) |
@dmsquare 联系不上啊,发邮件总是被退信。 |
How do you implement the PATTY pattern, please? I can't find the source code or patterns for a long time. |
Hi,
I am very interested in how MetaPAD works, and I am interested in how it works on Chinese corpus. However, it seems I can't find some files when I changed them into Chinese corpus, so I need your help, can you tell me what should I do for Chinese corpus? Thank you very much.
when I did your test in Github, I get the result of your test. But when I did test run after change the corpus, it gives the following outputs:
rm -rf bin
mkdir -p bin
g++ -std=c++11 -Wall -O3 -msse2 -fopenmp -I.. -pthread -lm -Wno-unused-result -Wno-sign-compare -Wno-unused-variable -Wno-parentheses -Wno-format -o bin/segphrase_train src/main.cpp
Traceback (most recent call last):
File "metapad.py", line 1166, in
File "metapad.py", line 17, in Encrypt
IndexError: list index out of range
Traceback (most recent call last):
File "metapad.py", line 1166, in
File "metapad.py", line 17, in Encrypt
IndexError: list index out of range
=== Current Settings ===
Iterations = 2
Minimum Support Threshold = 30
Maximum Length Threshold = 20
POS-Tagging Mode Disabled
Discard Ratio = 0.050000
Number of threads = 15
Auto labels from knowledge bases
=======
of total tokens = 6438
of total word tokens = 6438
max word token id = 1564
of documents = 810
of POS tags = 0
The number of sentences = 1
unigrams inserted
of frequent patterns of length-1 = 1566
of frequent patterns of length-2 = 2
of frequent patterns of length-3 = 2
of frequent patterns of length-4 = 2
of frequent patterns of length-5 = 2
of frequent patterns of length-6 = 2
of frequent patterns of length-7 = 2
of frequent patterns of length-8 = 2
of frequent patterns of length-9 = 2
of frequent patterns of length-10 = 2
of frequent patterns of length-11 = 2
of frequent patterns of length-12 = 2
of frequent patterns of length-13 = 2
of frequent patterns of length-14 = 2
of frequent patterns of length-15 = 2
of frequent patterns of length-16 = 2
of frequent patterns of length-17 = 2
of frequent patterns of length-18 = 2
of frequent patterns of length-19 = 2
of frequent patterns of length-20 = 2
of frequent patterns = 1584
total occurrence = 35810
feature extraction done!
=== Generate Labels ===
matched positives = 0
matched negatives = 19
selected positives = 0
selected negatives = 19
Loaded Truth = 19
Recognized Truth = 19
Feature Matrix = 1584 X 14
of threads = 15
Start Classifier Training...
[ERROR] empty node in decision tree!
[ERROR] empty node in decision tree![ERROR] empty node in decision tree![ERROR] empty node in decision tree!
[ERROR] empty node in decision tree![ERROR] empty node in decision tree![ERROR] empty node in decision tree![ERROR] empty node in decision tree!
[ERROR] empty node in decision tree!
[ERROR] empty node in decision tree!
[ERROR] empty node in decision tree!
cp: cannot stat `cseg/tmp/quality_phrases.txt': No such file or directory
=== Current Settings ===
Iterations = 2
Minimum Support Threshold = 30
Maximum Length Threshold = 20
POS-Tagging Mode Disabled
Discard Ratio = 0.050000
Number of threads = 15
Auto labels from knowledge bases
=======
of total tokens = 6438
of total word tokens = 6438
max word token id = 1564
of documents = 810
of POS tags = 0
The number of sentences = 1
unigrams inserted
of frequent patterns of length-1 = 1566
of frequent patterns of length-2 = 2
of frequent patterns of length-3 = 2
of frequent patterns of length-4 = 2
of frequent patterns of length-5 = 2
of frequent patterns of length-6 = 2
of frequent patterns of length-7 = 2
of frequent patterns of length-8 = 2
of frequent patterns of length-9 = 2
of frequent patterns of length-10 = 2
of frequent patterns of length-11 = 2
of frequent patterns of length-12 = 2
of frequent patterns of length-13 = 2
of frequent patterns of length-14 = 2
of frequent patterns of length-15 = 2
of frequent patterns of length-16 = 2
of frequent patterns of length-17 = 2
of frequent patterns of length-18 = 2
of frequent patterns of length-19 = 2
of frequent patterns of length-20 = 2
of frequent patterns = 1584
total occurrence = 35810
feature extraction done!
=== Generate Labels ===
matched positives = 0
matched negatives = 19
selected positives = 0
selected negatives = 19
Loaded Truth = 19
Recognized Truth = 19
Feature Matrix = 1584 X 14
of threads = 15
Start Classifier Training...
[ERROR] empty node in decision tree![ERROR] empty node in decision tree![ERROR] empty node in decision tree![ERROR] empty node in decision tree!
[ERROR] empty node in decision tree!
[ERROR] empty node in decision tree!
[ERROR] empty node in decision tree!
[ERROR] empty node in decision tree!
[ERROR] empty node in decision tree![ERROR] empty node in decision tree!cp: cannot stat `cseg/tmp/quality_phrases.txt': No such file or directory
Traceback (most recent call last):
File "metapad.py", line 1236, in
File "metapad.py", line 1148, in SalientFast
IOError: [Errno 2] No such file or directory: 'output/top-token-phrase.txt'
Traceback (most recent call last):
File "metapad.py", line 1236, in
File "metapad.py", line 1148, in SalientFast
IOError: [Errno 2] No such file or directory: 'output/bottom-token-phrase.txt'
Thank you
The text was updated successfully, but these errors were encountered: