Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can MetaPAD works on Chinese corpus? #1

Open
huangxiaohong opened this issue Dec 18, 2017 · 14 comments
Open

Can MetaPAD works on Chinese corpus? #1

huangxiaohong opened this issue Dec 18, 2017 · 14 comments

Comments

@huangxiaohong
Copy link

Hi,

I am very interested in how MetaPAD works, and I am interested in how it works on Chinese corpus. However, it seems I can't find some files when I changed them into Chinese corpus, so I need your help, can you tell me what should I do for Chinese corpus? Thank you very much.

when I did your test in Github, I get the result of your test. But when I did test run after change the corpus, it gives the following outputs:

rm -rf bin

mkdir -p bin

g++ -std=c++11 -Wall -O3 -msse2 -fopenmp -I.. -pthread -lm -Wno-unused-result -Wno-sign-compare -Wno-unused-variable -Wno-parentheses -Wno-format -o bin/segphrase_train src/main.cpp

Traceback (most recent call last):

File "metapad.py", line 1166, in

Encrypt(file_output_encrypted,file_output_label,file_output_positive,file_output_key,file_input_entitylinking,file_input_goodpattern,file_input_stopwords,10,LEVEL)

File "metapad.py", line 17, in Encrypt

if sentence[n-1][0] == 'PERIOD': sentence = sentence[0:n-1]

IndexError: list index out of range

Traceback (most recent call last):

File "metapad.py", line 1166, in

Encrypt(file_output_encrypted,file_output_label,file_output_positive,file_output_key,file_input_entitylinking,file_input_goodpattern,file_input_stopwords,10,LEVEL)

File "metapad.py", line 17, in Encrypt

if sentence[n-1][0] == 'PERIOD': sentence = sentence[0:n-1]

IndexError: list index out of range

=== Current Settings ===

Iterations = 2

Minimum Support Threshold = 30

Maximum Length Threshold = 20

POS-Tagging Mode Disabled

Discard Ratio = 0.050000

Number of threads = 15

Auto labels from knowledge bases

    Labeling Method = ByLengthByPositive

    Max Positive Samples = 100

    Negative Sampling Ratio = 2

=======

of total tokens = 6438

of total word tokens = 6438

max word token id = 1564

of documents = 810

of POS tags = 0

The number of sentences = 1

unigrams inserted

of frequent patterns of length-1 = 1566

of frequent patterns of length-2 = 2

of frequent patterns of length-3 = 2

of frequent patterns of length-4 = 2

of frequent patterns of length-5 = 2

of frequent patterns of length-6 = 2

of frequent patterns of length-7 = 2

of frequent patterns of length-8 = 2

of frequent patterns of length-9 = 2

of frequent patterns of length-10 = 2

of frequent patterns of length-11 = 2

of frequent patterns of length-12 = 2

of frequent patterns of length-13 = 2

of frequent patterns of length-14 = 2

of frequent patterns of length-15 = 2

of frequent patterns of length-16 = 2

of frequent patterns of length-17 = 2

of frequent patterns of length-18 = 2

of frequent patterns of length-19 = 2

of frequent patterns of length-20 = 2

of frequent patterns = 1584

total occurrence = 35810

feature extraction done!

=== Generate Labels ===

matched positives = 0

matched negatives = 19

selected positives = 0

selected negatives = 19

Loaded Truth = 19

Recognized Truth = 19

Feature Matrix = 1584 X 14

of threads = 15

Start Classifier Training...

[ERROR] empty node in decision tree!

[ERROR] empty node in decision tree![ERROR] empty node in decision tree![ERROR] empty node in decision tree!

[ERROR] empty node in decision tree![ERROR] empty node in decision tree![ERROR] empty node in decision tree![ERROR] empty node in decision tree!

[ERROR] empty node in decision tree!

[ERROR] empty node in decision tree!

[ERROR] empty node in decision tree!

cp: cannot stat `cseg/tmp/quality_phrases.txt': No such file or directory

=== Current Settings ===

Iterations = 2

Minimum Support Threshold = 30

Maximum Length Threshold = 20

POS-Tagging Mode Disabled

Discard Ratio = 0.050000

Number of threads = 15

Auto labels from knowledge bases

    Labeling Method = ByLengthByPositive

    Max Positive Samples = 100

    Negative Sampling Ratio = 2

=======

of total tokens = 6438

of total word tokens = 6438

max word token id = 1564

of documents = 810

of POS tags = 0

The number of sentences = 1

unigrams inserted

of frequent patterns of length-1 = 1566

of frequent patterns of length-2 = 2

of frequent patterns of length-3 = 2

of frequent patterns of length-4 = 2

of frequent patterns of length-5 = 2

of frequent patterns of length-6 = 2

of frequent patterns of length-7 = 2

of frequent patterns of length-8 = 2

of frequent patterns of length-9 = 2

of frequent patterns of length-10 = 2

of frequent patterns of length-11 = 2

of frequent patterns of length-12 = 2

of frequent patterns of length-13 = 2

of frequent patterns of length-14 = 2

of frequent patterns of length-15 = 2

of frequent patterns of length-16 = 2

of frequent patterns of length-17 = 2

of frequent patterns of length-18 = 2

of frequent patterns of length-19 = 2

of frequent patterns of length-20 = 2

of frequent patterns = 1584

total occurrence = 35810

feature extraction done!

=== Generate Labels ===

matched positives = 0

matched negatives = 19

selected positives = 0

selected negatives = 19

Loaded Truth = 19

Recognized Truth = 19

Feature Matrix = 1584 X 14

of threads = 15

Start Classifier Training...

[ERROR] empty node in decision tree![ERROR] empty node in decision tree![ERROR] empty node in decision tree![ERROR] empty node in decision tree!

[ERROR] empty node in decision tree!

[ERROR] empty node in decision tree!

[ERROR] empty node in decision tree!

[ERROR] empty node in decision tree!

[ERROR] empty node in decision tree![ERROR] empty node in decision tree!cp: cannot stat `cseg/tmp/quality_phrases.txt': No such file or directory

Traceback (most recent call last):

File "metapad.py", line 1236, in

SalientFast(file_output_salient,file_output_key,file_input_phrase,file_input_mapping)

File "metapad.py", line 1148, in SalientFast

fr = open(file_input_phrase,'rb')

IOError: [Errno 2] No such file or directory: 'output/top-token-phrase.txt'

Traceback (most recent call last):

File "metapad.py", line 1236, in

SalientFast(file_output_salient,file_output_key,file_input_phrase,file_input_mapping)

File "metapad.py", line 1148, in SalientFast

fr = open(file_input_phrase,'rb')

IOError: [Errno 2] No such file or directory: 'output/bottom-token-phrase.txt'

Thank you

@mjiang89
Copy link
Owner

Unfortunately, this MetaPAD package is developed only for English corpus thought the methodology (please look at the paper) works for any kind of corpus. If you look at the sample data input, you will see the English corpus has been processed by a few important tools: entity recognition, entity typing, fine-grained typing. If you use a Chinese language toolbox to process your corpus in the same format, you can run MetaPAD then.

@huangxiaohong
Copy link
Author

Thank you very much for your reply. I will continue to track your paper and use your method to try it.

@imnujf
Copy link

imnujf commented Mar 26, 2018

I am also very interested in your work in this paper.
How can I get these important tools: entity recognition, entity typing, fine-grained typing, please?
Are they open source, too?

@mjiang89
Copy link
Owner

@imnujf Sure. You should be able to find tools for entity recognition, typing and fine-grained typing. There have been tons of papers published towards these three tasks. I'd suggest you to find your favorite work among those and either find the open source tool on the link the authors provided or just contact the authors to see if they can share with you. The tools I used were got from the authors on three papers (SIGMOD'15, KDD'16 and WWW'17 as cited in my paper).

@imnujf
Copy link

imnujf commented Mar 27, 2018

I have found tools of KDD'16 and WWW'17 as cited in your paper, but the SIGMOD'15 tools seems not open source. Could you share me a link, please? As you know, it's too hard for me to realize that paper.

@mjiang89
Copy link
Owner

https://shangjingbo1226.github.io/2018-03-04-autophrase/

Likely you will find a more powerful toolkit.

@imnujf
Copy link

imnujf commented Mar 28, 2018

Thank you very very very much much much!

@imnujf
Copy link

imnujf commented Apr 2, 2018

tim 20180402103940
When I read your paper, I can't understand how you label the EAV tuples, manually or voting by competitive methods?

@mjiang89
Copy link
Owner

mjiang89 commented Apr 2, 2018

We collected tuples from both our method's output and competitive methods' output. Then we manually label the tuples as true tuples or false tuples. We have five volunteers to label them and then vote for ground truth as true/false.

@imnujf
Copy link

imnujf commented Apr 2, 2018

Thank you。

@imnujf
Copy link

imnujf commented May 3, 2018

PATTY的工作你是如何实现的?自己编码?还是有开源的代码?我读他那个论文没有开源的代码啊。

@DM2-ND
Copy link

DM2-ND commented May 3, 2018

发邮件给作者 :)

@imnujf
Copy link

imnujf commented May 3, 2018

@dmsquare 联系不上啊,发邮件总是被退信。

@imnujf
Copy link

imnujf commented Nov 19, 2018

How do you implement the PATTY pattern, please? I can't find the source code or patterns for a long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants