Can MetaPAD works on Chinese corpus? #1

huangxiaohong · 2017-12-18T07:29:44Z

Hi,

I am very interested in how MetaPAD works, and I am interested in how it works on Chinese corpus. However, it seems I can't find some files when I changed them into Chinese corpus, so I need your help, can you tell me what should I do for Chinese corpus? Thank you very much.

when I did your test in Github, I get the result of your test. But when I did test run after change the corpus, it gives the following outputs:

rm -rf bin

mkdir -p bin

g++ -std=c++11 -Wall -O3 -msse2 -fopenmp -I.. -pthread -lm -Wno-unused-result -Wno-sign-compare -Wno-unused-variable -Wno-parentheses -Wno-format -o bin/segphrase_train src/main.cpp

Traceback (most recent call last):

File "metapad.py", line 1166, in

Encrypt(file_output_encrypted,file_output_label,file_output_positive,file_output_key,file_input_entitylinking,file_input_goodpattern,file_input_stopwords,10,LEVEL)

File "metapad.py", line 17, in Encrypt

if sentence[n-1][0] == 'PERIOD': sentence = sentence[0:n-1]

IndexError: list index out of range

Traceback (most recent call last):

File "metapad.py", line 1166, in

Encrypt(file_output_encrypted,file_output_label,file_output_positive,file_output_key,file_input_entitylinking,file_input_goodpattern,file_input_stopwords,10,LEVEL)

File "metapad.py", line 17, in Encrypt

if sentence[n-1][0] == 'PERIOD': sentence = sentence[0:n-1]

IndexError: list index out of range

=== Current Settings ===

Iterations = 2

Minimum Support Threshold = 30

Maximum Length Threshold = 20

POS-Tagging Mode Disabled

Discard Ratio = 0.050000

Number of threads = 15

Auto labels from knowledge bases

    Labeling Method = ByLengthByPositive

    Max Positive Samples = 100

    Negative Sampling Ratio = 2

=======

of total tokens = 6438

of total word tokens = 6438

max word token id = 1564

of documents = 810

of POS tags = 0

The number of sentences = 1

unigrams inserted

of frequent patterns of length-1 = 1566

of frequent patterns of length-2 = 2

of frequent patterns of length-3 = 2

of frequent patterns of length-4 = 2

of frequent patterns of length-5 = 2

of frequent patterns of length-6 = 2

of frequent patterns of length-7 = 2

of frequent patterns of length-8 = 2

of frequent patterns of length-9 = 2

of frequent patterns of length-10 = 2

of frequent patterns of length-11 = 2

of frequent patterns of length-12 = 2

of frequent patterns of length-13 = 2

of frequent patterns of length-14 = 2

of frequent patterns of length-15 = 2

of frequent patterns of length-16 = 2

of frequent patterns of length-17 = 2

of frequent patterns of length-18 = 2

of frequent patterns of length-19 = 2

of frequent patterns of length-20 = 2

of frequent patterns = 1584

total occurrence = 35810

feature extraction done!

=== Generate Labels ===

matched positives = 0

matched negatives = 19

selected positives = 0

selected negatives = 19

Loaded Truth = 19

Recognized Truth = 19

Feature Matrix = 1584 X 14

of threads = 15

Start Classifier Training...

[ERROR] empty node in decision tree!

[ERROR] empty node in decision tree![ERROR] empty node in decision tree![ERROR] empty node in decision tree!

[ERROR] empty node in decision tree![ERROR] empty node in decision tree![ERROR] empty node in decision tree![ERROR] empty node in decision tree!

[ERROR] empty node in decision tree!

cp: cannot stat `cseg/tmp/quality_phrases.txt': No such file or directory

=== Current Settings ===

Iterations = 2

Minimum Support Threshold = 30

Maximum Length Threshold = 20

POS-Tagging Mode Disabled

Discard Ratio = 0.050000

Number of threads = 15

Auto labels from knowledge bases

    Labeling Method = ByLengthByPositive

    Max Positive Samples = 100

    Negative Sampling Ratio = 2

=======

of total tokens = 6438

of total word tokens = 6438

max word token id = 1564

of documents = 810

of POS tags = 0

The number of sentences = 1

unigrams inserted

of frequent patterns of length-1 = 1566

of frequent patterns of length-2 = 2

of frequent patterns of length-3 = 2

of frequent patterns of length-4 = 2

of frequent patterns of length-5 = 2

of frequent patterns of length-6 = 2

of frequent patterns of length-7 = 2

of frequent patterns of length-8 = 2

of frequent patterns of length-9 = 2

of frequent patterns of length-10 = 2

of frequent patterns of length-11 = 2

of frequent patterns of length-12 = 2

of frequent patterns of length-13 = 2

of frequent patterns of length-14 = 2

of frequent patterns of length-15 = 2

of frequent patterns of length-16 = 2

of frequent patterns of length-17 = 2

of frequent patterns of length-18 = 2

of frequent patterns of length-19 = 2

of frequent patterns of length-20 = 2

of frequent patterns = 1584

total occurrence = 35810

feature extraction done!

=== Generate Labels ===

matched positives = 0

matched negatives = 19

selected positives = 0

selected negatives = 19

Loaded Truth = 19

Recognized Truth = 19

Feature Matrix = 1584 X 14

of threads = 15

Start Classifier Training...

[ERROR] empty node in decision tree![ERROR] empty node in decision tree![ERROR] empty node in decision tree![ERROR] empty node in decision tree!

[ERROR] empty node in decision tree!

[ERROR] empty node in decision tree![ERROR] empty node in decision tree!cp: cannot stat `cseg/tmp/quality_phrases.txt': No such file or directory

Traceback (most recent call last):

File "metapad.py", line 1236, in

SalientFast(file_output_salient,file_output_key,file_input_phrase,file_input_mapping)

File "metapad.py", line 1148, in SalientFast

fr = open(file_input_phrase,'rb')

IOError: [Errno 2] No such file or directory: 'output/top-token-phrase.txt'

Traceback (most recent call last):

File "metapad.py", line 1236, in

SalientFast(file_output_salient,file_output_key,file_input_phrase,file_input_mapping)

File "metapad.py", line 1148, in SalientFast

fr = open(file_input_phrase,'rb')

IOError: [Errno 2] No such file or directory: 'output/bottom-token-phrase.txt'

Thank you

The text was updated successfully, but these errors were encountered:

mjiang89 · 2017-12-18T22:48:00Z

Unfortunately, this MetaPAD package is developed only for English corpus thought the methodology (please look at the paper) works for any kind of corpus. If you look at the sample data input, you will see the English corpus has been processed by a few important tools: entity recognition, entity typing, fine-grained typing. If you use a Chinese language toolbox to process your corpus in the same format, you can run MetaPAD then.

huangxiaohong · 2017-12-19T00:08:42Z

Thank you very much for your reply. I will continue to track your paper and use your method to try it.

imnujf · 2018-03-26T09:24:58Z

I am also very interested in your work in this paper.
How can I get these important tools: entity recognition, entity typing, fine-grained typing, please?
Are they open source, too？

mjiang89 · 2018-03-26T12:43:13Z

@imnujf Sure. You should be able to find tools for entity recognition, typing and fine-grained typing. There have been tons of papers published towards these three tasks. I'd suggest you to find your favorite work among those and either find the open source tool on the link the authors provided or just contact the authors to see if they can share with you. The tools I used were got from the authors on three papers (SIGMOD'15, KDD'16 and WWW'17 as cited in my paper).

imnujf · 2018-03-27T07:06:40Z

I have found tools of KDD'16 and WWW'17 as cited in your paper, but the SIGMOD'15 tools seems not open source. Could you share me a link, please? As you know, it's too hard for me to realize that paper.

mjiang89 · 2018-03-27T14:26:54Z

https://shangjingbo1226.github.io/2018-03-04-autophrase/

Likely you will find a more powerful toolkit.

imnujf · 2018-03-28T02:36:15Z

Thank you very very very much much much!

imnujf · 2018-04-02T02:42:05Z

When I read your paper, I can't understand how you label the EAV tuples, manually or voting by competitive methods?

mjiang89 · 2018-04-02T02:44:25Z

We collected tuples from both our method's output and competitive methods' output. Then we manually label the tuples as true tuples or false tuples. We have five volunteers to label them and then vote for ground truth as true/false.

imnujf · 2018-04-02T05:46:07Z

Thank you。

imnujf · 2018-05-03T14:31:40Z

PATTY的工作你是如何实现的？自己编码？还是有开源的代码？我读他那个论文没有开源的代码啊。

DM2-ND · 2018-05-03T14:32:40Z

发邮件给作者 :)

imnujf · 2018-05-03T14:53:37Z

@dmsquare 联系不上啊，发邮件总是被退信。

imnujf · 2018-11-19T01:23:52Z

How do you implement the PATTY pattern, please? I can't find the source code or patterns for a long time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can MetaPAD works on Chinese corpus? #1

Can MetaPAD works on Chinese corpus? #1

huangxiaohong commented Dec 18, 2017

mjiang89 commented Dec 18, 2017

huangxiaohong commented Dec 19, 2017

imnujf commented Mar 26, 2018

mjiang89 commented Mar 26, 2018

imnujf commented Mar 27, 2018

mjiang89 commented Mar 27, 2018

imnujf commented Mar 28, 2018

imnujf commented Apr 2, 2018

mjiang89 commented Apr 2, 2018

imnujf commented Apr 2, 2018

imnujf commented May 3, 2018

DM2-ND commented May 3, 2018

imnujf commented May 3, 2018

imnujf commented Nov 19, 2018

Can MetaPAD works on Chinese corpus? #1

Can MetaPAD works on Chinese corpus? #1

Comments

huangxiaohong commented Dec 18, 2017

of total tokens = 6438

of total word tokens = 6438

of documents = 810

of POS tags = 0

of frequent patterns of length-1 = 1566

of frequent patterns of length-2 = 2

of frequent patterns of length-3 = 2

of frequent patterns of length-4 = 2

of frequent patterns of length-5 = 2

of frequent patterns of length-6 = 2

of frequent patterns of length-7 = 2

of frequent patterns of length-8 = 2

of frequent patterns of length-9 = 2

of frequent patterns of length-10 = 2

of frequent patterns of length-11 = 2

of frequent patterns of length-12 = 2

of frequent patterns of length-13 = 2

of frequent patterns of length-14 = 2

of frequent patterns of length-15 = 2

of frequent patterns of length-16 = 2

of frequent patterns of length-17 = 2

of frequent patterns of length-18 = 2

of frequent patterns of length-19 = 2

of frequent patterns of length-20 = 2

of frequent patterns = 1584

of threads = 15

of total tokens = 6438

of total word tokens = 6438

of documents = 810

of POS tags = 0

of frequent patterns of length-1 = 1566

of frequent patterns of length-2 = 2

of frequent patterns of length-3 = 2

of frequent patterns of length-4 = 2

of frequent patterns of length-5 = 2

of frequent patterns of length-6 = 2

of frequent patterns of length-7 = 2

of frequent patterns of length-8 = 2

of frequent patterns of length-9 = 2

of frequent patterns of length-10 = 2

of frequent patterns of length-11 = 2

of frequent patterns of length-12 = 2

of frequent patterns of length-13 = 2

of frequent patterns of length-14 = 2

of frequent patterns of length-15 = 2

of frequent patterns of length-16 = 2

of frequent patterns of length-17 = 2

of frequent patterns of length-18 = 2

of frequent patterns of length-19 = 2

of frequent patterns of length-20 = 2

of frequent patterns = 1584

of threads = 15

mjiang89 commented Dec 18, 2017

huangxiaohong commented Dec 19, 2017

imnujf commented Mar 26, 2018

mjiang89 commented Mar 26, 2018

imnujf commented Mar 27, 2018

mjiang89 commented Mar 27, 2018

imnujf commented Mar 28, 2018

imnujf commented Apr 2, 2018

mjiang89 commented Apr 2, 2018

imnujf commented Apr 2, 2018

imnujf commented May 3, 2018

DM2-ND commented May 3, 2018

imnujf commented May 3, 2018

imnujf commented Nov 19, 2018