You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LangTagger is a Language Tagger that uses a probabilistic model for language classification.
The LangTagger consists of two models, which are LangTag(S) the simple model and LangTag(C) the
combined model. LangTag(S) is trained using only QALD 7 training dataset. Therefore it supports
the languages English, Deutsch, French, Spanish, Brazilian Portuguese, Dutch, Hindi, Romanian
and Persian. The LangTag(C) model is trained using all the QALD 3 to QALD 9 training datasets.
Therefore this model supports two more languages than the LangTag(S) model. They are Portuguese
and Russian.
Benchmarks
To assess the efficiency of the different models and frameworks, we desgined threedifferent text length and domain
benchmarks (1) Short texts (rdfs:labels), (2) QA and (3)Long texts (dbo:abstracts).
Short: The short text benchmark uses the first10.000 entityrdfs:labelsofeach language returned
by the DBpedia SPARQL endpoint if possible, excluding resources containing digits. It is designed
to measure the efficiency of the different approaches on identifying a label language. We used English,
German, Russian, Italian, Spanish, French and Portuguese language for the test.
QA: The QA benchmark uses all questions in the QuestionOver Linked Data
(QALD) datasets in different forms, Keywords (K) and FullQuestions (F). It is designed to evaluate
the efficiency of the different approaches in the Question and Answering (QA) domain. The efficiency
of the models areassessed on detecting the language of a question containing a knowledge base resource.
The QALD test benchmark consists of following languages,
QALD 1 : English.
QALD 2 : English.
QALD 3 : English, German, French, Spanish, Italian and Dutch.
Long: The long text benchmark uses thedbo:abstractsof the top 10.000resources returned by the
DBpedia SPARQL endpoint–if possible. It is designedto evalute the efficiency of different language
identification approaches on longresource texts. We used English,German, Russian, Italian, Spanish,
and French language for the test.
Evaluation
Results achieved by different approaches on all languages of QALD testbenchmark in Full (F) and Keyword (K) questions
>
QALD
1
2
3
4
5
6
7
8
9
Questions
F
F
K
F
K
F
K
F
K
F
K
F
K
F
K
F
LangTag(S)
1.0
1.0
0.70
0.99
0.77
0.99
0.77
1.00
0.76
0.99
0.67
0.98
0.48
1.00
0.70
0.97
LangTag(C)
1.0
1.0
0.86
0.99
0.90
0.99
0.92
1.00
0.81
0.99
0.93
1.00
0.70
1.00
0.84
0.97
langdetect
0.96
0.96
0.65
0.93
0.76
0.92
0.72
0.92
0.68
0.91
0.76
0.95
0.51
1.00
0.65
0.82
Tika
0.96
0.93
0.61
0.88
0.70
0.90
0.66
0.91
0.63
0.89
0.72
0.91
0.56
0.97
0.61
0.80
openNLP
0.96
0.97
0.48
0.89
0.62
0.89
0.61
0.85
0.48
0.75
0.62
0.90
0.39
0.95
0.41
0.73
openNLP(12)
0.98
0.98
0.70
0.96
0.76
0.95
0.76
0.94
0.75
0.93
0.83
0.97
0.56
1.00
0.81
0.95
langdetect(12)
0.96
0.93
0.67
0.90
0.76
0.91
0.72
0.91
0.69
0.89
0.75
0.92
0.58
1.00
0.66
0.82
langid
0.98
0.94
0.62
0.93
0.72
0.94
0.64
0.95
0.68
0.91
0.64
0.93
0.65
1.00
0.64
0.82
Results achieved by different approaches on English questions of QALD testbenchmark.
>
QALD
1
2
3
4
5
6
7
8
9
Questions
F
F
K
F
K
F
K
F
K
F
K
F
K
F
K
F
LangTag(S)
1.00
1.00
0.69
1.00
0.80
1.00
0.77
1.00
0.80
1.00
0.60
1.00
0.48
1.00
0.72
1.00
LangTag(C)
1.0
1.0
0.87
1.00
0.98
1.00
0.93
1.00
0.83
1.00
0.93
1.00
0.70
1.00
0.87
1.00
langdetect
0.96
0.96
0.53
0.96
0.68
0.94
0.67
0.94
0.70
0.95
0.65
0.93
0.51
1.00
0.68
0.92
Tika
0.96
0.93
0.51
0.97
0.68
0.92
0.61
0.91
0.65
0.94
0.67
0.96
0.56
0.95
0.64
0.93
openNLP
0.96
0.97
0.52
0.97
0.70
0.92
0.67
0.91
0.63
0.94
0.62
0.96
0.39
0.95
0.58
0.93
openNLP(12)
0.98
0.98
0.70
0.98
0.82
0.96
0.82
0.98
0.83
0.98
1.00
0.79
1.00
0.56
0.80
0.98
langdetect(12)
0.98
0.93
0.55
0.93
0.72
0.90
0.69
0.98
0.72
0.96
0.74
0.88
0.56
1.00
0.66
0.93
langid
0.98
0.94
0.52
0.94
0.60
0.96
0.61
0.96
0.67
0.94
0.55
0.95
0.65
1.00
0.59
0.94
Results achieved by different approaches on German questions of QALD testbenchmark
>
QALD
3
4
5
6
7
9
Questions
K
F
K
F
K
F
K
F
K
F
K
F
LangTag(S)
0.87
1.00
0.88
1.00
0.93
1.00
0.86
0.99
0.90
1.00
0.88
1.00
LangTag(C)
0.80
1.00
0.88
1.00
0.93
1.00
0.86
0.99
0.90
1.00
0.88
1.00
langdetect
0.80
0.95
0.80
0.92
0.77
0.91
0.71
0.88
0.74
0.95
0.81
0.94
Tika
0.79
0.95
0.78
0.92
0.79
0.94
0.71
0.88
0.69
0.95
0.81
0.94
openNLP
0.42
0.88
0.54
0.80
0.59
0.81
0.39
0.74
0.48
0.79
0.48
0.82
openNLP(12)
0.68
0.92
0.70
0.84
0.77
0.85
0.54
0.80
0.76
0.93
0.72
0.92
langdetect(12)
0.80
0.95
0.78
0.90
0.83
0.93
0.75
0.85
0.76
0.90
0.81
0.94
langid
0.70
0.93
0.82
0.94
0.75
0.95
0.71
0.92
0.67
0.90
0.78
0.94
Results achieved by different approaches on French questions of QALD testbenchmark
>
QALD
3
4
5
6
7
9
Questions
K
F
K
F
K
F
K
F
K
F
K
F
LangTag(S)
0.72
0.98
0.72
1.00
0.79
1.00
0.74
0.99
0.62
0.97
0.66
0.99
LangTag(C)
0.88
1.00
0.84
1.00
0.93
1.00
0.78
0.99
0.90
1.00
0.80
0.99
langdetect
0.61
0.90
0.84
0.98
0.86
0.96
0.69
0.92
0.88
1.00
0.77
0.94
Tika
0.61
0.89
0.76
0.98
0.79
0.93
0.65
0.93
0.79
1.00
0.73
0.96
openNLP
0.51
0.86
0.62
0.90
0.58
0.80
0.59
0.82
0.65
0.90
0.62
0.82
openNLP(12)
0.70
0.94
0.76
0.96
0.68
0.96
0.68
0.90
0.93
0.97
0.78
0.91
langdetect(12)
0.73
0.88
0.76
0.98
0.68
0.93
0.66
0.91
0.83
0.95
0.76
0.92
langid
0.75
0.96
0.88
0.96
0.75
1.00
0.78
0.94
0.79
0.97
0.84
0.92
Results achieved by different approaches on Entity rdfs:labels
>
Approach
EN
DE
RU
IT
ES
FR
PT
AVG
#Resources
10,000
10,000
83
243
10,000
782
227
Accuracy
Runtime(s)
LangTag(S)
0.21
0.91
-
0.25
0.09
0.34
0.36
0.36
0.00162
LangTag(C)
0.26
0.88
0.12
0.35
0.15
0.36
0.44
0.34
0.00186
langdetect
0.40
0.43
0.57
0.63
0.31
0.59
0.43
0.48
0.01761
Tika
0.24
0.39
50
0.68
0.15
0.59
0.35
0.41
0.41428
openNLP
0.16
0.18
0.12
0.30
0.15
0.33
0.25
0.21
0.01125
openNLP(12)
0.75
0.37
0.98
0.80
0.37
0.59
0.52
0.62
0.05361
langdetect(12)
0.35
0.51
0.59
0.67
0.32
0.59
0.43
0.49
0.03611
langid
0.69
0.33
0.56
0.44
0.33
0.57
0.28
0.45
0.01651
Results achieved by different approaches on Abstracts
>
Approach
EN
DE
RU
IT
ES
FR
AVG
#Resources
10,000
10,000
285
10,000
10,000
10,000
Accuracy
Runtime(s)
LangTag(S)
0.96
0.99
-
0.99
0.99
0.99
0.98
0.00267
LangTag(C)
0.96
0.99
0.86
0.99
0.99
0.99
0.96
0.00287
langdetect
0.95
0.99
0.95
0.99
0.99
0.99
0.97
0.01657
Tika
0.95
0.99
0.95
0.99
0.98
0.99
0.97
0.43918
openNLP
0.79
0.81
0.13
0.76
0.78
0.71
0.66
0.01427
openNLP(12)
0.95
0.98
0.98
0.99
0.99
0.99
0.98
0.18625
langdetect(12)
0.95
0.99
0.95
0.99
0.99
0.99
0.97
0.02183
langid
0.96
0.97
0.94
0.99
0.98
0.99
0.97
0.03579
Average runtime in seconds(s) different approaches on on QALD test bench-marks.
>
QALD
3
4
5
6
7
8
9
Questions
K
F
K
F
K
F
K
F
K
F
K
F
K
F
LangTag(S)
0.0003
0.0003
0.0006
0.0003
0.0006
0.0003
0.0002
0.0002
0.0019
0.0004
0.0041
0.0014
0.0001
0.0002
LangTag(C)
0.0018
0.0012
0.0026
0.0021
0.0029
0.0022
0.0017
0.0011
0.0036
0.0031
0.0131
0.0120
0.0017
0.0012
langdetect
0.0087
0.0063
0.0079
0.0042
0.0072
0.0057
0.0078
0.0054
0.0082
0.0041
0.0092
0.0021
0.0075
0.0116
Tika
1.5677
1.4068
1.4021
1.4009
1.6072
1.3928
1.5981
1.3978
1.4379
1.3955
1.4213
1.3778
1.9081
1.4836
openNLP
0.0027
0.0011
0.0036
0.0039
0.0035
0.0030
0.0023
0.0011
0.0058
0.0062
0.0032
0.0026
0.0012
0.0014
model size in Megabytes (MB) andKilobytes (KB) achieved by different approaches on QALD test benchmarks.