LangTagger

LangTagger is a Language Tagger that uses a probabilistic model for language classification. The LangTagger consists of two models, which are LangTag(S) the simple model and LangTag(C) the combined model. LangTag(S) is trained using only QALD 7 training dataset. Therefore it supports the languages English, Deutsch, French, Spanish, Brazilian Portuguese, Dutch, Hindi, Romanian and Persian. The LangTag(C) model is trained using all the QALD 3 to QALD 9 training datasets. Therefore this model supports two more languages than the LangTag(S) model. They are Portuguese and Russian.

Benchmarks

To assess the efficiency of the different models and frameworks, we desgined threedifferent text length and domain benchmarks (1) Short texts (rdfs:labels), (2) QA and (3)Long texts (dbo:abstracts).

Short: The short text benchmark uses the first10.000 entityrdfs:labelsofeach language returned by the DBpedia SPARQL endpoint if possible, excluding resources containing digits. It is designed to measure the efficiency of the different approaches on identifying a label language. We used English, German, Russian, Italian, Spanish, French and Portuguese language for the test.

QA: The QA benchmark uses all questions in the QuestionOver Linked Data (QALD) datasets in different forms, Keywords (K) and FullQuestions (F). It is designed to evaluate the efficiency of the different approaches in the Question and Answering (QA) domain. The efficiency of the models areassessed on detecting the language of a question containing a knowledge base resource. The QALD test benchmark consists of following languages,

QALD 1 : English.
QALD 2 : English.
QALD 3 : English, German, French, Spanish, Italian and Dutch.
QALD 4 : English, German, French, Spanish, Italian, Dutch and Romanian.
QALD 5 : English, German, French, Spanish, Italian, Dutch and Romanian.
QALD 6 : English, German, French, Spanish, Italian, Dutch, Romanian and Persian.
QALD 7: English, German, French, Spanish, Brazilian Portuguese, Italian, Dutch, Hindi, Romanian and Persian.
QALD 8 : English
QALD 9 : English, Deutsch, French, Spanish, Brazilian Portuguese, Italian, Dutch, Hindi, Romanian, Persian, Portuguese and Russian.

Long: The long text benchmark uses thedbo:abstractsof the top 10.000resources returned by the DBpedia SPARQL endpoint–if possible. It is designedto evalute the efficiency of different language identification approaches on longresource texts. We used English,German, Russian, Italian, Spanish, and French language for the test.

Evaluation

Results achieved by different approaches on all languages of QALD testbenchmark in Full (F) and Keyword (K) questions

>

QALD	1	2	3		4		5		6		7		8		9
Questions	F	F	K	F	K	F	K	F	K	F	K	F	K	F	K	F
LangTag(S)	1.0	1.0	0.70	0.99	0.77	0.99	0.77	1.00	0.76	0.99	0.67	0.98	0.48	1.00	0.70	0.97
LangTag(C)	1.0	1.0	0.86	0.99	0.90	0.99	0.92	1.00	0.81	0.99	0.93	1.00	0.70	1.00	0.84	0.97
langdetect	0.96	0.96	0.65	0.93	0.76	0.92	0.72	0.92	0.68	0.91	0.76	0.95	0.51	1.00	0.65	0.82
Tika	0.96	0.93	0.61	0.88	0.70	0.90	0.66	0.91	0.63	0.89	0.72	0.91	0.56	0.97	0.61	0.80
openNLP	0.96	0.97	0.48	0.89	0.62	0.89	0.61	0.85	0.48	0.75	0.62	0.90	0.39	0.95	0.41	0.73
openNLP(12)	0.98	0.98	0.70	0.96	0.76	0.95	0.76	0.94	0.75	0.93	0.83	0.97	0.56	1.00	0.81	0.95
langdetect(12)	0.96	0.93	0.67	0.90	0.76	0.91	0.72	0.91	0.69	0.89	0.75	0.92	0.58	1.00	0.66	0.82
langid	0.98	0.94	0.62	0.93	0.72	0.94	0.64	0.95	0.68	0.91	0.64	0.93	0.65	1.00	0.64	0.82

Results achieved by different approaches on English questions of QALD testbenchmark.

>

QALD	1	2	3		4		5		6		7		8		9
Questions	F	F	K	F	K	F	K	F	K	F	K	F	K	F	K	F
LangTag(S)	1.00	1.00	0.69	1.00	0.80	1.00	0.77	1.00	0.80	1.00	0.60	1.00	0.48	1.00	0.72	1.00
LangTag(C)	1.0	1.0	0.87	1.00	0.98	1.00	0.93	1.00	0.83	1.00	0.93	1.00	0.70	1.00	0.87	1.00
langdetect	0.96	0.96	0.53	0.96	0.68	0.94	0.67	0.94	0.70	0.95	0.65	0.93	0.51	1.00	0.68	0.92
Tika	0.96	0.93	0.51	0.97	0.68	0.92	0.61	0.91	0.65	0.94	0.67	0.96	0.56	0.95	0.64	0.93
openNLP	0.96	0.97	0.52	0.97	0.70	0.92	0.67	0.91	0.63	0.94	0.62	0.96	0.39	0.95	0.58	0.93
openNLP(12)	0.98	0.98	0.70	0.98	0.82	0.96	0.82	0.98	0.83	0.98	1.00	0.79	1.00	0.56	0.80	0.98
langdetect(12)	0.98	0.93	0.55	0.93	0.72	0.90	0.69	0.98	0.72	0.96	0.74	0.88	0.56	1.00	0.66	0.93
langid	0.98	0.94	0.52	0.94	0.60	0.96	0.61	0.96	0.67	0.94	0.55	0.95	0.65	1.00	0.59	0.94

Results achieved by different approaches on German questions of QALD testbenchmark

>

QALD	3		4		5		6		7		9
Questions	K	F	K	F	K	F	K	F	K	F	K	F
LangTag(S)	0.87	1.00	0.88	1.00	0.93	1.00	0.86	0.99	0.90	1.00	0.88	1.00
LangTag(C)	0.80	1.00	0.88	1.00	0.93	1.00	0.86	0.99	0.90	1.00	0.88	1.00
langdetect	0.80	0.95	0.80	0.92	0.77	0.91	0.71	0.88	0.74	0.95	0.81	0.94
Tika	0.79	0.95	0.78	0.92	0.79	0.94	0.71	0.88	0.69	0.95	0.81	0.94
openNLP	0.42	0.88	0.54	0.80	0.59	0.81	0.39	0.74	0.48	0.79	0.48	0.82
openNLP(12)	0.68	0.92	0.70	0.84	0.77	0.85	0.54	0.80	0.76	0.93	0.72	0.92
langdetect(12)	0.80	0.95	0.78	0.90	0.83	0.93	0.75	0.85	0.76	0.90	0.81	0.94
langid	0.70	0.93	0.82	0.94	0.75	0.95	0.71	0.92	0.67	0.90	0.78	0.94

Results achieved by different approaches on French questions of QALD testbenchmark

>

QALD	3		4		5		6		7		9
Questions	K	F	K	F	K	F	K	F	K	F	K	F
LangTag(S)	0.72	0.98	0.72	1.00	0.79	1.00	0.74	0.99	0.62	0.97	0.66	0.99
LangTag(C)	0.88	1.00	0.84	1.00	0.93	1.00	0.78	0.99	0.90	1.00	0.80	0.99
langdetect	0.61	0.90	0.84	0.98	0.86	0.96	0.69	0.92	0.88	1.00	0.77	0.94
Tika	0.61	0.89	0.76	0.98	0.79	0.93	0.65	0.93	0.79	1.00	0.73	0.96
openNLP	0.51	0.86	0.62	0.90	0.58	0.80	0.59	0.82	0.65	0.90	0.62	0.82
openNLP(12)	0.70	0.94	0.76	0.96	0.68	0.96	0.68	0.90	0.93	0.97	0.78	0.91
langdetect(12)	0.73	0.88	0.76	0.98	0.68	0.93	0.66	0.91	0.83	0.95	0.76	0.92
langid	0.75	0.96	0.88	0.96	0.75	1.00	0.78	0.94	0.79	0.97	0.84	0.92

Results achieved by different approaches on Entity rdfs:labels

>

Approach	EN	DE	RU	IT	ES	FR	PT	AVG
#Resources	10,000	10,000	83	243	10,000	782	227	Accuracy	Runtime(s)
LangTag(S)	0.21	0.91	-	0.25	0.09	0.34	0.36	0.36	0.00162
LangTag(C)	0.26	0.88	0.12	0.35	0.15	0.36	0.44	0.34	0.00186
langdetect	0.40	0.43	0.57	0.63	0.31	0.59	0.43	0.48	0.01761
Tika	0.24	0.39	50	0.68	0.15	0.59	0.35	0.41	0.41428
openNLP	0.16	0.18	0.12	0.30	0.15	0.33	0.25	0.21	0.01125
openNLP(12)	0.75	0.37	0.98	0.80	0.37	0.59	0.52	0.62	0.05361
langdetect(12)	0.35	0.51	0.59	0.67	0.32	0.59	0.43	0.49	0.03611
langid	0.69	0.33	0.56	0.44	0.33	0.57	0.28	0.45	0.01651

Results achieved by different approaches on Abstracts

>

Approach	EN	DE	RU	IT	ES	FR	AVG
#Resources	10,000	10,000	285	10,000	10,000	10,000	Accuracy	Runtime(s)
LangTag(S)	0.96	0.99	-	0.99	0.99	0.99	0.98	0.00267
LangTag(C)	0.96	0.99	0.86	0.99	0.99	0.99	0.96	0.00287
langdetect	0.95	0.99	0.95	0.99	0.99	0.99	0.97	0.01657
Tika	0.95	0.99	0.95	0.99	0.98	0.99	0.97	0.43918
openNLP	0.79	0.81	0.13	0.76	0.78	0.71	0.66	0.01427
openNLP(12)	0.95	0.98	0.98	0.99	0.99	0.99	0.98	0.18625
langdetect(12)	0.95	0.99	0.95	0.99	0.99	0.99	0.97	0.02183
langid	0.96	0.97	0.94	0.99	0.98	0.99	0.97	0.03579

Average runtime in seconds(s) different approaches on on QALD test bench-marks.

>

QALD	3		4		5		6		7		8		9
Questions	K	F	K	F	K	F	K	F	K	F	K	F	K	F
LangTag(S)	0.0003	0.0003	0.0006	0.0003	0.0006	0.0003	0.0002	0.0002	0.0019	0.0004	0.0041	0.0014	0.0001	0.0002
LangTag(C)	0.0018	0.0012	0.0026	0.0021	0.0029	0.0022	0.0017	0.0011	0.0036	0.0031	0.0131	0.0120	0.0017	0.0012
langdetect	0.0087	0.0063	0.0079	0.0042	0.0072	0.0057	0.0078	0.0054	0.0082	0.0041	0.0092	0.0021	0.0075	0.0116
Tika	1.5677	1.4068	1.4021	1.4009	1.6072	1.3928	1.5981	1.3978	1.4379	1.3955	1.4213	1.3778	1.9081	1.4836
openNLP	0.0027	0.0011	0.0036	0.0039	0.0035	0.0030	0.0023	0.0011	0.0058	0.0062	0.0032	0.0026	0.0012	0.0014

model size in Megabytes (MB) andKilobytes (KB) achieved by different approaches on QALD test benchmarks.

>

Approach	Model Size	#Languages
LangTag(S)	8.2 KB	10
LangTag(C)	9.7 KB	12
langdetect	981.5 KB	55
Tika	74.9 MB	18
openNLP	10.6 MB	103
langid	1.9 MB	97

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
dataset_creator		dataset_creator
evaluation		evaluation
jars		jars
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LangTagger

Benchmarks

Evaluation

Results achieved by different approaches on all languages of QALD testbenchmark in Full (F) and Keyword (K) questions

Results achieved by different approaches on English questions of QALD testbenchmark.

Results achieved by different approaches on German questions of QALD testbenchmark

Results achieved by different approaches on French questions of QALD testbenchmark

Results achieved by different approaches on Entity rdfs:labels

Results achieved by different approaches on Abstracts

Average runtime in seconds(s) different approaches on on QALD test bench-marks.

model size in Megabytes (MB) andKilobytes (KB) achieved by different approaches on QALD test benchmarks.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LangTagger

Benchmarks

Evaluation

Results achieved by different approaches on all languages of QALD testbenchmark in Full (F) and Keyword (K) questions

Results achieved by different approaches on English questions of QALD testbenchmark.

Results achieved by different approaches on German questions of QALD testbenchmark

Results achieved by different approaches on French questions of QALD testbenchmark

Results achieved by different approaches on Entity rdfs:labels

Results achieved by different approaches on Abstracts

Average runtime in seconds(s) different approaches on on QALD test bench-marks.

model size in Megabytes (MB) andKilobytes (KB) achieved by different approaches on QALD test benchmarks.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages