Skip to content

Commit d1d538b

Browse files
PVQ-4317 Implement detect_language_txt_to_txt subtype action
- added option for desktop to access this feature - added test for this feature
1 parent 31a5118 commit d1d538b

File tree

3 files changed

+66
-3
lines changed

3 files changed

+66
-3
lines changed

config.json

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,35 @@
5757
"value": ""
5858
}
5959
]
60+
},
61+
{
62+
"title": "Detect TXT Language to Text (LangDetect)",
63+
"name": "pdfix_detect_pdf_language_to_text_langdetect",
64+
"desc": "Automatically detects the language of a TXT and saves the detected language code to a TXT file [Local]",
65+
"version": "v0.0.0",
66+
"icon": "language_txt",
67+
"category": "Metadata",
68+
"subtype": "detect_language_txt_to_txt",
69+
"local": "True",
70+
"program": "docker run -v ${working_directory}:/data -w /data --rm pdfix/detect-language:latest lang-detect --name \"${license_name}\" --key \"${license_key}\" -i \"/data/${input_txt}\" -o \"/data/${output_txt}\"",
71+
"args": [
72+
{
73+
"name": "input_txt",
74+
"desc": "Input text file",
75+
"flags": 2,
76+
"type": "file_path",
77+
"ext": "txt",
78+
"value": ""
79+
},
80+
{
81+
"name": "output_txt",
82+
"desc": "Output text file containing the detected language code",
83+
"flags": 4,
84+
"type": "file_path",
85+
"ext": "txt",
86+
"value": ""
87+
}
88+
]
6089
}
6190
]
6291
}
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
Language identification
2+
---
3+
4+
From Wikipedia, the free encyclopedia
5+
In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, solved with various statistical methods.
6+
7+
Overview
8+
---
9+
There are several statistical approaches to language identification using different techniques to classify the data. One technique is to compare the compressibility of the text to the compressibility of texts in a set of known languages. This approach is known as mutual information based distance measure. The same technique can also be used to empirically construct family trees of languages which closely correspond to the trees constructed using historical methods.[citation needed] Mutual information based distance measure is essentially equivalent to more conventional model-based methods and is not generally considered to be either novel or better than simpler techniques.
10+
11+
Another technique, as described by Cavnar and Trenkle (1994) and Dunning (1994) is to create a language n-gram model from a "training text" for each of the languages. These models can be based on characters (Cavnar and Trenkle) or encoded bytes (Dunning); in the latter, language identification and character encoding detection are integrated. Then, for any piece of text needing to be identified, a similar model is made, and that model is compared to each stored language model. The most likely language is the one with the model that is most similar to the model from the text needing to be identified. This approach can be problematic when the input text is in a language for which there is no model. In that case, the method may return another, "most similar" language as its result. Also problematic for any approach are pieces of input text that are composed of several languages, as is common on the Web.
12+
13+
For a more recent method, see Řehůřek and Kolkus (2009). This method can detect multiple languages in an unstructured piece of text and works robustly on short texts of only a few words: something that the n-gram approaches struggle with.
14+
15+
An older statistical method by Grefenstette was based on the prevalence of certain function words (e.g., "the" in English).
16+
17+
A common non-statistical intuitive approach (though highly uncertain) is to look for common letter combinations, or distinctive diacritics or punctuation.[1][2]
18+
19+
Identifying similar languages
20+
---
21+
One of the great bottlenecks of language identification systems is to distinguish between closely related languages. Similar languages like Bulgarian and Macedonian or Indonesian and Malay present significant lexical and structural overlap, making it challenging for systems to discriminate between them.
22+
23+
In 2014 the DSL shared task[3] has been organized providing a dataset (Tan et al., 2014) containing 13 different languages (and language varieties) in six language groups: Group A (Bosnian, Croatian, Serbian), Group B (Indonesian, Malaysian), Group C (Czech, Slovak), Group D (Brazilian Portuguese, European Portuguese), Group E (Peninsular Spanish, Argentine Spanish), Group F (American English, British English). The best system reached performance of over 95% results (Goutte et al., 2014). Results of the DSL shared task are described in Zampieri et al. 2014.
24+

test.sh

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ else
6363
EXIT_STATUS=1
6464
fi
6565

66-
info "Test #04: Run language detection to txt"
66+
info "Test #04: Run language detection pdf to txt"
6767
docker run --rm $PLATFORM -v $(pwd):/data -w /data $DOCKER_IMAGE lang-detect -i example/air_quality.pdf -o $TEMPORARY_DIRECTORY/air_quality.txt > /dev/null
6868
if [ -f "$(pwd)/$TEMPORARY_DIRECTORY/air_quality.txt" ]; then
6969
success "passed"
@@ -72,9 +72,18 @@ else
7272
EXIT_STATUS=1
7373
fi
7474

75+
info "Test #05: Run language detection txt to txt"
76+
docker run --rm $PLATFORM -v $(pwd):/data -w /data $DOCKER_IMAGE lang-detect -i example/language_identification_wikipedia.txt -o $TEMPORARY_DIRECTORY/language_identification_wikipedia_lang.txt > /dev/null
77+
if [ -f "$(pwd)/$TEMPORARY_DIRECTORY/language_identification_wikipedia_lang.txt" ]; then
78+
success "passed"
79+
else
80+
error "language detection to txt failed on example/language_identification_wikipedia.txt"
81+
EXIT_STATUS=1
82+
fi
83+
7584
# Move these tests to functional tests
7685

77-
# info "Test #05: Run lang-detect on pdf with empty page"
86+
# info "Test #06: Run lang-detect on pdf with empty page"
7887
# docker run --rm $PLATFORM -v $(pwd):/data -w /data $DOCKER_IMAGE lang-detect -i example/empty_page.pdf -o $TEMPORARY_DIRECTORY/empty_page.txt > /dev/null
7988
# if [ -f "$(pwd)/$TEMPORARY_DIRECTORY/empty_page.txt" ]; then
8089
# success "passed"
@@ -83,7 +92,7 @@ fi
8392
# EXIT_STATUS=1
8493
# fi
8594

86-
# info "Test #06: Run lang-detect on pdf with numbers"
95+
# info "Test #07: Run lang-detect on pdf with numbers"
8796
# docker run --rm $PLATFORM -v $(pwd):/data -w /data $DOCKER_IMAGE lang-detect -i example/pdfix_6_0_0_0053.pdf -o $TEMPORARY_DIRECTORY/num.txt > /dev/null
8897
# if [ -f "$(pwd)/$TEMPORARY_DIRECTORY/empty_page.txt" ]; then
8998
# success "passed"
@@ -96,6 +105,7 @@ info "Cleaning up temporary files from tests"
96105
rm -f $TEMPORARY_DIRECTORY/config.json
97106
rm -f $TEMPORARY_DIRECTORY/air_quality.pdf
98107
rm -f $TEMPORARY_DIRECTORY/air_quality.txt
108+
rm -f $TEMPORARY_DIRECTORY/language_identification_wikipedia_lang.txt
99109
rmdir $(pwd)/$TEMPORARY_DIRECTORY
100110

101111
info "Removing testing docker image"

0 commit comments

Comments
 (0)