Skip to content

Latest commit

 

History

History
80 lines (68 loc) · 4.91 KB

Text Content Locator - Dealing with Line Wrap.md

File metadata and controls

80 lines (68 loc) · 4.91 KB

Dealing with line wrapped text in the Text Content Locator.

The Text Content locator tokenizes natural text in a document to extract fields.

In the following example, which is a speeding ticket in German, we want the Text Content Locator to retrieve the following fields from this text.
image

Field Value
Vehicle Type PKW
License 3AK8017
Date 12.10.2022
Time 15:18
Location A38, AD Drammetal, km 0,492, Rampe zur A7, in Rtg. Kassel
Law § 24 StVG

Only Vehicle and License are single words. Date, Time, Location and Law are all multiword phrases, called tokens.
The Location coontains 12 words that wrap around. It is very important the Text Content Locator sees all 12 of these words, so that it knows that in is the first word before the field and folgende is the first word after the first.
The text is tokenized as
dem Führer des {Vehicle Type}, {License} wird vorgeworfen, am {Date}, um {Time} Uhr in {Location} folgende Ordnungswidrigkeit nach {Law} begangen zu haben Note that ALL the information is inside the {tokens} and that all of the words are just the Text Context but don't contain any field information. The Text Content locator will now learn that um comes 1 token after {Date} and is one token before {Time}. It will also learn that um is 4 tokens in front of {Location}. This is why you need t give the TCL many training samples so it can learn all of the possibilities, and also have the power to tokenize sentenece that it has not seen before.

In Kofax Transformation Validation it can be difficult to put the correct words in the field because of line wrapping.

It is important the the XDocument contains each of these words.

<text>A38, AD Drammetal, km 0,492, Rampe zur A7, in Rtg. Kassel</text>
<words>21;22;23;24;25;26;27;28;29;30;31;</words>

Do the following in Kofax Transformation

  • Drag the mouse from the first word to the end of the line.
    image
    image
    image
  • CTRL-drag the mouse for the first word of the second line to the last word.
    image
  • The field contains the correct text, but the image viewer and the red box highlight too much text. Ignore this!
    image
    image
  • If you make a mistake, clear the field and try again.
    image

If you have problems with the text, the selected words, you can run the following script after document validation (not in KTA) which correctly inserts all of the words in a field that are between the first and the last word.

Option Explicit

' Class script: TrafficFine
Private Sub Document_Validated(ByVal pXDoc As CASCADELib.CscXDocument)
   'This Event is Triggered when the Validation Screen is finished with the document. Does not work in KTA
   Dim F As Long
   For F=0 To pXDoc.Fields.Count-1
      Field_InterpolateWords(pXDoc, pXDoc.Fields(F))
   Next
End Sub

Public Sub Field_InterpolateWords(pXDoc As CscXDocument, Field As CscXDocField)
   'This checks if a Field contains at least two words. If these words are NOT adjacent then the text is cleared, all words are inserted into the Field and the text reconstructed
   'This is necessary for Text Content Locator training to be able to train from documents where fields line wrap.,
   Dim WFirst As Long, WLast As Long, W As Long, Words As CscXDocWords
   Set Words=Field.Words
   If Words.Count<2 Then Exit Sub 'This field contains zero or one word
   WFirst= Words(0).IndexOnDocument 'index of first word in field
   WLast = Words(Words.Count-1).IndexOnDocument 'index of last word in field
   If WLast < WFirst+1 Then Exit Sub ' exit if there are no words BETWEEN the fields.
   'This field contains at least two words
   'remove the text and words it has
   Field.Text=""
   While Words.Count>0
      Words.Remove(0)
   Wend
   'Add all the words, including the words between to the field. .Text will be filled automatically
   For W=WFirst To WLast
      Field.Words.Append(pXDoc.Words(W))
   Next
End Sub