-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR to merge all work from NCSU Senior Design project #165
Closed
Closed
Changes from 44 commits
Commits
Show all changes
64 commits
Select commit
Hold shift + click to select a range
d884a2e
added changes for regex search
3ad86c7
missed some changes
19ab131
fixed java code for regex search
225d648
Intermediary change while we get 4 corners working
fffae84
added Tess4j dependency to pom.xml, added base class for ocr implemen…
1b490b0
OCR Full conversion complete, dumped into individual pdf's folder
0026ac7
Changed return to succes or failure
3241583
Add files via upload
c721649
Add files via upload
a130b12
fixed imports for full build dependencies
2148259
fix double iteration on backup
7440fcb
temp fix for out of position strings that begin with same char
b8e904b
batch test for GUI
57093cc
updated container class for regular expressions
1210982
fixed a couple errors. file needed to be deleted
c9f60ef
added basic coordinate search
2e174a4
fixed check for coordinates
1879eab
System output change in regex and batch coordinate change
592ba47
system output changes and fixed bug with bottom of string not being w…
2f40969
ocr selection for batch now passed in
409f902
Fixed issue where first element wasn't within the coordinates
4980374
attempt to merge ocr into batch, overlap might do something
f533d9d
Merge branch 'master' into dev
643eb73
allowing 3 corner search, new parsing method
mattrich37 eb3fe2e
changed how string arrays are handled, should be able to perform 3 co…
mattrich37 46daf5c
removed backend cause of need for renaming operations.
d17d044
cleaned up code
15b6d54
removed print statements
mattrich37 cd96c69
tried to format tables better, removed prints and delete ocr files
mattrich37 f5ec776
fixes for ocr and indexing errors
mattrich37 4dce3d6
fixed comments
mattrich37 bba10d1
fixed conflict of ocr output naming. added boolean to determine ocr e…
0a92f37
updated output for ocr in batch processing
ee0e4eb
incorrect parsing of coordinate search list from json fix
c259189
basic 1 string search, should have fixed bounding box issues
mattrich37 32c31b0
no longer needed
mattrich37 9b5259f
one string search now compares to autodetect
93e0e63
added boolean as well..
dd40bb2
thinned out comments
a418f59
Complete rename of any regex to string to match functionality
dbangera23 45e316e
Initial frramework for testing of ocr and string search
dbangera23 68d9474
Implemented test cases for string searching
e238cad
Implemented test for OCR conversion
ed31e18
Wrote tests for batch processing. Added resources.
54cfc6b
preliminary effort to address PR comments. all relevant tests passing.
857d2de
cleaned up a number of PR comments. tested changes in GUI and verifie…
677ecee
Added OCR as 'e' in command line
dbangera23 9296532
changed OCR to remove creation of new files in user system. all new f…
3e69d60
reorganized batch search to remove all catch(Exception) swallowing.
6692f71
generalized path creation cross platform
701d3f6
fixing potential null pointers
ad2c520
fixed (another) potential nullpointerexception
5f39398
fixing linux error when files read in different order
680d375
Merge branch 'master' into master
jazzido c18ff67
install tesseract in travis
jazzido e24fc19
we need sudo in travis
jazzido 5fd099b
sudo for travis
jazzido 4471cee
travis: ghostscript
jazzido b716766
Wrote batch using file writer example in spreadsheetextractor
dbangera23 5bde097
Updated java process for batch processing. more modular process. upda…
dbangera23 763f341
Changed expected files in expected of batch testing due to change in …
dbangera23 0878b17
Merge branch 'master' into master
jeremybmerrill dd9ecea
updated command line interface, -b now takes into consideration -e fo…
dbangera23 54b3ec7
Now pass a page list to OcrConvertor and only run OCR on specified pages
dan144 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does
tess4j
include binaries for all the platforms that we support? (Windows, Mac and Linux). Does it require that Tesseract is present in the machine where Tabula is used?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I researched only Windows binaries are included.
[1] [2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look into this. I believe @jeremybmerrill one time mentioned that Unix systems require a terminal install command for a
tesseract
package, but I'll investigate this and find the fix.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbangera23
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we might need to add a line in the instructions for the linux/mac users. We used Tess4j in the project (http://tess4j.sourceforge.net/usage.html). which points to the original tesseract-ocr github page (https://github.com/tesseract-ocr/tesseract/wiki) that suggests running "sudo port install tesseract".
Although to be honest I'm not able to test since I don't have access to a linux device. Let me know if there is anything I can do and what you guys find out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jazzido it looks like we need ghostscript as well, which can be installed with
sudo apt-get install ghostscript
. That should be added to the.travis.yml
file the same way you added the tesseract dependency I believe.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the latest travis issue with Tesseract is
java.lang.UnsatisfiedLinkError: Error looking up function 'TessBaseAPICreate': /usr/lib/libtesseract.so.3.0.2: undefined symbol: TessBaseAPICreate
. When I runapt-get install tesseract-ocr
on my local Ubuntu 16.04 system, it installs the same package, but withlibtesseract.so.3.0.4
, which contains the API calls for Tess4J that we used, which are present in Tesseract 3.0.3+. It looks like the Travis system is running Ubuntu 12.04LTS, which ends support at 3.0.2. How do we want to proceed with this knowledge?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremybmerrill this is the cause of the failures and my analysis of it. I don't know Travis super well, so I'm not sure how to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dan144 @dbangera23 I see two failures when I run the tests myself (the other error may just be Travis being weird), both in testBatchExtractor, just with assertions that fail. Can you guys look into it and see what's going on, then get the tests to pass? Thanks in advance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremybmerrill I fixed the issue with testBatchExtractor. Looks like in the changes I forgot we had formatting in place for batch and now to keep in line with rest of Tabula, BatchWriter follows CSVWriter which messed up the expected output in testBatchExtractor.