-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR to merge all work from NCSU Senior Design project #165
Conversation
files for alpha demo
files for alpha demo
…ithin coordinates
Merging all development by ECE485 team into master branch
…iles created in temporary directories. removed OCR renaming and simply overwrite all OCR files, since all are now temporary
// NOTES: | ||
// need to remove tables from auto/spread list if they are used as a best guess | ||
// or remove very similar tables from the list before extracting data | ||
public class BatchSelectionExtractor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worthwhile for this to mimic https://github.com/tabulapdf/tabula-java/blob/master/src/main/java/technology/tabula/extractors/SpreadsheetExtractionAlgorithm.java#L85 and have the extract()
method return a list of tables, rather than writing the output to disk directly.
I think it would also work for it to return a list of lists of tables (one list of tables per document in the batch job) or even a dictionary/hash mapping batch document filenames to a list of tables.
This will also, I believe, solve one of the problems that's causing the continuous integration job to fail. Rather than requiring that an output directory exist, this extract
method can "write" the output to a stream object, rather than a real file. That way, the tests aren't side-effecty in terms of creating files and folders.
Incidentally, I think there needs to be a little bit of work to deal with Linux/Mac compatibility. Where the test for this method is supposed to write some files to the /src/test/resources/technology/tabula/batch/output/
directory, I end up with a file in java/src/test/resources/technology/tabula/batch/
called output\well_text_a.csv
. Java has the ability to let you name files without using \
or /
and letting Java figure out the right one to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the last section of this is caused by this section of the code. I'll switch this to use Paths.get. I'll look into making changes for the rest as well.
Just to keep visibility, @dbangera23 is looking into this presently; it hasn't been forgotten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jeremybmerrill, I changed the code to make the extract method close to how spreadsheetExtractionAlgorithm. It now returns a Map<String fileName, List
> format. I also changed some of the code to make it modular. Now batch processing is done separately than writing.Let me know if there is anything else.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes. The type signature looks good. Let's figure out why CI is failing, then, since @jazzido approved these too, we can see about getting these merged.
…ted test to use new process
…formatting to keep in line with CSV writer and rest of tabula
Okeedoke. Thanks for fixing the tests ("good failures" the ones where the tests fail because extraction got improved are common and kinda funny). I'm good to merge this. @jazzido ? |
Hi @dan and @dbangera23, Ran into another issue when playing with your branch. This command:
The document that I used for testing is here, although any document will show this behavior. Thanks! |
Hey guys, does String Search have a way to use it from teh command line? |
Hey @jazzido and @jeremybmerrill, Me and daniel will take a look at the mac error that you guys are seeing this weekend and see what we can do. I also didn't add in the functionality to do ocr on a per page basis. Just the whole document so we can look into that. Currently String search isn't implemented in the command line. We can look into making that work. |
@jazzido I looked into your error on Mac, and I think it's the same issue we're seeing on Linux. Since the included JAR file only includes Windows-compatible Tess4J functions by default, I think you'll have manually install the Tesseract libraries. Instructions are given here. |
@jeremybmerrill Hey Jeremy, I looked into the option for adding a string based command line search and I don't think it's a good idea to do so at the moment. Maybe after the merge we can take a closer look. The problem is that CommandLineApp.whichArea(line); get the rectangle to determine where to process. After which extraction is done later at exactly those rectangles. The string search might return a rectangle that is different from one page to another. I could dynamically search for the rectangle in extractFile but even then String search can return multiple rectangles per page. I'm not sure how we want to handle this since the above "solution" would break how modular the code is and would need a redesign of the page class. Might be better to handle this functionality at a later date. |
I'm finishing up the OCR per page functionality. Should have it done soon. |
Was this pull request ever merged into the master branch? I've thought about adding OCR (tess4j) functionality but it looks like it was already attempted. |
Hey @rosenjcb, this PR has not been worked on in some time. We had trouble with the Ubuntu version used for automated testing in this repo. No dev work has been done on this project in about 16 months, so if you're interested in picking up OCR effort on this project, this certainly shouldn't stop you. |
This pull request contains all of the work from the NC State University ECE Senior Design team. The major features added include string search, batch processing, and OCR.