Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR to merge all work from NCSU Senior Design project #165

Closed
wants to merge 64 commits into from

Conversation

dan144
Copy link

@dan144 dan144 commented Jun 21, 2017

This pull request contains all of the work from the NC State University ECE Senior Design team. The major features added include string search, batch processing, and OCR.

Dean and others added 30 commits April 13, 2017 22:32
files for alpha demo
files for alpha demo
Merging all development by ECE485 team into master branch
Daniel Gross added 2 commits June 29, 2017 20:15
…iles created in temporary directories. removed OCR renaming and simply overwrite all OCR files, since all are now temporary
// NOTES:
// need to remove tables from auto/spread list if they are used as a best guess
// or remove very similar tables from the list before extracting data
public class BatchSelectionExtractor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worthwhile for this to mimic https://github.com/tabulapdf/tabula-java/blob/master/src/main/java/technology/tabula/extractors/SpreadsheetExtractionAlgorithm.java#L85 and have the extract() method return a list of tables, rather than writing the output to disk directly.

I think it would also work for it to return a list of lists of tables (one list of tables per document in the batch job) or even a dictionary/hash mapping batch document filenames to a list of tables.

This will also, I believe, solve one of the problems that's causing the continuous integration job to fail. Rather than requiring that an output directory exist, this extract method can "write" the output to a stream object, rather than a real file. That way, the tests aren't side-effecty in terms of creating files and folders.

Incidentally, I think there needs to be a little bit of work to deal with Linux/Mac compatibility. Where the test for this method is supposed to write some files to the /src/test/resources/technology/tabula/batch/output/ directory, I end up with a file in java/src/test/resources/technology/tabula/batch/ called output\well_text_a.csv. Java has the ability to let you name files without using \ or / and letting Java figure out the right one to use.

Copy link
Author

@dan144 dan144 Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the last section of this is caused by this section of the code. I'll switch this to use Paths.get. I'll look into making changes for the rest as well.

Just to keep visibility, @dbangera23 is looking into this presently; it hasn't been forgotten.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jeremybmerrill, I changed the code to make the extract method close to how spreadsheetExtractionAlgorithm. It now returns a Map<String fileName, List

> format. I also changed some of the code to make it modular. Now batch processing is done separately than writing.

Let me know if there is anything else.
Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes. The type signature looks good. Let's figure out why CI is failing, then, since @jazzido approved these too, we can see about getting these merged.

@jeremybmerrill
Copy link
Member

Okeedoke. Thanks for fixing the tests ("good failures" the ones where the tests fail because extraction got improved are common and kinda funny). I'm good to merge this. @jazzido ?

@jazzido
Copy link
Contributor

jazzido commented Aug 16, 2017

Hi @dan and @dbangera23,

Ran into another issue when playing with your branch. This command:

java -jar target/tabula-1.0.2-SNAPSHOT-jar-with-dependencies.jar -e -p 10 -g ~/Downloads/170803_crosstabs_Politico_v1_TB-1.pdf
Exception in thread "main" java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract': Native library (darwin/libtesseract.dylib) not found in resource path ([file:/Users/manuel/Work/code/tabula-java/target/tabula-1.0.2-SNAPSHOT-jar-with-dependencies.jar])
	at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:271)
	at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:398)
	at com.sun.jna.Library$Handler.<init>(Library.java:147)
	at com.sun.jna.Native.loadLibrary(Native.java:412)
	at com.sun.jna.Native.loadLibrary(Native.java:391)
	at net.sourceforge.tess4j.util.LoadLibs.getTessAPIInstance(LoadLibs.java:75)
	at net.sourceforge.tess4j.TessAPI.<clinit>(TessAPI.java:42)
	at net.sourceforge.tess4j.Tesseract.init(Tesseract.java:368)
	at net.sourceforge.tess4j.Tesseract.createDocuments(Tesseract.java:524)
	at net.sourceforge.tess4j.Tesseract.createDocuments(Tesseract.java:507)
	at technology.tabula.extractors.OcrConverter.extract(OcrConverter.java:45)
	at technology.tabula.CommandLineApp.extractFileTables(CommandLineApp.java:124)
	at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:104)
	at technology.tabula.CommandLineApp.main(CommandLineApp.java:74)
  • Even though I indicated that only page 10 needs to be processed, the app generated PNG images for all pages in the same folder that contains the input PDF. Of course, that should not happen. If you need to generate temporary files, they need to be generated in a temporary folder that is deleted afterwards.
  • As the trace shows, it failed. I'm on a Mac.

The document that I used for testing is here, although any document will show this behavior.

Thanks!

@jeremybmerrill
Copy link
Member

Hey guys, does String Search have a way to use it from teh command line?

@dbangera23
Copy link

Hey @jazzido and @jeremybmerrill, Me and daniel will take a look at the mac error that you guys are seeing this weekend and see what we can do. I also didn't add in the functionality to do ocr on a per page basis. Just the whole document so we can look into that.

Currently String search isn't implemented in the command line. We can look into making that work.

@dan144
Copy link
Author

dan144 commented Aug 20, 2017

@jazzido I looked into your error on Mac, and I think it's the same issue we're seeing on Linux. Since the included JAR file only includes Windows-compatible Tess4J functions by default, I think you'll have manually install the Tesseract libraries. Instructions are given here.

@dbangera23
Copy link

@jeremybmerrill Hey Jeremy, I looked into the option for adding a string based command line search and I don't think it's a good idea to do so at the moment. Maybe after the merge we can take a closer look.

The problem is that CommandLineApp.whichArea(line); get the rectangle to determine where to process. After which extraction is done later at exactly those rectangles. The string search might return a rectangle that is different from one page to another. I could dynamically search for the rectangle in extractFile but even then String search can return multiple rectangles per page.

I'm not sure how we want to handle this since the above "solution" would break how modular the code is and would need a redesign of the page class. Might be better to handle this functionality at a later date.

@dbangera23
Copy link

I'm finishing up the OCR per page functionality. Should have it done soon.

@rosenjcb
Copy link

rosenjcb commented Sep 5, 2018

Was this pull request ever merged into the master branch? I've thought about adding OCR (tess4j) functionality but it looks like it was already attempted.

@dan144
Copy link
Author

dan144 commented Sep 6, 2018

Hey @rosenjcb, this PR has not been worked on in some time. We had trouble with the Ubuntu version used for automated testing in this repo. No dev work has been done on this project in about 16 months, so if you're interested in picking up OCR effort on this project, this certainly shouldn't stop you.

@jazzido jazzido closed this Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants