The problem statement is to extract tables from a bunch of scanned pdfs. We have addressed the problem in two ways. First we have converted the pdfs into images then extracted the texts from those images. Later we have counted the horizontal lines. This is because the images with tables have higher number of horizontal lines. Then we have extracted text from an image if it crosses a certain value indicating that the image contains a table. The JSON files for this approach are stored in line_count folder. Our next approach is to count the number of numerical values in an image. The motivation behind this approach is the presence of higher number of numerical values in an image with tables. The JSON files for numeric value count approach are stored in num_count folder.
We have also implemeted an user interface from which a user can upload an image and the table will be extracted as a JSON file and will be showed in the format of a table into an HTML page. We also did some postprocessing of the texts to remove garbage words. To run the program you need to first install the required python packages and type in flask run from the particular directory where app.py is.