Skip to content

Commit d60fdcc

Browse files
ELToulemondeint3l
authored andcommitted
Adding pd.DataFrame as a posssible output of image_to_data (madmaze#160)
* Adding pd.DataFrame as a posssible output of image_to_data **Motivation** Since most data scientist use pandas.DataFrame in a daily basis, I find it a nice have to add it as a possible output for `image_to_data`. **New dependencies** No new dependencies. This feature is only available if pandas is already installed (to avoid requiring it), see `pandas_installed` (inspired by `numpy_installed`). **Implementation** It is done by only changing function `image_to_data` so it is not perfectly clean and require calling `StringIO`. But I did it this way to avoid changing `run_and_get_output`. *New code:* Add a conditional import to pandas ``` from io import StringIO pandas_installed = find_loader('pandas') is not None if pandas_installed: import pandas as pd ``` Add DATAFRAME to output format ``` class Output: STRING = "string" BYTES = "bytes" DICT = "dict" DATAFRAME = "data.frame" ``` Add a call to pandas.read_csv going through `StingIO ` to retrieve a data.frame in `image_to_data` ``` elif pandas_installed and output_type == Output.DATAFRAME: args.append(True) return pd.read_csv(StringIO(run_and_get_output(*args)), sep="\t") ``` * New line at the end * Improve version according to comments Added an explicit error in case of pandas not installed
1 parent 01edecc commit d60fdcc

File tree

1 file changed

+17
-0
lines changed

1 file changed

+17
-0
lines changed

src/pytesseract.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,11 @@
2424
if numpy_installed:
2525
from numpy import ndarray
2626

27+
from io import StringIO
28+
pandas_installed = find_loader('pandas') is not None
29+
if pandas_installed:
30+
import pandas as pd
31+
2732
# CHANGE THIS IF TESSERACT IS NOT IN YOUR PATH, OR IS NAMED DIFFERENTLY
2833
tesseract_cmd = 'tesseract'
2934
RGB_MODE = 'RGB'
@@ -41,6 +46,12 @@ class Output:
4146
STRING = "string"
4247
BYTES = "bytes"
4348
DICT = "dict"
49+
DATAFRAME = "data.frame"
50+
51+
52+
class PandasNotSupported(EnvironmentError):
53+
def __init__(self):
54+
super(PandasNotSupported, self).__init__('Missing pandas package')
4455

4556

4657
class TesseractError(RuntimeError):
@@ -348,6 +359,12 @@ def image_to_data(image,
348359

349360
if output_type == Output.DICT:
350361
return file_to_dict(run_and_get_output(*args), '\t', -1)
362+
elif output_type == Output.DATAFRAME:
363+
if not pandas_installed:
364+
raise PandasNotSupported()
365+
366+
args.append(True)
367+
return pd.read_csv(StringIO(run_and_get_output(*args)), sep="\t")
351368
elif output_type == Output.BYTES:
352369
args.append(True)
353370

0 commit comments

Comments
 (0)