Skip to content

PyPDF2 failing to read unicode character #37

Closed
@SharmileeS

Description

I have a PDF which PDFFileReader is unable to read the text , instead this is the output:

u'\n˘ˇˆ˘ˇ˙˝˛˛˚˜ !!"#$%&"˝˛˝˘˛˘˛˚˙˘ˇ˝˛˘˛$\'(˘%˘ˇ˘ˆ˘)_)˛\'+,-)"˛./0"0!123˛"4˙"5)46)!6"˙˘˘˘,˘ˇˆ˙˙ˆ˝˛˚˜ !˘ˇˆ˙˝"" ˜#˝$˛˚˜ ˆ˙˝"" ˜ %˛˚˜ !˛˚ˇ!"#$%˘ˇ&ˆ˙˝˛˝ˆ˙&˚˝\'˛˚&\'()_ˇ+˙˝"" ˜#˝$˜#( ˛˚(ˇ+,˘˘˘ˇˆˆˆˇ,ˆ--ˆˇˇ˙˝˝% ˜)˜#_#˝$$˜  ˙ ˝_˛˚ˆ-&ˆ!ˆˇ&˘+$ˆ(˙˝+˚˜,!˛˚./&0ˆˆ+$ˆ(˙˝-˛-,&˘˝ˆ. ˚%˝% ˜)˜#\* ˜!˛˚&ˆˇ%ˆ!&(12+3ˇ˙˝,˜ˆ/˛˚%#"+3("ˆˇ.!ˆˇ43ˇ(˙-,&53ˇ6ˆˇ,˝˝% ˜)˜#\* ˜!˛˚(77777777777˜#( 0123& ˜"" ˜ %˛˚˜ 77777777777˜#( _ˆ_˛ ,4+#(56˝% ˜)˜#\* ˜!7  56 _˜ˆ(  %!_ˆ_˛ ˆ˙&˚˝\'586"ˇ+((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((&\'()_&\'(_&\'()˘536((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((&\'&\' &\'˜ ˜˙˚ˆ-",ˇˆˇ!ˆ-ˆ,ˆ&ˆ!ˆˇ&53ˇ6ˆˇ,(˙˚&ˆ!-ˇ!6ˆˇ,˘ 8-ˇˆ-˙˝˝% ˜)˜#_ ˜!7  ˛˚(˙˚9ˇˇˆ-6ˆˇ,:;ˇˇˆ-<ˆˆ-ˇ&\' ,,˘˘ˇˇˆ-(9ˆˇˆ-!˘ˇˆ9˘ˆˇ˘˘(\n\n'

This is the output after Extract Text and it doesnot throw any error message.

A similar issue has been posted here:

http://stackoverflow.com/questions/15583535/how-to-extract-text-from-a-pdf-file-in-python
I am using windows so the solution in link is not helpful

Metadata

Assignees

No one assigned

    Labels

    workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions