Skip to content

wanted: PDF analyzer (Bugzilla #17751) #492

Open
@vladak

Description

@vladak

status ACCEPTED severity enhancement in component analyzer for ---
Reported in version unspecified on platform ANY/Generic
Assigned to: Lubos Kosco

On 2011-01-20 10:12:05 +0000, Vladimir Kotal wrote:

PDF analyzer would be beneficial to have, e.g. in order to search design documents together with source code (by selecting a project with the source code and a "project" with design documents).

On 2011-02-15 13:54:44 +0000, Lubos Kosco wrote:

we could reuse http://pdfbox.apache.org/

after all old opengrok - arcs - still used for psarcs had it like that ...

forwardport? :-D

On 2011-02-15 13:59:43 +0000, Lubos Kosco wrote:

alternatively is to use pdfbox underneath tika and grant a myriad of supported formats for lucene:

http://tika.apache.org/0.8/formats.html

(pdf, (open)office, mbox, rtf, audio/video metadata alt. java class and jar parser, it also has a compressed files parser, which can be used to satisfy bug 343 )

I have a feeling this might be one of the major features for next version! :)

On 2011-03-15 07:29:16 +0000, Lubos Kosco wrote:

for odf formats we also have:
http://odftoolkit.org/

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions