PDFExtract

PDFExtract is a Node.js application for extracting structured data from PDF documents. It provides a set of tools for processing text extracted from PDF files, including cleaning, formatting, and mapping to meaningful entities.

Installation

Clone the repository:

git clone https://github.com/RXGabriel/PDFExtract.git

Navigate to the project directory:

cd pdfExtract

Install dependencies:

npm install

Usage

To use PDFExtract, follow these steps:

Ensure you have Node.js installed on your system.
Provide PDF documents to be processed in the /docs directory.
Run the application:

npm start

Features

Extract structured data from PDF documents.
Clean and format extracted text.
Map extracted data to meaningful entities (e.g., people, addresses).
Ensure safety when working with regular expressions.

Project Structure

src/: Contains the source code for the PDFExtract application.
- index.js: Main entry point for the application.
- textProcessorFacade.js: Facade for text processing operations.
- textProcessorFluentAPI.js: Fluent API for text processing.
- person.js: Class for representing a person entity.
- util.js: Utility functions for working with regular expressions.
test/: Contains unit tests for the application.
docs/: Directory for storing PDF documents to be processed.

Scripts

npm start: Run the application.
npm test: Run unit tests.
npm run test:cov: Run tests with coverage report.

Dependencies

pdf-parse: Library for extracting text from PDF documents.
safe-regex: Library for checking the safety of regular expressions.
mocha, chai: Testing framework and assertion library.
nyc: Istanbul code coverage tool.

Contributing

Contributions are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.nyc_output		.nyc_output
.vscode		.vscode
docs		docs
node_modules		node_modules
src		src
test		test
.nycrc.json		.nycrc.json
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
text.txt		text.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFExtract

Installation

Usage

Features

Project Structure

Scripts

Dependencies

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

RXGabriel/PDFExtract

Folders and files

Latest commit

History

Repository files navigation

PDFExtract

Installation

Usage

Features

Project Structure

Scripts

Dependencies

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages