Project identifies document hierarchy layout.
- Web Service :
app.py - Executors :
src/pipeline_executor.py - HTML Splitter and Reader:
src/html_splitter.pyandsrc/html_line_parser.py - Classifiers:
src/classifiers_executor.py - Validators:
src/validators_executor.py - Response Generators:
src/json_gen_executor.py
classifierspackage consists of training/testing/classification logic and algorithms.datasetsdirectory consists of all training dataset for each models.modelspackage holds the trained and saved models.feature_extractorspackage holds custom engineered features for respective classifiers.lookup_dictionariespackage holds keyword based features.
bullet_templates: Implementation of Strategy Design Pattern, to find bullet style of a text.objects: Python Object Classes, used while, JSON Response Generation.tests: Unit Tests for Document Hierarchy Identification Code.utils: holds CONSTANTS and other utility methods.setup_log.py: logging configuration file loader.logConfig.yaml: logging configuration file, used throughout the Code.
/holmes4business/contract_intel/v2/sectionExtract: External API for Paragraph Extraction/holmes4business/contract_intel/v2/flatResponse: Internal Usage Purpose to View the FlatResponse/holmes4business/contract_intel/v2/lineidsResponse: External API for Paragraph Extraction with sub_section level lineids./holmes4business/contract_intel/v2/htmlResponse: External API accepts input html and returns<h1>and<h2>tags added to html after identifying headings and subheadings.
Input to the REST API is html content of file as body and key as Client and following values for client in header.
Based on passed Client, dynamically models and client specific features are executed in pipeline.
Following are permissible values for clients:
ermto process erm filest_mobileto process t_mobile filesisdato process isda filesbestbuyto process bestbuy filesgenericto process any new file.
Note: If no Client is passed in headers, by default loads generic models.
helpers/bulk_json_generator.py: Internal API to process in bulk html files for paragraph extraction. Needsfolder_pathof html files andClient.Generates and writesJSON Response.helpers/bulk_excel_generator.py: Internal API to process in bulk html files for paragraph extraction. Needsfolder_pathof html files andClient.Generates and writesExcel_Response.helpers/bulk_doc_title_extractor.py: Internal API to process in bulk html files for document title extraction. Needsfolder_pathof html files andClient.Generates and writesExcel_Response.helpers/model_dataset_evaluator.py: Evaluator to evaluate custom model on custom dataset. generates Precision, Recall and F1 scores.helpers/bulk_doctype_runner.py: Internal API to process in bulk ocr htmls and generate docType htmls.helpers/bulk_html_generator.py: Internal API to process in bulk input htmls and converts them to htmls having<h1>tag for heading,<h2>tag for subheading.helpers/bulk_ocr_runner.py: Internal API to process in bulk pdfs to OCR and generate htmls.
Returns JSON response, consists of following JSON keys:
Paragraphs: holds the content of posted document in structured format.Document_Titles: holds the identified document title.others: holds the filtered out texts while generating document structured format.
{
"Paragraphs": [{
"Clause": {
"Main-heading": "SITE LEASE AGREEMENT",
"Sub-heading": "1. Property Description .",
"Sub-section": [
" Landlord is the owner of the real property located at Error! Reference source not found. , Error! Reference source not found. as further described on Exhibit A (the Property ). The Property includes the premises which is comprised of approximately Error! Reference source not found. square feet plus any additional portions of the Property which Tenant may require for the use and operation of its facilities as generally described on Exhibit B (the Premises ). Tenant reserves the right to update the description of the Premises on Exhibit B to reflect any modifications or changes."
]
},
"Page_Style": "width:815.79596px;height:1055.736px;overflow:hidden;",
"Document_Type": "exhibit",
"Bottom_Right": "top:327.82468px;left:336.27863px",
"Top_Left": "top:230.2091px;left:90.08947px",
"Page_Number": "['page_0']",
"Tag_No": "['11', '12', '13', '14', '15', '16', '17', '18', '19', '20']",
"File_Name": "\n Standard_Lease_Template.pdf\n ",
"Priority": "5"
}]
}