Skip to content

Experience the PDF Remediation solution developed at ASU AI Cloud Innovation Center. This innovative tool remediates PDF documents to meet WCAG 2.1 Level AA standards with tagging, metadata cleanup, and AI-powered alt-text generation, promoting digital accessibility for everyone.

License

Notifications You must be signed in to change notification settings

Mathpix/PDF_Accessibility

 
 

Repository files navigation

PDF Accessibility Solutions

This repository provides two complementary solutions for PDF accessibility:

  1. PDF-to-PDF Remediation: Processes PDFs and maintains the PDF format while improving accessibility.
  2. PDF-to-HTML Remediation: Converts PDFs to accessible HTML format.

Both solutions leverage AWS services and generative AI to improve content accessibility according to WCAG 2.1 Level AA standards.

Table of Contents

Index Description
Architecture Overview High level overview illustrating component interactions
Automated One Click Deployment How to deploy the project
Testing Your PDF Accessibility Solution User guide for the working solution
PDF-to-PDF Remediation Solution PDF format preservation solution details
PDF-to-HTML Remediation Solution HTML conversion solution details
Monitoring System monitoring and observability
Troubleshooting Common issues and solutions
Contributing How to contribute to the project

Architecture Overview

The following architecture diagram illustrates the various AWS components utilized to deliver the solution.

Architecture Diagram

Automated One Click Deployment

We provide a unified deployment script that allows you to deploy either or both the solutions with a single command. Choose your preferred solution during deployment:

Prerequisites

Common Requirements:

  1. AWS Account with appropriate permissions to create and manage AWS resources
  2. AWS CloudShell access (AWS CLI is pre-installed and configured automatically)
    • Sign in to the AWS Management Console
    • In the top navigation bar, click the CloudShell icon (terminal symbol) next to the search bar
    • Wait for CloudShell to initialize (this may take a few moments on first use)
  3. Enable AWS Bedrock Nova-Pro model in your AWS account (For PDF to PDF remediation)
    Enable AWS Bedrock Nova-Lite model in your AWS account (For PDF to HTML remediation)
    • Request access to Amazon Bedrock through the AWS console if not already enabled
    • Navigate to the AWS Bedrock console
    • Click "Model access" in the left navigation pane
    • Click "Manage model access"
    • Find Nova-Pro/Nova-Lite in the list and select the checkbox
    • Click "Save changes" and wait for access to be granted

Solution-Specific Requirements:

  • PDF-to-PDF:
    • Adobe API Access - An enterprise-level contract or a trial account (For Testing) for Adobe's API is required.
  • PDF-to-HTML: AWS Bedrock Data Automation service access
    • Ensure you have access to create a Bedrock Data Automation project - usually present by default

One-Click Deployment

Step 1: Open AWS CloudShell and Clone the Repository

git clone https://github.com/ASUCICREPO/PDF_Accessibility.git
cd PDF_Accessibility

Step 2: Run the Unified Deployment Script

chmod +x deploy.sh
./deploy.sh

Step 3: Follow the Interactive Prompts

The script will guide you through:

  1. Solution Selection: Choose between PDF-to-PDF or PDF-to-HTML remediation
  2. Solution-Specific Setup:
    • PDF-to-PDF: Enter Adobe API credentials (stored securely in AWS Secrets Manager)
    • PDF-to-HTML: Automatic creation of Bedrock Data Automation project
  3. Automated Deployment: Real-time monitoring of the deployment progress
  4. Optional UI Deployment: After successful deployment of your chosen solution(s), you'll have the option to deploy a user interface as well

Step 4: Test Your Deployment

After successful deployment, the script provides specific testing instructions for your chosen solution.

Testing Your PDF Accessibility Solution

PDF-to-PDF Solution Testing

  1. Navigate to Your S3 Bucket

    • In the AWS S3 Console, find the bucket starting with pdfaccessibility-
    • This bucket was automatically created during deployment
  2. Create the Input Folder

    • Create a folder named pdf/ in the root of the bucket
    • This is where you'll upload PDFs for processing
  3. Upload Your PDF Files

    • Upload any PDF file(s) to the pdf/ folder
    • Bulk Processing: You can upload multiple PDFs in the bucket for batch remediation
    • The process automatically triggers when files are uploaded
  4. Monitor Processing

    • Temporary Files: A temp/ folder will be created containing intermediate processing files
    • Final Results: A result/ folder will be created with your accessibility-compliant PDF files
    • Use the CloudWatch dashboard to monitor processing progress
  5. Download Results

    • Navigate to the result/ folder to access your remediated PDFs
    • Files maintain their original names with "COMPLIANT" prefix after accessibility improvements applied

PDF-to-HTML Solution Testing

  1. Navigate to Your S3 Bucket

    • In the AWS S3 Console, find the bucket starting with pdf2html-bucket-
    • This bucket was automatically created during deployment
  2. Upload Your PDF Files

    • Navigate to the uploads/ folder (created automatically during deployment)
    • Bulk Processing: You can upload multiple PDFs in the bucket for batch remediation
    • The process automatically triggers when files are uploaded
  3. Monitor Processing

    • Two folders will be created automatically:
      • output/: Contains temporary processing data and intermediate files
      • remediated/: Contains the final remediated results
  4. Access Your Results

    • Navigate to the remediated/ folder
    • Download the zip file named final_{your-filename}.zip
  5. Explore the Remediated Content The downloaded zip file contains:

    • remediated.html: Final accessibility-compliant HTML version
    • result.html: Original HTML conversion (before remediation)
    • images/ folder: Extracted images with generated alt text
    • remediation_report.html: Detailed report of accessibility improvements made
    • usage_data.json: Processing metrics and usage statistics

Advanced Usage

Redeployment After initial deployment, you can redeploy using the created CodeBuild project:

aws codebuild start-build --project-name YOUR-PROJECT-NAME --source-version main

Or simply re-run the deployment script and choose the solution your want redeploy.

PDF-to-PDF Remediation Solution

Overview

This solution processes PDFs while maintaining the original PDF format. It uses AWS CDK to build infrastructure that splits PDFs into chunks, processes them via AWS Step Functions, and merges the results using ECS tasks.

Architecture

  • S3 Bucket: Stores input and processed PDFs
  • Lambda Functions: PDF splitting, merging, and accessibility checking
  • Step Functions: Orchestrates the processing workflow
  • ECS Fargate: Runs containerized processing tasks
  • CloudWatch Dashboard: Monitors progress and performance

Manual Deployment

For detailed manual deployment instructions, see our Manual Deployment Guide.

PDF-to-HTML Remediation Solution

Overview

This solution converts PDF documents to accessible HTML format while preserving layout and visual appearance. It leverages AWS Bedrock Data Automation for PDF parsing and uses a serverless Lambda architecture.

Architecture

  • S3 Bucket: Stores input PDFs and remediated HTML files
  • Lambda Function: Processes PDFs using containerized accessibility utility
  • ECR Repository: Hosts the Docker image for Lambda
  • Bedrock Data Automation: Provides PDF parsing and extraction capabilities

Monitoring

PDF-to-PDF Solution

  • CloudWatch Dashboard: Automatically created during deployment
  • Step Functions Console: Monitor workflow executions
  • ECS Console: Track container task status

PDF-to-HTML Solution

  • Lambda Logs: /aws/lambda/Pdf2HtmlPipeline
  • S3 Events: Monitor file processing status
  • CloudWatch Metrics: Track function performance

Troubleshooting

Common Issues

AWS Credentials

  • Ensure AWS CLI is configured with appropriate permissions
  • Verify access to required AWS services (S3, Lambda, ECS, Bedrock)

Service Limits

  • Check AWS service quotas if deployment fails
  • Request additional Elastic IPs if needed: EC2 Service Quotas

Build Failures

  • Check CodeBuild console for detailed error messages
  • Verify all prerequisites are met
  • Ensure Docker is available for PDF-to-HTML deployments

Solution-Specific Troubleshooting

PDF-to-PDF Issues

  • Verify Adobe API credentials are correct and active
  • Check CloudWatch logs for Lambda functions and ECS tasks
  • Ensure NOVA_PRO Bedrock model access is granted

PDF-to-HTML Issues

  • Verify Bedrock Data Automation permissions
  • Check Lambda function logs in CloudWatch
  • Ensure Docker image was pushed to ECR successfully

Getting Help

Contributing

Contributions to this project are welcome. Please fork the repository and submit a pull request with your changes.

Acknowledgments

The PDF-to-HTML remediation functionality in this project is adapted from AWS Labs' Content Accessibility Utility on AWS. This version includes updates and enhancements tailored for integration within the PDF Accessibility backend.


Support

For questions, issues, or support:


Built by Arizona State University's AI Cloud Innovation Center (AI CIC)
Powered by AWS

About

Experience the PDF Remediation solution developed at ASU AI Cloud Innovation Center. This innovative tool remediates PDF documents to meet WCAG 2.1 Level AA standards with tagging, metadata cleanup, and AI-powered alt-text generation, promoting digital accessibility for everyone.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 79.0%
  • C++ 13.7%
  • C 6.3%
  • Shell 0.4%
  • JavaScript 0.3%
  • HTML 0.1%
  • Other 0.2%