This serverless application extracts and processes power converter specifications from various manufacturers using Azure Functions.
The application uses Azure Durable Functions to orchestrate a multi-step extraction and processing pipeline:
- Scrape Series: Extract product series information
- Scrape Products: Extract detailed product information
- Download PDFs: Retrieve and store product datasheets
- Extract PDF Data: Extract text content from PDFs
- Extract Structured Data: Parse structured data using AI
- Validate Data: Validate and consolidate extracted data
The pipeline can be triggered in two ways:
Upload a JSON configuration file to the power-converter-data/config/
container to automatically trigger the pipeline:
{
"manufacturer": "recom",
"product_types": ["dc-dc-converters", "ac-dc-power-supplies"]
}
This will trigger the blob_trigger
function which starts the orchestrator.
Send a POST request to the start_orchestrator
function with the following payload:
{
"manufacturer": "recom",
"product_types": ["dc-dc-converters", "ac-dc-power-supplies"],
"trigger_mode": "direct"
}
Setting trigger_mode
to blob
will create a blob that triggers the pipeline instead of starting it directly:
{
"manufacturer": "recom",
"product_types": ["dc-dc-converters", "ac-dc-power-supplies"],
"trigger_mode": "blob"
}
The application uses the following environment variables:
AzureWebJobsStorage
: Azure Storage connection stringSTORAGE_CONTAINER
: Storage container name (default:power-converter-data
)DOCUMENT_INTELLIGENCE_ENDPOINT
: Azure Document Intelligence API endpointDOCUMENT_INTELLIGENCE_KEY
: Azure Document Intelligence API keyOPENAI_API_KEY
: OpenAI API key
- RECOM Power
- Traco Power
- XP Power
The pipeline produces structured data in the following formats:
- CSV files for each step of the pipeline
- JSON files containing structured power converter specifications
- Content-addressable storage for PDFs and extracted text
All results are stored in the configured Azure Blob Storage container.