DSS Capacity Plan
DSS Capacity Plan — Recording
Executive Summary
## Current Processing Capacity and Bottlenecks The system was initially designed to handle 2,400 invoices daily but now processes approximately 3,000-3,100 invoices, leading to persistent backlogs (e.g., 10,000 in data acquisition). This prevents meeting SLAs for both bill-paying customers and historical data. Key constraints include Microsoft's throttling thresholds-exceeding them triggers multi-day slowdowns-and unoptimized batch processing that risks overwhelming capacity during peak web-upload activity. ### Capacity Management Concerns - **Threshold risks**: Increasing batch frequency (e.g., from 5 to 3 minutes) could breach Microsoft’s limits, causing system-wide slowdowns lasting days due to cascading timeouts and reprocessing needs. - **Web-upload impact**: Direct uploads (e.g., via FDG Connect) bypass queues and consume reserved capacity, necessitating a buffer to avoid tripping thresholds during high-volume periods. - -- ## Optimization Strategies for Increased Throughput Two primary solutions were evaluated to address capacity limitations: reserved Microsoft capacity and leveraging underutilized non-production resources. ### Reserved Capacity Analysis - **Cost-benefit trade-offs**: Transitioning to reserved TPUs (dedicated compute) requires a monthly commitment, but current volumes (~3,000/day) fall below the breakeven point (~12,000/day), making it economically unviable for now. - **Performance metrics**: Token utilization in Azure must be analyzed to determine safe scaling margins (e.g., operating at 70% capacity allows incremental frequency adjustments). ### Non-Production Resource Utilization - **Dev/test environment exploitation**: Non-production subscriptions offer identical, underused capacity. Redirecting a portion of processing (e.g., historical invoices) to these environments could increase daily throughput by ~50% (from 3,000 to 4,500 invoices). - **Implementation complexity**: Requires adding a discriminator to route invoices and modifying tracking logic to fetch results from the correct environment, estimated as a moderate technical lift. - -- ## Technical Improvements and System Refactoring Optimizing requests and addressing processing failures emerged as critical priorities. ### Batch Processing Efficiency - **Microsoft’s feedback**: Sending single-row files increases overhead; batching multiple files per request could improve performance but disrupts the current one-to-one error-tracking framework. - **Feasibility assessment**: Refactoring for batched inputs is deprioritized due to high effort and marginal gains, as latency would shift from processing to queue wait times. ### Error Handling and Retry Mechanisms - **Transient failures**: SQL timeouts (Error #3) affect ~342 invoices; implementing exponential-backoff retries at the database-transaction level would automate recovery without manual intervention. - **Operational gaps**: 25,000 unaddressed failure emails highlight workflow breakdowns; fixes include linking DSS statuses to legacy system completion flags to filter resolved cases. - -- ## System Integration and Workflow Disconnects Misalignments between DSS, DDIs, and UBM cause invoices to stall, with 15,000+ items stuck in "waiting for operator" status due to filtering inaccuracies and missing client/vendor data. ### Data Consistency Challenges - **Upstream data gaps**: BDE files lack client/vendor codes, causing pre-audit failures. Legacy system inconsistencies (e.g., duplicate client entries) further prevent automated matching. - **Operational process fixes**: Manual intervention is required for unresolved failures, but email alerts are ignored; reactivating notifications and clarifying ops workflows is urgent. ### Cross-Platform Synchronization - **Status tracking**: DSS doesn’t reflect UBM/legacy system completions. Adding a column to show legacy-system status would allow bulk-hiding processed invoices, clearing operational backlogs. - **Filtering improvements**: Current filters display completed and failed invoices indiscriminately; refining them would provide accurate "action required" visibility. - -- ## Long-Term Architectural Evolution Decoupling from legacy systems via a rearchitected output service is proposed to accelerate processing and reduce dependencies. ### Output Service Modernization - **Direct UBM integration**: Migrating output generation from the legacy system to DSS would use Cosmos DB data directly, skipping 7-10 legacy workflow steps and enabling near-real-time UBM feeds. - **Fallback mechanism**: Failures would reroute to the legacy system, ensuring continuity while isolating DSS for most invoices. - **Strategic impact**: This shift positions DSS as a standalone platform, critical for future scalability and operational resilience. ### Hybrid Transition Approach - **Coexistence phase**: Legacy and DSS output pathways would operate in parallel, allowing incremental validation. Legacy dependencies reduce to error-handling only, minimizing bottlenecks. - **Implementation clarity**: The output service code requires minimal adjustments to use DSS data instead of legacy sources, avoiding major redevelopment.
Key Topics
Decisions
No decisions recorded
Action Items(0/0 done)
No action items recorded