Distance Learning Systems Indiana
Automated OCR Contract Processing Pipeline
Intelligent multi-source data enrichment pipeline for online higher-education enrollment management, processing 7,117 student records across Salesforce, SharePoint, and Excel with automated contract matching and verification.
Automated Data Processing Pipeline
Excel Master Data Input
7,117 student records with 56 fields requiring enrichment
Salesforce API Contact Lookup
Name-based matching with fuzzy logic (80%+ similarity threshold)
Multi-Object Data Enrichment
Fetch from Loan_Account__c, Payment_Transaction__c, and Amortization_Schedule__c
SharePoint Folder Verification
Case-insensitive matching across 24,711 folders with name variation handling
Google Document AI OCR Processing
Automated SSN redaction and contract detail extraction
Master Ledger Update
Audit-ready contract ledger with 56 enriched fields and payment history
The Challenge
DLSI faced a three-system data fragmentation problem across master Excel spreadsheets (static student data with gaps), Salesforce (live loan management with 158 available fields per account), and SharePoint (24,711+ unorganized PDF folders with inconsistent naming). The real complexity wasn't just volume—it was data quality. Analysis revealed 16.1% of students had name changes (marriage, legal name updates), spelling variations between systems, and case sensitivity issues creating matching failures.
Key Problems to Solve:
- Data Fragmentation: Critical information scattered across 3 disconnected systems
- Name Matching Complexity: 16.1% name variations (marriage, spelling differences) causing lookup failures
- Case Sensitivity Issues: UPPERCASE Excel vs Title Case SharePoint preventing automated matching
- Document Recovery: Only 22% initial contract availability due to naming inconsistencies
- Encrypted SSN Fields: Salesforce SSN encryption requiring complex name-based matching strategies
- Scalability: Manual processing couldn't handle 7,117+ records efficiently for bad debt sales
Our Solution
We developed an intelligent multi-source data enrichment pipeline that processes 7,117 student records across Salesforce, SharePoint, and Excel. The system uses multi-pass intelligent matching (exact → fuzzy with 80%+ similarity → First Name + SSN fallback) to handle name variations, title case conversion for SharePoint folder patterns, and a 4-tier data enrichment pipeline pulling from Contact, Loan_Account__c, Payment_Transaction__c, and Amortization_Schedule__c Salesforce objects.
The breakthrough was improving document recovery from 22% to 70.8% by solving case sensitivity and name variation challenges, then preparing 5,038 verified contracts for Google Document AI OCR processing with automated SSN redaction.
Key Features Built:
Intelligent Name Matching Engine
Multi-pass fuzzy matching (80%+ threshold) with title case conversion, resolving 16.1% name variations.
4-Tier Salesforce Integration
Automated enrichment from Contact, Loan Account, Payment Transaction, and Amortization Schedule objects.
SharePoint Document Verification
Scanned 24,711 folders with case-insensitive matching, achieving 70.8% document recovery (up from 22%).
Payment History Generation
Automated 12-month payment history in bad debt sale format for 100% of records.
Google Document AI OCR
Automated SSN redaction and contract detail extraction for 5,038 verified PDFs.
Real-Time Quality Assurance
Progress tracking with backup snapshots, comprehensive error logging, and field-level statistics.
Technology Stack:
- OCR & Document Processing: Google Cloud Document AI, Python 3.x with regex pattern extraction
- Data Integration: Salesforce REST API (4 object types: Contact, Loan_Account__c, Payment_Transaction__c, Amortization_Schedule__c)
- SharePoint: Microsoft Graph API with OAuth 2.0 authentication, 24,711+ folder scanning capability
- Backend Processing: Python 3.x (pandas, openpyxl, requests), fuzzy name matching with 80%+ similarity threshold
- Data Storage: Excel/XLSX for master ledger (56 mapped fields), JSON for audit trails
- Performance: Batch processing with rate limiting (1.6-6.0 records/sec), automated backup snapshots
Results & Impact
7,117 records processed with 6,117 Salesforce contacts matched (86% success rate) and 5,038 contracts verified in SharePoint—without manual intervention. The breakthrough was improving document availability from 22% to 70.8% through intelligent matching that resolved 16.1% name variations (marriages, legal name changes). The system eliminated hours of daily manual verification work, ensured zero data leakage through proper authentication and SSN handling, and provided DLSI's operations team with an audit-ready contract ledger with 56 enriched fields.
Key Metrics & Achievements:
Ready to Automate Your Document Processing?
Let's build an intelligent automation solution for your business.