Distance Learning Systems Indiana

Automated OCR Contract Processing Pipeline

Intelligent multi-source data enrichment pipeline for online higher-education enrollment management, processing 7,117 student records across Salesforce, SharePoint, and Excel with automated contract matching and verification.

7,117
Records Processed
86%
Salesforce Match Rate
70.8%
Document Recovery
24,711
Folders Analyzed

Automated Data Processing Pipeline

1

Excel Master Data Input

7,117 student records with 56 fields requiring enrichment

2

Salesforce API Contact Lookup

Name-based matching with fuzzy logic (80%+ similarity threshold)

✓ 6,117 contacts matched (86% success rate)
3

Multi-Object Data Enrichment

Fetch from Loan_Account__c, Payment_Transaction__c, and Amortization_Schedule__c

✓ 5,234 loan accounts enriched (73.5%)
4

SharePoint Folder Verification

Case-insensitive matching across 24,711 folders with name variation handling

✓ 5,038 contracts verified (70.8% availability, up from 22%)
5

Google Document AI OCR Processing

Automated SSN redaction and contract detail extraction

✓ Ready for 5,038 contract PDFs
6

Master Ledger Update

Audit-ready contract ledger with 56 enriched fields and payment history

✓ 100% payment history generation (7,117 records)

The Challenge

DLSI faced a three-system data fragmentation problem across master Excel spreadsheets (static student data with gaps), Salesforce (live loan management with 158 available fields per account), and SharePoint (24,711+ unorganized PDF folders with inconsistent naming). The real complexity wasn't just volume—it was data quality. Analysis revealed 16.1% of students had name changes (marriage, legal name updates), spelling variations between systems, and case sensitivity issues creating matching failures.

Key Problems to Solve:

  • Data Fragmentation: Critical information scattered across 3 disconnected systems
  • Name Matching Complexity: 16.1% name variations (marriage, spelling differences) causing lookup failures
  • Case Sensitivity Issues: UPPERCASE Excel vs Title Case SharePoint preventing automated matching
  • Document Recovery: Only 22% initial contract availability due to naming inconsistencies
  • Encrypted SSN Fields: Salesforce SSN encryption requiring complex name-based matching strategies
  • Scalability: Manual processing couldn't handle 7,117+ records efficiently for bad debt sales

Our Solution

We developed an intelligent multi-source data enrichment pipeline that processes 7,117 student records across Salesforce, SharePoint, and Excel. The system uses multi-pass intelligent matching (exact → fuzzy with 80%+ similarity → First Name + SSN fallback) to handle name variations, title case conversion for SharePoint folder patterns, and a 4-tier data enrichment pipeline pulling from Contact, Loan_Account__c, Payment_Transaction__c, and Amortization_Schedule__c Salesforce objects.

The breakthrough was improving document recovery from 22% to 70.8% by solving case sensitivity and name variation challenges, then preparing 5,038 verified contracts for Google Document AI OCR processing with automated SSN redaction.

Key Features Built:

Intelligent Name Matching Engine

Multi-pass fuzzy matching (80%+ threshold) with title case conversion, resolving 16.1% name variations.

4-Tier Salesforce Integration

Automated enrichment from Contact, Loan Account, Payment Transaction, and Amortization Schedule objects.

SharePoint Document Verification

Scanned 24,711 folders with case-insensitive matching, achieving 70.8% document recovery (up from 22%).

Payment History Generation

Automated 12-month payment history in bad debt sale format for 100% of records.

Google Document AI OCR

Automated SSN redaction and contract detail extraction for 5,038 verified PDFs.

Real-Time Quality Assurance

Progress tracking with backup snapshots, comprehensive error logging, and field-level statistics.

Technology Stack:

  • OCR & Document Processing: Google Cloud Document AI, Python 3.x with regex pattern extraction
  • Data Integration: Salesforce REST API (4 object types: Contact, Loan_Account__c, Payment_Transaction__c, Amortization_Schedule__c)
  • SharePoint: Microsoft Graph API with OAuth 2.0 authentication, 24,711+ folder scanning capability
  • Backend Processing: Python 3.x (pandas, openpyxl, requests), fuzzy name matching with 80%+ similarity threshold
  • Data Storage: Excel/XLSX for master ledger (56 mapped fields), JSON for audit trails
  • Performance: Batch processing with rate limiting (1.6-6.0 records/sec), automated backup snapshots

Results & Impact

7,117 records processed with 6,117 Salesforce contacts matched (86% success rate) and 5,038 contracts verified in SharePoint—without manual intervention. The breakthrough was improving document availability from 22% to 70.8% through intelligent matching that resolved 16.1% name variations (marriages, legal name changes). The system eliminated hours of daily manual verification work, ensured zero data leakage through proper authentication and SSN handling, and provided DLSI's operations team with an audit-ready contract ledger with 56 enriched fields.

Key Metrics & Achievements:

7,117
Total records processed
86%
Salesforce contact match rate (6,117 matched)
70.8%
Document recovery rate (up from 22%)
5,038
Contracts verified and ready for OCR
24,711
SharePoint folders analyzed
100%
Payment history generation rate
16.1%
Name variation cases resolved
56
Fields auto-mapped and enriched
6.0
Records/second processing rate
Zero
Data leakage incidents

Ready to Automate Your Document Processing?

Let's build an intelligent automation solution for your business.