Why This Automation Was Developed
Managing large volumes of candidate resumes manually is time-consuming and error-prone. The challenge was extracting key information — names, emails, skills, companies, experience, and LinkedIn profiles — from hundreds of resumes in PDF, DOC, and DOCX formats. By leveraging Gemini AI API with Google Apps Script, the process was automated to save time, improve accuracy, and create a searchable, filterable candidate database.
Technical Architecture: End-to-End Pipeline Design
The automation pipeline follows a four-stage architecture designed for reliability and scalability. Stage 1 — File Discovery: The script recursively scans designated Google Drive folders and subfolders, building a queue of unprocessed resume files (PDF, DOC, DOCX) while skipping already-processed files tracked in a metadata sheet. Stage 2 — Text Extraction: Each file is converted to a temporary Google Doc using Drive's built-in conversion engine, the raw text is extracted programmatically, and the temporary Doc is deleted to avoid Drive clutter. Stage 3 — AI Processing: The extracted text is sent to Gemini AI via structured API calls with carefully engineered prompts that specify the exact JSON schema for the response. Stage 4 — Data Storage: Parsed candidate data is validated, deduplicated against existing records by email address, and appended to the master Google Sheet with timestamps and source file references. This modular pipeline ensures that failures at any stage can be retried without reprocessing the entire batch.
Key Features of the Automation
- Resume Parsing: Handles PDF, DOC, and DOCX formats by converting to Google Docs, extracting text, and sending to Gemini AI API for structured analysis
- Duplicate Detection: Automatically checks for duplicate emails and removes redundant files from Google Drive
- Data Structuring: Extracted information is organized into Google Sheets for easy searching, filtering, and management
- Scalability: Processes large datasets and multiple subfolders, suitable for extensive recruitment campaigns
- Error Recovery: Failed extractions are logged to a separate error sheet with the file link and failure reason, enabling manual review without blocking the pipeline
Gemini AI Prompt Engineering for Structured Extraction
The quality of AI-extracted data depends entirely on prompt engineering — the instructions sent to the Gemini API alongside the resume text. The system uses a system prompt that defines the AI's role as a recruitment data extraction specialist, followed by a structured output schema that specifies every field the model must return: fullName, email, phone, currentCompany, totalExperience, skills (as an array), linkedInUrl, currentLocation, and summary. The prompt explicitly instructs the model to return null for missing fields rather than guessing — a critical design decision that prevents hallucinated data from entering the candidate database. Few-shot examples are included in the prompt to demonstrate the expected output format, significantly improving extraction consistency across diverse resume layouts. Temperature is set to 0.1 to minimize creative variation in structured extraction tasks. The Gemini 1.5 Pro model's 1-million-token context window ensures that even lengthy multi-page resumes with portfolio appendices are processed without truncation.
Google Apps Script: Zero-Infrastructure Automation
Google Apps Script is a cloud-based, JavaScript-based scripting language that automates tasks across Google Workspace — Sheets, Drive, Gmail, Calendar, and Docs. It's free with a Google account, requires no server infrastructure, integrates natively with Google services, and can connect to external APIs like Gemini AI. Key advantages for this automation include: UrlFetchApp for making HTTP requests to the Gemini API with custom headers and JSON payloads; DriveApp for file management, format conversion, and folder traversal; SpreadsheetApp for writing structured data to Google Sheets with cell-level formatting; and Triggers for scheduling automated runs (e.g., process new resumes every hour). The 6-minute execution time limit per invocation is managed through continuation tokens — the script saves its progress state and triggers itself to resume processing in the next invocation.
Transform Your Publishing Workflow
Our experts can help you build scalable, API-driven publishing systems tailored to your business.
Error Handling, Rate Limiting, and Data Validation
Production-grade automation demands robust error handling across every pipeline stage. API rate limiting: The Gemini API enforces request-per-minute quotas — the script implements exponential backoff with jitter, automatically retrying failed requests after increasing delays (1s → 2s → 4s → 8s) to avoid quota exhaustion. File conversion failures: Some PDFs (image-only scans without OCR text) produce empty text after conversion — the script detects these and logs them for manual processing or OCR pre-processing. Schema validation: Every Gemini response is validated against the expected JSON schema before writing to the spreadsheet — malformed responses are caught and retried with a stricter prompt. Email validation: Extracted email addresses are validated against regex patterns to prevent obviously invalid entries (missing @ signs, malformed domains) from entering the database. Concurrency protection: Script lock service prevents multiple trigger instances from processing the same file simultaneously, avoiding duplicate entries.
Scaling to Enterprise Recruitment Campaigns
While the base automation handles hundreds of resumes efficiently, enterprise recruitment campaigns processing thousands of applications require additional architectural considerations. Batch processing: Instead of processing one resume per API call, the system groups 3–5 shorter resumes into a single Gemini request (leveraging the large context window), reducing total API calls by 60–70%. Multi-sheet architecture: Campaign-specific Google Sheets prevent single-sheet performance degradation at high row counts — a master index sheet provides cross-campaign search. Webhook notifications: Completed batches trigger Slack or email notifications to recruiters with summary statistics (X new candidates added, Y duplicates skipped, Z errors requiring review). Analytics dashboard: A separate Google Sheet uses QUERY functions and charts to visualize sourcing metrics — candidates per source folder, skill distribution heatmaps, and extraction accuracy rates over time. These enhancements transform a basic automation into a scalable recruitment intelligence platform.
Results, ROI, and Business Impact
- Speed: Hundreds of resumes processed in minutes instead of hours — a 95% reduction in manual data entry time for recruitment coordinators
- Accuracy: AI consistently and accurately identified and extracted critical candidate data with 90%+ field-level accuracy across diverse resume formats
- Scalability: Handles large applicant pools for recruitment campaigns — tested with 2,000+ resumes across 50+ subfolders without performance degradation
- Structured Output: Candidate database can be filtered by skills, experience, location, and other criteria for targeted outreach and pipeline management
- Cost Efficiency: Zero infrastructure cost (Google Apps Script is free) — only Gemini API usage costs, which average under $0.01 per resume for extraction



