Software Engineering & Digital Products for Global Enterprises since 2006
CMMi Level 3SOC 2ISO 27001
Menu
View all services
Staff Augmentation
Embed senior engineers in your team within weeks.
Dedicated Teams
A ring-fenced squad with PM, leads, and engineers.
Build-Operate-Transfer
We hire, run, and transfer the team to you.
Contract-to-Hire
Try the talent. Convert when you're ready.
ForceHQ
Skill testing, interviews and ranking — powered by AI.
RoboRingo
Build, deploy and monitor voice agents without code.
MailGovern
Policy, retention and compliance for enterprise email.
Vishing
Test and train staff against AI-driven voice attacks.
CyberForceHQ
Continuous, adaptive security training for every team.
IDS Load Balancer
Built for Multi Instance InDesign Server, to distribute jobs.
AutoVAPT.ai
AI agent for continuous, automated vulnerability and penetration testing.
Salesforce + InDesign Connector
Bridge Salesforce data into InDesign to design print catalogues at scale.
View all solutions
Banking, Financial Services & Insurance
Cloud, digital and legacy modernisation across financial entities.
Healthcare
Clinical platforms, patient engagement, and connected medical devices.
Pharma & Life Sciences
Trial systems, regulatory data, and field-force enablement.
Professional Services & Education
Workflow automation, learning platforms, and consulting tooling.
Media & Entertainment
AI video processing, OTT platforms, and content workflows.
Technology & SaaS
Product engineering, integrations, and scale for tech companies.
Retail & eCommerce
Shopify, print catalogues, web-to-print, and order automation.
View all industries
Blog
Engineering notes, opinions, and field reports.
Case Studies
How clients shipped — outcomes, stack, lessons.
White Papers
Deep-dives on AI, talent models, and platforms.
Portfolio
Selected work across industries.
View all resources
About Us
Who we are, our story, and what drives us.
Co-Innovation
How we partner to build new products together.
Careers
Open roles and what it's like to work here.
News
Press, announcements, and industry updates.
Leadership
The people steering MetaDesign.
Locations
Gurugram, Brisbane, Detroit and beyond.
Contact Us
Talk to sales, hiring, or partnerships.
Request TalentStart a Project
AI & Machine Learning

Harnessing AI for Automated Candidate Data Extraction with Gemini AI API and Google App Script

AG
Amit Gupta
CEO & Founder
October 3, 2024
9 min read
Harnessing AI for Automated Candidate Data Extraction with Gemini AI API and Google App Script — AI & Machine Learning | Meta

Why This Automation Was Developed

Managing large volumes of candidate resumes manually is time-consuming and error-prone. The challenge was extracting key information — names, emails, skills, companies, experience, and LinkedIn profiles — from hundreds of resumes in PDF, DOC, and DOCX formats. By leveraging Gemini AI API with Google Apps Script, the process was automated to save time, improve accuracy, and create a searchable, filterable candidate database.

Technical Architecture: End-to-End Pipeline Design

The automation pipeline follows a four-stage architecture designed for reliability and scalability. Stage 1 — File Discovery: The script recursively scans designated Google Drive folders and subfolders, building a queue of unprocessed resume files (PDF, DOC, DOCX) while skipping already-processed files tracked in a metadata sheet. Stage 2 — Text Extraction: Each file is converted to a temporary Google Doc using Drive's built-in conversion engine, the raw text is extracted programmatically, and the temporary Doc is deleted to avoid Drive clutter. Stage 3 — AI Processing: The extracted text is sent to Gemini AI via structured API calls with carefully engineered prompts that specify the exact JSON schema for the response. Stage 4 — Data Storage: Parsed candidate data is validated, deduplicated against existing records by email address, and appended to the master Google Sheet with timestamps and source file references. This modular pipeline ensures that failures at any stage can be retried without reprocessing the entire batch.

Key Features of the Automation

  • Resume Parsing: Handles PDF, DOC, and DOCX formats by converting to Google Docs, extracting text, and sending to Gemini AI API for structured analysis
  • Duplicate Detection: Automatically checks for duplicate emails and removes redundant files from Google Drive
  • Data Structuring: Extracted information is organized into Google Sheets for easy searching, filtering, and management
  • Scalability: Processes large datasets and multiple subfolders, suitable for extensive recruitment campaigns
  • Error Recovery: Failed extractions are logged to a separate error sheet with the file link and failure reason, enabling manual review without blocking the pipeline

Gemini AI Prompt Engineering for Structured Extraction

The quality of AI-extracted data depends entirely on prompt engineering — the instructions sent to the Gemini API alongside the resume text. The system uses a system prompt that defines the AI's role as a recruitment data extraction specialist, followed by a structured output schema that specifies every field the model must return: fullName, email, phone, currentCompany, totalExperience, skills (as an array), linkedInUrl, currentLocation, and summary. The prompt explicitly instructs the model to return null for missing fields rather than guessing — a critical design decision that prevents hallucinated data from entering the candidate database. Few-shot examples are included in the prompt to demonstrate the expected output format, significantly improving extraction consistency across diverse resume layouts. Temperature is set to 0.1 to minimize creative variation in structured extraction tasks. The Gemini 1.5 Pro model's 1-million-token context window ensures that even lengthy multi-page resumes with portfolio appendices are processed without truncation.

Google Apps Script: Zero-Infrastructure Automation

Google Apps Script is a cloud-based, JavaScript-based scripting language that automates tasks across Google Workspace — Sheets, Drive, Gmail, Calendar, and Docs. It's free with a Google account, requires no server infrastructure, integrates natively with Google services, and can connect to external APIs like Gemini AI. Key advantages for this automation include: UrlFetchApp for making HTTP requests to the Gemini API with custom headers and JSON payloads; DriveApp for file management, format conversion, and folder traversal; SpreadsheetApp for writing structured data to Google Sheets with cell-level formatting; and Triggers for scheduling automated runs (e.g., process new resumes every hour). The 6-minute execution time limit per invocation is managed through continuation tokens — the script saves its progress state and triggers itself to resume processing in the next invocation.

Transform Your Publishing Workflow

Our experts can help you build scalable, API-driven publishing systems tailored to your business.

Book a free consultation

Error Handling, Rate Limiting, and Data Validation

Production-grade automation demands robust error handling across every pipeline stage. API rate limiting: The Gemini API enforces request-per-minute quotas — the script implements exponential backoff with jitter, automatically retrying failed requests after increasing delays (1s → 2s → 4s → 8s) to avoid quota exhaustion. File conversion failures: Some PDFs (image-only scans without OCR text) produce empty text after conversion — the script detects these and logs them for manual processing or OCR pre-processing. Schema validation: Every Gemini response is validated against the expected JSON schema before writing to the spreadsheet — malformed responses are caught and retried with a stricter prompt. Email validation: Extracted email addresses are validated against regex patterns to prevent obviously invalid entries (missing @ signs, malformed domains) from entering the database. Concurrency protection: Script lock service prevents multiple trigger instances from processing the same file simultaneously, avoiding duplicate entries.

Scaling to Enterprise Recruitment Campaigns

While the base automation handles hundreds of resumes efficiently, enterprise recruitment campaigns processing thousands of applications require additional architectural considerations. Batch processing: Instead of processing one resume per API call, the system groups 3–5 shorter resumes into a single Gemini request (leveraging the large context window), reducing total API calls by 60–70%. Multi-sheet architecture: Campaign-specific Google Sheets prevent single-sheet performance degradation at high row counts — a master index sheet provides cross-campaign search. Webhook notifications: Completed batches trigger Slack or email notifications to recruiters with summary statistics (X new candidates added, Y duplicates skipped, Z errors requiring review). Analytics dashboard: A separate Google Sheet uses QUERY functions and charts to visualize sourcing metrics — candidates per source folder, skill distribution heatmaps, and extraction accuracy rates over time. These enhancements transform a basic automation into a scalable recruitment intelligence platform.

Results, ROI, and Business Impact

  • Speed: Hundreds of resumes processed in minutes instead of hours — a 95% reduction in manual data entry time for recruitment coordinators
  • Accuracy: AI consistently and accurately identified and extracted critical candidate data with 90%+ field-level accuracy across diverse resume formats
  • Scalability: Handles large applicant pools for recruitment campaigns — tested with 2,000+ resumes across 50+ subfolders without performance degradation
  • Structured Output: Candidate database can be filtered by skills, experience, location, and other criteria for targeted outreach and pipeline management
  • Cost Efficiency: Zero infrastructure cost (Google Apps Script is free) — only Gemini API usage costs, which average under $0.01 per resume for extraction
FAQ

Frequently Asked Questions

Common questions about this topic, answered by our engineering team.

Google Apps Script is a free, cloud-based scripting language based on JavaScript that automates tasks across Google Workspace. It requires zero server infrastructure, integrates natively with Drive, Sheets, and Gmail, and can call external APIs like Gemini AI — making it perfect for lightweight, cost-effective automations.

The script extracts raw text from resume files, sends it to the Gemini AI API with carefully engineered prompts specifying the exact JSON schema (name, email, skills, experience, etc.). The AI returns structured fields with null values for missing data rather than guessing, ensuring database accuracy.

Yes, the system handles PDF, DOC, and DOCX formats by converting them to Google Docs for text extraction. Gemini AI's large context window and few-shot prompt examples enable accurate extraction across diverse resume layouts, from simple text documents to multi-column formatted CVs.

The pipeline implements exponential backoff for API rate limiting, schema validation for AI responses, email regex validation, and a dedicated error logging sheet. Failed extractions are quarantined for manual review without blocking the remaining pipeline.

Google Apps Script is completely free. The only cost is Gemini API usage, which averages under $0.01 per resume for structured extraction. Processing 1,000 resumes typically costs less than $10 — a fraction of the manual labor cost it replaces.

Discussion

Join the Conversation

Ready when you are

Let's build something great together.

A 30-minute call with a principal engineer. We'll listen, sketch, and tell you whether we're the right partner — even if the answer is no.

Talk to a strategist
Need help with your project? Let's talk.
Book a call