Leveraging AI in Large-Scale Government Security Audits

Manual security auditing doesn't scale. When your mandate is to assess security posture across hundreds — or thousands — of government web domains, the traditional approach of a human analyst checking each domain is simply not viable. This post describes the architecture and design decisions behind CyberDrishti (साइबर दृष्टि), an AI-assisted scanning platform built to automate the discovery of exposed credentials, PII, and critical misconfigurations at national scale.

CyberDrishti was built for an authorized government security audit mandate. All scanning operations are performed with appropriate legal authorization. No unauthorized access was performed or implied.

The Problem: Scale and Signal-to-Noise

Government web infrastructure spans a vast and heterogeneous landscape — central ministry portals, state government sites, PSU websites, e-governance platforms, and citizen service portals. Each may expose different types of sensitive information: hardcoded database credentials in JavaScript files, Aadhaar numbers in PDF reports, AWS access keys in public repositories, unprotected admin panels, or misconfigured S3-equivalent buckets.

A typical human-led audit might cover 20–50 domains per analyst per day with meaningful depth. With thousands of targets, you need to triage automatically and route only high-confidence findings to human reviewers. This is where AI pipelines add genuine value — not by replacing security judgement, but by applying consistent, tireless pattern recognition at scale.

Architecture Overview

CyberDrishti is structured as a multi-stage pipeline:

Discovery Layer
    └── Domain enumeration (subfinder, amass, certificate transparency)
    └── HTTP crawling (custom Go-based spider)
    └── JavaScript file extraction
    └── PDF / document harvesting

Analysis Layer
    └── Credential Detection (regex + heuristics)
    └── PII Detection (spaCy NER pipeline)
    └── OCR Engine (Tesseract for images/PDFs with embedded text)
    └── Secret Scanning (custom entropy analysis)
    └── Misconfiguration Checks (headers, TLS, CORS, exposed panels)

Reporting Layer
    └── Finding deduplication and severity scoring
    └── Human review queue (React dashboard)
    └── Export: JSON, PDF, CERT-IN report format

The NLP Pipeline — PII Detection with spaCy

Standard regex-based PII detection produces enormous false positive rates in Indian government content. Generic patterns for "12-digit numbers" will match phone numbers, reference IDs, and order numbers — not just Aadhaar numbers. We needed context-aware detection.

spaCy's Named Entity Recognition (NER) provided the foundation. We fine-tuned a custom NER model on a labelled dataset of government document extracts to recognise Indian-specific PII entity types:

import spacy
from spacy.tokens import DocBin

# Custom entity labels for Indian government PII
ENTITY_LABELS = [
    "AADHAAR",    # 12-digit Aadhaar number with contextual validation
    "PAN",         # Permanent Account Number (format: AAAAA0000A)
    "VOTER_ID",    # Voter ID card numbers
    "PERSON_NAME", # Indian name patterns in sensitive document contexts
    "PHONE_IN",    # Indian mobile/landline numbers
    "GOV_ID",      # Generic government ID references
]

# Load fine-tuned model
nlp = spacy.load("./models/cyberdrishti_ner_v2")

def scan_text_for_pii(text: str) -> list[dict]:
    doc = nlp(text)
    findings = []
    for ent in doc.ents:
        if ent.label_ in ENTITY_LABELS:
            findings.append({
                "entity_type": ent.label_,
                "value_hash": hash_pii(ent.text),  # Never store raw PII
                "context": ent.sent.text[:200],
                "confidence": ent._.confidence_score,
            })
    return findings

Key design decision: we never store raw PII values in the finding database. Only context (surrounding text) and a one-way hash of the sensitive value is stored — enough to confirm the finding during human review without the platform itself becoming a PII repository.

OCR for Embedded Sensitive Data

A surprising volume of exposed government PII lives not in HTML or JavaScript but in PDF reports, scanned images, and presentations published on official portals. These are invisible to text-based scanners.

We integrated Tesseract OCR with image preprocessing for both raster PDFs and standalone images:

import pytesseract
from PIL import Image
import pdf2image
import cv2
import numpy as np

def extract_text_from_pdf(pdf_path: str) -> str:
    """Convert PDF pages to images then OCR each page."""
    images = pdf2image.convert_from_path(pdf_path, dpi=200)
    full_text = []

    for img in images:
        # Preprocessing: grayscale, denoise, threshold
        cv_img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2GRAY)
        cv_img = cv2.fastNlMeansDenoising(cv_img, h=10)
        _, cv_img = cv2.threshold(cv_img, 0, 255,
                                   cv2.THRESH_BINARY + cv2.THRESH_OTSU)

        # OCR with Indian language support
        text = pytesseract.image_to_string(
            Image.fromarray(cv_img),
            lang='eng+hin',  # English + Hindi
            config='--psm 6'  # Assume uniform block of text
        )
        full_text.append(text)

    return "\n".join(full_text)

The dual-language OCR (English + Hindi) was essential — many government documents mix English IDs and Hindi names within the same table, and single-language OCR produced garbled output for mixed-script content.

Credential and Secret Detection

For JavaScript files and source code fragments, we implemented a layered detection approach combining entropy analysis with pattern matching. High-entropy strings in assignment contexts are strong signals for API keys, tokens, and passwords:

import math
import re

def shannon_entropy(data: str) -> float:
    """Calculate Shannon entropy of a string."""
    if not data:
        return 0
    entropy = 0
    for char in set(data):
        p = data.count(char) / len(data)
        entropy -= p * math.log2(p)
    return entropy

def scan_js_for_secrets(js_content: str) -> list[dict]:
    findings = []

    # Pattern: variable assignment with high-entropy value
    assignment_pattern = re.compile(
        r'(?:const|let|var|"|\')\s*'
        r'(?:key|token|secret|password|api_key|apikey|auth|credential)\s*'
        r'[=:]\s*["\']([A-Za-z0-9+/=_\-]{20,})["\']',
        re.IGNORECASE
    )

    for match in assignment_pattern.finditer(js_content):
        value = match.group(1)
        entropy = shannon_entropy(value)

        if entropy > 4.0:  # High entropy threshold for secret-like strings
            findings.append({
                "type": "POTENTIAL_SECRET",
                "entropy": round(entropy, 2),
                "context": js_content[max(0, match.start()-50):match.end()+50],
                "severity": "HIGH" if entropy > 4.5 else "MEDIUM"
            })

    return findings

Human Review Dashboard

Automation produces findings at a rate no team can manually review in real time. The React dashboard implements a priority queue with three levels:

Critical (immediate): Active database credentials, live AWS keys, exposed admin panels with default passwords
High (same day): High-confidence PII in public documents, exposed .env files, CORS misconfigurations on sensitive APIs
Medium (weekly batch): Low-confidence PII, missing security headers, informational exposures

The dashboard shows the analyst the full context window around each finding, with a one-click action to mark as confirmed, false positive, or escalated. Confirmed findings automatically populate the reporting template in CERT-IN's required format.

Results and Lessons Learned

Across the pilot phase of the audit, the pipeline surfaced findings that included exposed service account credentials, citizen data in publicly accessible PDF reports, and misconfigured cloud storage. The OCR component in particular found issues that had been entirely invisible to previous text-based scans.

Key lessons from building this system:

False positives are a feature, not a bug — at scale, you'd rather have human eyes on a false positive than miss a real credential exposure. Tune thresholds conservatively.
Context windows matter enormously for NER accuracy. A 12-digit number is an Aadhaar only if it appears near "Aadhaar", "UID", or a name field — pure pattern matching misses this entirely.
Rate limiting and politeness are non-negotiable for government infrastructure. Aggressive crawling can inadvertently degrade legitimate citizen services on under-resourced servers.
The reporting format is half the work. Findings are only valuable if they're communicated in a format that the responsible ministry can act on. Investing in good report templates paid dividends in remediation velocity.

AI-assisted scanning doesn't replace the skilled security analyst — it changes where their time is spent. Instead of crawling pages manually, analysts review findings, validate edge cases, and work directly with stakeholders on remediation. That's a meaningful improvement in how a security team's expertise is applied.