RESEARCH_METHODOLOGY← INTEL

Retro Phish — Research Methodology

Working Title: State of Phishing in the Rise of GenAI Author: Scott Altiparmak Status: Active collection — dataset frozen at v1


Core Research Question (LOCKED)

Which phishing techniques are humans most likely to miss when linguistic quality is no longer a reliable detection signal?

Design decisions that flow from this question

  • Dataset is entirely GenAI-generated. Linguistic quality is held constant as the baseline assumption — this is the post-GenAI world where grammar and fluency can no longer be relied upon as tells. We are not studying whether AI-generated emails fool people; we are studying which technique fools them when quality is constant.
  • Technique is the primary independent variable. Six techniques are studied, each represented with equal volume and controlled difficulty distribution.
  • Human detection rate (correct identification as phishing) is the primary outcome metric.
  • Behavioral signals are secondary findings: response time, confidence level, confidence selection time, scroll depth, answer method, session position.
  • Sample is self-selected game players. Limitation noted: the game's nature may attract security-interested individuals, meaning bypass rates are likely conservative estimates for the general population.

Why this is novel

Prior technique-level phishing detection research was built on datasets where linguistic quality varied across samples. This conflates technique with quality as variables. By holding linguistic quality constant at AI-generation level and using identical difficulty distributions across all techniques, this study isolates technique as the sole variable. No published baseline exists for technique-level human detection rates under this condition. This study establishes that baseline.

Publication target: Blog post on scottaltiparmak.com — "Which Phishing Techniques Fool You When AI Writes the Email" Last updated: 2026-03-06


Dataset: Retro Phish v1

Overview

| | Count | |---|---| | Total cards | 1,000 | | Phishing cards | 690 | | Legitimate cards | 310 | | Types | Email | | Generation | Claude Haiku + Sonnet (documented prompt templates, 3 Haiku + 1 Sonnet per batch) |

All cards are AI-generated by design. There are no real-world phishing samples in the dataset. This is a deliberate research choice: it eliminates PII handling, sourcing/licensing complexity, and uncontrolled variation in linguistic quality across samples.

Phishing Cards (690 total)

Six techniques, 115 cards each. Each technique is split across difficulty (35 easy/medium/hard, 10 extreme):

| Technique | Easy | Medium | Hard | Extreme | Total | |-----------|------|--------|------|---------|-------| | urgency | 35 | 35 | 35 | 10 | 115 | | authority-impersonation | 35 | 35 | 35 | 10 | 115 | | credential-harvest | 35 | 35 | 35 | 10 | 115 | | hyper-personalization | 35 | 35 | 35 | 10 | 115 | | pretexting | 35 | 35 | 35 | 10 | 115 | | fluent-prose | 35 | 35 | 35 | 10 | 115 | | Total | 210 | 210 | 210 | 60 | 690 |

Technique definitions:

| Technique | Description | |-----------|-------------| | urgency | False time pressure or threat of account loss — the most traditional and recognisable phishing vector | | authority-impersonation | Impersonates IT, management, government entity, or known brand with authority framing | | credential-harvest | Explicit credential request or redirect to a login page — direct ask for passwords or access | | hyper-personalization | Uses recipient name, role, company, or context convincingly — a primary GenAI differentiator | | pretexting | Builds a false scenario before the ask — invoice dispute, ongoing thread, project context | | fluent-prose | Polished natural language with no traditional tells — phish is embedded in entirely plausible correspondence |

Difficulty calibration:

  • Easy: One or more traditional tells present (urgency language, suspicious sender domain, implausible pretext, direct ask for credentials)
  • Medium: Surface quality is good; technique is present but subtle; one careful read should reveal it
  • Hard: High-quality phishing with technique well-concealed; requires close reading to detect
  • Extreme: Near-indistinguishable from legitimate correspondence; technique fully embedded; a security-aware reader may miss it on the first and second pass. Expert Mode cards.

Legitimate Cards (310 total)

| Category | Count | |----------|-------| | Transactional | 110 | | Marketing | 100 | | Workplace | 100 | | Total | 310 |

Category definitions:

  • Transactional: Order confirmations, shipping notifications, account statements, password reset confirmations — routine expected correspondence
  • Marketing: Promotional emails, newsletters, product updates — legitimate but often visually similar to phishing lures
  • Workplace: Internal communication, IT notifications, meeting invites, HR updates — the category most likely to trigger false positives

Legitimate cards are generated with the same linguistic quality standard as phishing cards. No intentional quality degradation.

Generation Standards

All cards are generated using documented prompt templates stored in docs/prompts/. Each prompt template specifies:

  • Target technique (phishing) or category (legitimate)
  • Difficulty tier and calibration criteria
  • Required output fields (from, subject, body, highlights, clues, explanation)
  • Quality constraints (no grammar errors, no traditional tells unless calibrated for easy difficulty)
  • Diversity requirements (varied sender domains, industries, contexts)

Each card records: ai_model, ai_prompt_version, generation_date.


Curation Pipeline

Stage 1: Generation

Cards generated by scripts/generate-cards.ts using structured prompt templates per technique/category. Output written directly to cards_staging with status = 'pending' and source_corpus = 'generated'.

Generation runs in batches: 20 cards per call, one technique/category per batch. Batch tracked in import_batches.

Stage 2: Admin Review

All generated cards reviewed via /admin review UI before approval. Reviewer (Scott) sees:

  • Card display as players will see it
  • Suggested fields from generation: technique, difficulty, highlights, clues, explanation
  • All fields editable inline

Reviewer actions:

  • Edit any field
  • Confirm or adjust technique/difficulty
  • Approve or reject
  • Rejection reason logged

No card enters cards_real without human review. This is the quality gate — rejecting cards that are low-quality, implausible, or do not represent the technique cleanly.

Stage 3: Dataset Freeze

Once cards_real reaches 1,000 approved cards (690 phishing + 310 legitimate, balanced across techniques), the dataset is frozen as v1. Freeze recorded in dataset_versions.


Database Schema

cards_staging

Holds generated cards awaiting review. Fields used for generated content:

| Field | Used | Notes | |-------|------|-------| | id | ✓ | UUID primary key | | import_batch_id | ✓ | FK to import_batches (generation batch) | | source_corpus | ✓ | Always 'generated' | | raw_from | ✓ | Generated sender address | | raw_subject | ✓ | Generated subject line | | raw_body | ✓ | Generated email/SMS body | | inferred_type | ✓ | email / sms | | is_phishing | ✓ | Set at generation | | suggested_technique | ✓ | From generation prompt | | suggested_difficulty | ✓ | From generation prompt | | suggested_highlights | ✓ | Phrases to highlight in feedback | | suggested_clues | ✓ | Analyst clues for feedback | | suggested_explanation | ✓ | Why this is/isn't phishing | | ai_provider | ✓ | openai / anthropic | | ai_model | ✓ | e.g. gpt-4o | | ai_preprocessing_version | ✓ | Prompt template version | | status | ✓ | pending / approved / rejected | | raw_email_hash | — | N/A for generated (no dedup needed) | | email_headers_json | — | N/A | | genai_detector_score | — | N/A (all cards are known AI-generated) | | is_genai_suspected | — | N/A |

cards_real

Approved, curated live dataset:

| Field | Type | Notes | |-------|------|-------| | id | UUID | Primary key | | staging_id | UUID | FK to cards_staging | | card_id | TEXT | Unique game ID e.g. real-p-001 | | type | TEXT | email / sms | | is_phishing | BOOLEAN | | | difficulty | TEXT | easy / medium / hard / extreme | | secondary_technique | TEXT | Secondary phishing technique, if applicable (null for most cards) | | from_address | TEXT | | | subject | TEXT | | | body | TEXT | | | technique | TEXT | Primary technique | | highlights | TEXT[] | Phrases to highlight in feedback | | clues | TEXT[] | Analyst clues | | explanation | TEXT | Why this is/isn't phishing | | auth_status | TEXT | verified / unverified / fail — simulated SPF/DKIM/DMARC result | | reply_to | TEXT | Mismatched reply-to address (hard/extreme phishing only) | | attachment_name | TEXT | Filename shown in ATCH row (when card references an attachment) | | sent_at | TEXT | RFC 2822 timestamp — odd hours for phishing, business hours for legit | | ai_model | TEXT | Which model generated the card | | ai_preprocessing_version | TEXT | Prompt template version | | dataset_version | TEXT | v1 | | approved_at | TIMESTAMPTZ | |

answers

Every answer event from research mode:

| Field | Type | Notes | |-------|------|-------| | id | UUID | Primary key | | session_id | UUID | Groups answers from same game | | card_id | TEXT | | | is_phishing | BOOLEAN | Ground truth | | technique | TEXT | Primary technique (null for legit cards) | | difficulty | TEXT | | | type | TEXT | email / sms | | user_answer | TEXT | phishing / legit | | correct | BOOLEAN | | | confidence | TEXT | guessing / likely / certain | | time_from_render_ms | INT | Card shown → answer submitted | | time_from_confidence_ms | INT | Confidence selected → answer submitted | | confidence_selection_time_ms | INT | Card shown → confidence selected | | scroll_depth_pct | SMALLINT | 0–100 | | answer_method | TEXT | swipe / button | | answer_ordinal | SMALLINT | Position in session (1–10) | | streak_at_answer_time | SMALLINT | | | correct_count_at_time | SMALLINT | | | game_mode | TEXT | freeplay / daily / research / expert / preview | | is_daily_challenge | BOOLEAN | | | card_source | TEXT | generated / real | | dataset_version | TEXT | v1 | | is_genai_suspected | BOOLEAN | Card flagged as likely GenAI-generated | | genai_confidence | TEXT | low / medium / high (null if not suspected) | | grammar_quality | SMALLINT | 0–5 rating from generation metadata | | prose_fluency | SMALLINT | 0–5 rating from generation metadata | | personalization_level | SMALLINT | 0–5 rating from generation metadata | | contextual_coherence | SMALLINT | 0–5 rating from generation metadata | | secondary_technique | TEXT | Secondary phishing technique if applicable | | player_id | UUID | FK to players table (pseudonymous) | | headers_opened | BOOLEAN | Player opened the [HEADERS] panel | | url_inspected | BOOLEAN | Player tapped a URL to inspect it | | auth_status | TEXT | Card's SPF/DKIM/DMARC result (verified/unverified/fail) | | has_reply_to | BOOLEAN | Card had a mismatched Reply-To address | | has_url | BOOLEAN | Card body contained at least one URL | | has_attachment | BOOLEAN | Card had an attachment name set | | has_sent_at | BOOLEAN | Card had a sent timestamp (odd-hours signal available) | | created_at | TIMESTAMPTZ | |

sessions

One row per game played:

| Field | Type | Notes | |-------|------|-------| | session_id | UUID | Primary key | | game_mode | TEXT | freeplay / daily / research / expert / preview | | is_daily_challenge | BOOLEAN | | | started_at | TIMESTAMPTZ | | | completed_at | TIMESTAMPTZ | Null if abandoned | | cards_answered | SMALLINT | | | final_score | INT | | | final_rank | TEXT | | | device_type | TEXT | mobile / tablet / desktop | | viewport_width | SMALLINT | | | viewport_height | SMALLINT | | | referrer | TEXT | |

import_batches

Tracks each generation batch:

| Field | Type | Notes | |-------|------|-------| | batch_id | UUID | Primary key | | source_corpus | TEXT | Always 'generated' for v1 | | import_date | TIMESTAMPTZ | | | raw_count | INT | Cards generated | | processed_count | INT | Cards reviewed | | approved_count | INT | Cards approved | | rejected_count | INT | | | notes | TEXT | Technique, difficulty, model, prompt version |

dataset_versions

Version registry:

| Field | Type | Notes | |-------|------|-------| | version | TEXT | v1, v2 etc. | | locked_at | TIMESTAMPTZ | Null until frozen | | total_cards | INT | | | phishing_count | INT | | | legit_count | INT | | | description | TEXT | |


Data Collection (Gameplay)

Research Mode requires a player account. Answers are linked to a pseudonymous player UUID via the player_id foreign key in the answers table. Email addresses are held only in Supabase Auth and are never stored in research tables — our own tables record only UUIDs, game mode, technique, correctness, confidence, and timing signals. A session UUID is generated at game start and persisted to the sessions table to group answers from the same round.

Per-player collection cap: Each player can contribute a maximum of 30 research answers (3 complete sessions of 10 cards each), enforced server-side. After reaching the cap, players are marked as research-graduated and gain access to Expert Mode. This cap prevents any single player from dominating the dataset and creates an incentive structure for completion. Answers beyond the cap are silently discarded at the API layer.

Timing measurements:

  • time_from_render_ms — card first render to answer submission
  • time_from_confidence_ms — confidence selection to answer submission (pure decision deliberation)
  • confidence_selection_time_ms — card render to confidence selection

Scroll depth: tracked via scrollTop ratio on the card body element. Records maximum scroll percentage reached before answering.

No PII in research tables. No IP storage. No behavioural tracking outside the game session.

Consent: Players informed via the game UI that Research Mode answers contribute to anonymised security awareness research. Participation is voluntary and implicit in selecting Research Mode.

Collection target: ~1,000 research mode answers minimum before publishing. With random sampling from a 1,000-card dataset (115 phishing cards per technique), expected answers per technique per N total answers = N × (115/1000). Reaching 100 answers per technique in expectation requires approximately 870 total research answer events. A target of 1,000 provides a comfortable buffer. This provides a statistically meaningful sample for primary technique-level comparisons.


Analysis Plan

Primary Analysis — Bypass Rate by Technique

For each of the 6 techniques:

  • Bypass rate = answers where user_answer = 'legit' and correct = false (i.e., missed phishing)
  • Compared across techniques using proportion comparisons
  • Reported with 95% confidence intervals

Expected finding direction: hyper-personalization and fluent-prose have higher bypass rates than urgency and credential-harvest (which have more traditional tells even at high quality).

Controlled Comparison

Difficulty is distributed across techniques (35 easy/35 medium/35 hard/10 extreme per technique). Extreme is intentionally under-represented as it represents near-indistinguishable attacks unlikely to be detected by most players. Primary analysis uses all difficulties combined. Difficulty-stratified breakdown reported separately to confirm the technique effect is not an artifact of difficulty distribution.

Secondary Analysis — Background Group Comparison

Players can optionally self-report their professional background: other (general users), technical (technical, non-security), or infosec (security/cybersecurity professionals). Background is set on the player profile and linked to answers via player UUID.

Analysis questions:

  • Do infosec professionals have lower bypass rates than technical non-security users? Than general users?
  • Does the technique ranking hold across background groups, or do certain techniques disproportionately bypass even security-trained individuals?
  • Is the security-aware sample assumption validated by the data (i.e., do infosec players perform measurably better)?

Background is optional and a significant portion of players may not disclose it. This analysis is reported as a supplementary finding, not a primary result.

Secondary Analysis — Behavioral Signals

  • Confidence calibration: CERTAIN vs. LIKELY vs. GUESSING accuracy rate
  • Time-to-decision: does longer deliberation improve accuracy?
  • Scroll depth: does reading the full card improve accuracy?
  • Answer ordinal: within-session learning curve (positions 1–10)
  • Streak effect: does a correct-answer streak correlate with accuracy on subsequent cards?
  • Answer method: swipe vs. button — any accuracy difference?

Secondary Analysis — Tool Usage

Six signals are available to players during gameplay. Three are passive (always visible), three are active (require deliberate interaction):

| Signal | Type | Behavioral field | |--------|------|-----------------| | Sender domain (FROM vs body) | Passive | — | | Send time (SENT row) | Passive | — | | Attachment name (ATCH row) | Passive | has_attachment on card | | Authentication headers ([HEADERS]) | Active | headers_opened | | Reply-To mismatch ([HEADERS]) | Active | headers_opened | | URL destinations (URL inspector) | Active | url_inspected |

Active tool interactions are logged per answer. Analysis questions:

  • What percentage of players open the headers panel? Inspect URLs?
  • Does opening headers improve accuracy on phishing cards? By technique?
  • Does URL inspection improve accuracy?
  • On cards where auth status is PASS (sophisticated attacker with valid domain), do players who open headers still detect the phishing?

Descriptive Statistics

  • Total answers collected, sessions played, completion rate
  • Bypass rate overall (phishing cards missed / total phishing cards answered)
  • False positive rate (legitimate cards flagged as phishing / total legit cards answered)
  • Technique distribution of answers
  • Difficulty distribution of answers
  • Device type breakdown
  • Confidence level distribution

Sample Characteristics and Limitations

Self-Selected Sample

Players who seek out a retro phishing awareness game are likely more security-aware than the general population. Results should be interpreted as reflecting a security-aware population, not general users. This is a limitation but also produces conservative bypass rates — if even security-aware individuals miss certain techniques at elevated rates, the finding is stronger for the general population.

Text-Based Presentation

The terminal interface strips all visual design cues (logos, branding, CSS styling). Results reflect text-based linguistic phishing recognition, not full email client simulation.

This is appropriate for the research question. GenAI's primary advantage over traditional phishing is text quality, not visual design. Testing linguistic cues in isolation directly matches the research question.

Training Effect — Immediate Feedback

The game provides immediate post-answer feedback including the technique label, difficulty, and forensic signal analysis. This creates a within-session and cross-session learning effect: players calibrate over a session and improve across sessions.

How this is controlled for: answer_ordinal is logged for every answer (position 1–10 within the session). This allows isolation of naive answers (early ordinals) from calibrated answers (later ordinals). Primary analysis uses all ordinals combined. A sensitivity analysis using only ordinals 1–3 (first three cards of each session) tests whether the technique ranking is robust to the learning effect. If the technique ordering holds across both cuts, the finding is valid.

Secondary finding enabled: The learning structure is an opportunity, not only a limitation. Players who return for multiple sessions provide data on technique-specific trainability — which attacks remain hard even after repeated exposure to feedback. Techniques with persistently high bypass rates across return players represent harder-to-train threats. This is separately reportable and directly relevant to security awareness program design.

All-Generated Dataset

The dataset is entirely AI-generated by design. Two implications:

  1. Ecological validity: Real phishing is mixed in quality. The dataset presents only high-quality phishing, which is the direction phishing is heading but does not represent today's full landscape.
  2. Generation bias: All legitimate cards and all phishing cards are generated by Claude (Haiku and Sonnet). There may be stylistic similarities between phishing and legitimate cards that do not exist in the real world. This is partially mitigated by the varied prompt templates, explicit diversity requirements, and the model mix (3 Haiku + 1 Sonnet per batch).

Both limitations are disclosed in the publication.

Repeated Card Exposure

The same card can be served to multiple distinct sessions. There is no within-session repetition (each session draws a unique 10-card sample), but across sessions a given card may be seen by many different players. This is by design: repeated exposure across sessions provides the statistical volume needed for per-card and per-technique analysis. Cards are not removed from the pool after being seen. Answers are linked to a pseudonymous player_id, so returning players can be identified for cross-session analysis (e.g., trainability across sessions). The per-player cap of 30 answers (3 sessions) bounds the maximum contribution from any single player.

fluent-prose Confound

fluent-prose phishing cards are defined by polished natural language with no traditional tells. This technique partially overlaps with the GenAI baseline condition shared by all cards in this dataset — all cards are grammatically fluent by construction. As a result, fluent-prose cards may be harder to distinguish from legitimate cards not because of superior technique, but because the technique definition is closest to the baseline condition. This is a design confound that will be disclosed in the publication. The technique is retained in the dataset because it represents a real and distinct category of attack. Readers should interpret elevated bypass rates for fluent-prose as an upper bound that includes baseline noise.

Auth Header Shortcut

Players who open the [HEADERS] panel and observe SPF/DKIM/DMARC: FAIL have a near-deterministic signal on easy and medium phishing cards, where authentication always fails by design. A player using headers as their primary detection heuristic will produce correct answers that are not attributable to technique recognition — their accuracy reflects forensic hygiene, not response to the technique content.

How this is controlled for in analysis:

  • Primary analysis uses all answers combined.
  • A sensitivity analysis segments results by headers_opened = false (answers made without opening headers) as the "content-only" detection signal. If the technique ranking is consistent across both cuts, the finding is robust to this confound.
  • For hard/extreme phishing cards where auth_status may be verified (attacker registered their own domain with passing authentication), header inspection provides no correct-direction signal — the technique effect is cleanest in this difficulty stratum and should be reported separately.

Difficulty Distribution During Collection

The research deck is drawn by purely random sampling from all cards in cards_real — 10 cards per session with no stratification by technique or difficulty. Difficulty balance is guaranteed at the dataset level (15 cards per difficulty tier per technique), not enforced at the session level. Over a sufficient number of sessions, expected exposure across techniques and difficulty tiers converges to the dataset proportions. Difficulty-stratified analysis is a planned secondary analysis that will confirm the technique effect is not an artifact of difficulty imbalance.

Answer Pool Scope

The answers table records answers from all game modes (research, freeplay, daily). Primary analysis uses only game_mode = 'research' answers from cards_real sourced cards. Non-research mode answers are excluded from all findings. The multi-mode table structure is a product decision (unified schema) and does not contaminate research data.


Publication Plan

  1. Blog post (scottaltiparmak.com) — "Which Phishing Techniques Fool You When AI Writes the Email" — detailed write-up with methodology, technique breakdown, and implications. Published once ~1,000 research answers collected.
  2. Public analytics page (/intel) — live aggregate findings, always current. Methodology note links to this document.

Version History

| Version | Date | Notes | |---------|------|-------| | 0.1 | 2026-03-01 | Initial methodology draft (real-world corpus plan) | | 0.2 | 2026-03-01 | Full schema, GenAI classification methodology | | 1.0 | 2026-03-01 | Pivot to all-generated dataset. New research question locked. 550 cards, 6 techniques. Methodology rewritten. | | 1.1 | 2026-03-02 | Added auth_status, reply_to, attachment_name, sent_at card fields. Added behavioral tracking (headers_opened, url_inspected, has_reply_to, has_url, has_attachment). Added tool usage secondary analysis. Added training effect / learning section. Signal count corrected to 6. | | 1.2 | 2026-03-02 | Added disclosures: repeated card exposure, fluent-prose confound, difficulty distribution during collection, answer pool scope. Server-side correct/technique verification added to answer collection pipeline. | | 1.3 | 2026-03-03 | Added auth header shortcut limitation with sensitivity analysis plan. Updated difficulty distribution section: research deck now stratifies by difficulty tier within technique per session (random tier selection), not pure random sampling. ResearchIntro updated with explicit data collection disclosure. | | 1.4 | 2026-03-04 | Removed SMS from dataset scope (email only). Added Extreme difficulty tier (15/15/15/15 per technique). Switched research deck from stratified to purely random sampling. Updated collection target to ~1,000 (math updated for random sampling). Updated data collection section: pseudonymous UUID model, corrected anonymity claims. Added secondary_technique to cards_real schema and approve pipeline. | | 1.5 | 2026-03-04 | Scaled dataset from 550 to 1,000 cards. Phishing: 690 (6 × 35 easy/medium/hard + 6 × 10 extreme). Legit: 310 (110/100/100). Extreme capped at 10 per technique — sufficient for expert mode coverage without over-representing near-indistinguishable attacks. Collection target math updated. | | 1.6 | 2026-03-06 | Status updated to active collection (dataset frozen at v1). Corrected session UUID claim — sessions ARE persisted to the sessions table. Corrected returning-player claim — player_id FK enables cross-session analysis. Added per-player collection cap (30 answers / 3 sessions, server-side enforced). Added missing answers table fields: player_id, card_source, is_genai_suspected, genai_confidence, grammar_quality, prose_fluency, personalization_level, contextual_coherence, secondary_technique, has_sent_at. Corrected game_mode values across schema tables. |