Making machines act human

A simulation engineering problem — modelling human typing behaviour for browser automation

Browser Automation Behavioural Simulation Systems Design Vibe Engineering AI Integration

Product

Multi-tenant WhatsApp lead automation

Six services orchestrated into a single pipeline — monitoring group conversations, identifying leads, and delivering personalised responses.

Role

Solo developer — architecture to deployment

End-to-end ownership across Puppeteer, Airtable, Make.com, OpenAI, Twilio, and Softr.

Goal

Model real human typing behaviour, not just add randomness

Build a physics-based simulation using keyboard proximity, cognitive pauses, and correction rhythms.

Domain

Real estate SaaS — Southeast Asia

WhatsApp-first market with multiple concurrent agents across high-volume property groups.

Stack

Node.js, Puppeteer, Airtable, Make.com, Twilio, OpenAI

Six services orchestrated into a single automated pipeline with Softr powering the client dashboard.

I built a multi-tenant system that monitors WhatsApp group conversations in real-time, identifies qualified real estate leads through keyword matching, and delivers AI-generated personalised responses via Twilio. Six services — Puppeteer, Airtable, Make.com, OpenAI, Twilio, and Softr — orchestrated into a single automated pipeline supporting multiple agents simultaneously.

The system worked. But early on, I hit a problem that turned out to be more interesting than I expected.

The Problem

When a browser automation tool interacts with a web application, it does so with machine precision. Puppeteer’s page.type() sends every keystroke at a perfectly uniform interval. There are no mistakes, no hesitation, no variance. It types like a machine because it is one.

This created a quality problem. The interactions my system was producing didn’t feel like a real person using WhatsApp Web — they felt synthetic. Every search query was typed with identical rhythm, zero errors, and instant submission.

I wanted to solve this properly: not by adding random delays, but by actually modelling how humans type.

“If another human were watching this browser session, they would immediately know something was off.”

David Quill

Breaking it down

I started by thinking about what makes human typing recognisably human. Three things stood out:

Timing is variable, not random. Humans don’t type at constant speed, but the variation isn’t purely random either — it follows patterns based on finger movement distance, familiarity with the word, and cognitive load.

Mistakes follow physical constraints. When a human mistypes, the wrong key is almost always physically adjacent to the intended key. You hit ‘s’ instead of ‘a’ because your finger drifted left, not because you randomly pressed ‘k’.

Corrections have their own rhythm. There’s a perceptible pause between making a mistake and pressing Backspace — the moment of noticing. Then another brief pause before continuing. These micro-pauses are part of the human signature.

The Solution

Keyboard Proximity Graph

The first thing I built was a spatial model of the QWERTY keyboard. Each key maps to its physically adjacent neighbours:

javascript
const nearbyKeys = 
  a: ['s', 'q', 'z'],  b: ['v', 'g', 'n'],
  c: ['x', 'v', 'f'],  d: ['s', 'f', 'e'],
  e: ['w', 'r', 'd'],  f: ['d', 'g', 'r'],
  g: ['f', 'h', 't'],  h: ['g', 'j', 'y'],
  i: ['u', 'o', 'k'],  l: ['k', 'o'],
  n: ['b', 'm', 'h'],  o: ['i', 'p', 'l'],
  p: ['o', 'l'],        q: ['w', 'a'],
  r: ['e', 't', 'f'],   s: ['a', 'd', 'w'],
  t: ['r', 'y', 'g'],   u: ['y', 'i', 'j'],
  v: ['c', 'b', 'g'],   w: ['q', 'e', 's'],
  x: ['z', 'c', 'd'],   y: ['t', 'u', 'h'],
  z: ['x', 'a']
;

function getNearbyKey(char) 
  const lower = char.toLowerCase();
  const options = nearbyKeys[lower];
  if (!options || options.length === 0) return char;
  return options[Math.floor(Math.random() * options.length)];

This isn’t a random character generator — it’s a physical model. When the system “mistypes,” it produces the same kind of error a human finger would: adjacent key drift. Pressing ‘d’ instead of ‘f’, not ‘d’ instead of ‘p’.

The Typing Sequence

With the proximity graph in place, I built the full typing simulation:

javascript
let typoMade = false;
let typoIndex = Math.floor(Math.random() * text.length);

for (let i = 0; i < text.length; i++) 
  const char = text[i];
  const delay = 120 + Math.floor(Math.random() * 100); // 120-220ms

  if (!typoMade && i === typoIndex) 
    // 1. Type the wrong (adjacent) key
    const wrongChar = getNearbyKey(char);
    await page.type('div[role="textbox"]', wrongChar,  delay );

    // 2. Pause — the "noticing" moment (300-600ms)
    await new Promise(res => setTimeout(res, 300 + Math.random() * 300));

    // 3. Correct the mistake
    await page.keyboard.press('Backspace');

    // 4. Brief recovery pause (100-300ms)
    await new Promise(res => setTimeout(res, 100 + Math.random() * 200));

    typoMade = true;
  

  // Type the correct character with variable delay
  await page.type('div[role="textbox"]', char,  delay );


// Pre-submission pause — reviewing before hitting Enter
await new Promise(res => setTimeout(res, 800 + Math.random() * 400));
await page.keyboard.press('Enter');

What Each Layer Models

Behaviour

Implementation

Why It Matters

Variable typing speed

120–220ms per keystroke

Constant intervals are the strongest machine signal

Adjacent-key typos

QWERTY proximity graph

Random typos don’t match real finger-drift patterns

Recognition pause

300–600ms before Backspace

Humans don’t instantly notice mistakes

Recovery pause

100–300ms after Backspace

Brief reset before resuming typing

Pre-submit review

800–1200ms before Enter

The natural moment of reading before confirming

Each delay range was calibrated through observation — watching myself and others type and noting the patterns. The ranges aren’t arbitrary; they represent the band of natural human variance for each micro-behaviour.

“Effective simulation isn’t about adding randomness — it’s about understanding the structure of the behaviour you’re modelling.”

David Quill

Challenges

Building the simulation was the most interesting problem, but the broader system surfaced several others worth noting.

Multi-tenant session isolation

Each agent needed a completely independent WhatsApp session, keyword set, contact blacklist, and credit balance. I solved this with process-level isolation — each user runs as a separate Node.js child process with USER_EMAIL threaded through every layer as the universal tenant identifier. The key lesson: multi-tenancy is an architectural decision you make on day one, not a feature you bolt on later.

Unstable DOM extraction

WhatsApp Web’s DOM changes without notice. I built a multi-strategy phone number extraction system with cascading fallbacks — if the primary selector breaks, secondary and tertiary strategies catch what the primary missed. Defensive extraction with format validation at each stage prevented silent data corruption.

Transient API failures

Airtable rate-limits at 5 req/sec, and network timeouts were silently dropping matched leads. I added an Axios response interceptor that retries on transient errors (timeouts, connection resets) while letting permanent errors (validation, bad data) fail immediately. The distinction between retryable and non-retryable failures turned an unreliable integration into a near-zero-loss pipeline.

Silent disconnection

The most dangerous failure mode wasn’t a crash — it was WhatsApp Web sessions expiring silently. The bot would keep running with no errors, but receive no messages. I built a companion connection monitor that evaluates DOM state every 2 minutes and syncs status to Airtable, plus an Express API that serves the QR code as a base64 image for remote reconnection.

Platform tier constraints

Make.com’s programmatic API is paywalled. I built a webhook-based workaround that uses the free tier’s webhook triggers as the entry point to a 12-module automation scenario. This constraint actually produced a cleaner integration boundary — the webhook URL became a stable contract between the monitoring layer and the automation layer.

Reflection

On the surface, the typing simulation is a small feature — maybe 40 lines of code. But it sits at the intersection of several disciplines.

Signal processing. The core question is: what distinguishes a human-generated signal from a machine-generated one? The answer turns out to be variance structure — not just “add randomness,” but “add the right kind of randomness that matches physical and cognitive constraints.”

Behavioural modelling. The typo system isn’t random error injection. It’s a simplified physical model: fingers occupy space on a keyboard, drift happens in predictable directions, and correction has a cognitive pipeline (notice → decide → act) with measurable latency at each stage.

Practical constraints. The model had to be lightweight enough to run inline during a typing sequence without introducing noticeable computational delay. A more sophisticated model (Fitts’ law for inter-key timing, per-user typing profiles) would be interesting but unnecessary for the use case.

What I’d improve

The typing simulation models a single behavioural archetype. In practice, people have different typing profiles — speed, error rate, correction patterns. A more complete model would introduce per-session variation: some sessions type faster with fewer mistakes, others slower with more corrections. I’d also add digraph timing — the delay between specific key pairs. Typing “th” is faster than typing “zx” because of finger positioning and muscle memory.

“A random delay is noise. A delay calibrated to finger drift on a QWERTY layout is simulation.”

David Quill