Detection Scoring

Overview

The confidence score

Every potential detection is assigned a score from 0.0 to 1.0. Only detections that meet or exceed the minimum threshold of 0.6 are flagged. Actions (masking, tokenizing, blocking) only trigger above this line.

Score range

0.0

0.6 threshold

1.0

Below threshold: ignored

Score < 0.6. No action taken, no log entry produced.

At or above threshold: flagged

Score ≥ 0.6. Policy is evaluated and the configured action fires.

Step 1

Base confidence

When a pattern matches, each entity type starts with a base confidence score reflecting how reliably that pattern identifies genuine sensitive data. Structural validators (the Luhn algorithm for credit cards, check-digit math for IBANs) can push base confidence to 0.90 or higher before any contextual analysis occurs.

Entity	Base Confidence	Validator
SSN	0.80	Pattern only
Credit Card	0.90	Luhn + BIN range
IBAN	0.88	Country check digit
Email	0.85	RFC pattern
Phone	0.72	Format matching
MRN	0.78	Pattern only

Step 2

Context scoring

The text surrounding a match is analyzed for keywords that signal whether the match is genuine sensitive data or an incidental pattern hit. Nearby keywords boost the score; terms associated with sample, test, or non-sensitive contexts apply a penalty. The same raw pattern can land on opposite sides of the threshold depending on context.

Example A: Boosted

"Please update the patient's
social security number: 078-05-1120
in the enrollment form."

Base score · 0.80

Context boost + +0.15 ("social security number")

Final score 0.95 ✓ FLAGGED

Example B: Penalized

"Use product code 078-05-1120
when referencing this item
in the sample catalog."

Base score · 0.80

Context penalty − −0.30 ("product code", "sample")

Final score 0.50 ✗ SKIPPED

BOOST KEYWORDS

ssn social security taxpayer id national id patient id member id account number card number date of birth dob

PENALTY KEYWORDS

sample test example product code order number placeholder dummy lorem reference demo

Step 3

Format validation

For entity types where the data format has mathematically verifiable properties, slim.io runs deterministic validators in addition to pattern matching. These validators confirm structural integrity and substantially reduce false positives from random digit sequences.

Luhn algorithm

Credit Cards & SINs

Validates the trailing check digit against a weighted sum of the preceding digits. Eliminates roughly 90% of false positives from random number sequences that happen to match a card number pattern.

BIN range matching

Credit Cards

Cross-references the Bank Identification Number (first 6 digits) against known issuer ranges. Visa starts with 4, Mastercard 51–55, Amex 34 or 37, Discover 6011 / 65.

IBAN check digits

IBANs

Validates the two-digit check code using mod-97 arithmetic as specified in ISO 13616. Country-specific length rules are also enforced: a DE IBAN is always 22 characters, a GB IBAN 22, and so on.

SIN Luhn

Canadian SINs

Canadian Social Insurance Numbers pass through the same Luhn check digit validation applied to credit cards, providing format-level verification independent of pattern matching alone.

Outcome

What happens at the threshold

Once a final score is computed, the outcome is binary: the entity is either flagged and routed through your policy, or it is silently dropped with no side effects.

✓

Flagged

Score ≥ 0.6

Entity is identified and its position is recorded
Configured policy rules are evaluated against the entity type
Action fires: mask, hash, tokenize, redact, or block
Detection is written to the audit log

×

Skipped

Score < 0.6

Entity is discarded with no action taken
No policy evaluation occurs
No log entry is written
Original data passes through unchanged

Optional · Step 4

LLM Assist review

When enabled, slim.io runs a second pass after pattern-based scoring. An LLM reviews each flagged entity in context and returns a verdict of true_positive or false_positive. False positives are removed before any action fires. Four independent scopes — each opt-in, each targeting a distinct class of hard-to-detect PII.

How it works

1

Pattern matching produces a scored candidate set. All entities at or above the 0.6 threshold are queued for review by any enabled scopes.

2

Each entity value is replaced with a type-tagged placeholder before the LLM sees it. The model receives surrounding context only — never the raw sensitive value.

3

The LLM returns a verdict for each entity. Detections marked false_positive are dropped. The remaining set proceeds to policy and action.

4

Fail-open: if the LLM endpoint errors or times out, all findings are preserved and forwarded unchanged.

Configuration

# enable per scan config

llm_assist_enabled: true

# scopes (pipe-separated, opt-in)

scope: ENTITY_RESOLUTION | MULTILANG | CODE_CONTEXT

tokenization_mode: type_tagged

batch_size: 50

timeout_per_finding_s: 5

# optional: bring your own LLM

byollm_endpoint: https://your-llm.internal

Privacy guarantee: Raw values are never sent to an LLM. The review operates on [SSN]-style tags and surrounding context only.

Four scopes

Scope	What it resolves	Default
ENTITY_RESOLUTION	Cross-record deduplication. Matches the same person appearing under different spellings or partial identifiers across records. Uses American Soundex + email-domain + SSN-area + CC-BIN blocking keys.	ON
MULTILANG	PII in non-English text. lingua-language-detector identifies the language first (75+ languages supported); the LLM then reviews entities in language-appropriate context.	ON
CODE_CONTEXT	Sensitive values in source code, config files, and infrastructure-as-code. Uses a ±2,000-char context window (vs the ±120-char default) and a code-aware system prompt.	ON
BORDERLINE	Edge-case detections that scored between 0.6 and 0.75 — above threshold but below high-confidence. The LLM makes the final call on whether to flag or drop.	OFF

Bring Your Own LLM BYOLLM

Point slim.io at any OpenAI-compatible endpoint — a self-hosted model, a regional deployment, or a private cloud instance. Set byollm_endpoint in the scan config. slim.io falls back to its default model when the custom endpoint is unreachable.

opt-in LLM Assist is disabled by default. Enable per scan config. Each scope is also individually opt-in — turn on only the ones that match your data profile.