Entity Resolution and Deduplication

Entity resolution and deduplication are separate but closely related controls.

Entity resolution

Entity resolution maps source language to a canonical subject such as:

ticker;
company;
crypto asset;
protocol;
wallet;
sector;
person;
topic.

It uses explicit symbols, aliases, full names, source context, and curated registries. A bare word that can be ordinary language is not automatically treated as a ticker. Person names are not blindly tagged as securities.

Better entity resolution improves search, clustering, watchlist matching, reports, and narrative detection.

Deduplication

Deduplication prevents the same observation from creating artificial event volume.

Signals can include:

provider source identifier;
raw-item identity;
canonical URL;
normalized content hash;
entity;
source;
publication time;
stable event or cluster key.

Exact and near duplicates

An exact duplicate is the same source item ingested again. A near duplicate can be a corrected headline, syndicated copy, repost, or provider variation.

Exact duplicates should be idempotently ignored. Near duplicates may remain separate source references when they add independent information, but they should not be counted as unrelated confirmation.

Replay safety

NataPulse can replay stored observations to re-run classification or clustering after logic improvements. Stable keys and database constraints allow replay without multiplying published events.

Limits

Entity resolution can fail when a symbol is ambiguous, a company has multiple securities, a token shares a name, a wallet is unlabeled, or a source lacks context. Deduplication can also be imperfect when publishers substantially rewrite the same underlying report.

These uncertainties affect confidence and should remain visible in the investigation.