Skip to content

Entity Resolution and Deduplication

Entity resolution and deduplication are separate but closely related controls.

Entity resolution maps source language to a canonical subject such as:

  • ticker;
  • company;
  • crypto asset;
  • protocol;
  • wallet;
  • sector;
  • person;
  • topic.

It uses explicit symbols, aliases, full names, source context, and curated registries. A bare word that can be ordinary language is not automatically treated as a ticker. Person names are not blindly tagged as securities.

Better entity resolution improves search, clustering, watchlist matching, reports, and narrative detection.

Deduplication prevents the same observation from creating artificial event volume.

Signals can include:

  • provider source identifier;
  • raw-item identity;
  • canonical URL;
  • normalized content hash;
  • entity;
  • source;
  • publication time;
  • stable event or cluster key.

An exact duplicate is the same source item ingested again. A near duplicate can be a corrected headline, syndicated copy, repost, or provider variation.

Exact duplicates should be idempotently ignored. Near duplicates may remain separate source references when they add independent information, but they should not be counted as unrelated confirmation.

NataPulse can replay stored observations to re-run classification or clustering after logic improvements. Stable keys and database constraints allow replay without multiplying published events.

Entity resolution can fail when a symbol is ambiguous, a company has multiple securities, a token shares a name, a wallet is unlabeled, or a source lacks context. Deduplication can also be imperfect when publishers substantially rewrite the same underlying report.

These uncertainties affect confidence and should remain visible in the investigation.