Skip to content

Data Normalization

Normalization makes different source domains comparable without erasing their provenance.

A social post, filing, market candle, and transaction use different field names, identifiers, and structures. Without a common model, the system cannot reliably cluster, score, search, cite, or reason across them.

Depending on availability, a normalized observation can contain:

  • source family and provider;
  • external source identifier;
  • canonical URL;
  • title and cleaned body;
  • source occurrence time and ingestion time;
  • entity type and entity key;
  • source metadata;
  • structured domain payload;
  • content fingerprint;
  • provenance references.

The pipeline decodes entities, strips unsafe markup, preserves readable paragraph boundaries, and suppresses raw structured blobs when a dedicated field presentation exists.

Original provenance remains available internally even when the user-facing text is cleaned.

Source timestamps are standardized so events can be ordered and compared. NataPulse distinguishes when an event occurred from when it was collected or generated.

A common event model does not flatten every source into plain text. Important domain fields remain structured:

  • filing form and issuer;
  • market OHLCV;
  • transaction hash and addresses;
  • quantitative timeframe, strength, volatility, anomaly, and quality.

Normalization rejects or marks malformed values rather than inventing replacements. Missing data should remain missing. Product pages must show an honest empty state or omit an unsupported field.

The normalized internal record may contain operational metadata that is not public. Product serializers explicitly select safe fields rather than exposing a raw database row.