Data Normalization

Normalization makes different source domains comparable without erasing their provenance.

Why normalization is required

A social post, filing, market candle, and transaction use different field names, identifiers, and structures. Without a common model, the system cannot reliably cluster, score, search, cite, or reason across them.

Common normalized fields

Depending on availability, a normalized observation can contain:

source family and provider;
external source identifier;
canonical URL;
title and cleaned body;
source occurrence time and ingestion time;
entity type and entity key;
source metadata;
structured domain payload;
content fingerprint;
provenance references.

Text handling

The pipeline decodes entities, strips unsafe markup, preserves readable paragraph boundaries, and suppresses raw structured blobs when a dedicated field presentation exists.

Original provenance remains available internally even when the user-facing text is cleaned.

Time handling

Source timestamps are standardized so events can be ordered and compared. NataPulse distinguishes when an event occurred from when it was collected or generated.

Domain-specific preservation

A common event model does not flatten every source into plain text. Important domain fields remain structured:

filing form and issuer;
market OHLCV;
transaction hash and addresses;
quantitative timeframe, strength, volatility, anomaly, and quality.

Validation

Normalization rejects or marks malformed values rather than inventing replacements. Missing data should remain missing. Product pages must show an honest empty state or omit an unsupported field.

Public boundary

The normalized internal record may contain operational metadata that is not public. Product serializers explicitly select safe fields rather than exposing a raw database row.