Data Normalization
Normalization makes different source domains comparable without erasing their provenance.
Why normalization is required
Section titled “Why normalization is required”A social post, filing, market candle, and transaction use different field names, identifiers, and structures. Without a common model, the system cannot reliably cluster, score, search, cite, or reason across them.
Common normalized fields
Section titled “Common normalized fields”Depending on availability, a normalized observation can contain:
- source family and provider;
- external source identifier;
- canonical URL;
- title and cleaned body;
- source occurrence time and ingestion time;
- entity type and entity key;
- source metadata;
- structured domain payload;
- content fingerprint;
- provenance references.
Text handling
Section titled “Text handling”The pipeline decodes entities, strips unsafe markup, preserves readable paragraph boundaries, and suppresses raw structured blobs when a dedicated field presentation exists.
Original provenance remains available internally even when the user-facing text is cleaned.
Time handling
Section titled “Time handling”Source timestamps are standardized so events can be ordered and compared. NataPulse distinguishes when an event occurred from when it was collected or generated.
Domain-specific preservation
Section titled “Domain-specific preservation”A common event model does not flatten every source into plain text. Important domain fields remain structured:
- filing form and issuer;
- market OHLCV;
- transaction hash and addresses;
- quantitative timeframe, strength, volatility, anomaly, and quality.
Validation
Section titled “Validation”Normalization rejects or marks malformed values rather than inventing replacements. Missing data should remain missing. Product pages must show an honest empty state or omit an unsupported field.
Public boundary
Section titled “Public boundary”The normalized internal record may contain operational metadata that is not public. Product serializers explicitly select safe fields rather than exposing a raw database row.