Data Pipeline Overview
- Provider observation
- Validation
- Normalization
- Entity resolution
- Deduplication
- Scoring — importance & confidence
- Clustering
- Emerging narratives
- Products — reports, research, watchlists, alerts
Every NataPulse source has domain-specific logic, but the product uses one common pipeline contract.
Provider → raw observation → validation → normalization → entity resolution → deduplication → source and quality assessment → importance and confidence → publication gate → event → cluster → emerging narrative → report, research, alert, or cited answerCollection
Section titled “Collection”Scheduled or event-driven workers request data from enabled providers. Each run records operational outcomes such as items read, inserted, skipped, or rejected. Provider budgets, rate limits, credentials, and availability can affect coverage.
Raw observation
Section titled “Raw observation”The raw layer preserves enough provenance to reproduce how a published event originated. It is not exposed directly as the public product model.
Derivation
Section titled “Derivation”Derivation transforms source-specific material into consistent financial intelligence:
- clean and map fields;
- standardize timestamps and identifiers;
- resolve entities;
- remove duplicate source items;
- enrich structured domains;
- estimate source reliability and data quality;
- classify relevance and materiality;
- create or update events;
- attach events to clusters;
- derive narrative trends.
Curated read models
Section titled “Curated read models”Product pages do not query raw provider payloads or private processing tables. They use curated endpoints that explicitly whitelist fields, enforce workspace and permission boundaries, and expose only published records.
Idempotency
Section titled “Idempotency”A pipeline should be safe to run again. Stable source identifiers, content hashes, unique constraints, and stable cluster keys prevent replay from creating artificial activity.
Degraded operation
Section titled “Degraded operation”A source can be delayed or unavailable without making every product page unusable. The interface can continue to show previously published evidence, use polling when live channels are unavailable, and display honest empty or error states.
Traceability without leakage
Section titled “Traceability without leakage”Internal tracing helps operators diagnose a processing run. Public users receive source provenance and timestamps, but not internal trace data, secrets, raw credentials, or infrastructure information.