Skip to content

Data Pipeline Overview

  1. Provider observation
  2. Validation
  3. Normalization
  4. Entity resolution
  5. Deduplication
  6. Scoring — importance & confidence
  7. Clustering
  8. Emerging narratives
  9. Products — reports, research, watchlists, alerts
Every source flows through these stages, in order, from a raw observation to published intelligence.

Every NataPulse source has domain-specific logic, but the product uses one common pipeline contract.

Provider
→ raw observation
→ validation
→ normalization
→ entity resolution
→ deduplication
→ source and quality assessment
→ importance and confidence
→ publication gate
→ event
→ cluster
→ emerging narrative
→ report, research, alert, or cited answer

Scheduled or event-driven workers request data from enabled providers. Each run records operational outcomes such as items read, inserted, skipped, or rejected. Provider budgets, rate limits, credentials, and availability can affect coverage.

The raw layer preserves enough provenance to reproduce how a published event originated. It is not exposed directly as the public product model.

Derivation transforms source-specific material into consistent financial intelligence:

  • clean and map fields;
  • standardize timestamps and identifiers;
  • resolve entities;
  • remove duplicate source items;
  • enrich structured domains;
  • estimate source reliability and data quality;
  • classify relevance and materiality;
  • create or update events;
  • attach events to clusters;
  • derive narrative trends.

Product pages do not query raw provider payloads or private processing tables. They use curated endpoints that explicitly whitelist fields, enforce workspace and permission boundaries, and expose only published records.

A pipeline should be safe to run again. Stable source identifiers, content hashes, unique constraints, and stable cluster keys prevent replay from creating artificial activity.

A source can be delayed or unavailable without making every product page unusable. The interface can continue to show previously published evidence, use polling when live channels are unavailable, and display honest empty or error states.

Internal tracing helps operators diagnose a processing run. Public users receive source provenance and timestamps, but not internal trace data, secrets, raw credentials, or infrastructure information.