Evidence-backed forecasting
with a public track record.

Recent forecasts, resolved outcomes, scorecards, and methodology.

Probabilistic and timestamped
Every update cites its evidence
Outcomes scored against reality
Strengths and weaknesses both shown

Track record at a glance

We show both the full historical production record and a clearly labeled current-production cohort so you can see what the active forecasting loop is doing now. The continuous update loop launched recently and will get its own scorecard once the sample is mature.

Historical production — all versions
--
Resolved Questions
--
Avg Brier (First/Q)
--
vs. Market
--
Last Updated

Historical preserves the full public record across all production versions. Current production shows how the active loop is performing now. Both use first-per-question scoring.


Performance by domain

Not every domain is equal. We break out performance so you can see where the system is strong and where it’s still learning.

Loading domains…

Recent forecasts

Current questions with live probabilities and the evidence that would change our view. Click any card to see the full forecast run and agent reasoning.

Loading forecasts…

Resolved postmortems

We publish both hits and misses so the system can be evaluated honestly. In the first public version, postmortems default to the initial forecast for each resolved question. Click any card to see the underlying run detail.

Loading postmortems…

How the system works

01

Research

Multiple AI agents search the web, analyze data, and gather evidence on each question.

02

Forecast

Agents produce independent probability estimates. A critic challenges the reasoning before synthesis.

03

Update

When new evidence surfaces, forecasts are re-run. Every change is timestamped and explained.

04

Score

Resolved questions are scored with Brier scores and compared against market benchmarks.

05

Improve

Calibration data feeds back into the system. Known weak areas are documented and addressed.


How to read the numbers

What is a Brier score?

The Brier score measures how close probabilistic forecasts are to actual outcomes. It ranges from 0 (perfect) to 1 (worst possible). A score of 0.25 corresponds to always guessing 50% — the baseline for an uninformed forecaster. Lower is better.

What does “first-per-question” mean?

The first-per-question view scores only the initial forecast generated for each question. This gives the cleanest baseline: one score per question, no double-counting from repeated updates.

Why is initial forecast quality the current headline?

The continuous update loop — where forecasts are re-run as new evidence surfaces — launched recently. Until the update loop has enough resolved questions to produce a meaningful sample, we headline initial forecast quality. Update-loop metrics will be published separately once the sample is mature.

Why are historical and current production shown separately?

The forecasting system evolves over time. Showing both cohorts lets you see the full public record (all production versions, preserving history) alongside how the current active production loop is performing right now. This avoids rewriting history when a new version deploys, while still making rollout-era results visible.

How are market comparisons defined?

Cenva forecasts on many questions that never appear on public prediction markets. Where a prediction market price does exist (e.g., from Polymarket), we use it as a calibration benchmark — comparing our initial forecast against the market price at the same point in time. The market’s Brier score is computed the same way ours is, using first-per-question methodology. Market comparisons are a tool for measuring accuracy, not the product itself.

What is included and excluded?

The public scorecard includes all questions that have been resolved through the standard pipeline. Questions that were cancelled, withdrawn, or had ambiguous resolution criteria are excluded. We do not retroactively remove questions that scored poorly.

Sample-size caveats

Forecasting accuracy is best evaluated over large samples. Early-stage scorecards — particularly in domains with few resolved questions — should be interpreted with appropriate caution. We note the resolved count alongside every metric.

Domain variation

Performance varies across domains. Some areas (e.g., geopolitics) may have fewer resolved questions or inherently harder-to-forecast dynamics than others. The domain breakdown section shows these differences transparently rather than hiding behind aggregate numbers.


These are public examples. The full system tracks the questions that matter to your team.