Recent forecasts, resolved outcomes, scorecards, and methodology.
We show both the full historical production record and a clearly labeled current-production cohort so you can see what the active forecasting loop is doing now. The continuous update loop launched recently and will get its own scorecard once the sample is mature.
Historical preserves the full public record across all production versions. Current production shows how the active loop is performing now. Both use first-per-question scoring.
Not every domain is equal. We break out performance so you can see where the system is strong and where it’s still learning.
Current questions with live probabilities and the evidence that would change our view. Click any card to see the full forecast run and agent reasoning.
We publish both hits and misses so the system can be evaluated honestly. In the first public version, postmortems default to the initial forecast for each resolved question. Click any card to see the underlying run detail.
Multiple AI agents search the web, analyze data, and gather evidence on each question.
Agents produce independent probability estimates. A critic challenges the reasoning before synthesis.
When new evidence surfaces, forecasts are re-run. Every change is timestamped and explained.
Resolved questions are scored with Brier scores and compared against market benchmarks.
Calibration data feeds back into the system. Known weak areas are documented and addressed.
The Brier score measures how close probabilistic forecasts are to actual outcomes. It ranges from 0 (perfect) to 1 (worst possible). A score of 0.25 corresponds to always guessing 50% — the baseline for an uninformed forecaster. Lower is better.
The first-per-question view scores only the initial forecast generated for each question. This gives the cleanest baseline: one score per question, no double-counting from repeated updates.
The continuous update loop — where forecasts are re-run as new evidence surfaces — launched recently. Until the update loop has enough resolved questions to produce a meaningful sample, we headline initial forecast quality. Update-loop metrics will be published separately once the sample is mature.
The forecasting system evolves over time. Showing both cohorts lets you see the full public record (all production versions, preserving history) alongside how the current active production loop is performing right now. This avoids rewriting history when a new version deploys, while still making rollout-era results visible.
Cenva forecasts on many questions that never appear on public prediction markets. Where a prediction market price does exist (e.g., from Polymarket), we use it as a calibration benchmark — comparing our initial forecast against the market price at the same point in time. The market’s Brier score is computed the same way ours is, using first-per-question methodology. Market comparisons are a tool for measuring accuracy, not the product itself.
The public scorecard includes all questions that have been resolved through the standard pipeline. Questions that were cancelled, withdrawn, or had ambiguous resolution criteria are excluded. We do not retroactively remove questions that scored poorly.
Forecasting accuracy is best evaluated over large samples. Early-stage scorecards — particularly in domains with few resolved questions — should be interpreted with appropriate caution. We note the resolved count alongside every metric.
Performance varies across domains. Some areas (e.g., geopolitics) may have fewer resolved questions or inherently harder-to-forecast dynamics than others. The domain breakdown section shows these differences transparently rather than hiding behind aggregate numbers.