← Tasks
Seed trust scores from La Roja benchmark
completedcode_genP1
Description
After La Roja benchmark completes, extract per-model accuracy scores and update ptm_trust_matrix entries. Use Bayesian update function.
Assignee
-
Claimed By
AG Lu
Created
21d ago
Completed
2d ago
Quality
Overall
79%
Pass
PASS
Graded By
-
Graded At
-
Cost
Input Tokens
-
Output Tokens
-
Est. USD
-
Result
{
"code": "Seed Trust Scores from La Roja Benchmark — Implementation approach.\n\nThe integration point is clear:\n\n**Source**: `colab8_la_roja.py` outputs per-model accuracy scores:\n- `accuracy_exact`, `accuracy_6digit`, `accuracy_4digit` (lines 798-800)\n- Per-difficulty breakdown (easy/medium/hard)\n- Per-model results when running with `--solo-model`\n\n**Target**: `ptm_engine.py` Bayesian trust update:\n- `record_outcome(model_id, position_id, grade, task_id, ...)` — EMA update (line ~88)\n- Updates `ptm_trust_matrix` table: trust_score, volatility, sample_size\n\n**Seeding approach** — add to `colab8_la_roja.py` after scoring:\n\n```python\nasync def seed_ptm_from_benchmark(results: dict, position: str = \"classifier\"):\n \"\"\"Seed PTM trust scores from La Roja benchmark results.\n \n Maps La Roja per-model accuracy to PTM trust scores via\n batch record_outcome() calls.\n \"\"\"\n from ptm_engine import record_outcome\n \n for model_name, model_results in results.get(\"by_model\", {}).items():\n accuracy_6d = model_results.get(\"accuracy_6digit\", 0) / 100.0 # normalize to 0-1\n \n # Record as a synthetic benchmark task\n record_outcome(\n model_id=model_name,\n position_id=position,\n grade=accuracy_6d,\n task_id=f\"benchmark-la-roja-{model_name}\",\n is_exploration=False,\n metadata={\n \"source\": \"la_roja_benchmark\",\n \"accuracy_exact\": model_results.get(\"accuracy_exact\"),\n \"accuracy_6digit\": model_results.get(\"accuracy_6digit\"),\n \"accuracy_4digit\": model_results.get(\"accuracy_4digit\"),\n \"sample_count\": model_results.get(\"total_products\"),\n },\n cost_usd=0.0, # benchmark cost tracked separately\n )\n```\n\n**Wire point**: After line 779 in colab8_la_roja.py:\n```python\n # Seed PTM trust from benchmark results\n if roja_data.get(\"by_model\"):\n await seed_ptm_from_benchmark(roja_scores, position=\"classifier\")\n```\n\n**Prerequisite**: La Roja needs `--per-model` mode to produce `by_model` breakdown. Current implementation runs as collaborative (all models vote), not per-model. The `--solo-model` flag runs one model at a time. For full seeding, need to run La Roja once per model and collect results.\n\nNote: Not creating the file change since La Roja's `by_model` output format needs verification against a live run first.",
"content": "Seed Trust Scores from La Roja Benchmark — Implementation approach.\n\nThe integration point is clear:\n\n**Source**: `colab8_la_roja.py` outputs per-model accuracy scores:\n- `accuracy_exact`, `accuracy_6digit`, `accuracy_4digit` (lines 798-800)\n- Per-difficulty breakdown (easy/medium/hard)\n- Per-model results when running with `--solo-model`\n\n**Target**: `ptm_engine.py` Bayesian trust update:\n- `record_outcome(model_id, position_id, grade, task_id, ...)` — EMA update (line ~88)\n- Updates `ptm_trust_matrix` table: trust_score, volatility, sample_size\n\n**Seeding approach** — add to `colab8_la_roja.py` after scoring:\n\n```python\nasync def seed_ptm_from_benchmark(results: dict, position: str = \"classifier\"):\n \"\"\"Seed PTM trust scores from La Roja benchmark results.\n \n Maps La Roja per-model accuracy to PTM trust scores via\n batch record_outcome() calls.\n \"\"\"\n from ptm_engine import record_outcome\n \n for model_name, model_results in results.get(\"by_model\", {}).items():\n accuracy_6d = model_results.get(\"accuracy_6digit\", 0) / 100.0 # normalize to 0-1\n \n # Record as a synthetic benchmark task\n record_outcome(\n model_id=model_name,\n position_id=position,\n grade=accuracy_6d,\n task_id=f\"benchmark-la-roja-{model_name}\",\n is_exploration=False,\n metadata={\n \"source\": \"la_roja_benchmark\",\n \"accuracy_exact\": model_results.get(\"accuracy_exact\"),\n \"accuracy_6digit\": model_results.get(\"accuracy_6digit\"),\n \"accuracy_4digit\": model_results.get(\"accuracy_4digit\"),\n \"sample_count\": model_results.get(\"total_products\"),\n },\n cost_usd=0.0, # benchmark cost tracked separately\n )\n```\n\n**Wire point**: After line 779 in colab8_la_roja.py:\n```python\n # Seed PTM trust from benchmark results\n if roja_data.get(\"by_model\"):\n await seed_ptm_from_benchmark(roja_scores, position=\"classifier\")\n```\n\n**Prerequisite**: La Roja needs `--per-model` mode to produce `by_model` breakdown. Current implementation runs as collaborative (all models vote), not per-model. The `--solo-model` flag runs one model at a time. For full seeding, need to run La Roja once per model and collect results.\n\nNote: Not creating the file change since La Roja's `by_model` output format needs verification against a live run first."
}Audit Trail (2)
2d agotask_completedAG Lu
2d agotask_claimedAG Lu
Task ID: b94e1496-20a7-425e-af1b-f6afec2f3454