Amber Tanaka commited on
Commit
7b52df4
·
unverified ·
1 Parent(s): 0dd7833

Paper cuts (#64)

Browse files
about.py CHANGED
@@ -3,76 +3,127 @@ import gradio as gr
3
 
4
  def build_page():
5
  with gr.Column(elem_id="about-page-content-wrapper"):
6
- gr.Markdown(
 
7
  """
8
- ## About AstaBench
9
- AstaBench is a novel AI agents evaluation framework, providing a challenging new test for AI agents: the first benchmark challenge that evaluates agents’ scientific abilities on a broad spectrum of research skills, including literature understanding, data analysis, planning, tool use, coding, and search. Asta’s set of standard tools makes it easy to build general-purpose science agents and to compare their performance in an apples-to-apples manner.
10
- """)
11
- gr.Markdown("---",elem_classes="divider-line")
 
 
 
12
 
13
- gr.Markdown(""" ## Why AstaBench?
14
- Most current benchmarks test agentic AI and isolated aspects of scientific reasoning, but rarely evaluate AI agentic behavior rigorously or capture the full skill set scientific research requires. Agents can appear effective despite inconsistent results and high compute use, often outperforming others by consuming more resources. Advancing scientific AI requires evaluations that emphasize reproducibility, efficiency, and the real complexity of research.
15
-
16
- AstaBench fills this gap: an agents evaluation framework and suite of open benchmarks for evaluating scientific AI assistants on core scientific tasks that require novel reasoning. AstaBench helps scientists identify which agents best support their needs through task-relevant leaderboards, while giving AI developers a standard execution environment and tools to test the scientific reasoning capabilities of their agents compared to well-known baselines from the literature, including both open and closed LLM foundation models.
17
- """)
18
- gr.Markdown("---",elem_classes="divider-line")
 
 
 
 
 
 
 
 
19
 
20
- gr.Markdown("""
21
- ## What Does AstaBench Include?
22
- AstaBench includes a rigorous agents evaluation framework and a suite of benchmarks consisting of over 2,400 problems across 11 benchmarks, organized into four core categories:
23
- - Literature Understanding
24
- - Code & Execution
25
- - Data Analysis
26
- - End-to-End Discovery
27
- Plus: a large suite of integrated agents and leaderboards with results from extensive evaluation of agents and models.
28
-
29
- 🔍 Learn more in the AstaBench technical blog post
30
- """)
31
- gr.Markdown("---",elem_classes="divider-line")
 
 
 
 
 
 
 
 
 
 
32
 
33
- gr.Markdown("""
34
- ## Understanding the Leaderboards
35
- The AstaBench Overall Leaderboard provides a high-level view of overall agent performance and efficiency:
36
- - Overall score: A macro-average of the four category-level averages (equal weighting)
37
- - Overall cost: Average cost per task, aggregated only across benchmarks with reported cost
38
-
39
- Each category leaderboard provides:
40
- - Average score and cost for that category (macro-averaged across the benchmarks in the category)
41
- - A breakdown by individual benchmarks
42
- """)
43
- gr.Markdown("---",elem_classes="divider-line")
 
 
 
 
 
 
 
 
 
 
44
 
45
- gr.Markdown("""
46
- ## Scoring & Aggregation
47
- AstaBench encourages careful, transparent evaluation. Here's how we handle scoring, cost, and partial results:
48
-
49
- **Scores**
50
- - Each benchmark returns an average score based on per-problem scores
51
- - All scores are aggregated upward using macro-averaging
52
- - Partial completions are included (even with poor performance)
53
-
54
- **Cost**
55
- - Costs are reported in USD per task.
56
- - Benchmarks without cost data are excluded from cost averages
57
- - In scatter plots, agents without cost are plotted to the far right and clearly marked.
58
-
59
- Note: Cost values reflect pricing and infrastructure conditions at the time of each submission. We recognize that compute costs may change over time, and are actively working on methods to normalize cost data across submissions for fairer longitudinal comparisons.
60
-
61
- **Coverage**
62
- - Main leaderboard: category coverage (X/4)
63
- - Category view: benchmark coverage (X/Y)
64
- - Incomplete coverage is flagged visually
65
-
66
- These design choices ensure fair comparison while penalizing cherry-picking and omissions.
67
- """)
68
- gr.Markdown("---",elem_classes="divider-line")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
- gr.Markdown("""
71
- ## Learn More
72
- - AstaBench technical blog post
73
- - FAQ and submission guide
74
- """
75
- )
 
 
 
 
76
 
77
  # Floating feedback button
78
  floating_feedback_button_html = """
 
3
 
4
  def build_page():
5
  with gr.Column(elem_id="about-page-content-wrapper"):
6
+ # --- Section 1: About AstaBench ---
7
+ gr.HTML(
8
  """
9
+ <h2>About AstaBench</h2>
10
+ <p>
11
+ AstaBench is a novel AI agents evaluation framework, providing a challenging new test for AI agents: the first benchmark challenge that evaluates agents’ scientific abilities on a broad spectrum of research skills, including literature understanding, data analysis, planning, tool use, coding, and search. Asta’s set of standard tools makes it easy to build general-purpose science agents and to compare their performance in an apples-to-apples manner.
12
+ </p>
13
+ """
14
+ )
15
+ gr.Markdown("---", elem_classes="divider-line")
16
 
17
+ # --- Section 2: Why AstaBench? ---
18
+ gr.HTML(
19
+ """
20
+ <h2>Why AstaBench?</h2>
21
+ <p>
22
+ Most current benchmarks test agentic AI and isolated aspects of scientific reasoning, but rarely evaluate AI agentic behavior rigorously or capture the full skill set scientific research requires. Agents can appear effective despite inconsistent results and high compute use, often outperforming others by consuming more resources. Advancing scientific AI requires evaluations that emphasize reproducibility, efficiency, and the real complexity of research.
23
+ </p>
24
+ <br>
25
+ <p>
26
+ AstaBench fills this gap: an agents evaluation framework and suite of open benchmarks for evaluating scientific AI assistants on core scientific tasks that require novel reasoning. AstaBench helps scientists identify which agents best support their needs through task-relevant leaderboards, while giving AI developers a standard execution environment and tools to test the scientific reasoning capabilities of their agents compared to well-known baselines from the literature, including both open and closed LLM foundation models.
27
+ </p>
28
+ """
29
+ )
30
+ gr.Markdown("---", elem_classes="divider-line")
31
 
32
+ # --- Section 3: What Does AstaBench Include? ---
33
+ gr.HTML(
34
+ """
35
+ <h2>What Does AstaBench Include?</h2>
36
+ <p>
37
+ AstaBench includes a rigorous agents evaluation framework and a suite of benchmarks consisting of over 2,400 problems across 11 benchmarks, organized into four core categories:
38
+ </p>
39
+ <ul class="info-list">
40
+ <li>Literature Understanding</li>
41
+ <li>Code & Execution</li>
42
+ <li>Data Analysis</li>
43
+ <li>End-to-End Discovery</li>
44
+ </ul>
45
+ <p>
46
+ Plus: a large suite of integrated agents and leaderboards with results from extensive evaluation of agents and models.
47
+ </p>
48
+ <p>
49
+ 🔍 Learn more in the AstaBench technical blog post
50
+ </p>
51
+ """
52
+ )
53
+ gr.Markdown("---", elem_classes="divider-line")
54
 
55
+ # --- Section 4: Understanding the Leaderboards ---
56
+ gr.HTML(
57
+ """
58
+ <h2>Understanding the Leaderboards</h2>
59
+ <p>
60
+ The AstaBench Overall Leaderboard provides a high-level view of overall agent performance and efficiency:
61
+ </p>
62
+ <ul class="info-list">
63
+ <li>Overall score: A macro-average of the four category-level averages (equal weighting)</li>
64
+ <li>Overall cost: Average cost per task, aggregated only across benchmarks with reported cost</li>
65
+ </ul>
66
+ <p>
67
+ Each category leaderboard provides:
68
+ </p>
69
+ <ul class="info-list">
70
+ <li>Average score and cost for that category (macro-averaged across the benchmarks in the category)</li>
71
+ <li>A breakdown by individual benchmarks</li>
72
+ </ul>
73
+ """
74
+ )
75
+ gr.Markdown("---", elem_classes="divider-line")
76
 
77
+ # --- Section 5: Scoring & Aggregation ---
78
+ gr.HTML(
79
+ """
80
+ <h2>Scoring & Aggregation</h2>
81
+ <p>
82
+ AstaBench encourages careful, transparent evaluation. Here's how we handle scoring, cost, and partial results:
83
+ </p>
84
+
85
+ <h3>Scores</h3>
86
+ <ul class="info-list">
87
+ <li>Each benchmark returns an average score based on per-problem scores</li>
88
+ <li>All scores are aggregated upward using macro-averaging</li>
89
+ <li>Partial completions are included (even with poor performance)</li>
90
+ </ul>
91
+
92
+ <h3>Cost</h3>
93
+ <ul class="info-list">
94
+ <li>Costs are reported in USD per task.</li>
95
+ <li>Benchmarks without cost data are excluded from cost averages</li>
96
+ <li>In scatter plots, agents without cost are plotted to the far right and clearly marked.</li>
97
+ </ul>
98
+
99
+ <p>
100
+ <em>Note: Cost values reflect pricing and infrastructure conditions at the time of each submission. We recognize that compute costs may change over time, and are actively working on methods to normalize cost data across submissions for fairer longitudinal comparisons.</em>
101
+ </p>
102
+
103
+ <h3>Coverage</h3>
104
+ <ul class="info-list">
105
+ <li>Main leaderboard: category coverage (X/4)</li>
106
+ <li>Category view: benchmark coverage (X/Y)</li>
107
+ <li>Incomplete coverage is flagged visually</li>
108
+ </ul>
109
+
110
+ <p>
111
+ These design choices ensure fair comparison while penalizing cherry-picking and omissions.
112
+ </p>
113
+ """
114
+ )
115
+ gr.Markdown("---", elem_classes="divider-line")
116
 
117
+ # --- Section 6: Learn More ---
118
+ gr.HTML(
119
+ """
120
+ <h2>Learn More</h2>
121
+ <ul class="info-list">
122
+ <li>AstaBench technical blog post</li>
123
+ <li>FAQ and submission guide</li>
124
+ </ul>
125
+ """
126
+ )
127
 
128
  # Floating feedback button
129
  floating_feedback_button_html = """
assets/c-custom.svg CHANGED
assets/c-equivalent.svg CHANGED
assets/c-legend.svg CHANGED
assets/c-standard.svg CHANGED
assets/trophy.svg ADDED
category_page_builder.py CHANGED
@@ -6,6 +6,7 @@ from ui_components import create_leaderboard_display, create_benchmark_details_d
6
 
7
  def build_category_page(CATEGORY_NAME, PAGE_DESCRIPTION):
8
  with gr.Column(elem_id="page-content-wrapper"):
 
9
  validation_df, validation_tag_map = get_full_leaderboard_data("validation")
10
  test_df, test_tag_map = get_full_leaderboard_data("test")
11
  with gr.Column(elem_id="validation_nav_container", visible=False) as validation_nav_container:
@@ -13,7 +14,6 @@ def build_category_page(CATEGORY_NAME, PAGE_DESCRIPTION):
13
 
14
  with gr.Column(elem_id="test_nav_container", visible=True) as test_nav_container:
15
  create_sub_navigation_bar(test_tag_map, CATEGORY_NAME)
16
- gr.HTML(f'<h2>AstaBench {CATEGORY_NAME} Leaderboard <span style="font-weight: normal; color: inherit;">(Aggregate)</span></h2>', elem_id="main-header")
17
  gr.Markdown(PAGE_DESCRIPTION, elem_id="category-intro")
18
  # --- This page now has two main sections: Validation and Test ---
19
  with gr.Tabs():
 
6
 
7
  def build_category_page(CATEGORY_NAME, PAGE_DESCRIPTION):
8
  with gr.Column(elem_id="page-content-wrapper"):
9
+ gr.HTML(f'<h2>AstaBench {CATEGORY_NAME} Leaderboard <span style="font-weight: normal; color: inherit;">(Aggregate)</span></h2>', elem_id="main-header")
10
  validation_df, validation_tag_map = get_full_leaderboard_data("validation")
11
  test_df, test_tag_map = get_full_leaderboard_data("test")
12
  with gr.Column(elem_id="validation_nav_container", visible=False) as validation_nav_container:
 
14
 
15
  with gr.Column(elem_id="test_nav_container", visible=True) as test_nav_container:
16
  create_sub_navigation_bar(test_tag_map, CATEGORY_NAME)
 
17
  gr.Markdown(PAGE_DESCRIPTION, elem_id="category-intro")
18
  # --- This page now has two main sections: Validation and Test ---
19
  with gr.Tabs():
content.py CHANGED
@@ -1,48 +1,45 @@
1
  TITLE = """<h1 align="left" id="space-title">AstaBench Leaderboard</h1>"""
2
 
3
  INTRO_PARAGRAPH = """
4
- Newer benchmarks may test agentic AI and isolated aspects of scientific reasoning, but none rigorously measure agentic AI or capture the full range of skills research demands. Agents can appear effective by simply retrying tasks—often at high computational cost and with inconsistent results. Scientific AI needs evaluations that reflect the real complexity of research.
 
 
5
  <br>
 
 
 
6
  <br>
7
- AstaBench fills that gap: a suite of open benchmarks for evaluating scientific AI assistants on core scientific tasks that require novel reasoning. The suite includes over 8,000 tasks across 11 benchmarks, organized into four core categories: Literature Understanding, Code & Execution, Data Analysis, and End-to-End Discovery.
 
 
 
 
 
 
 
 
 
 
8
  <br>
 
 
 
 
 
 
 
 
 
9
  <br>
10
- The **AstaBench Leaderboard** below provides a high-level summary of agent performance and efficiency. It includes:
11
- <br>
12
- <br>
13
- - An **overall score**, computed as a macro average of the four category-level macro averages, ensuring each domain contributes equally—regardless of how many benchmarks each category includes. This provides a fair and balanced comparison across agents with varying capabilities.
14
- - An **overall average cost per task**, consistently aggregated across all categories, to reflect the real efficiency of each agent under comparable conditions.
15
- <br>
16
- To support domain-specific insight, AstaBench also provides per-category leaderboards:
17
- <br>
18
- <br>
19
- - Literature Understanding
20
- <br>
21
- - Code & Execution
22
- <br>
23
- - Data Analysis
24
- <br>
25
- - End-to-End Discovery
26
- <br>
27
- <br>
28
- Each category page includes a summary table (average score and cost per problem for that domain), as well as per-benchmark leaderboards for detailed comparisons on specific tasks.
29
- <br>
30
- <br>
31
- 🔍 Learn more in the AstaBench technical blog post
32
  """
33
  SCATTER_DISCLAIMER = """
34
- **Note:** Agents without cost data are displayed to the right of the vertical divider line. <span class="tooltip-icon" data-tooltip="Missing Cost Dashed Line: Max Cost + (MaxCost/10) Missing Cost Datapoints/No Cost Data = Max Cost + (MaxCost/5)">ⓘ</span>
35
- """
36
- scatter_disclaimer_html = """
37
- <div class="disclaimer-text">
38
- <b>Note:</b> Agents without cost data are displayed to the right of the vertical divider line.
39
- <span class="tooltip-icon" data-tooltip="Missing Cost Dashed Line:
40
- Max Cost + (MaxCost/10)
41
- Missing Cost Datapoints / No Cost Data:
42
- Max Cost + (MaxCost/5)">
43
-
44
- </span>
45
- </div>
46
  """
47
  PARETO_DISCLAIMER = """
48
  Agents names that are green are Pareto optimal, meaning they achieve the best performance for their cost.
@@ -246,7 +243,7 @@ nav.svelte-ti537g.svelte-ti537g {
246
  line-height: 1.2 !important;
247
  vertical-align: top !important;
248
  font-size: 12px !important;
249
-
250
  }
251
  .wrap-header-df th {
252
  height: auto !important;
@@ -539,11 +536,11 @@ span.wrap[tabindex="0"][role="button"][data-editable="false"] {
539
  margin-top: 40px;
540
  opacity: 85%;
541
  }
542
- .divider-line {
543
- opacity: 40%;
544
- }
545
  #leaderboard-accordion table {
546
  width: auto !important;
547
  margin-right: auto !important;
548
  }
 
 
 
549
  """
 
1
  TITLE = """<h1 align="left" id="space-title">AstaBench Leaderboard</h1>"""
2
 
3
  INTRO_PARAGRAPH = """
4
+ <p>
5
+ Newer benchmarks may test agentic AI and isolated aspects of scientific reasoning, but none rigorously measure agentic AI or capture the full range of skills research demands. Agents can appear effective by simply retrying tasks—often at high computational cost and with inconsistent results. Scientific AI needs evaluations that reflect the real complexity of research.
6
+ </p>
7
  <br>
8
+ <p>
9
+ AstaBench fills that gap: a suite of open benchmarks for evaluating scientific AI assistants on core scientific tasks that require novel reasoning. The suite includes over 8,000 tasks across 11 benchmarks, organized into four core categories: Literature Understanding, Code & Execution, Data Analysis, and End-to-End Discovery.
10
+ </p>
11
  <br>
12
+ <p>
13
+ The <strong>AstaBench Leaderboard</strong> below provides a high-level summary of agent performance and efficiency. It includes:
14
+ </p>
15
+ <ul class="info-list">
16
+ <li>
17
+ An <strong>overall score</strong>, computed as a macro average of the four category-level macro averages, ensuring each domain contributes equally—regardless of how many benchmarks each category includes. This provides a fair and balanced comparison across agents with varying capabilities.
18
+ </li>
19
+ <li>
20
+ An <strong>overall average cost per task</strong>, consistently aggregated across all categories, to reflect the real efficiency of each agent under comparable conditions.
21
+ </li>
22
+ </ul>
23
  <br>
24
+ <p>
25
+ To support domain-specific insight, AstaBench also provides per-category leaderboards:
26
+ </p>
27
+ <ul class="info-list">
28
+ <li>Literature Understanding</li>
29
+ <li>Code & Execution</li>
30
+ <li>Data Analysis</li>
31
+ <li>End-to-End Discovery</li>
32
+ </ul>
33
  <br>
34
+ <p>
35
+ Each category page includes a summary table (average score and cost per problem for that domain), as well as per-benchmark leaderboards for detailed comparisons on specific tasks.
36
+ </p>
37
+ <p>
38
+ 🔍 Learn more in the AstaBench technical blog post
39
+ </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  """
41
  SCATTER_DISCLAIMER = """
42
+ **Note:** Agents without cost data are displayed to the right of the vertical divider line.
 
 
 
 
 
 
 
 
 
 
 
43
  """
44
  PARETO_DISCLAIMER = """
45
  Agents names that are green are Pareto optimal, meaning they achieve the best performance for their cost.
 
243
  line-height: 1.2 !important;
244
  vertical-align: top !important;
245
  font-size: 12px !important;
246
+ font-family: 'Manrope';
247
  }
248
  .wrap-header-df th {
249
  height: auto !important;
 
536
  margin-top: 40px;
537
  opacity: 85%;
538
  }
 
 
 
539
  #leaderboard-accordion table {
540
  width: auto !important;
541
  margin-right: auto !important;
542
  }
543
+ .info-list {
544
+ padding-left: 20px;
545
+ }
546
  """
leaderboard_transformer.py CHANGED
@@ -511,7 +511,7 @@ def _plot_scatter_plotly(
511
  marker=dict(
512
  color=color_map.get(category, 'black'),
513
  symbol=group['shape_symbol'],
514
- size=10,
515
  opacity=0.8,
516
  line=dict(width=1, color='deeppink')
517
  )
@@ -561,7 +561,7 @@ def _plot_scatter_plotly(
561
 
562
  fig.update_layout(
563
  template="plotly_white",
564
- title=f"Astabench {name} Leaderboard",
565
  xaxis=xaxis_config, # Use the updated config
566
  yaxis=dict(title="Average (mean) score", rangemode="tozero"),
567
  legend=dict(
 
511
  marker=dict(
512
  color=color_map.get(category, 'black'),
513
  symbol=group['shape_symbol'],
514
+ size=15,
515
  opacity=0.8,
516
  line=dict(width=1, color='deeppink')
517
  )
 
561
 
562
  fig.update_layout(
563
  template="plotly_white",
564
+ title=f"AstaBench {name} Leaderboard",
565
  xaxis=xaxis_config, # Use the updated config
566
  yaxis=dict(title="Average (mean) score", rangemode="tozero"),
567
  legend=dict(
main_page.py CHANGED
@@ -16,11 +16,11 @@ CACHED_VIEWERS = {}
16
  CACHED_TAG_MAPS = {}
17
 
18
  def build_page():
19
- gr.Markdown(INTRO_PARAGRAPH, elem_id="intro-paragraph")
20
  # --- Leaderboard Display Section ---
21
  gr.Markdown("---")
22
  CATEGORY_NAME = "Overall"
23
- gr.Markdown(f"## Astabench {CATEGORY_NAME} Leaderboard")
24
 
25
  with gr.Tabs() as tabs:
26
  with gr.Tab("Results: Test Set") as test_tab:
 
16
  CACHED_TAG_MAPS = {}
17
 
18
  def build_page():
19
+ gr.HTML(INTRO_PARAGRAPH, elem_id="intro-paragraph")
20
  # --- Leaderboard Display Section ---
21
  gr.Markdown("---")
22
  CATEGORY_NAME = "Overall"
23
+ gr.Markdown(f"## AstaBench {CATEGORY_NAME} Leaderboard")
24
 
25
  with gr.Tabs() as tabs:
26
  with gr.Tab("Results: Test Set") as test_tab:
ui_components.py CHANGED
@@ -27,12 +27,12 @@ from config import (
27
  RESULTS_DATASET,
28
  )
29
  from content import (
30
- scatter_disclaimer_html,
31
  format_error,
32
  format_log,
33
  format_warning,
34
  hf_uri_to_web_url,
35
  hyperlink,
 
36
  )
37
 
38
  api = HfApi()
@@ -167,6 +167,8 @@ def build_openness_tooltip_content() -> str:
167
 
168
  def build_pareto_tooltip_content() -> str:
169
  """Generates the inner HTML for the Pareto tooltip card with final copy."""
 
 
170
  return f"""
171
  <h3>On Pareto Frontier</h3>
172
  <p class="tooltip-description">The Pareto frontier represents the best balance between score and cost.</p>
@@ -175,7 +177,10 @@ def build_pareto_tooltip_content() -> str:
175
  <li>Offer the lowest cost for a given performance, or</li>
176
  <li>Deliver the best performance at a given cost.</li>
177
  </ul>
178
- <p class="tooltip-description" style="margin-top: 12px;">These agents are marked with this icon: 🏆</p>
 
 
 
179
  """
180
 
181
  def build_tooling_tooltip_content() -> str:
@@ -295,6 +300,7 @@ def create_legend_markdown(which_table: str) -> str:
295
  This is used in the main leaderboard display.
296
  """
297
  descriptions_tooltip_content = build_descriptions_tooltip_content(which_table)
 
298
  legend_markdown = f"""
299
  <div style="display: flex; flex-wrap: wrap; align-items: flex-start; gap: 10px; font-size: 14px; padding-bottom: 8px;">
300
 
@@ -304,7 +310,10 @@ def create_legend_markdown(which_table: str) -> str:
304
 
305
  <span class="tooltip-card">{pareto_tooltip_content}</span>
306
  </span>
307
- <div style="margin-top: 8px;"><span>🏆 On frontier</span></div>
 
 
 
308
  </div>
309
 
310
  <div> <!-- Container for the Openness section -->
@@ -424,12 +433,14 @@ def create_leaderboard_display(
424
  df_view, plots_dict = transformer.view(tag=category_name, use_plotly=True)
425
  pareto_df = get_pareto_df(df_view)
426
  # Get the list of agents on the frontier. We'll use this list later.
 
 
427
  if not pareto_df.empty and 'id' in pareto_df.columns:
428
  pareto_agent_names = pareto_df['id'].tolist()
429
  else:
430
  pareto_agent_names = []
431
  df_view['Pareto'] = df_view.apply(
432
- lambda row: '🏆' if row['id'] in pareto_agent_names else '',
433
  axis=1
434
  )
435
  # Create mapping for Openness / tooling
@@ -472,7 +483,7 @@ def create_leaderboard_display(
472
  for col in df_headers:
473
  if col == "Logs" or "Cost" in col or "Score" in col:
474
  df_datatypes.append("markdown")
475
- elif col in ["Agent","Icon","LLM Base"]:
476
  df_datatypes.append("html")
477
  else:
478
  df_datatypes.append("str")
@@ -499,7 +510,7 @@ def create_leaderboard_display(
499
  value=scatter_plot,
500
  show_label=False
501
  )
502
- gr.HTML(value=scatter_disclaimer_html, elem_id="scatter-disclaimer")
503
  # Put table and key into an accordion
504
  with gr.Accordion("Show / Hide Table View", open=True, elem_id="leaderboard-accordion"):
505
  dataframe_component = gr.DataFrame(
@@ -539,8 +550,8 @@ def create_benchmark_details_display(
539
  gr.Markdown(f"No detailed benchmarks found for the category: {category_name}")
540
  return
541
 
 
542
  gr.Markdown("---")
543
- gr.Markdown("## Detailed Benchmark Results")
544
  # 2. Loop through each benchmark and create its UI components
545
  for benchmark_name in benchmark_names:
546
  with gr.Row(elem_classes=["benchmark-header"]):
@@ -573,12 +584,14 @@ def create_benchmark_details_display(
573
  benchmark_table_df = full_df[existing_table_cols].copy()
574
  pareto_df = get_pareto_df(benchmark_table_df)
575
  # Get the list of agents on the frontier. We'll use this list later.
 
 
576
  if not pareto_df.empty and 'id' in pareto_df.columns:
577
  pareto_agent_names = pareto_df['id'].tolist()
578
  else:
579
  pareto_agent_names = []
580
  benchmark_table_df['Pareto'] = benchmark_table_df.apply(
581
- lambda row: ' 🏆' if row['id'] in pareto_agent_names else '',
582
  axis=1
583
  )
584
 
@@ -643,7 +656,7 @@ def create_benchmark_details_display(
643
  for col in df_headers:
644
  if "Logs" in col or "Cost" in col or "Score" in col:
645
  df_datatypes.append("markdown")
646
- elif col in ["Agent","Icon", "LLM Base"]:
647
  df_datatypes.append("html")
648
  else:
649
  df_datatypes.append("str")
@@ -662,7 +675,7 @@ def create_benchmark_details_display(
662
  name=benchmark_name
663
  )
664
  gr.Plot(value=benchmark_plot, show_label=False)
665
- gr.HTML(value=scatter_disclaimer_html, elem_id="scatter-disclaimer")
666
  # Put table and key into an accordion
667
  with gr.Accordion("Show / Hide Table View", open=True, elem_id="leaderboard-accordion"):
668
  gr.DataFrame(
@@ -755,7 +768,7 @@ def create_sub_navigation_bar(tag_map: dict, category_name: str, validation: boo
755
  # This container will be our flexbox row.
756
  full_html = f"""
757
  <div class="sub-nav-bar-container">
758
- <span class="sub-nav-label">Benchmarks:</span>
759
  {''.join(html_buttons)}
760
  </div>
761
  """
 
27
  RESULTS_DATASET,
28
  )
29
  from content import (
 
30
  format_error,
31
  format_log,
32
  format_warning,
33
  hf_uri_to_web_url,
34
  hyperlink,
35
+ SCATTER_DISCLAIMER,
36
  )
37
 
38
  api = HfApi()
 
167
 
168
  def build_pareto_tooltip_content() -> str:
169
  """Generates the inner HTML for the Pareto tooltip card with final copy."""
170
+ trophy_uri = get_svg_as_data_uri("assets/trophy.svg")
171
+ trophy_icon_html = f'<img src="{trophy_uri}" style="width: 25px; height: 25px; vertical-align: middle;">'
172
  return f"""
173
  <h3>On Pareto Frontier</h3>
174
  <p class="tooltip-description">The Pareto frontier represents the best balance between score and cost.</p>
 
177
  <li>Offer the lowest cost for a given performance, or</li>
178
  <li>Deliver the best performance at a given cost.</li>
179
  </ul>
180
+ <div class="tooltip-description" style="margin-top: 12px; display: flex; align-items: center;">
181
+ <span>These agents are marked with this icon:</span>
182
+ <span>{trophy_icon_html}</span>
183
+ </div>
184
  """
185
 
186
  def build_tooling_tooltip_content() -> str:
 
300
  This is used in the main leaderboard display.
301
  """
302
  descriptions_tooltip_content = build_descriptions_tooltip_content(which_table)
303
+ trophy_uri = get_svg_as_data_uri("assets/trophy.svg")
304
  legend_markdown = f"""
305
  <div style="display: flex; flex-wrap: wrap; align-items: flex-start; gap: 10px; font-size: 14px; padding-bottom: 8px;">
306
 
 
310
 
311
  <span class="tooltip-card">{pareto_tooltip_content}</span>
312
  </span>
313
+ <div style="margin-top: 8px; display: flex; align-items: center; gap: 6px;">
314
+ <img src="{trophy_uri}" alt="On frontier" style="width: 25px; height: 25px;">
315
+ <span>On frontier</span>
316
+ </div>
317
  </div>
318
 
319
  <div> <!-- Container for the Openness section -->
 
433
  df_view, plots_dict = transformer.view(tag=category_name, use_plotly=True)
434
  pareto_df = get_pareto_df(df_view)
435
  # Get the list of agents on the frontier. We'll use this list later.
436
+ trophy_uri = get_svg_as_data_uri("assets/trophy.svg")
437
+ trophy_icon_html = f'<img src="{trophy_uri}" alt="On Pareto Frontier" title="On Pareto Frontier" style="width:25px; height:25px;">'
438
  if not pareto_df.empty and 'id' in pareto_df.columns:
439
  pareto_agent_names = pareto_df['id'].tolist()
440
  else:
441
  pareto_agent_names = []
442
  df_view['Pareto'] = df_view.apply(
443
+ lambda row: trophy_icon_html if row['id'] in pareto_agent_names else '',
444
  axis=1
445
  )
446
  # Create mapping for Openness / tooling
 
483
  for col in df_headers:
484
  if col == "Logs" or "Cost" in col or "Score" in col:
485
  df_datatypes.append("markdown")
486
+ elif col in ["Agent","Icon","LLM Base", "Pareto"]:
487
  df_datatypes.append("html")
488
  else:
489
  df_datatypes.append("str")
 
510
  value=scatter_plot,
511
  show_label=False
512
  )
513
+ gr.Markdown(value=SCATTER_DISCLAIMER, elem_id="scatter-disclaimer")
514
  # Put table and key into an accordion
515
  with gr.Accordion("Show / Hide Table View", open=True, elem_id="leaderboard-accordion"):
516
  dataframe_component = gr.DataFrame(
 
550
  gr.Markdown(f"No detailed benchmarks found for the category: {category_name}")
551
  return
552
 
553
+ gr.HTML(f'<h2 style="padding-top: 120px;">{category_name} Detailed Benchmark Results</h2>')
554
  gr.Markdown("---")
 
555
  # 2. Loop through each benchmark and create its UI components
556
  for benchmark_name in benchmark_names:
557
  with gr.Row(elem_classes=["benchmark-header"]):
 
584
  benchmark_table_df = full_df[existing_table_cols].copy()
585
  pareto_df = get_pareto_df(benchmark_table_df)
586
  # Get the list of agents on the frontier. We'll use this list later.
587
+ trophy_uri = get_svg_as_data_uri("assets/trophy.svg")
588
+ trophy_icon_html = f'<img src="{trophy_uri}" alt="On Pareto Frontier" title="On Pareto Frontier" style="width:25px; height:25px;">'
589
  if not pareto_df.empty and 'id' in pareto_df.columns:
590
  pareto_agent_names = pareto_df['id'].tolist()
591
  else:
592
  pareto_agent_names = []
593
  benchmark_table_df['Pareto'] = benchmark_table_df.apply(
594
+ lambda row: trophy_icon_html if row['id'] in pareto_agent_names else '',
595
  axis=1
596
  )
597
 
 
656
  for col in df_headers:
657
  if "Logs" in col or "Cost" in col or "Score" in col:
658
  df_datatypes.append("markdown")
659
+ elif col in ["Agent", "Icon", "LLM Base", "Pareto"]:
660
  df_datatypes.append("html")
661
  else:
662
  df_datatypes.append("str")
 
675
  name=benchmark_name
676
  )
677
  gr.Plot(value=benchmark_plot, show_label=False)
678
+ gr.Markdown(value=SCATTER_DISCLAIMER, elem_id="scatter-disclaimer")
679
  # Put table and key into an accordion
680
  with gr.Accordion("Show / Hide Table View", open=True, elem_id="leaderboard-accordion"):
681
  gr.DataFrame(
 
768
  # This container will be our flexbox row.
769
  full_html = f"""
770
  <div class="sub-nav-bar-container">
771
+ <span class="sub-nav-label">Benchmarks in this category:</span>
772
  {''.join(html_buttons)}
773
  </div>
774
  """