Amber Tanaka commited on
Commit
b077021
·
unverified ·
1 Parent(s): 17162c9

Refactoring intro paragraph / layout (#67)

Browse files
Files changed (2) hide show
  1. content.py +32 -27
  2. main_page.py +8 -2
content.py CHANGED
@@ -17,40 +17,20 @@ TITLE = """<h1 align="left" id="space-title">AstaBench Leaderboard</h1>"""
17
 
18
  INTRO_PARAGRAPH = """
19
  <p>
20
- Newer benchmarks may test agentic AI and isolated aspects of scientific reasoning, but none rigorously measure agentic AI or capture the full range of skills research demands. Agents can appear effective by simply retrying tasks—often at high computational cost and with inconsistent results. Scientific AI needs evaluations that reflect the real complexity of research.
21
- </p>
22
- <br>
23
- <p>
24
- AstaBench fills that gap: a suite of open benchmarks for evaluating scientific AI assistants on core scientific tasks that require novel reasoning. The suite includes over 8,000 tasks across 11 benchmarks, organized into four core categories: Literature Understanding, Code & Execution, Data Analysis, and End-to-End Discovery.
25
- </p>
26
- <br>
27
- <p>
28
- The <strong>AstaBench Leaderboard</strong> below provides a high-level summary of agent performance and efficiency. It includes:
29
  </p>
 
30
  <ul class="info-list">
31
  <li>
32
- An <strong>overall score</strong>, computed as a macro average of the four category-level macro averages, ensuring each domain contributes equallyregardless of how many benchmarks each category includes. This provides a fair and balanced comparison across agents with varying capabilities.
33
  </li>
34
  <li>
35
- An <strong>overall average cost per task</strong>, consistently aggregated across all categories, to reflect the real efficiency of each agent under comparable conditions.
36
  </li>
37
  </ul>
38
- <br>
39
- <p>
40
- To support domain-specific insight, AstaBench also provides per-category leaderboards:
41
- </p>
42
- <ul class="info-list">
43
- <li>Literature Understanding</li>
44
- <li>Code & Execution</li>
45
- <li>Data Analysis</li>
46
- <li>End-to-End Discovery</li>
47
- </ul>
48
- <br>
49
- <p>
50
- Each category page includes a summary table (average score and cost per problem for that domain), as well as per-benchmark leaderboards for detailed comparisons on specific tasks.
51
- </p>
52
  <p>
53
- 🔍 Learn more in the AstaBench technical blog post
54
  </p>
55
  """
56
  SCATTER_DISCLAIMER = """
@@ -661,7 +641,32 @@ span.wrap[tabindex="0"][role="button"][data-editable="false"] {
661
  html {
662
  scroll-behavior: smooth;
663
  }
664
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
665
  /* Plot legend styles */
666
  .plot-legend-container {
667
  height: 572px;
 
17
 
18
  INTRO_PARAGRAPH = """
19
  <p>
20
+ <strong>AstaBench</strong> provides an aggregated view of agent performance and efficiency across all benchmarks in all four categories. We report:
 
 
 
 
 
 
 
 
21
  </p>
22
+
23
  <ul class="info-list">
24
  <li>
25
+ <strong>Overall score:</strong> A macro-average of the four category-level average scores. Each category contributes equally, regardless of how many benchmarks it includes. This ensures fair comparisons across agents with different domain strengths.
26
  </li>
27
  <li>
28
+ <strong>Overall cost:</strong> A macro-average of the agent’s cost per problem across all categories, in USD. Each category contributes equally.
29
  </li>
30
  </ul>
31
+
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  <p>
33
+ This view is designed for quick comparison of general-purpose scientific agents. For more details on how we calculate scores and cost, please see the <a href="/about" style="color: #0FCB8C; text-decoration: underline;">About</a> Page.
34
  </p>
35
  """
36
  SCATTER_DISCLAIMER = """
 
641
  html {
642
  scroll-behavior: smooth;
643
  }
644
+ /* Home Page Styling */
645
+ .diagram-placeholder {
646
+ width: 100%;
647
+ height: 100%;
648
+ min-height: 250px;
649
+ display: flex;
650
+ align-items: center;
651
+ justify-content: center;
652
+ background-color: #FAF2E9;
653
+ color: #F0529C;
654
+ border-radius: 8px;
655
+ font-size: 14px;
656
+ text-align: center;
657
+ }
658
+ /* 2. Responsive behavior for smaller screens */
659
+ @media (max-width: 900px) {
660
+ #intro-row {
661
+ flex-direction: column;
662
+ }
663
+ }
664
+ #home-page-content-wrapper{
665
+ margin-top: 40px;
666
+ }
667
+ #intro-paragraph {
668
+ max-width: 90%;
669
+ }
670
  /* Plot legend styles */
671
  .plot-legend-container {
672
  height: 572px;
main_page.py CHANGED
@@ -16,11 +16,17 @@ CACHED_VIEWERS = {}
16
  CACHED_TAG_MAPS = {}
17
 
18
  def build_page():
19
- gr.HTML(INTRO_PARAGRAPH, elem_id="intro-paragraph")
 
 
 
 
 
 
20
  # --- Leaderboard Display Section ---
21
  gr.Markdown("---")
22
  CATEGORY_NAME = "Overall"
23
- gr.Markdown(f"## AstaBench {CATEGORY_NAME} Leaderboard")
24
 
25
  with gr.Tabs() as tabs:
26
  with gr.Tab("Results: Test Set") as test_tab:
 
16
  CACHED_TAG_MAPS = {}
17
 
18
  def build_page():
19
+ with gr.Column(elem_id="home-page-content-wrapper"):
20
+ with gr.Row(elem_id="intro-row"):
21
+ with gr.Column(scale=6):
22
+ gr.HTML(INTRO_PARAGRAPH, elem_id="intro-paragraph")
23
+ with gr.Column(scale=4):
24
+ gr.HTML('<div class="diagram-placeholder">Future Diagram</div>')
25
+
26
  # --- Leaderboard Display Section ---
27
  gr.Markdown("---")
28
  CATEGORY_NAME = "Overall"
29
+ gr.HTML(f'<h2>AstaBench {CATEGORY_NAME} Leaderboard <span style="font-weight: normal; color: inherit;">(Aggregate)</span></h2>', elem_id="main-header")
30
 
31
  with gr.Tabs() as tabs:
32
  with gr.Tab("Results: Test Set") as test_tab: