Spaces:

hpi-dhc
/

FairEval

Runtime error

App Files Files Community

illorca commited on Dec 10, 2022

Commit

3109162

1 Parent(s): fa93db6

Show trad_scores when mode is fair (and docs)

Browse files

Files changed (8) hide show

.idea/.gitignore +3 -0
.idea/FairEval.iml +12 -0
.idea/inspectionProfiles/Project_Default.xml +13 -0
.idea/inspectionProfiles/profiles_settings.xml +6 -0
.idea/modules.xml +8 -0
.idea/vcs.xml +6 -0
FairEval.py +44 -13
README.md +27 -20

.idea/.gitignore ADDED Viewed

	@@ -0,0 +1,3 @@

+# Default ignored files
+/shelf/
+/workspace.xml

.idea/FairEval.iml ADDED Viewed

	@@ -0,0 +1,12 @@

+<?xml version="1.0" encoding="UTF-8"?>
+<module type="PYTHON_MODULE" version="4">
+  <component name="NewModuleRootManager">
+    <content url="file://$MODULE_DIR$" />
+    <orderEntry type="inheritedJdk" />
+    <orderEntry type="sourceFolder" forTests="false" />
+  </component>
+  <component name="PyDocumentationSettings">
+    <option name="format" value="PLAIN" />
+    <option name="myDocStringFormat" value="Plain" />
+  </component>
+</module>

.idea/inspectionProfiles/Project_Default.xml ADDED Viewed

	@@ -0,0 +1,13 @@

+<component name="InspectionProjectProfileManager">
+  <profile version="1.0">
+    <option name="myName" value="Project Default" />
+    <inspection_tool class="PyUnresolvedReferencesInspection" enabled="true" level="WARNING" enabled_by_default="true">
+      <option name="ignoredIdentifiers">
+        <list>
+          <option value="Version" />
+          <option value="Pipeline" />
+        </list>
+      </option>
+    </inspection_tool>
+  </profile>
+</component>

.idea/inspectionProfiles/profiles_settings.xml ADDED Viewed

	@@ -0,0 +1,6 @@

+<component name="InspectionProjectProfileManager">
+  <settings>
+    <option name="USE_PROJECT_PROFILE" value="false" />
+    <version value="1.0" />
+  </settings>
+</component>

.idea/modules.xml ADDED Viewed

	@@ -0,0 +1,8 @@

+<?xml version="1.0" encoding="UTF-8"?>
+<project version="4">
+  <component name="ProjectModuleManager">
+    <modules>
+      <module fileurl="file://$PROJECT_DIR$/.idea/FairEval.iml" filepath="$PROJECT_DIR$/.idea/FairEval.iml" />
+    </modules>
+  </component>
+</project>

.idea/vcs.xml ADDED Viewed

	@@ -0,0 +1,6 @@

+<?xml version="1.0" encoding="UTF-8"?>
+<project version="4">
+  <component name="VcsDirectoryMappings">
+    <mapping directory="$PROJECT_DIR$" vcs="Git" />
+  </component>
+</project>

FairEval.py CHANGED Viewed

@@ -59,8 +59,8 @@ Args:
     references: list of ground truth reference labels. Predicted sentences must have the same number of tokens as the references.
     mode: 'fair', 'traditional' ot 'weighted. Controls the desired output. The default value is 'fair'.
         - 'traditional': equivalent to seqeval's metrics / classic span-based evaluation.
-        - 'fair': default fair score calculation.
-        - 'weighted': custom score calculation with the weights passed.
     weights: dictionary with the weight of each error for the custom score calculation.
         If none is passed and the mode is set to 'weighted', the following is used:
         {"TP": {"TP": 1},
@@ -90,17 +90,48 @@ Examples:
     >>> ref =  [['O', 'O', 'O',      'B-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O']]
     >>> results = faireval.compute(predictions=pred, references=ref, mode='fair', error_format='count')
     >>> print(results)
-    {'MISC': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'TP': 0,'FP': 0,'FN': 0,'LE': 0,'BE': 1,'LBE': 0},
-    'PER': {'precision': 1.0,'recall': 1.0,'f1': 1.0,'TP': 1,'FP': 0,'FN': 0,'LE': 0,'BE': 0,'LBE': 0},
-    'overall_precision': 0.6666666666666666,
-    'overall_recall': 0.6666666666666666,
-    'overall_f1': 0.6666666666666666,
-    'TP': 1,
-    'FP': 0,
-    'FN': 0,
-    'LE': 0,
-    'BE': 1,
-    'LBE': 0}
     """

     references: list of ground truth reference labels. Predicted sentences must have the same number of tokens as the references.
     mode: 'fair', 'traditional' ot 'weighted. Controls the desired output. The default value is 'fair'.
         - 'traditional': equivalent to seqeval's metrics / classic span-based evaluation.
+        - 'fair': default fair score calculation. It will also show traditional scores for comparison.
+        - 'weighted': custom score calculation with the weights passed.  It will also show traditional scores for comparison.
     weights: dictionary with the weight of each error for the custom score calculation.
         If none is passed and the mode is set to 'weighted', the following is used:
         {"TP": {"TP": 1},
     >>> ref =  [['O', 'O', 'O',      'B-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O']]
     >>> results = faireval.compute(predictions=pred, references=ref, mode='fair', error_format='count')
     >>> print(results)
+    {
+   "MISC": {
+      "precision": 0.0,
+      "recall": 0.0,
+      "f1": 0.0,
+      "trad_prec": 0.0,
+      "trad_rec": 0.0,
+      "trad_f1": 0.0,
+      "TP": 0,
+      "FP": 0.0,
+      "FN": 0.0,
+      "LE": 0.0,
+      "BE": 1.0,
+      "LBE": 0.0
+   },
+   "PER": {
+      "precision": 1.0,
+      "recall": 1.0,
+      "f1": 1.0,
+      "trad_prec": 1.0,
+      "trad_rec": 1.0,
+      "trad_f1": 1.0,
+      "TP": 1,
+      "FP": 0.0,
+      "FN": 0.0,
+      "LE": 0.0,
+      "BE": 0.0,
+      "LBE": 0.0
+   },
+   "overall_precision": 0.6666666666666666,
+   "overall_recall": 0.6666666666666666,
+   "overall_f1": 0.6666666666666666,
+   "overall_trad_prec": 0.5,
+   "overall_trad_rec": 0.5,
+   "overall_trad_f1": 0.5,
+   "TP": 1,
+   "FP": 0.0,
+   "FN": 0.0,
+   "LE": 0.0,
+   "BE": 1.0,
+   "LBE": 0.0
+    }
     """

README.md CHANGED Viewed

@@ -44,8 +44,8 @@ Predicted sentences must have the same number of tokens as the references.
 The optional arguments are:
 - **mode** *(str)*: 'fair', 'traditional' ot 'weighted. Controls the desired output. The default value is 'fair'.
   - 'traditional': equivalent to seqeval's metrics / classic span-based evaluation.
-  - 'fair': default fair score calculation.
-  - 'weighted': custom score calculation with the weights passed.
 - **weights** *(dict)*: dictionary with the weight of each error for the custom score calculation.
 - **error_format** *(str)*: 'count', 'error_ratio' or 'entity_ratio'. Controls the desired output for TP, FP, BE, LE, etc.  Default value is 'count'.
   - 'count': absolute count of each parameter.
@@ -64,8 +64,6 @@ If mode is 'traditional', the error parameters shown are the classical TP, FP an
 TP remain the same, FP and FN are shown as per the fair definition and additional errors BE, LE and LBE are shown.
 ### Examples
-A comprehensive set of side-by-side examples is shown [here](https://huggingface.co/spaces/hpi-dhc/FairEval/blob/main/HFFE_use_cases.pdf).
 Considering the following input annotated sentences:
 ```python
 >>> r1 = ['O', 'O', 'B-PER', 'I-PER', 'O', 'B-PER']
@@ -82,6 +80,31 @@ Considering the following input annotated sentences:
 ```
 The output for different modes and error_formats is:
 ```python
 >>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='count')
 {'PER': {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'TP': 1, 'FP': 1, 'FN': 1},
@@ -108,22 +131,6 @@ The output for different modes and error_formats is:
  'FN': 0.5714}
 ```
-```python
->>> faireval.compute(predictions=y_pred, references=y_true, mode='fair', error_format='count')
-{'PER': {'precision': 1.0, 'recall': 0.5, 'f1': 0.6666, 'TP': 1, 'FP': 0, 'FN': 1, 'LE': 0, 'BE': 0, 'LBE': 0},
- 'INT': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'TP': 0, 'FP': 0, 'FN': 0, 'LE': 0, 'BE': 1, 'LBE': 1},
- 'OUT': {'precision': 0.6666, 'recall': 0.6666, 'f1': 0.6666, 'TP': 1, 'FP': 0, 'FN': 0, 'LE': 1, 'BE': 0, 'LBE': 0},
- 'overall_precision': 0.5714,
- 'overall_recall': 0.4444444444444444,
- 'overall_f1': 0.5,
- 'TP': 2,
- 'FP': 0,
- 'FN': 1,
- 'LE': 1,
- 'BE': 1,
- 'LBE': 1}
-```
 #### Values from Popular Papers
 *Examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*

 The optional arguments are:
 - **mode** *(str)*: 'fair', 'traditional' ot 'weighted. Controls the desired output. The default value is 'fair'.
   - 'traditional': equivalent to seqeval's metrics / classic span-based evaluation.
+  - 'fair': default fair score calculation. Fair will also show traditional scores for comparison.
+  - 'weighted': custom score calculation with the weights passed. Weighted will also show traditional scores for comparison.
 - **weights** *(dict)*: dictionary with the weight of each error for the custom score calculation.
 - **error_format** *(str)*: 'count', 'error_ratio' or 'entity_ratio'. Controls the desired output for TP, FP, BE, LE, etc.  Default value is 'count'.
   - 'count': absolute count of each parameter.
 TP remain the same, FP and FN are shown as per the fair definition and additional errors BE, LE and LBE are shown.
 ### Examples
 Considering the following input annotated sentences:
 ```python
 >>> r1 = ['O', 'O', 'B-PER', 'I-PER', 'O', 'B-PER']
 ```
 The output for different modes and error_formats is:
+```python
+>>> faireval.compute(predictions=y_pred, references=y_true, mode='fair', error_format='count')
+{'PER': {'precision': 1.0, 'recall': 0.5, 'f1': 0.6666,
+         "trad_prec": 0.5, "trad_rec": 0.5, "trad_f1": 0.5,
+         'TP': 1, 'FP': 0, 'FN': 1, 'LE': 0, 'BE': 0, 'LBE': 0},
+ 'INT': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0,
+         "trad_prec": 0.0, "trad_rec": 0.0, "trad_f1": 0.0,
+         'TP': 0, 'FP': 0, 'FN': 0, 'LE': 0, 'BE': 1, 'LBE': 1},
+ 'OUT': {'precision': 0.6666, 'recall': 0.6666, 'f1': 0.6666,
+         "trad_prec": 0.5, "trad_rec": 0.5, "trad_f1": 0.5,
+         'TP': 1, 'FP': 0, 'FN': 0, 'LE': 1, 'BE': 0, 'LBE': 0},
+ 'overall_precision': 0.5714,
+ 'overall_recall': 0.4444444444444444,
+ 'overall_f1': 0.5,
+ 'trad_prec': 0.5,
+ 'trad_rec': 0.5,
+ 'trad_f1': 0.5,
+ 'TP': 2,
+ 'FP': 0,
+ 'FN': 1,
+ 'LE': 1,
+ 'BE': 1,
+ 'LBE': 1}
+```
 ```python
 >>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='count')
 {'PER': {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'TP': 1, 'FP': 1, 'FN': 1},
  'FN': 0.5714}
 ```
 #### Values from Popular Papers
 *Examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*