We first show individual participant results from the user study. Each row corresponds to a single participant. In the first 4 columns we plot participant labeling accuracy across categories and the different interfaces, where each bar represents an individual round of labeling. The lighter color in each bar encodes the number of entities mislabeled for that category/interface, the darker color encodes the number of correct entities, and their combined height encodes the total number of labels provided. Please note that the scales for these plots are unique to each participant, to facilitate comparison within participants. The next column shows the difference in bootstrapping accuracy between the scatterplot and the list interface for each category. Positive values indicate where the scatterplot is performing better than the list. The last column plots overall accuracy across all categories.
To help with analysis, we have enabled linked highlighting. Namely, for a given participant, by hovering the mouse over a bar in the labeling accuracy plot, a dashed line is shown in the corresponding Accuracy Difference and Overall Accuracy plots for that given epoch, in addition to highlighting the labeling accuracy plot of the other interface for that epoch. Likewise, by hovering the mouse in either the Accuracy Difference or Overall Accuracy plots, the epochs for each category's labeling accuracy are highlighted. This can be useful to assess whether a user's interactions helped improve the accuracy in bootstrapping performance for specific epochs.
As the scale of the number of annotated labels has wide variability, we also allow dynamic rescaling of axes by clicking on a bar. This will rescale all axes for that participant, including all other categories and interfaces, to the maximum value for that interface, category, and epoch. This can be useful to better compare certain categories that received less annotations, e.g. AFF and GPE. However, we clamp the other plots' maxima to the specified new scale, so please be aware of this effect. The mouse cursor exiting the bar will prompt the default scaling.
To help understand the types of errors participants made in human annotations, we have provided individual confusion matrices for each participant, across the different interfaces. The y-axis corresponds to human labels, and the x-axis corresponds to ground-truth labels. To better emphasize errors made, we zero out the diagonal entries of each confusion matrix, i.e. we do not show accuracy, only error. Each confusion matrix is color-mapped with a unique scale, in order to make comparisons within an individual matrix. Comparing across confusion matrices, however, is made more difficult as a result, thus please consider the scale when comparing between matrices.
In general, we find that persons and organizations were pretty common categories that participants would tend to to mix up. In post-experiment questionnaires, we found this to be a common finding when we asked participants about label confidence. Yet, we also suspect that the predominance of persons and organizations might have predisposed participants to overtly label entities with these categories.
Last, as a complement to our consensus analysis, we provide a plot of annotator agreement for assigning entities to the same category. We show this as a histogram over entities, where the x-axis groups entities by the number of annotators that labeled an entity, e.g. for "4", this represents all entities that were labeled by 4 annotators. For each of these groups, we further break the counts down into the maximum number of annotators that labeled the entity with the same category. So for group "4" on the x-axis, the lighter shade of red within this bar that corresponds to "3" in the upper right legend represents the number of entities that were labeled by 4 annotators wherein 3 annotators labeled entities with the same category.
This plot suggests that there is considerably more overlap in labels provided by users in the scatterplot interface compared to the list interface. Thus when we combine the labels in the scatterplot via Majority Vote, as described in the paper, there is a better opportunity to mitigate noise compared to the list, on the basis of overlap alone. However, the fact that the accuracy is still quite high in scatterplot (92.5) relative to the list (95.4) suggests that the scatterplot interface enables users to label a much larger amount of data, at a comparable accuracy when combined across users.
As there is wide variability in these counts, we allow for dynamic rescaling of the axes, similar to the above. Simply hover the mouse over a bar to rescale the full chart to its maximum. Hovering off of the bar will reset the scale. We similarly clamp other bars to that maxima.