### Datasets

Given that rodents are widely used in behavioral research, and mice are the most studied rodents^{35}, we chose two publicly-available datasets featuring mice engaging in a range of behaviors in our main analysis. The first dataset, referred to as the “home-cage dataset,” was collected by Jhuang et al.^{7} and features 12 videos (approximately 10.5 h and 1.13 million frames in total) of singly housed mice in their home cages, recorded from the side view. Video resolution is \(320\times 240\) pixels. The authors annotate each video in full and identify eight mutually exclusive behaviors (Fig. S1A) of varying incidence (Fig. 3A). This dataset allows us to benchmark our approach against existing methods, allows us to evaluate our method on a common use-case, and is relatively well-balanced in terms of the incidence of each behavior.

The second dataset used is the Caltech Resident-Intruder Mouse dataset (CRIM13), collected by Burgos-Artizzu et al.^{6}. It consists of 237 pairs of videos, recorded from synchronized top- and side-view cameras, at 25 frames per second and an 8-bit pixel depth. Videos are approximately 10 min long, and the authors label 13 mutually exclusive actions (Fig. S1B). Of these actions, 12 are social behaviors, and the remaining action is the category “other,” which denotes periods where no behavior of interest occurs^{6}. This dataset features a number of challenges absent from the Jhuang et al.^{7} dataset. In addition to including social behavior (in contrast to the home-cage dataset, which features singly-housed mice), it presents two algorithmic challenges. First, videos are recorded using a pair of synchronized cameras. This allows us to test multiple-camera integration functionality (see “Methods: Feature extraction” section), to evaluate classifier performance using features from multiple cameras. And second, it is highly unbalanced, with a slight majority of all annotations being the category “other” (periods during which no social behavior occurred; Fig. 3D).

We also include an exploratory dataset, to demonstrate the applicability of our model to non-rodent models, comprised of seven unique videos of single-housed octopus bimaculoides during a study of octopus habituation behaviors in the Dartmouth octopus lab. One video (approximately 62 min in length) was annotated by two different annotators, allowing us to assess inter-observer reliability by calculating the agreement between these two independent annotations. The videos span approximately 6.75 h in total, with 6.15 h annotated. Video was recorded at 10 frames per second with a resolution of \(640\times 436\) pixels. We define five behaviors of interest: crawling, fixed pattern (crawling in fixed formation along the tank wall), relaxation, jetting (quick acceleration away from stimuli), and expanding (tentacle spread in alarm reaction or aggressive display), and an indicator for when none of these behaviors occur (none). In the original dataset, there were three additional behaviors (inking/jetting, display of dominance, color change), comprising a very small number of the total frames, which could co-occur with the other six behaviors (crawling, fixed pattern, relaxation, jetting, expanding, none). However, because our classification model can only predict mutually-exclusive classes at the current time, we removed these three behaviors from our input annotations.

### Inter-observer reliability

Both datasets include a set of annotations performed by two groups of annotators. The primary set of annotations was produced by the first group of annotators and includes all video in the dataset. The secondary set of annotations was performed by a second, independent set of annotators on a subset of videos. We use the primary set of annotations to train and evaluate our method, and the secondary set to establish inter-observer reliability; that is, how much two, independent human annotator’s annotations can be expected to differ. Given this, classifier-produced labels can be most precisely interpreted as the predicted behavior *if the video was annotated by the first group of annotators*. This distinction becomes important because we benchmark the accuracy of our method (i.e., the agreement between the classifier’s predictions and the primary set of annotations) relative to the inter-observer agreement (i.e., the agreement between the first and second group of annotators, on the subset of video labeled by both groups). So, for example, when we note that our model achieves accuracy “above human agreement,” we mean that our classifier predicts the labels from the first human annotator group better than the second human annotator group does. In the case of the home-cage dataset, the agreement between the primary and secondary sets was 78.3 percent, compared on a 1.6 h subset of all dataset video^{7}. For CRIM13, agreement was 69.7 percent, evaluated on a random selection of 12 videos^{6}.

### Simulating labeled data

To simulate our approach’s performance with varying amount of training data, in our primary analyses we train the classifier using the following amounts of labeling:

$${\mathrm{prop}}_{\mathrm{labeled}}=\left[0.02:0.02:0.2, 0.2:0.05:0.9\right].$$

That is, we use a proportion of all data, \({\mathrm{prop}}_{\mathrm{labeled}}\), to construct our training and validation sets (i.e., \({\mathcal{D}}^{\mathrm{labeled}})\), and the remaining \({1-\mathrm{prop}}_{\mathrm{labeled}}\) data to create our test set, \({\mathcal{D}}^{\mathrm{test}}\) (Fig. 1B,D). We use an increment of \(0.02\) for low training proportions (up to \(0.20\)), because that is when we see the greatest change relative to a small change in added training data (Fig. 2A,E). We increment values from \(0.25\) to \(0.90\) by \(0.05\). This gives us a set of 24 training proportions per analysis. Additionally, for each training proportion, unless otherwise noted, we evaluate the model on 10 random splits of the data. In our main analyses, we use a clip length of one minute for both datasets.

### Comparison with existing methods

When comparing our model to existing methods, we employ *k*-fold validation instead of evaluating on random splits of the data. In the case of the home-cage dataset, the existent methods cited employ a “leave one out” approach—using 11 of the 12 videos to train their methods, and the remaining video to test it. In our approach, however, we rely on splitting the data into clips, so instead we use 12-fold cross-validation, where we randomly split the dataset clips into 12 folds and then employ cross-validation on the clips, rather than entire videos. In evaluating their approach’s performance on the CRIM13 dataset, Burgos-Artizzu et al.^{6} selected 104 videos for training and 133 for testing, meaning that they trained their program on 44 percent of the data, and tested it on 56 percent. Here, we evaluate our method relative to theirs using two-fold cross validation (50 percent test and 50 percent train split) to retain similar levels of training data.

### Frame extraction

To generate spatial frames, we extract raw video frames from each video file. Rather than save each image as an image file in a directory, we save the entire sequence of images corresponding to a single video to a sequence file, using the implementation provided by Dollár^{36} with JPG compression. This has the advantage of making the video frames easier to transfer between file systems and readable on any operating system (which is useful for users running the toolbox on high performance computing clusters). To generate the temporal component, we use the TV-L1 algorithm^{19,37}, which shows superior performance to alternate optical flow algorithms^{15}, to calculate the dense optical flow between pairs of sequential video frames and represent it visually via the MATLAB implementation by Cun^{38}. In the visual representation of optical flow fields, hue and brightness of a pixel represent the orientation and magnitude of that pixel’s motion between sequential frames. By representing motion information of the video as a set of images, we can use a similar feature extraction method for both spatial and temporal frames. Just as the features derived from the spatial images represent the spatial information in the video, the features derived from the temporal images should provide a representation of the motion information in the video.

### Feature extraction

We utilize the pretrained ResNet18 convolutional neural network (CNN) to extract high-level features from the spatial and temporal video frames. Often used in image processing applications, CNNs consist of a series of layers that take an image as an input and generate an output based on the content of that image. Intuitively, classification CNNs can be broken down into two components: feature extraction and classification. In the feature extraction component, the network uses a series of layers to extract increasingly complex features from the image. In the classification component, the network uses the highest-level features to generate a final classification for the image (e.g., “dog” or “cat”). In the case of pretrained CNNs, the network learns to extract important features from the input image through training—by generating predictions for a set of images for which the ground truth is known, and then modifying the network based on the deviation of the predicted classification from the true classification, the network learns which features in the image are important in discriminating one object class from another. In pretrained CNNs, such as the ResNet18, which was trained to categorize millions of images from the ImageNet database into one thousand distinct classes^{39}, early layers detect generic features (e.g., edges, textures, and simple patterns) and later layers represent image data more abstractly^{40}.

Here, we leverage transfer learning—where a network trained for one context is used in another—to extract a low-dimensional representation of the data in the spatial and temporal video frames. The idea is that, since the ResNet18 is trained on a large, general object dataset, the generality of the network allows us to obtain an abstract representation of the salient visual features in the underlying video by extracting activations from the later layers of the network in response to a completely different set of images (in this case, laboratory video of animal behavior). To extract features from the ResNet18 network for a given image, we input the image into the network and record the response (“activations”) from a specified layer of the network. In this work, we chose to extract activations from the global average pooling layer (“pool5” in MATLAB) of the ResNet18, close to the end of the network (to obtain high-level feature representations). This generates a feature vector of length \(512\), representing high level CNN features for each image.

By default, the ResNet18 accepts input images of size \([224, 224, 3]\) (i.e., images with a width and height of 224 pixels and three color channels), so we preprocess frames by first resizing them to a width and height of 224 pixels. In the case of spatial frames, the resized images are input directly into the unmodified network. For temporal frames, however, rather than inputting frames into the network individually, we “stack” each input frame to the CNN with the five frames preceding it and the five frames following it, resulting in an input size of \([224, 224, 33]\). This approach allows the network to extract features with longer-term motion information and has been shown to improve discriminative performance^{14,18}. We select a stack size of 11 based on the findings from Simonyan and Zisserman^{18}. By default, the ResNet18 network only accepts inputs of size \([224, 224, 3]\), so to modify it so that it accepts inputs of size \([224, 224, 33]\) we replicate the weights of the first convolutional layer (normally three channels) 11 times. This allows the modified “flow ResNet18” to accept stacks of images as inputs, while retaining the pretrained weights to extract salient image features.

After spatial features and temporal features have been separately extracted from the spatial and temporal frames, respectively, we combine them to produce the *spatiotemporal* features that will be used to train the classifier (Fig. 1E). To do so, we simply concatenate the spatial and temporal features for each frame. That is, for a given segment of video with \(n\) frames, the initial spatiotemporal features are a matrix of size \(\left[n, 512\times 2\right]=[n, 1024]\), where \(512\) represents the dimensionality of the features extracted from the ResNet18. If multiple synchronized cameras are used (as is the case in one of our benchmark datasets), we employ the same process, concatenating the spatial and temporal features for each frame *and each camera*. In the case of two cameras, for example, this implies the initial spatiotemporal features is a matrix of size \(\left[n,512\times 2\times 2\right]=\left[n,2048\right]\). To decrease training time, memory requirements, and improve performance^{41,42}, we utilize dimensionality reduction to decrease the size of the *initial* spatiotemporal features to generate *final* spatiotemporal features of size \(\left[n,512\right]\). We selected reconstruction independent component analysis^{43,44} as our dimensionality reduction method, which creates a linear transformation by minimizing an objective function that balances the independence of output features with the capacity to reconstruct input features from output features.

### Classifier architecture

The labeled and unlabeled data consist of a set of clips, generated from project video, which the classifier uses to predict behavior. Clips in \({\mathcal{D}}^{\mathrm{labeled}}\) and \({\mathcal{D}}^{\mathrm{unlabeled}}\) are both constituted of a segment of video and a corresponding array of spatiotemporal features extracted from that video. Clips in \({\mathcal{D}}^{\mathrm{labeled}}\) also include an accompanying set of manual annotations (Fig. 1B). For a given clip in \({\mathcal{D}}^{\mathrm{labeled}}\) with \({n}_{\mathrm{labeled}}\) frames, the classifier takes a \([{n}_{\mathrm{labeled}},512]\)-dimensional vector of spatiotemporal features (Fig. 1E) and a one-dimensional array of \({n}_{\mathrm{labeled}}\) manually-produced labels (e.g., “eat,” “drink,” etc.) as inputs, and learns to predict the \({n}_{\mathrm{labeled}}\) labels from the features. After training, for a given clip in \({\mathcal{D}}^{\mathrm{unlabeled}}\) with \({n}_{\mathrm{unlabeled}}\) frames, the classifier takes as an input a \([{n}_{\mathrm{unlabeled}},512]\)-dimensional vector of spatiotemporal features and outputs a set of \({n}_{\mathrm{unlabeled}}\) behavioral labels, corresponding to the predicted behavior in each of the \({n}_{\mathrm{unlabeled}}\) frames. To implement this transformation from features to labels, we rely on recurrent neural networks (RNNs). Prior to inputting clips into the RNN, we further divide them into shorter “sequences,” corresponding to 15 s of video to reduce overfitting^{45} and sequence padding^{46}. Unlike traditional neural networks, recurrent neural networks contain cyclical connections which allows information to persist over time, enabling them to learn dependencies in sequential data^{47}. Given that predicting behavior accurately requires the integration of information over time (i.e., annotators generally must view more than one frame to classify most behavior, since behaviors are often distinguished by movement over time), this persistence is critical.

We opt for a long short-term memory (LSTM) network with bidirectional LSTM layers (BiLSTM) as the core of our classification model. LSTMs are better able to learn long-term dependencies in data than traditional RNNs in practice^{48,49}, and the use of bidirectional layers allows the network to process information in both temporal directions^{50} (i.e., forward and backward in time, rather than forward only in the case of a traditional LSTM layer). As shown in Figure S4, our network’s architecture begins with a sequence input layer, which accepts a two-dimensional array corresponding to spatiotemporal video features (with one row per frame and one column per feature). We then apply two BiLSTM layers, which increases model complexity and allows the model to learn more abstract relationships between input sequences and correct output labels^{51}. To reduce the likelihood of model overfitting, we use a dropout layer after each BiLSTM layer, which randomly sets some proportion of input units (here, \(50\) percent) to \(0\), which reduces overfitting by curbing the power of any individual neuron to generate the output^{52}. The second dropout layer is followed by a fully-connected layer with an output size of \([n,K\)], where \(K\) is the number of classes and \(n\) is the number of frames in the input clip. The softmax layer then normalizes the fully-connected layer’s output into a set of class probabilities with shape \([n,K]\), where the sum of each row is \(1\) and the softmax probability of class \(k\) in frame \(j\) is given by the entry \(jk\). Following the softmax layer, the sequence-to-sequence classification layer generates a one-dimensional categorical array of \(n\) labels corresponding to the behavior with the highest softmax probability in each frame. We select cross-entropy loss for \(K\) mutually exclusive classes^{53} as our loss function, since the behaviors in both datasets are mutually exclusive. All classifiers were trained using a single Nvidia Tesla K80 GPU running on the Dartmouth College high performance computing cluster.

### Classifier training

In this analysis, we use the hyperparameters specified in Table 2 when training the network. To avoid overfitting, we select 20 percent of \({\mathcal{D}}^{\mathrm{labeled}}\) to use in our validation set (i.e., \({prop}_{train}=0.20\); see Fig. 1C). We then evaluate the network on this validation set every epoch (where “epoch” is defined as a single pass of the entire training set through the network) and record its cross-entropy loss. If the loss on the validation set after a given epoch is larger than or equal to the smallest previous loss on the validation set more than twice, training terminates.

### Classifier evaluation

To evaluate the classifier, we consider its performance on the test set, \({\mathcal{D}}^{\mathrm{test}}\) (Fig. 1B,D). For each clip, the classifier outputs a set of predicted labels for each frame, corresponding to the predicted behavior in that frame. In evaluating the classifier, we are interested in how closely these predicted labels match the true ones. We first consider overall prediction accuracy. We let \(\mathrm{correct}\) denote the number labels in which the network’s prediction is the same as the true label and \(\mathrm{incorrect}\) the number of labels in which the network’s prediction is not the same as the true label. Then accuracy can be quantified as the following proportion:

$$\mathrm{accuracy}=\frac{\mathrm{correct}}{\mathrm{correct}+\mathrm{incorrect}}.$$

Next, we consider the performance of the network by behavior. To do so, we let \({\mathrm{TP}}_{k}\) denote the number of true positives (predicted class \(k\) and true class \(k\)), \({\mathrm{FP}}_{k}\) the number of false positives (predicted class \(k\), but true label not class \(k\)), and \({\mathrm{FN}}_{k}\) the number of false negatives (true class \(k\), predicted not class \(k\)) for class \(k\) (where \(k\) is between \(1\) and the total number of classes, \(K\)).

We then calculate the precision, recall, and F1 score for each label^{11,55}, where the precision and recall for class \(k\) are defined as follows:

$${\mathrm{precision}}_{k}=\frac{{\mathrm{TP}}_{k}}{{{\mathrm{TP}}_{k}+\mathrm{FP}}_{k}},$$

$${\mathrm{recall}}_{k}=\frac{{\mathrm{TP}}_{k}}{{{\mathrm{TP}}_{k}+\mathrm{FN}}_{k}}.$$

Precision is the proportion of correct predictions out of all cases in which the *predicted class* is class \(k\). Recall, meanwhile, denotes the proportion of correct predictions out of all the cases in which the *true class* is class \(k\). From the precision and recall, we calculate the F1 score for class \(k\). The F1 score is the harmonic mean of precision and recall, where a high F1 score indicates both high precision and recall, and deficits in either decrease it:

$${\mathrm{F}1}_{k}=2\cdot \frac{{\mathrm{precision}}_{k}\cdot {\mathrm{recall}}_{k}}{{\mathrm{precision}}_{k}+{\mathrm{recall}}_{k}}$$

After calculating the F1 score for each class, we calculate the average F1 score, \({\mathrm{F}1}_{\mathrm{all}}\) as follows: \({\mathrm{F}1}_{\mathrm{all}}=\frac{1}{K}\sum_{k=1}^{K}{\mathrm{F}1}_{k}\).

### Confidence score definition

For each input clip, the classifier returns a set of predicted annotations corresponding to the predicted behavior (e.g., “walk,” “drink,” “rest,” etc.) occurring in each frame of that clip. We denote the set of classifier-predicted labels for clip number \(i\), \({\text{clip}}_{i}\), as \(\left\{{\widehat{y}}_{j} | j\in {\text{clip}}_{i}\right\}\). Each clip also has a set of “true” labels, corresponding to those that would be produced if the clip was manually annotated. In the case of the labeled data, the true labels are known (and used to train the classifier). In the case of unlabeled data, they are not known (prior to manual review). We denote the set of true labels for \({\mathrm{clip}}_{i}\) as \(\left\{{y}_{j} | j\in {\text{clip}}_{i}\right\}\). For each frame in a clip, in addition to outputting a prediction for the behavior occurring in that frame, we also generate an estimate of how likely that frame’s classifier-assigned label is correct. That is, for each clip, we generate a set of predicted probabilities \(\left\{{\widehat{p}}_{j} | j\in {\text{clip}}_{i}\right\}\) such that \({\widehat{p}}_{j}\) denotes the estimated likelihood that \({\widehat{y}}_{j}\) is equal to \({y}_{j}\). In an optimal classifier, \({\mathbb{P}}\left({\widehat{y}}_{j}={y}_{j}\right)={\widehat{p}}_{j}\). That is, \({\widehat{p}}_{j}\) is an estimate of the probability the classification is correct; and, in an optimal confidence-scorer, the estimated probability the classification is correct will be the ground truth likelihood the classification is correct^{56}.

Now that we have established an estimated probability that a given *frame* in a clip is correct, we extend the confidence score to an entire clip. As in training data annotation, the review process is conducted at the level of an entire clip, not individual video frames. That is, even if there are a handful of frames in a clip that the classifier is relatively unconfident about, we assume that a human reviewer would need to see the entire clip to have enough context to accurately correct any misclassified frames. Since \({\widehat{p}}_{j}\) is the estimated probability a given frame \(j\) is correct, it follows that the average \({\widehat{p}}_{j}\) for \(j\in {\text{clip}}_{i}\) is the estimated probability a randomly selected frame in \({\mathrm{clip}}_{i}\) is correct. We define this quantity to be the clip confidence score; formally, \(\mathrm{conf}\left({\text{clip}}_{i}\right)=\frac{1}{\left|{\text{clip}}_{i}\right|}\sum_{j\in {\text{clip}}_{i}}{\widehat{p}}_{j}\), where \(\mathrm{conf}\left({\mathrm{clip}}_{i}\right)\) is the clip confidence score of \({\text{clip}}_{i}\) and \(\left|{\text{clip}}_{i}\right|\) is the number of frames in \({\text{clip}}_{i}\). We then consider that accuracy is the true probability a randomly selected frame in \({\text{clip}}_{i}\) is correct by definition. That is, \(\mathrm{acc}\left({\text{clip}}_{i}\right)=\frac{1}{\left|{\text{clip}}_{i}\right|}\sum_{j\in {\text{clip}}_{i}}\mathbf{I}({\widehat{y}}_{j}={y}_{j})\), where \(\mathrm{acc}\left({\text{clip}}_{i}\right)\) is the accuracy of \({\text{clip}}_{i}\) and \(\mathbf{I}\) is the indicator function. In the case of an optimal confidence score, we’ll have that \(\mathrm{conf}\left({\text{clip}}_{i}\right)=\mathrm{acc}\left({\text{clip}}_{i}\right)\). If we compare \(\mathrm{conf}\left({{\text{cli}}{\text{p}}}_{i}\right)\) with \(\mathrm{acc}\left({\text{clip}}_{i}\right)\) on our test data, we can establish how well the confidence score can be expected to perform when the ground truth accuracy, \(\mathrm{acc}\left({\text{clip}}_{i}\right)\), is unknown. In Methods: Confidence score calculation, we discuss our approach for obtaining \({\widehat{p}}_{j}\), after which finding clip-wise confidence scores is trivial.

### Confidence score calculation

Here, we first examine how to calculate the frame-wise confidence score \({\widehat{p}}_{j}\). To do so, we consider the classifier structure (Fig. S4) in more detail. In particular, we focus on the last three layers: the fully-connected layer, the softmax layer, and the classification layer. To generate a classification for a given frame, the softmax layer takes in a logits vector from the fully-connected layer. This logits vector represents the raw (unnormalized) predictions of the model. The softmax layer then normalizes these predictions into a set of probabilities, where each probability is proportional to the exponential of the input. That is, given \(K\) classes, the \(K\)-dimensional vector from the fully-connected layer is normalized to a set of probabilities, representing the probability of each class. The class with the highest probability is then returned as the network’s predicted label (e.g., “eat” or “walk”) for that frame. We can then interpret this probability as a confidence score derived from the softmax function^{56}. Formally, if we let logits vector \({{\varvec{z}}}_{j}\) represent the output from the fully-connected layer corresponding to frame \(j\), the softmax-estimated probability that the predicted label of frame \(j\) is correct is \({\widehat{p}}_{j}^{\mathrm{SM}}=\underset{k}{\mathrm{max}}{\sigma \left({{\varvec{z}}}_{j}\right)}^{(k)}\), where \(\sigma \) is the softmax function. We refer to this confidence score as the “max softmax score,” since it is derived from the maximum softmax probability.

One of the challenges with using the max softmax probability as a confidence score, however, is that it is often poorly scaled. Ideally, estimated accuracy for a prediction would closely match its actual expected accuracy, but in practice the softmax function tends to be “overconfident”^{56}. That is, \({\widehat{p}}_{j}^{\mathrm{SM}}\) tends to be larger than \({\mathbb{P}}({\widehat{y}}_{j}={y}_{j})\). To generate a more well-calibrated confidence score (i.e., one in which \({\widehat{p}}_{j}\) is closer to \({\mathbb{P}}({\widehat{y}}_{j}={y}_{j})\), we use an approach called temperature scaling. Temperature scaling uses a learned parameter \(T\) (where \(T>1\) indicates decreased confidence and \(T<1\) increased confidence) to rescale class probabilities so that the confidence score more closely matches the true accuracy of a prediction^{57}. We define the temperature scaling-based confidence for frame \(j\) as \({\widehat{p}}_{j}^{\mathrm{TS}}=\underset{k}{\mathrm{max}}{\sigma ({{\varvec{z}}}_{j}/T)}^{(k)}\), where \(T\) is selected to minimize the negative log likelihood on the validation set. Now that we have established the process for generating a frame-wise confidence score, we can generate the clip-wise confidence score that is used in the confidence-based review. As previously described, for \({\text{clip}}_{i}\) this is simply \(\mathrm{conf}\left({\text{clip}}_{i}\right)=\frac{1}{\left|{\text{clip}}_{i}\right|}\sum_{j\in {\text{clip}}_{i}}{\widehat{p}}_{j}\), where \({\widehat{p}}_{j}\) is either generated via the softmax function (\({\widehat{p}}_{j}={\widehat{p}}_{j}^{\mathrm{SM}}\)) or temperature scaling (\({\widehat{p}}_{j}={\widehat{p}}_{j}^{\mathrm{TS}}\)).

### Confidence-based review

Now that we have generated a confidence score for a given clip, we use it in two ways. First, recall that one of the purposes of the confidence-based review is to estimate the accuracy of the unlabeled data, \({\mathcal{D}}^{\mathrm{unlabeled}}\). If, for example, a user decided that an accuracy of 80 percent was acceptable for their given behavior analysis application (i.e., \(\mathrm{acc}({\mathcal{D}}^{\mathrm{unlabeled}})\ge 0.8\)), then given an acceptably reliable confidence score, unlabeled data for which \(\mathrm{conf}\left({\mathcal{D}}^{\mathrm{unlabeled}}\right)\ge 0.8\) would be sufficient for export and use in their given analysis without manual review. Before obtaining an estimate for \(\mathrm{conf}\left({\mathcal{D}}^{\mathrm{unlabeled}}\right)\), we first consider that the true (unknown) accuracy of the annotations in \({\mathcal{D}}^{\mathrm{unlabeled}}\) is the weighted sum of the accuracies of the clips in \({\mathcal{D}}^{\mathrm{unlabeled}}\), where weight is determined by the number of frames in each clip. Formally, we can express the accuracy of \({\mathcal{D}}^{\mathrm{unlabeled}}\) as:

$$\mathrm{acc}\left({\mathcal{D}}^{\mathrm{unlabeled}}\right)={\sum }_{i\in {\mathcal{D}}^{\mathrm{unlabeled}}}(\mathrm{acc}\left({\text{clip}}_{i}\right)\times \frac{\left|{\text{clip}}_{i}\right|}{\sum_{j\in {\mathcal{D}}^{\mathrm{unlabeled}}}\left|{\text{clip}}_{j}\right|}),$$

where \(\frac{\left|{\text{clip}}_{i}\right|}{\sum_{j\in {\mathcal{D}}^{\mathrm{unlabeled}}}\left|{\mathrm{clip}}_{j}\right|}\) weights the accuracy of \(\mathrm{acc}\left({\text{clip}}_{i}\right)\) by the number of frames in \({\text{clip}}_{i}\) (i.e., \(\left|{\text{clip}}_{i}\right|\)) relative to the total number of clips (i.e., \(\sum_{j\in {\mathcal{D}}^{\mathrm{unlabeled}}}\left|{\mathrm{clip}}_{j}\right|\)). We then estimate the accuracy of the unlabeled data by substituting the known \(\mathrm{conf}({\text{clip}}_{i})\) for the unknown \(\mathrm{acc}({\text{clip}}_{i})\):

$$\mathrm{conf}\left({\mathcal{D}}^{\mathrm{unlabeled}}\right)={\sum }_{i\in {\mathcal{D}}^{\mathrm{unlabeled}}}(\mathrm{conf}\left({\text{clip}}_{i}\right)\times \frac{\left|{\text{clip}}_{i}\right|}{\sum_{j\in {\mathcal{D}}^{\mathrm{unlabeled}}}\left|{\mathrm{clip}}_{j}\right|}).$$

In this way, \(\mathrm{conf}\left({\mathcal{D}}^{\mathrm{unlabeled}}\right)\) represents the approximate accuracy of the classifier on unlabeled data. If the confidence score functions well, then \(\mathrm{conf}\left({\mathcal{D}}^{\mathrm{unlabeled}}\right)\) will closely match \(\mathrm{acc}\left({\mathcal{D}}^{\mathrm{unlabeled}}\right)\).

Next, we consider the confidence-based review. In this component of the workflow, user can review and correct labels automatically generated by the classifier for \({\mathcal{D}}^{\mathrm{unlabeled}}\). A naïve approach would be to review all the video clips contained in \({\mathcal{D}}^{\mathrm{unlabeled}}\). While this would indeed ensure all the labels produced by the classifier are correct, if \({\mathcal{D}}^{\mathrm{unlabeled}}\) is large it can prove quite time-consuming. So instead, we leverage confidence scores to allow users to only annotate the subset of clips with relatively low confidence scores (i.e., relatively low predicted accuracy), for which review is most productive, while omitting those with relatively high confidence scores.

If a user reviews only a portion of the clips, it should be the portion with the lowest accuracy, for which correction is the most important. To express this formally, consider an ordered sequence of the \(n\) clips in \({\mathcal{D}}^{\mathrm{unlabeled}}\), \(({\mathrm{clip}}_{1}, {\mathrm{clip}}_{2}, \dots , {\mathrm{clip}}_{n})\), sorted in ascending order by accuracy (i.e., \(\mathrm{acc}\left({\mathrm{clip}}_{i}\right)\le \mathrm{acc}\left({\mathrm{clip}}_{j}\right)\), for \(i<j\) and all \(i,j\le n\)). If we review only \(k\) of the \(n\) clips, where \(k\le n\), we are best off reviewing clips \({\mathrm{clip}}_{1}, {\mathrm{clip}}_{2}, \dots , {\mathrm{clip}}_{k}\) from the list since they have the lowest accuracy. For unlabeled data, however, recall that we can’t precisely sort clips by accuracy, since without ground truth annotations \(\mathrm{acc}\left({\text{clip}}_{i}\right)\) is unknown. However, since \(\mathrm{conf}\left({\text{clip}}_{i}\right)\) approximates \(\mathrm{acc}\left({\text{clip}}_{i}\right)\), we can instead sort unlabeled clips by their (known) confidence scores, and then select the clips with the lowest confidence scores to review first. This forms the basis of the confidence-based review. Given a set of clips in \({\mathcal{D}}^{\mathrm{unlabeled}}\), we simply create a sequence of clips \(({\mathrm{clip}}_{1}, {\mathrm{clip}}_{2}, \dots , {\mathrm{clip}}_{n})\) sorted by confidence score (i.e., such that \(\mathrm{conf}\left({\text{clip}}_{i}\right)\le \mathrm{conf}({\mathrm{clip}}_{j})\), for all \(i<j\)) and then have users review clips in ascending order. If the confidence score is an effective estimate of the clip accuracies, sorting based on confidence score will approximate sorting by accuracy.

### Evaluating confidence score calibration

To examine the relationship between confidence scores and accuracy, we first consider the relationship between individual clips’ predicted accuracy (as derived from confidence cores) and actual accuracy. The prediction error (PE) for a given clip is defined as the signed difference between its predicted accuracy and its actual accuracy. For \({\mathrm{clip}}_{i}\), the PE is then \(\mathrm{PE}({\mathrm{clip}}_{i})=\mathrm{conf}\left({\mathrm{clip}}_{i}\right)-\mathrm{acc}({\mathrm{clip}}_{i})\). Positive values indicate an overconfident score, and negative value and underconfident one. The absolute error (AE) is the magnitude of the prediction error and is defined as \(\mathrm{AE}\left({\mathrm{clip}}_{i}\right)=\left|\mathrm{PE}({\mathrm{clip}}_{i})\right|\). The AE is always positive, with a higher \(\mathrm{AE}\left({\mathrm{clip}}_{i}\right)\) indicating a greater absolute deviation between \(\mathrm{conf}\left({\mathrm{clip}}_{i}\right)\) and \(\mathrm{acc}({\mathrm{clip}}_{i})\).

While PE and AE are defined for a single clip, we also consider the mean absolute error and mean prediction error across all the clips in \({\mathcal{D}}^{\mathrm{unlabeled}}\). Here, we let \({\mathrm{clip}}_{1}, {\mathrm{clip}}_{2},\dots ,{\mathrm{clip}}_{n}\) denote a set of \(n\) clips. The mean absolute error (MAE) is defined as \(\mathrm{MAE}=\frac{1}{n}\sum_{i=1}^{n}\mathrm{AE}({\mathrm{clip}}_{i})\). MAE expresses the average magnitude of the difference between predicted accuracy and actual accuracy for a randomly selected clip in the set. So, for example, if \(\mathrm{MAE}=0.1\), then a randomly selected clip’s confidence score will differ from its accuracy score by about 10 percent, in expectation. The mean signed difference (MSD), meanwhile, is defined as \(\mathrm{MSD}=\frac{1}{n}\sum_{i=1}^{n}\mathrm{PE}({\mathrm{clip}}_{i})\). MSD expresses the signed difference between the total expected accuracy across clips and the total actual accuracy. So, for example, is \(\mathrm{MSD}=-0.05\), then the total estimated accuracy of annotations for the set \({\mathrm{clip}}_{1}, {\mathrm{clip}}_{2},\dots ,{\mathrm{clip}}_{n}\) is five percent lower than the true accuracy.

### Evaluating review efficiency

To develop a metric for the performance of the confidence-based review, we first consider a case where a user has generated predicted labels for \(n\) clips, which have not been manually labeled, and selects \(k\) of them to review, where \(k\le n\). The remaining \(n-k\) clips are not reviewed and are exported with unrevised classifier-generated labels. Then, for each of the \(k\) clips the user has selected, he or she reviews the clip and corrects any incorrect classifier-generated labels. In this formation, after reviewing a given clip, that clip’s accuracy (defined as the agreement between a clip’s labels and the labels produced by manual annotation), is \(1\), since any incorrect classifier-produced labels would have been corrected.

Next, we assume that we have been provided with a *sequence* of \(n\) clips, \(\mathcal{D}={(\mathrm{clip}}_{1}, {\mathrm{clip}}_{2}, \dots , {\mathrm{clip}}_{n})\), from which we select the first \(k\) clips in the sequence to review. If we denote \({\mathrm{clip}}_{i}^{\mathrm{unrev}}\) as clip \(i\) prior to being reviewed, and \({\mathrm{clip}}_{i}^{\mathrm{rev}}\) as clip \(i\) after being reviewed, then we can express the sequence of the first \(k\) clips after they have been reviewed as \({\mathcal{D}}_{k}^{\mathrm{rev}}=({\mathrm{clip}}_{1}^{\mathrm{rev}}, {\mathrm{clip}}_{2}^{\mathrm{rev}}, \dots , {\mathrm{clip}}_{k}^{\mathrm{rev}})\). We then express the remaining \(n-k\) clips as the sequence \({\mathcal{D}}_{k}^{\mathrm{unrev}}=({\mathrm{clip}}_{k+1}^{\mathrm{unrev}}, {\mathrm{clip}}_{k+2}^{\mathrm{unrev}}, \dots , {\mathrm{clip}}_{n}^{\mathrm{unrev}})\). We then consider that the overall accuracy of the sequence of clips, \(\mathrm{acc}(\mathcal{D})\), is simply weighted average of the accuracy of the reviewed videos, \({\mathcal{D}}_{k}^{\mathrm{rev}}\), and the unreviewed ones, \({\mathcal{D}}_{k}^{\mathrm{unrev}}\), where the weight is a function of the number of frames in each clip. Formally,

$$\mathrm{acc}\left({\mathcal{D}}_{k}\right)=\mathrm{acc}\left({\mathcal{D}}_{k}^{\mathrm{rev}}\right)\times \frac{\left|{\mathcal{D}}_{k}^{\mathrm{rev}}\right|}{\left|\mathcal{D}\right|}+\mathrm{acc}\left({\mathcal{D}}_{k}^{\mathrm{unrev}}\right)\times \frac{\left|{\mathcal{D}}_{k}^{\mathrm{unrev}}\right|}{\left|\mathcal{D}\right|},$$

where \(\left|\mathcal{D}\right|\) is the total number of video frames in the clips in set \(\mathcal{D}\) (i.e., \(\left|\mathcal{D}\right|={\sum }_{i\in \mathcal{D}}\left|{\mathrm{clip}}_{i}\right|\)). We then consider that, after reviewing and correcting the first \(k\) clips, the accuracy of each reviewed clip is now \(1\). That is, \(\mathrm{acc}\left({\mathrm{clip}}_{i}^{\mathrm{rev}}\right)=1\) for all \({\mathrm{clip}}_{i}^{\mathrm{rev}}\in {\mathcal{D}}_{k}^{\mathrm{rev}}\). Therefore, the total accuracy of sequence \(\mathcal{D}\), after reviewing the first \(k\) clips, is

$$\mathrm{acc}\left({\mathcal{D}}_{k}\right)=\frac{\left|{\mathcal{D}}_{k}^{\mathrm{rev}}\right|}{\left|\mathcal{D}\right|}+\mathrm{acc}\left({\mathcal{D}}_{k}^{\mathrm{unrev}}\right)\times \frac{\left|{\mathcal{D}}_{k}^{\mathrm{unrev}}\right|}{\left|\mathcal{D}\right|}.$$

This method for calculating the accuracy of dataset \(\mathcal{D}\) after reviewing the first \(k\) clips becomes useful for analyzing the performance of the confidence-based review. To see why, we first consider the lower bound on \(\mathrm{acc}\left({\mathcal{D}}_{k}\right)\). In the worst case, our confidence score will convey no information about the relative accuracies of the clips in \(\mathcal{D}\). Without a relationship between \(\mathrm{acc}\left({\mathrm{clip}}_{i}\right)\) and \(\mathrm{conf}\left({\mathrm{clip}}_{i}\right)\), sorting based on confidence score is effectively the same as randomly selecting clips. In this way, we can compare the accuracy after labeling the first \(k\) clips via confidence-score with the accuracy that *would have been obtained if the first* \(\mathrm{k}\) *clips were reviewed*. We denote this improvement in accuracy using confidence metric \(\mathrm{conf}\) as the “improvement over random” and formalize it as \({\mathrm{IOR}}_{k}^{\mathrm{conf}}=\mathrm{acc}\left({\mathcal{D}}_{k}^{\mathrm{conf}}\right)-\mathrm{acc}\left({\mathcal{D}}_{k}^{\mathrm{rand}}\right)\), where \({\mathcal{D}}_{k}^{\mathrm{conf}}\) and \({\mathcal{D}}_{k}^{\mathrm{rand}}\) denote dataset \(\mathcal{D}\) sorted by confidence score and randomly, respectively.

Next, we place an upper bound on \({\mathrm{IOR}}_{k}\) by considering the maximum accuracy that \(\mathcal{D}\) could have after reviewing \(k\) clips. In the best case, the first \(k\) clips reviewed would be the \(k\) clips with the lowest accuracy. Here, since we’re evaluating on \({\mathcal{D}}^{\mathrm{test}}\), where accuracy is known, we can calculate this. If we let \({\mathcal{D}}^{\mathrm{acc}}\) denote the sequence of clips sorted in ascending order by their true accuracy, then the maximum accuracy of \(\mathcal{D}\) after reviewing \(k\) clips is \(\mathrm{acc}\left({\mathcal{D}}_{k}^{\mathrm{acc}}\right)\). Then, similar to the analysis above, we calculate the improvement of optimal review (i.e., review based on true accuracy) over random review as \({\mathrm{IOR}}_{k}^{\mathrm{opt}}=\mathrm{acc}\left({\mathcal{D}}_{k}^{\mathrm{acc}}\right)-\mathrm{acc}\left({\mathcal{D}}_{k}^{\mathrm{rand}}\right).\) Semantically, \({\mathrm{IOR}}_{k}^{\mathrm{opt}}\) expresses how much higher the accuracy of the test set it after reviewing \(k\) clips in the optimal order than it would be if clips had been reviewed randomly.

We can then derive a series of global measures for the confidence-based review. While \({\mathrm{IOR}}_{k}\) is defined for a single number of clips reviewed, \(k\), we look to generate a measure that expresses \({\mathrm{IOR}}_{k}\) across a range of \(k\) values. To do so, we calculate the average improvement over random across the number of clips reviewed, from \(0\) to the total number, \(n\), as follows:

$${\overline{\mathrm{IOR}} }_{n}^{\mathrm{method}}=\frac{1}{n}{\sum }_{k=0}^{n}{\mathrm{IOR}}_{k}^{\mathrm{method}}.$$

\({\overline{\mathrm{IOR}} }_{n}^{\mathrm{method}}\) expresses the mean improvement over random of method \(\mathrm{method}\) over \(n\) clips. After calculating \({\overline{\mathrm{IOR}} }_{n}^{\mathrm{conf}}\) and \({\overline{\mathrm{IOR}} }_{n}^{\mathrm{opt}}\) (i.e., \({\overline{IOR} }_{n}\) for confidence-based and optimal sorting), we can generate a final measure for the review efficiency by expressing the average improvement of confidence score \(\mathrm{conf}\) over random relative to the maximum possible improvement over random (optimal review):

$${\mathrm{review}\_\mathrm{efficiency}}_{n}^{\mathrm{conf}}=\frac{ {\overline{\mathrm{IOR}} }_{n}^{\mathrm{conf}}}{{\overline{\mathrm{IOR}} }_{n}^{\mathrm{opt}}}.$$

This metric expresses how close review using metric \(\mathrm{conf}\) is to optimal. If sort order based on \(\mathrm{conf}\) exactly matches that of sorting by accuracy, \({\mathrm{review}\_\mathrm{efficiency}}_{n}^{\mathrm{conf}}=1\). If the sort order is no better than random, \({\mathrm{review}\_\mathrm{efficiency}}_{n}^{\mathrm{conf}}=0\).

### Implementation details and code availability

We implement the toolbox in MATLAB version 2020b. The GUI for annotation and confidence-based review is included in the toolbox as a MATLAB application. Figures are produced using Prism9 and OmniGraffle. The entire toolbox, along with example scripts, documentation, and additional implementation details is hosted via a public GitHub repository at: https://github.com/carlwharris/DeepAction. We provide the intermediary data generated for the home-cage dataset (e.g., spatial and temporal frames and features, annotations, etc.) as an example project linked in the GitHub repository. Data produced to generate results for the CRIM13 project is available upon request, but not provided as an example project due to its large file sizes. Full data (i.e., results for each test split in both projects) needed to replicate the results is also available on request. Data for the exploratory data set is proprietary.