key: cord-0477253-ggl1yetw
authors: Abhishek, Guttu Sai; Ingole, Harshad; Laturia, Parth; Dorna, Vineeth; Maheshwari, Ayush; Ramakrishnan, Ganesh; Iyer, Rishabh
title: SPEAR : Semi-supervised Data Programming in Python
date: 2021-08-01
journal: nan
DOI: nan
sha: 17d9343750b75dff73ca81b8f46b57d9b387ab5d
doc_id: 477253
cord_uid: ggl1yetw

We present spear, an open-source python library for data programming with semi supervision. The package implements several recent data programming approaches including facility to programmatically label and build training data. SPEAR facilitates textit{weak supervision} in the form of heuristics (or rules) and association of textit{noisy} labels to the training dataset. These textit{noisy} labels are aggregated to assign labels to the unlabeled data for downstream tasks. We have implemented several label aggregation approaches that aggregate the textit{noisy} labels and then train using the textit{noisily} labeled set in a cascaded manner. Our implementation also includes other approaches that textit{jointly} aggregate and train the model for text classification tasks. Thus, in our python package, we integrate several cascade and joint data-programming approaches while also providing the facility of data programming by letting the user define labeling functions or rules. The code and tutorial notebooks are available at url{https://github.com/decile-team/spear}. Further, extensive documentation can be found at url{https://spear-decile.readthedocs.io/}. Video tutorials demonstrating the usage of our package are available at https://youtu.be/SN9YYK4FlU0, https://www.youtube.com/watch?v=qdukvO3B8YU, and https://drive.google.com/uc?export=download&confirm=KY-u&id=1P0iOGHrIR1Te0sSeCB3hwdPpjzB44K33. We also present some real-world use cases of spear.

Supervised machine learning approaches require large amounts of labeled data to train robust machine learning models. For classification tasks such as spam detection, (movie) genre categorization, sequence labelling, and so on, modern machine learning systems rely heavily on humanannotated gold labels. Creating labeled data can be a time-consuming and expensive procedure that necessitates a significant amount of human effort. * Equal contribution To reduce dependence on human-annotated labels, various techniques such as semi-supervision, distant supervision, and crowdsourcing have been proposed. In order to help reduce the subjectivity and drudgery in the labeling process, several recent data programming approaches (Bach et al., 2019; Chatterjee et al., 2020; Awasthi et al., 2020; Maheshwari et al., 2021a) have proposed the use of human-crafted labelling functions or automatic LFs (Maheshwari et al., 2021b) to weakly associate labels with the training data. Users encode supervision in the form of labelling functions (LFs), which assign noisy labels to unlabeled data, reducing dependence on human labeled data.

While most data-programming approaches cited above provide their source code in the public domain, a unified package providing access to all data programming approaches is however missing. In this work, we describe SPEAR, a python package that implements several existing data programming approaches while also providing a platform for integrating and benchmarking newer ones. Inspired by frameworks such as Snorkel (Bach et al., 2019; Ratner et al., 2017) and algorithm based labeling in Matlab 1 , we provide a facility for users to define LFs. Further, we develop and integrate several recent data programming models that uses these LFs. We provide many easy-to-use jupyter notebooks and video tutorials for helping new users get quickly started. The users can get started by installing the package using the below command. label aggregator (LA). At the outset, the user is expected to declare an enum class listing all the class labels. The enum class associates the numeric class label with the readable class name. As part of (i), SPEAR provides the facility for manually creating LFs. LFs can be in the form of regex rules as well. Additionally, we also provide the facility of declaring a @preprocessor decorator to use an external library such as spacy 2 , nltk, etc. which can be optionally invoked by the LFs. Thereafter, as part of (ii), the LFs can be applied on the unlabeled (and labeled) set using an apply function that returns a matrix of dimension #LFs × #instances. The matrix is then provided as input to the selected label aggregator (LA) in (iii), as shown in Figure1. We integrate several LA options into SPEAR. Each LA aggregates multiple noisy labels (obtained from the LFs) to associate a single class label with an instance. Additionally, we have also implemented in SPEAR, several joint learning approaches that employ semi-supervision and feature information. The high-level flow of the SPEAR library is presented in Figure 1 .

User interacts with the library by designing labeling functions. Similar to Ratner et al. (2017) , labeling functions are python functions which take a candidate as an input and either associates class label or abstains. However, continuous LFs returns a continuous score in addition to the class label. These continuous LFs are more natural to 2 https://spacy.io program and lead to improved recall (Chatterjee et al., 2020) .

SPEAR uses a @labeling_function() decorator to define a labeling function. Each LF, when applied on an instance, can either return a class label or not return anything, i.e. abstain. The LF decorator has an additional argument that accepts a list of preprocessors. Each preprocessor can be either declared as a pre-defined function or can employ external libraries. The pre-processor transforms the data point before applying the labeling function.

@labeling_function(cont_scorer, resources, preprocessors, label) def CLF1(x, ** kwargs): return label if kwargs["continuous_score"] >= threshold else ABSTAIN

The LF can express pattern matching rules in the form of heuristics, distant supervision by using external knowledge bases and other data resources to label datapoints. LFs on SMS dataset can be seen in the example notebook here.

Continuous LFs: In the discrete LFs, users construct heuristic patterns based on dictionary lookups or thresholded distance for the classification tasks. However, the keywords in hand-crafted dictionaries might be incomplete. Chatterjee et al. (2020) proposed a comprehensive alternative that design continuous valued LFs that return scores derived from soft match between words in the sentence and the dictionary. SPEAR provides the facility to declare con- (Awasthi et al., 2020) . Snorkel provides support for designing and applying LFs and semi-supervised LA approaches but does not have facility for continuous LFs, unsupervised LA and labeled-data subset selection.

tinuous LFs, each of which returns the associated label along with a confidence score using the @continuous_scorer decorator. The continuous score can be accessed in the LF definition through the keyword argument continuous_score. As evident from 

Once LFs are defined, users can analyse labeling functions by calculating coverage, overlap, conflicts, empirical accuracy for each LF which helps to re-iterate on the process by refining new LFs. The metrics can be visualised within the SPEAR tool, either in the form of a table or graphs as shown in Figure 2 . PreLabels is the master class which encapsulates a set of LFs, the dataset to label and enum of class labels. PreLabels facilitates the process of applying the LFs on the dataset, and of analysing and refining the LF set. We provide functions to store labels assigned by LFs and associated metadata such as mapping of class name to numeric class labels on the disk in the form json file(s). The pre-labeling performed using the LFs can be consolidated into labeling performed using several consensus models described in Section 4. sms_pre_labels = PreLabels(name="sms", data=X_V, gold_labels=Y_V, data_feats=X_feats_V, rules=rules, labels_enum=ClassLabels, num_classes=2)

We implement several data-programming approaches in this demonstration that includes simple baselines such as fully-supervised, semisupervised and unsupervised approaches.

The joint learning (JL) module implements a semi-supervised data programming paradigm that learns a joint model over LFs and features. JL has two key components, viz., feature model (fm) and graphical model (gm) and their sum is used as a training objective. During training, the JL requires labeled (L), validation (V), test (T ) sets consisting of true labels and an unlabeled (U) set whose true labels are to be inferred. The model API closely follows that of scikit-learn (Pedregosa et al., 2011) to make the package easily accessible to the machine learning audience. The primary functions are: (1) fit_and_predict_proba, which trains using the prelabels assigned by LFs and true labels of L data and predicts the probabilities of labels for each instance of U data (2) fit_and_predict, similar to the previous one but which predicts labels of U using maximum posterior probabilities (3) predict_(fm/gm)_proba, predicts the probabilities, using feature model(fm)/graphical model(gm) (4) predict_(fm/gm), predicts labels using fm/gm based on learned parameters.

We also provide functions save or load_params to save or load the trained parameters.

As another unique feature (c.f. Table 1) , our library supports a subset-selection framework that makes the best use of human-annotation efforts. The L set can be chosen using submodular functions such as facility location, max cover, etc. We utilise the submodlib 3 library for the subset selection algorithms. The function alternatives for subset selection are rand_subset, unsup_subset, sup_subset_indices, sup_subset_save_files.

In this, the classifier P (y|x) is trained only on the labeled data. Following Maheshwari et al. (2021a) , we provide facility to use either Logistic Regression or a 2-layered neural network. Our package is flexible to allow other architectures to be plugged-in as well. (Chatterjee et al., 2020) This accepts both continuous and discrete LFs. Further, each LF has an associated quality guide component, that refers to the fraction of times the LF predicts the correct label; this stabilises training in absence of V set. In our package, CAGE accepts U and T sets during training. CAGE has member functions similar to (except there are no fm or gm variants to predict_proba, 3 https://github.com/decile-team/submodlib predict functions in Cage) JL module, with different arguments, serving the same purpose. It should be noted that this model doesn't need labeled(L) or validation(V) data. (Ren et al., 2018) This method is an online meta-learning approach for reweighting training examples with a mix of U and L. It leverages validation set to determine and adaptively assigns importance weights to examples based on the gradient direction. This does not employ additional parameters to weigh or denoise individual rules. 4.6 Posterior Regularization (PR) (Hu et al., 2016) This is a method that enables to simultaneously learn from L and logic rules by jointly learning a rule and feature network in a teacher-student setup. The student network learns parameter θ using the L set and teacher networks attempts to imitates the student network in a joint learning manner. The teacher network encodes logic rules as a regularization term in the overall loss objective. (Awasthi et al., 2020) This approach uses additional information in the form of labeled rule exemplars and trains with a denoised rule-label loss. They leverage both rules and labeled data by mapping each rule with exemplars of correct firings (i.e., instantiations) of that rule. Their joint training algorithms denoise overgeneralized rules and train a classification model. It has two main components:

1. Rule Network: It learns to predict whether a given rule has overgeneralized on a given sample using latent coverage variables.

2. Classification Network: It is trained on L and U to predict the output label and maximize the accuracy on unseen test instances using a soft implication loss.

This module contains the following primary classes:

1. DataFeeder -It will essentially take all the parameters as input and create a data feeder class with all these parameters as its attributes.

2. HighLevelSupervisionNetwork (HLS) -It will take the 2 networks, the mode or the approach that needs to be used to train the model, the required parameters, the directory storing model checkpoints at different instances and the instances and labels from the labeled dataset (L) and create an object named "hls".

HLS object will have many member functions of which the 2 significant are: (a) hls.train: This function, when called with the required mode, will train the 2 network attributes of the object. (b) hls.test: It supports 3 types of testing: (i) test_w: this will test the rule network and the related model of the object.

(ii) test_f: this will test the classification network and the related model of the object.

(iii) test_all: this will test both the networks and models of the class.

We prepared jupyter tutorial notebooks for two standard text classification datasets, namely SMS and MIT-R. We took LFs on these datasets from Awasthi et al. (2020) and train using approaches implemented in this paper. Figure 3 shows performance of various approaches implemented using our package on two datasets. 

SPEAR is employed in project UDAAN 4 for reducing post editing efforts. UDAAN is a postediting workbench to translate content in native languages. Based on the post editor's patterns of changes to the target language document, candidate labeling functions are generated (based on a combination of heuristics and linguistic patterns) by the UDAAN workbench (c.f. Figure 4 for examples of LFs). Based on these LFs, SPEAR gets invoked on a combination of the edited (i.e., labeled) data and the not yet edited (i.e., unlabeled) data to present consolidated edits to the post-editor. This use case has been presented in the flow chart in Figure 4 wherein, we present the appropriate incorporation of SPEAR into the postediting environment of an ecosystem such as for translation (UDAAN) or even for Optical Character Recognition 5 or Automatic Speech Recognition (ASR). As a part of third wave preparedness, SPEAR was used by the Municipal Corporation of Greater Mumbai (MCGM)'s Health Ward 6 for predicting the COVID-19 status of patients to help in prelim-4 https://www.udaanproject.org/ 5 https://www.cse.iitb.ac.in/~ocr/ 6 https://colab.research.google.com/ drive/1tNUObqSDypUos7YNvnqvemALlkrrsB0z inary diagnosis.

For ease of use, we use well-known and established 3rd-party packages to increase system stability. For instance, we build documentation using sphinx documentation generator 7 . We also use standard open-source packages for development and visualisation such as numpy, matplotlib and pandas. The package is written in Python3 and open-sourced with a MIT License 8 .

SPEAR is a unified package for semi-supervised data programming that quickly annotates training data and train machine learning models. It eases the use of developing LFs and label aggregation approaches. This allows for better reproducibility, benchmarking and easier ML development in lowresource settings such as textual post-editing.

8th International Conference on Learning Representations

Snorkel drybell: A case study in deploying weak supervision at industrial scale

Robust data programming with precision-guided labeling functions

Harnessing deep neural networks with logic rules

Data programming using semisupervision and subset selection

Learning to robustly aggregate labeling functions for semi-supervised data programming

Scikit-learn: Machine learning in python. the

Snorkel: Fast training set generation for information extraction

Learning to reweight examples for robust deep learning