Bootstrapping Sign Language Annotations with Sign Language Models

0 0 1 minute read

Bootstrapping Sign Language Annotations with Sign Language Models

AI-powered sign language interpretation is limited by the lack of high-quality annotation data. New datasets including the ASL STEM Wiki and FLEURS-ASL contain professional interpreters and 100 hours of data but remain under-annotated and underutilized, in part due to the prohibitive cost of annotation at this scale. In this work, we develop a pseudo-annotation pipeline that takes signed video and English as input and outputs a standardized set of possible annotations, including time intervals, glosses, fingerspelled words, and character separators. Our pipeline uses several predictions from the fingerprint recognition and single character recognition (ISR), as well as the K-Shot LLM method, to estimate these annotations. In the application of this pipeline, we develop simple but effective fingerspelling and ISR models, achieving high performance on the FSBoard (6.7% CER) and ASL Citizen datasets (74% accuracy at maximum 1). To validate and provide a gold-standard benchmark, a professional translator has annotated nearly 500 videos from the ASL STEM Wiki with sequence-level glossary labels containing glosses, separators, and finger-spelling symbols. These human annotations and over 300 hours of pseudo-annotations are extracted from the supplemental material.