Quantifying data annotation using visual language models to enable portable AI systems

A severe labor shortage is hampering growth in all industries, transport, construction and agriculture. The problem is critical in design: nearly 500,000 positions remain unfilled in the United States, and 40% of the current workforce is close to retirement within a decade. These manpower limitations cause project delays, increased costs, and delayed development plans. To address these challenges, organizations are developing autonomous systems that can perform tasks that close energy gaps, increase operational efficiencies, and provide increased productivity around the clock.
Building autonomous systems requires large, annotated datasets to train AI models. Effective training determines whether these programs deliver business value. Bottleneck: high cost of data processing. Essentially, the act of labeling video data—identifying information about devices, activities, and the environment—is necessary to ensure that the data is useful for training the model. This move could hamper the deployment of models, delaying the delivery of AI-powered products and services to customers. For construction companies managing millions of hours of video, manual data preparation and annotation is impractical. Visual language models (VLMs) help address this by interpreting images and video, answering natural language queries, and generating descriptions at a speed and scale that manual processes cannot match, providing a cost-effective alternative.
In this post, we explore how Bedrock Robotics is tackling this challenge. By joining the AWS Physical AI Fellowship, the startup has partnered with the AWS Generative AI Innovation Center to use visual language models that analyze construction video, extract performance information, and generate labeled training datasets at scale, improving data processing for autonomous construction machines.
Bedrock Robotics: a case study for accelerating autonomous construction
Since 2024, Bedrock Robotics has been developing autonomous systems for construction machines. The company's product, Bedrock Operator, is a reclamation solution that combines hardware and AI models so that excavators and other machines operate with minimal human intervention. These systems can perform tasks such as digging, grading, and material handling with centimeter-level accuracy. Training these models requires a large amount of video imaging equipment, tasks, and environments – a very resource-intensive process that limits scalability.
VLMs provide a solution by analyzing this image and video data and generating textual descriptions. This makes them well-suited for annotation tasks, which are important for teaching models how to relate observed patterns to human language. Bedrock Robotics has used this technology to guide the preparation of training data for AI models, enabling autonomous operation of machines. Additionally, with the right model selection and fast engineering, the company has improved tool identification from 34% to 70%. This transformed a manual, time-consuming process into an automated, scalable data pipeline solution. The success accelerated the deployment of autonomous missions.
This approach provides a replicable framework for organizations facing similar data challenges and demonstrates how strategic investments in foundational models (FMs) can deliver measurable performance results and competitive advantage. Baseline models are models trained on large amounts of data using supervised learning techniques that learn general representations that can be adapted to many downstream tasks. VLMs use these large pre-training methods to combine visual and textual modes, understand, analyze, and generate content in both visual and linguistic formats.
In the following sections, we look at the process Bedrock Robotics used to annotate millions of hours of video footage and accelerate innovation using a VLM-based solution.
From unstructured video data to strategic assets using VLMs
Enabling autonomous computing requires extracting useful information from millions of hours of random performance footage. Specifically, Bedrock Robotics needed to identify tool attachments, tasks, and work site conditions across a variety of situations. The following images are sample video frames from this dataset.
Materials work with an attachment of many tools, each of which requires precise classification in order to train reliable AI models. In collaboration with the Innovation Center, Bedrock Robotics is focusing its innovation efforts by addressing several key tool categories: lifting hooks for material handling, concrete breakers, leveling beams for surface leveling, and chipping buckets for excavation.
These labels allow Bedrock Robotics to select relevant video segments and compile a training dataset representing various machine configurations and operating conditions.
Accelerating AI deployment through strategic modeling
Off-the-shelf VLMs (VLMs without rapid development) struggle with construction video data because they are trained on web images, not operator images from drill rooms. They can't handle unusual angles, machine-specific sights, or dust and weather malfunctions. And they don't have the domain knowledge to distinguish similar visual tools like digging buckets from hacking buckets.
Bedrock Robotics and the Innovation Center addressed this through targeted model selection and rapid optimization. Teams explored multiple VLMs—including open-source options and FMs available on Amazon Bedrock—and were refined with detailed visual descriptions of each tool, guidance on the various tools that are often confused, and step-by-step instructions for analyzing video frames.
These changes improved the classification accuracy from 34% to 70% on a test set consisting of 130 videos, at $10 per hour of video processing. These results show how rapid engineering adapts VLMs to specialized tasks. For Bedrock robots, this configuration has delivered faster training cycles, reduced deployment time, and a cost-effective annotation pipeline that adapts to operational needs.
The way forward: tackling the labor shortage automatically
Competitive Advantage. For Bedrock Robots, visual language systems enabled rapid identification and extraction of key data sets, providing the necessary insights from large-scale construction video. With an overall accuracy of 70%, this cost-effective method provides a practical basis for measuring the accuracy of model training data. It shows how AI innovation can transform workforce barriers and accelerate industry transformation. Organizations that streamline data processing can accelerate autonomous system deployments, reduce operating costs, and explore new areas of growth in industries affected by labor shortages. Through this iterative framework, manufacturing leaders and industry practitioners facing similar challenges can use these principles to drive competitive differentiation within their domains.
To learn more, visit Bedrock Robotics or explore virtual AI resources on AWS.
About the writers



