coding

FiftyOne Adds Local OCR Plugins for Datasets

FiftyOne introduces two OCR plugins, GLM-OCR and LightOnOCR-2-1B, enabling developers to extract and store text from images directly within their computer

What It Is

FiftyOne, the open-source computer vision dataset management platform, now supports two OCR plugins that extract text directly from images in datasets. GLM-OCR and LightOnOCR-2-1B both integrate with FiftyOne’s workflow, allowing developers to process entire image collections and store extracted text as structured fields alongside the visual data. Instead of manually annotating text or routing images through external OCR services, these plugins run locally and populate dataset fields automatically.

GLM-OCR connects to the GLM-4V-9B vision-language model, while LightOnOCR-2-1B uses a different architecture optimized for optical character recognition tasks. Both handle common OCR scenarios like document scanning, receipt processing, and scene text extraction, but they differ in speed and output formatting.

Why It Matters

Dataset preparation typically consumes more time than model training itself. When working with datasets containing receipts, forms, street signs, or any text-bearing images, extracting that text manually creates a bottleneck. External OCR APIs solve the problem but introduce costs, latency, and data privacy concerns when sending images to third-party services.

These FiftyOne plugins shift OCR into the dataset preparation phase where it belongs. Computer vision teams can now extract text during initial dataset ingestion, making it searchable and analyzable alongside bounding boxes, classifications, and other annotations. This matters particularly for multimodal projects where text content influences labeling decisions or training strategies.

GLM-OCR’s speed advantage becomes significant at scale. Processing thousands of images with faster extraction means shorter iteration cycles when exploring datasets or debugging annotation pipelines. The structured output capability also reduces post-processing work - instead of parsing messy text blobs, developers get clean field data ready for analysis or model training.

Getting Started

Installing GLM-OCR requires a single command:

After installation, the plugin appears in FiftyOne’s interface and can process datasets through the GUI or programmatically. The workflow typically involves loading a dataset, selecting images for OCR processing, and running the plugin to populate text fields.

For hands-on exploration, the quickstart notebook at https://github.com/harpreetsahota204/glm_ocr/blob/main/glm_ocr_fiftyone_example.ipynb demonstrates the complete process with example code. The notebook shows how to configure extraction parameters, handle batch processing, and access the resulting text data.

LightOnOCR-2-1B follows a similar installation pattern and shares the same FiftyOne plugin interface. Teams can test both plugins on sample datasets to compare performance characteristics before committing to one for production workflows.

Full documentation lives at https://docs.voxel51.com/plugins/plugins_ecosystem/glm_ocr.html for GLM-OCR and https://docs.voxel51.com/plugins/plugins_ecosystem/lightonocr_2.html for LightOnOCR-2-1B.

Context

Traditional OCR tools like Tesseract remain viable for simple text extraction, but they lack integration with modern dataset management workflows. Cloud services from Google, AWS, and Azure offer robust OCR but require API authentication, incur per-image costs, and send data outside local infrastructure.

GLM-OCR’s performance edge comes from its underlying vision-language model architecture, which handles varied text orientations and complex backgrounds more gracefully than older OCR engines. The structured output capability means it can return JSON-formatted results instead of raw text strings, reducing parsing overhead.

LightOnOCR-2-1B serves as a fallback option when GLM-OCR’s dependencies conflict with existing environments or when specific accuracy requirements favor its architecture. Some datasets with particular text characteristics may perform better with one model over the other, making both options valuable.

The main limitation applies to both plugins: they run locally and require sufficient GPU memory for efficient processing. Large-scale batch jobs may need workflow adjustments to manage memory usage. Neither plugin currently supports handwriting recognition or heavily degraded text, scenarios where specialized commercial services still hold advantages.