general

Auto-Rename Images with Vision Models & Live Reasoning

Sorting-hat is an open-source utility that automatically renames image files using vision-language models to analyze content and generate descriptive

Auto-Rename Images with Vision Models & Live Reasoning

What It Is

Sorting-hat is an open-source utility that automatically renames image files by analyzing their content with vision-language models. Instead of manually renaming hundreds of photos or leaving them with cryptic camera-generated names like DSC_4821.jpg, the tool examines each image and generates descriptive filenames based on what it sees.

The implementation stands out because it displays the model’s reasoning process in real-time. When using reasoning-capable models like Qwen3.5, developers can watch the AI’s thought process unfold as it analyzes each image - considering composition, subjects, context, and appropriate naming conventions before settling on a final filename. This transforms what would typically be a passive waiting experience into an observable demonstration of how vision models interpret visual information.

The tool accepts any OpenAI-compatible API endpoint, making it compatible with both cloud services and locally-hosted models running through frameworks like llama.cpp. Initial testing focused on Qwen3.5 variants ranging from 0.8b to 27b parameters, though the architecture supports any vision model that follows the standard API format.

Why It Matters

Digital photography generates massive volumes of generically-named files that become increasingly difficult to organize over time. Professional photographers, content creators, and anyone managing large image libraries face the tedious task of manual organization. Vision models have reached a capability threshold where they can reliably identify image content, but practical tools bridging this capability to everyday file management tasks remain scarce.

The live reasoning display serves a dual purpose beyond entertainment value. For developers experimenting with different vision models, observing the reasoning trace provides immediate feedback about model behavior, accuracy, and decision-making patterns. This visibility helps when comparing model performance or debugging unexpected naming choices. Teams evaluating vision models for production use gain insight into how different parameter sizes affect reasoning quality and processing speed.

The OpenAI-compatible API approach removes vendor lock-in. Organizations already running local inference servers can integrate this functionality without additional API costs or privacy concerns about sending images to external services. Research teams working with custom vision models can test them against real-world file organization tasks using familiar tooling.

Getting Started

Clone the repository from https://github.com/marksverdhei/sorting-hat to begin:

The tool requires an OpenAI-compatible endpoint. For local inference with Qwen3.5 models, llama.cpp provides a straightforward server option. Point the configuration to the API endpoint, specify the target directory containing images, and launch the renaming process.

Configuration typically involves setting the base URL for the vision model API, selecting which model variant to use, and defining any naming preferences or constraints. The repository documentation includes examples for common setups including local llama.cpp servers and cloud API providers.

During execution, the terminal displays each image filename alongside the model’s reasoning trace. For Qwen3.5 models, this shows the step-by-step analysis: identifying the main subject, noting relevant details, considering appropriate descriptive terms, and formulating the final filename. Processing speed varies based on model size and hardware - smaller 0.8b models process faster but may produce less nuanced descriptions compared to 27b variants.

Context

Traditional batch renaming tools rely on metadata, timestamps, or pattern matching. ExifTool extracts camera metadata for systematic renaming, while utilities like Bulk Rename Utility offer rule-based transformations. These approaches work well for standardization but cannot describe image content.

Cloud services like Google Photos and Apple Photos use vision models for search and organization but keep the functionality locked within their ecosystems. Sorting-hat brings similar capabilities to local workflows with full transparency into the decision process.

The main limitation involves processing time. Vision models require significantly more computation than metadata extraction, making this approach practical for occasional organization tasks rather than real-time workflows. Accuracy depends entirely on the chosen model’s vision capabilities - smaller models may misidentify complex scenes or specialized content.

Alternative approaches include training custom classifiers for specific image categories or using CLIP-based models for similarity-based organization. However, these require additional setup and don’t provide the natural language flexibility of generative vision models.