奕道配资 Meta公司：DINOv3是以前所未有的规模进行视觉自我监督学习

Meta公司网站原文奕道配资，请享用，谭老师我看完感慨一句：性能确实很棒，但 Apache 许可证已改商业许可证了。换句话说，原来可以免费使用、修改甚至商用的 Apache 许可证被换成了需要付费或受更多限制的商业许可证，想继续用就得按新规矩来。

Open Source 开源

DINOv3: Self-supervised learning for vision at unprecedented scaleDINOv3：以前所未有的规模进行视觉自我监督学习

August 14, 2025 2025年8月14日

Takeaways: 要点：

We’re introducing DINOv3, which scales self-supervised learning for images to create universal vision backbones that achieve absolute state-of-the-art performance across diverse domains, including web and satellite imagery.我们正在推出 DINOv3，它扩展图像的自监督学习，以创建通用视觉主干，从而在不同领域（包括网络和卫星图像）实现绝对最先进的性能。DINOv3 backbones produce powerful, high-resolution image features that make it easy to train lightweight adapters. This leads to exceptional performance on a broad array of downstream vision tasks, including image classification, semantic segmentation, and object tracking in video.DINOv3 主干可生成强大的高分辨率图像功能，使训练轻量级适配器变得容易。这导致了广泛的下游视觉任务的卓越性能，包括图像分类，语义分割和视频中的对象跟踪。We’ve incorporated valuable community feedback, enhancing the versatility of DINOv3 by shipping smaller models that outperform comparable CLIP-based derivatives across a broad evaluation suite, as well as alternative ConvNeXt architectures for resource-constrained use cases.我们已经整合了宝贵的社区反馈，通过在广泛的评估套件中提供比基于 CLIP 的衍生产品性能更好的小型模型，以及用于资源受限用例的替代 ConvNeXt 架构，增强了 DINOv3 的多功能性。We’re releasing the DINOv3 training code and pre-trained backbones under a commercial license to help drive innovation and advancements in the computer vision and multimodal ecosystem.我们将在商业许可证下发布 DINOv 3 训练代码和预先训练的骨干，以帮助推动计算机视觉和多模式生态系统的创新和进步。

Self-supervised learning (SSL) —the concept that AI models can learn independently without human supervision—has emerged as the dominant paradigm in modern machine learning. It has driven the rise of large language models that acquire universal representations by pre-training on massive text corpora. However, progress in computer vision has lagged behind, as the most powerful image encoding models still rely heavily on human-generated metadata, such as web captions, for training.自监督学习（SSL）-AI 模型可以在没有人类监督的情况下独立学习的概念-已成为现代机器学习的主导范式。它推动了大型语言模型的兴起，这些模型通过在大量文本语料库上进行预训练来获得通用表示。然而，计算机视觉的进展却落后了，因为最强大的图像编码模型仍然严重依赖于人类生成的元数据，例如网络标题。

Today, we’re releasing DINOv3, a generalist, state-of-the-art computer vision model trained with SSL that produces superior high-resolution visual features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense prediction tasks including object detection and semantic segmentation.今天，我们发布了 DINOv3，这是一个通用的、最先进的计算机视觉模型，使用 SSL 进行训练，可以产生上级高分辨率的视觉特征。这是第一次，单一的冻结视觉骨干在多个长期存在的密集预测任务（包括对象检测和语义分割）上的表现优于专业解决方案。

DINOv3’s breakthrough performance is driven by innovative SSL techniques that eliminate the need for labeled data—drastically reducing the time and resources required for training and enabling us to scale training data to 1.7B images and model size to 7B parameters. This label-free approach enables applications where annotations are scarce, costly, or impossible.

For example, our research shows that DINOv3 backbones pre-trained on satellite imagery achieve exceptional performance on downstream tasks such as canopy height estimation.DINOv3 的突破性性能是由创新的 SSL 技术驱动的奕道配资，该技术消除了对标记数据的需求，大大减少了训练所需的时间和资源，使我们能够将训练数据扩展到 1.7 B 图像，并将模型大小扩展到 7 B 参数。这种无标签的方法使应用程序能够在注释稀缺、昂贵或不可能的情况下使用。例如，我们的研究表明，在卫星图像上预训练的 DINOv3 骨干在下游任务（如冠层高度估计）上实现了卓越的性能。

We believe DINOv3 will help accelerate existing use cases and also unlock new ones, leading to advancements in industries such as healthcare, environmental monitoring, autonomous vehicles, retail, and manufacturing—enabling more accurate and efficient visual understanding at scale.我们相信，DINOv3 将有助于加速现有的用例，并解锁新的用例，从而推动医疗保健、环境监测、自动驾驶汽车、零售和制造等行业的进步，从而实现更准确、更高效的大规模视觉理解。

We’re releasing DINOv3 with a comprehensive suite of open sourced backbones under a commercial license, including a satellite backbone trained on MAXAR imagery. We’re also sharing a subset of our downstream evaluation heads, enabling the community to reproduce our results and build upon them. Additionally, we’re providing sample notebooks so the community has detailed documentation to help them start building with DINOv3 today.我们将在商业许可下发布 DINOv3，其中包含一套全面的开源主干，包括一个在 MAXAR 图像上训练的卫星主干。我们还共享了下游评估负责人的子集，使社区能够复制我们的结果并在此基础上进行构建。此外，我们还提供了示例笔记本，以便社区拥有详细的文档，帮助他们立即开始使用 DINOv3 进行构建。

Unlocking high-impact applications with self-supervised learning通过自我监督学习解锁高影响力的应用程序

DINOv3 achieves a new milestone by demonstrating, for the first time, that SSL models can outperform their weakly supervised counterparts across a wide range of tasks.

While previous DINO models set a significant lead in dense prediction tasks, such as segmentation and monocular depth estimation, DINOv3 surpasses these accomplishments.

Our models match or exceed the performance of the strongest recent models such as SigLIP 2 and Perception Encoder on many image classification benchmarks, and at the same time, they drastically widen the performance gap for dense prediction tasks.

DINOv3 实现了一个新的里程碑，首次证明 SSL 模型可以在广泛的任务中优于弱监督模型。虽然以前的 DINO 模型在密集预测任务（如分割和单目深度估计）方面取得了显著领先，但 DINOv3 超越了这些成就。我们的模型在许多图像分类基准测试中的性能与最近最强的模型（如 SigLIP 2 和 Perception Encoder）相匹配或超过，同时，它们大大扩大了密集预测任务的性能差距。

DINOv3 builds on the breakthrough DINO algorithm, requiring no metadata input, consuming only a fraction of the training compute compared to prior methods, and still delivering exceptionally strong vision foundation models.

The novel refinements introduced in DINOv3 lead to state-of-the-art performance on competitive downstream tasks such as object detection under the severe constraint of frozen weights. This eliminates the need for researchers and developers to fine-tune the model for specific tasks, enabling broader and more efficient application.DINOv3 建立在突破性的 DINO 算法之上，不需要元数据输入，与以前的方法相比，只消耗一小部分训练计算，并且仍然提供非常强大的视觉基础模型。DINOv3 中引入的新改进导致竞争性下游任务的最新性能奕道配资，例如在冻结权重的严格约束下的对象检测。这消除了研究人员和开发人员为特定任务微调模型的需要，从而实现更广泛和更有效的应用。

Finally, because the DINO approach is not specifically tailored to any image modality, the same algorithm can be applied beyond web imagery to other domains where labeling is prohibitively difficult or expensive. DINOv2 already leverages vast amounts of unlabeled data to support diagnostic and research efforts in histology, endoscopy, and medical imaging. In satellite and aerial imagery, the overwhelming volume and complexity of data make manual labeling impractical.

With DINOv3, we make it possible for these rich datasets to be used to train a single backbone that can then be used across satellite types, enabling general applications in environmental monitoring, urban planning, and disaster response.最后，由于 DINO 方法不是专门针对任何图像模态定制的，因此相同的算法可以应用于 Web 图像之外的其他领域，这些领域的标记非常困难或昂贵。DINOv2 已经利用大量未标记的数据来支持组织学、内窥镜检查和医学成像方面的诊断和研究工作。在卫星和航空图像中，数据的巨大数量和复杂性使得手动标记不切实际。通过 DINOv3，我们可以使用这些丰富的数据集来训练单个骨干，然后可以跨卫星类型使用，从而实现环境监测，城市规划和灾害响应中的一般应用。

DINOv3 is already having real-world impact.

The World Resources Institute (WRI) is using our latest model to monitor deforestation and support restoration, helping local groups protect vulnerable ecosystems. WRI uses DINOv3 to analyze satellite images and detect tree loss and land-use changes in affected ecosystems. The accuracy gains from DINOv3 support automating climate finance payments by verifying restoration outcomes, reducing transaction costs, and accelerating funding to small, local groups.

For example, compared to DINOv2, DINOv3 trained on satellite and aerial imagery reduces the average error in measuring tree canopy height in a region of Kenya from 4.1 meters to 1.2 meters. WRI is now able to scale support for thousands of farmers and conservation projects more efficiently.

DINOv3 已经对现实世界产生了影响。世界资源研究所（WRI）正在使用我们的最新模型来监测森林砍伐和支持恢复，帮助当地团体保护脆弱的生态系统。世界资源研究所使用 DINOv3 分析卫星图像，并检测受影响生态系统中的树木损失和土地使用变化。

DINOv3 带来的准确性收益通过验证恢复结果、降低交易成本和加速向小型地方团体提供资金，支持气候融资支付的自动化。例如，与 DINOv2 相比，在卫星和航空图像上训练的 DINOv3 将测量肯尼亚地区树冠高度的平均误差从 4.1 米降低到 1.2 米。世界资源研究所现在能够更有效地扩大对数千名农民和保护项目的支持。

Scalable and efficient visual modeling without fine-tuning可扩展且高效的可视化建模，无需微调

We built DINOv3 by training a 7x larger model on a 12x larger dataset than its predecessor, DINOv2. To showcase the model’s versatility, we evaluate it across 15 diverse visual tasks and more than 60 benchmarks. The DINOv3 backbone particularly shines on all dense prediction tasks, showing an exceptional understanding of the scene layout and underlying physics.我们通过在比其前身 DINOv2 大 12 倍的数据集上训练 7 倍大的模型来构建 DINOv3。为了展示该模型的多功能性，我们在 15 个不同的视觉任务和 60 多个基准测试中对其进行了评估。DINOv3 主干在所有密集预测任务中表现出色，表现出对场景布局和底层物理的卓越理解。

The rich, dense features capture measurable attributes or characteristics of each pixel in an image and are represented as vectors of floating-point numbers. These features are capable of parsing objects into finer parts, even generalizing across instances and categories. This dense representation power makes it easy to train lightweight adapters with minimal annotations on top of DINOv3, meaning a few annotations and a linear model are sufficient to obtain robust dense predictions.

Pushing things further and using a more sophisticated decoder, we show that it’s possible to achieve state-of-the-art performance on long-standing core computer vision tasks without fine-tuning the backbone.

We show such results on object detection, semantic segmentation, and relative depth estimation.丰富、密集的特征捕捉图像中每个像素的可测量属性或特征，并表示为浮点数向量。这些功能能够将对象解析为更精细的部分，甚至跨实例和类别进行概括。这种密集表示能力使得在 DINOv3 之上使用最少的注释来训练轻量级适配器变得很容易，这意味着一些注释和线性模型就足以获得强大的密集预测。通过进一步推进并使用更复杂的解码器，我们证明了在无需微调主干的情况下，可以在长期的核心计算机视觉任务上实现最先进的性能。我们展示了这样的结果，对象检测，语义分割和相对深度估计。

Because state-of-the-art results can be achieved without fine-tuning the backbone, a single forward pass can serve multiple applications simultaneously.

This enables the inference cost of the backbone to be shared across tasks, which is especially critical for edge applications that often require running many predictions at once.

DINOv3’s versatility and efficiency make it the perfect candidate for such deployment scenarios, as demonstrated by NASA’s Jet Propulsion Laboratory (JPL), which is already using DINOv2 to build exploration robots for Mars, enabling multiple vision tasks with minimal compute.由于无需微调主干即可实现最先进的结果，因此单个前向通道可以同时服务于多个应用。这使得骨干网的推理成本能够在任务之间共享，这对于经常需要同时运行许多预测的边缘应用程序尤其重要。DINOv3 的多功能性和效率使其成为此类部署场景的完美候选者，正如 NASA 喷气推进实验室（JPL）所证明的那样，该实验室已经使用 DINOv2 为火星建造探测机器人，以最小的计算实现多个视觉任务。

A family of deployment-friendly models一系列部署友好型模型

Scaling DINOv3 to 7B parameters shows SSL’s full potential. However, a 7B model is impractical for many downstream applications. Following feedback from the community, we built a family of models spanning a large range of inference compute requirements to empower researchers and developers across diverse use cases.

By distilling the ViT-7B model into smaller, high-performing variants like ViT-B and ViT-L, DINOv3 outperforms comparable CLIP-based models across a broad evaluation suite.

Additionally, we introduce alternative ConvNeXt architectures (T, S, B, L) distilled from ViT-7B, that can accommodate varying compute constraints. We’re also releasing our distillation pipeline to enable the community to build upon this foundation.将 DINOv 3 参数扩展到 7 B 显示了 SSL 的全部潜力。然而，7 B 模型对于许多下游应用是不切实际的。根据社区的反馈，我们构建了一系列涵盖大量推理计算需求的模型，以支持研究人员和开发人员跨各种用例。通过将 ViT-7 B 模型提炼成更小的高性能变体，如 ViT-B 和 ViT-L，DINOv 3 在广泛的评估套件中优于基于 CLIP 的同类模型。此外，我们介绍了替代 ConvNeXt 架构（T，S，B，L）从 ViT-7 B，可以适应不同的计算约束。我们还发布了我们的蒸馏管道，以使社区能够在此基础上再接再厉。