释义可以避开 AI 生成文本检测器，但检索是一种有效的防御策略

https://arxiv.org/pdf/2303.13408.pdf

摘要

为了检测针对恶意使用大型语言模型（例如，虚假内容创建或学术剽窃），最近提出了几种方法通过水印或统计违规行为识别 AI 生成的文本。这些检测算法对 AI 生成文本的检测有多稳健？为了对这些检测器进行压力测试，我们首先训练一个 11B 参数释义生成模型 (DIPPER)，该模型可以释义段落，可选择利用周围的文本（例如，用户编写的提示）作为上下文。 DIPPER 还使用标量旋钮来控制释义中词汇多样性和重新排序的数量。三种大型语言模型（包括 GPT3.5-davinci-003）使用 DIPPER 生成的释义文本成功逃避了多种检测器，包括水印、GPTZero、DetectGPT 和 OpenAI 的文本分类器。例如，DIPPER 将 DetectGPT 的检测精度从 70.3% 降低到 4.6%（以 1% 的恒定误报率），而没有明显修改输入语义。

精选图片

Figure 1: An overview of paraphrasing attacks with DIPPER on watermarked text (Kirchenbauer et al., 2023). Theoriginal model generation (top) contains several “green” watermarked tokens that are counted by a detector tojudge whether the text was written by an AI. After paraphrasing, several of these green tokens are replaced withapproximately semantically-equivalent red tokens, thereby fooling the detector (actual outputs from a watermarkedversion of GPT2-XL and our paraphraser DIPPER shown).

Figure 2: An illustration of the method used to train DIPPER on English translations of the French novel The Nun. We first align sentences between the two translations to create parallel data. Next, a subset of the alignments are chosen; in this example, we use (p2, q2) and (p3, q3q4). We shuffle sentences, compute control codes, and finally fine-tune a T5-XXL LM to generate p2p3 given q3q4q2 and the context p1 and p4.

结论

我们提出了 DIPPER，这是一种文本释义生成模型，可以重写多个文本句子，并可以选择利用周围的上下文。我们使用 DIPPER 对当前 AI 生成的原始文本进行压力测试。我们发现 DIPPER 释义很容易避开这些检测器，同时大致保留输入语义。为了抵御这种释义攻击，我们提出了一种简单的基于检索的机制，在该机制中，我们从 LLM API 中搜索预先生成的序列的语料库，以获得与给定查询语义相似的内容。我们的实验表明，这种检索防御在释义文本上明显优于基线检测器，并且在大规模数据上是有效的。我们也讨论了我们防御的可能局限性，并且我们开源了我们预训练的模型、代码和数据，以使研究社区能够在这些想法的基础上进行构建新的研究。

Make-It-3D: 使用Diffusion Prior从单个图像创建高保真 3D 对象

https://arxiv.org/pdf/2303.14184.pdf

摘要

在这项工作中，我们研究了仅从单个图像创建高保真 3D 内容的问题。这本身就具有挑战性：它本质上涉及估计底层 3D 几何体，同时产生看不见的纹理。为了应对这一挑战，我们利用来自训练有素的 2D 扩散模型的先验知识来充当 3D 创作的 3D 感知监督。我们的方法 Make-It-3D 采用两阶段优化管道：第一阶段通过在正面视图中结合来自参考图像的约束和在新视图中的扩散先验来优化神经辐射场；第二阶段将粗糙模型转换为带纹理的点云，并在利用参考图像的高质量纹理的同时进一步提升扩散先验的真实感。大量实验表明，我们的方法大大优于之前的工作，有着令人印象深刻的视觉质量。我们的方法首次尝试从单个图像为一般对象实现高质量 3D 创建，并支持各种应用程序，例如文本到 3D 创建和纹理编辑。

精选图片

Figure 1: Make-It-3D can create high-fidelity 3D content from only a single image. We show the normal map and novel-viewrenderings of created 3D content, showcasing fine geometry and faithful textures with stunning quality at novel views.

Figure 2: Overview architecture. We propose a two-stage framework for creating a high-quality 3D model from a referenceimage with diffusion prior (Sec. 3.1). At the coarse stage, we optimize a NeRF for reconstructing the geometry of thereference image (Sec. 3.2). We further build textured point clouds from NeRF and the reference image, and jointly optimizethe texture of invisible points and a learnable deferred renderer to generate realistic and view-consistent textures (Sec. 3.3).

Figure 4: 360◦ object reconstruction from real images

结论

我们介绍了 Make-It-3D，这是一种新颖的两阶段方法，用于从单个图像创建高保真 3D 内容。利用扩散先验作为 3D 感知监督，生成的 3D 模型展示了高保真的几何形状和逼真的纹理，具有扩散 CLIP 损失和纹理点云增强。 Make-It-3D 适用于一般物体，赋予多种有趣的应用程序。我们相信我们的方法在将 2D 内容创建的成功扩展到 3D 方面迈出了一大步，为用户提供了全新的 3D 创作体验。

ChatGPT 在文本注释任务方面优于人类工作者

https://arxiv.org/pdf/2303.15056.pdf

摘要

许多 NLP 应用程序需要对各种任务进行手动数据注释，特别是训练分类器或评估无监督模型的性能。根据规模和复杂程度，这些任务可能由人们在 MTurk 等平台上进行，也会用到训练有素的注释者，例如研究助理。使用包含 2,382 条推文的样本，我们证明 ChatGPT 在多项注释任务（包括相关性、立场、主题和框架检测）方面优于众包工作者。具体来说，ChatGPT 的零样本准确率在80%的任务中超过了众包工作者，而 ChatGPT 的intercode agreement在所有任务上都超过了众包工作者和训练有素的注释者。此外，ChatGPT 的每次注释成本低于0.003 美元——大约比 MTurk 便宜 20 倍。这些结果显示了大型语言模型在显着提高文本分类效率方面的潜力。

精选图片

Figure 1: ChatGPT zero-shot text annotation performance, compared to MTurk andtrained annotators. ChatGPT’s accuracy outperforms that of MTurk for four of the fivetasks. ChatGPT’s intercoder agreement outperforms that of both MTurk and trained annotators in all tasks. Accuracy means agreement with the trained annotators.

结论

本文展示了 LLM 为许多研究项目常见的各种任务转换文本注释程序的潜力。尽管专注于单个数据集且测试数量相对有限，但有证据表明，与 MTurk 等平台上的人类注释相比，LLM 可能已经是一种更好的方法。至少，这些发现证明了更深入地研究 LLM 的文本注释属性和功能的重要性。以下问题和步骤似乎特别有前途：(i) ChatGPT 跨多种语言的性能； (ii) ChatGPT 在多种文本类型（社交媒体、新闻媒体、立法、演讲等）中的表现； (iii) 与 BERT 和 RoBERTa 等微调模型相比，在 ChatGPT 上实施少样本学习； (iv) 构建半自动化数据标签系统，其中模型首先通过观察人类注释进行学习，然后用于推荐甚至自动化标签（Desmond 等人，2021）； (v) 使用思维链提示和其他策略来提高零样本推理的性能 (Kojima et al., 2022)； (vi) 只要可用性允许，就可以使用 GPT-4 实施注释任务。

EVA-CLIP: 大规模改进 CLIP 训练技术

https://arxiv.org/pdf/2303.15389.pdf

摘要

Contrastive language-image pre-training, CLIP for short,has gained increasing attention for its potential in variousscenarios. In this paper, we propose EVA-CLIP, a seriesof models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates newtechniques for representation learning, optimization, andaugmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the samenumber of parameters but significantly smaller training costs.Notably, our largest 5.0B-parameter EVA-02-CLIP-E/14+with only 9 billion seen samples achieves 82.0% zero-shottop-1 accuracy on ImageNet-1K val. A smaller EVA-02-CLIP-L/14+ with only 430 million parameters and 6 billionseen samples achieves 80.4% zero-shot top-1 accuracy onImageNet-1K val. To facilitate open access and open research, we release the complete suite of EVA-CLIP to thecommunity.

Contrastive language-image pre-training，简称CLIP，因其在各种场景中的潜力而受到越来越多的关注。在本文中，我们提出了 EVA-CLIP，这是一系列显著提高 CLIP 训练效率和有效性的模型。我们的方法结合了用于表示学习、优化和增强的新技术，使 EVA-CLIP 能够在参数数量相同但训练成本显着降低的情况下实现优于先前 CLIP 模型的性能。值得注意的是，我们最大的 5.0B 参数 EVA-02-CLIP- E/14+ 只有 90 亿个可见样本，在 ImageNet-1K val 上达到 82.0% zero-shottop-1 精度。较小的 EVA-02-CLIP-L/14+ 只有 4.3 亿个参数和 6 亿个样本，在 ImageNet-1K val 上实现了 80.4% 的零样本 top-1 精度。为了促进开放获取和开放研究，我们向社区发布了完整的 EVA-CLIP 套件。

精选图片

Figure 1: Summary of CLIP models’ ImageNet-1K zero-shotclassification performance. The diameter of each circle corresponds to forward GFLOPs x the number of training samples.

Table 6: Training time and GPU memory. Training on 16NVIDIA 40G-A100 GPUs with the DeepSpeed [43] ZeRO stage-1optimizer [40] and gradient checkpointing [10]. The batch size is32k.

结论

在表 6 中（上图），我们展示了我们实现的内存和时间成本。如图所示，masking50% 的图像标记可以将训练时间加快 2 倍，使用 flash attention 可以额外减少 15% 的训练时间。使用所有这些技术，我们可以用比其他对应 CLIP 模型更低的预算来训练 EVA-CLIP。例如，EVA-CLIP-B/16 可以使用 16 个 NVIDIA40GB-A100 GPU 以 32k 的批量大小进行训练，并在 300 小时内收敛。同样，十亿规模的 EVA CLIPg/14 可以在 65k 的批量大小上进行训练，使用 64 个 NVIDIA40G-A100 GPU 训练 12B 个样本需要不到 25 天的时间。这些结果证明了我们的方法在实现最先进结果同时保持训练时间和 GPU 内存利用率之间的最佳平衡方面的可扩展性和有效性。

Text-to-Image Diffusion Models 是零样本分类器

https://arxiv.org/pdf/2303.15233.pdf

摘要

文本到图像扩散模型的出色生成能力表明它们可以学习图像文本数据的信息表示。然而，它们的表示捕获了哪些知识还没有被人们完全理解，并且它们还没有在下游任务中得到彻底的探索。我们通过提出一种将它们评估为零样本分类器的方法来研究扩散模型。关键思想是使用扩散模型的能力来给有噪声的图像降噪，给定标签的文本描述作为该标签可能性的代理。我们将我们的方法应用于 Imagen，用它来探测 Imagen 知识的细粒度值，并将其与 CLIP 的零样本能力进行比较。 Imagen 在广泛的零镜头图像分类数据集上与 CLIP 相比具有竞争力。此外，它在形状/纹理偏差测试上实现了最先进的结果，并且可以成功执行属性绑定，而 CLIP 则不能。尽管生成预训练在 NLP 中很普遍，但视觉基础模型通常使用其他方法，例如对比学习。根据我们的发现，我们认为应该探索生成式预训练作为解决视觉和视觉语言问题的令人信服的替代方案。

精选图片

Figure 1. Zero-Shot Classification using Imagen. We first calculate scores for each label prompt across multiple time-steps to generatea scores matrix. We then classify an image by aggregating the scores for each class using a weighting function over the time-steps. Theimage is assigned the class with the minimum aggregate score. In Section 3.1, we discuss how efficiency can be improved only computinga subset of the full scores matrix.

Figure 2. Comparison of efficiency improvements on CIFAR-100.Shared noise improves sample efficiency by roughly 100x andpruning by an additional 8-10x.

Figure 3. Examples of the synthetic-data attribute binding tasks. We explored more sophisticated prompts than in the figure (e.g., “Ablender rendering of two objects, one of which is a yellow sphere.”), but they didn’t substantially change results.

结论

我们提出了一种方法，使扩散模型可以用作零样本分类器，并开发了大大提高其效率以使其可用的方法。我们使用 Imagen 进行的实验在图像分类方面展示了强大的结果。此外，我们展示了 Imagen 对于误导性纹理非常稳健，在 Stylized Imagenet 上实现了最先进的结果。虽然现有的扩散模型分析通常定性研究生成的图像，但我们的框架提供了一种通过在受控分类任务上评估文本到图像生成模型来定量评估文本到图像生成模型的方法。我们通过对属性绑定的研究展示了这一点。我们发现 Imagen 有时能够绑定属性，而 CLIP 似乎没有这种能力。我们希望我们的发现能够激发未来的工作，将文本到图像扩散模型用作生成以外任务的基础模型。一个方向是微调下游任务的扩散模型；鉴于 Imagen 强大的零样本性能，下一步自然是在进一步监督训练后对其进行评估。

事实上，Brempong 等人（2022）已经探索了一个相关的想法，发现去噪预训练可以改进语义分割模型。我们注意到我们在这项工作中与 CLIP 的主要比较并不直接，因为模型架构、参数计数和训练数据不同。随着模型变得更大，一个关键问题是对比预训练与生成预训练的缩放定律（Hestnesset al., 2017; Kaplan et al., 2020）如何比较，我们将其留作未来工作的问题。我们也有兴趣将我们的分析应用于其他扩散模型，以表明我们的结果并非特定于 Imagen。为此，我们目前正在努力将我们的方法应用于稳定扩散 (Rombach et al., 2022)。此外，我们也有兴趣将我们的分析应用于其他生成模型，并研究与扩散预训练相比，我们的结果在多大程度上是生成预训练的结果。最终，我们的方法不会产生实用的分类器，因为它需要大量的在给许多类型打分时计算。相反，我们看到这项工作的主要价值更多地揭示了大型预训练扩散模型的能力。我们的结果表明，生成预训练可能是文本图像自监督学习对比预训练的有用替代方案。

StyleDiffusion: 基于文本的 Prompt-Embedding Inversion

https://arxiv.org/pdf/2303.15649.pdf

摘要

当前一项重要的研究工作集中在利用预训练扩散模型的惊人能力来编辑图像。这些工作要么微调模型，要么在预训练模型的潜在空间中反转图像。但是，他们也遇到两个问题：（1）选定区域的结果不令人满意，以及非选定区域的意外变化。 (2) 它们需要仔细的文本提示编辑，其中提示应包括输入图像中的所有可视对象。为了解决这个问题，我们提出了两项改进：（1）仅优化交叉注意力层中价值线性网络的输入，就足以重建真实图像。 (2) 我们提出注意正则化以在编辑后保留类似对象的注意图，使我们能够在不调用重大结构更改的情况下获得准确的样式编辑。我们进一步改进了用于无分类器指导的无条件分支的编辑技术，以及 P2P [15] 使用的条件分支。对各种图像进行广泛的实验提示编辑结果，定性和定量地证明我们的方法具有比现有和并行工作更优越的编辑能力。

精选图片

Figure 1: Our method takes as input a real image (leftmost column) and an associated caption. We have more accurateediting capability than Null-text inversion [26]. We manipulate the inverted image using the editing technique P2P [15].

Figure 4: Overview of the proposed method. (a) DDIM inversion: the diffusion process is performed to generate the latentrepresentations: (ˆzt, ˆat)(t = 1, ..., T), where ˆz0 is set to be the encoding of the input real image x0. c0 is the extractedtextual embedding by a Clip-text Encoder with a given prompt. (b) The proposed method: we take the input image x0 asinput, and extract the textual embedding ct−1 = Mt−1 (x0), which is used to generate the value matrix v with the linearnetwork ΨV . We freeze the input of the linear network ΨK with the given textual embedding c0.

结论

我们提出了一种真实图像编辑的新方法。我们将真实图像转换为交叉注意力层中价值线性映射网络的输入，并使用用户提供的文本嵌入冻结关键线性映射网络的输入。这允许学习初始注意力图和重建真实图像的近似轨迹。我们引入了一种新的注意力正则化来保留编辑后的注意力图，使我们能够获得更准确的编辑能力。此外，我们在无分类器扩散模型的无条件分支中提出了注意注入，进一步提高了编辑能力，尤其是当源和目标提示都有较大的域偏移时。虽然 StyleDiffusion 成功地修改了真实图像，但它仍然存在一些局限性。当真实图像的对象具有罕见的姿势（图 8（左）），或者源和目标提示都具有较大的语义偏移（图 8（右））时，我们的方法无法生成令人满意的图像。

Stable Diffusion 图像编辑中使用高度个性化的文本嵌入

Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion

https://arxiv.org/pdf/2303.08767.pdf

摘要

扩散模型在图像生成和操作方面表现出优越的性能，但固有的随机性在保存和操作图像内容和特征方面提出了挑战。虽然以前的方法如 DreamBooth [16] 和 Textual Inversion [3] 提出了模型或潜在表示个性化来维护内容，但它们对多个参考图像和复杂训练的依赖限制了它们的实用性。在本文中，我们通过分解用于个性化和内容操作的 CLIP 嵌入空间，提出了一种使用高度个性化 (HiPer) 文本嵌入进行个性化的简单而高效的方法。我们的方法不需要模型微调或标识符，但仍然可以仅使用单个图像和目标文本来处理背景、纹理和运动。通过对不同目标文本的实验，我们证明我们的方法可以在广泛的任务中产生高度个性化和复杂的语义图像编辑。我们相信，这项工作中对文本嵌入空间的新颖理解有可能激发对各种任务的进一步研究。

精选图片

Figure 1: Image manipulation results with highly personalized (HiPer) text embeddings. In the upper row, the identities of therabbit and the dog are well preserved while adequately manipulating the images to align with target texts. In the bottom row,not only motion and background, but also texture of the source image is transformed towards corresponding target text.

Figure 2: The proposed method. (Training) First, the source text prompt, which have the meaning of source image, is convertedto text embedding. Some parts of text embedding, which have no information, are removed. The informative target embeddingpart and the personalized embedding is concatenated, and they are the input of pre-trained U-net. In training, the personalizedembedding is only optimized. Although this figure depicts it as learning in image space, the embedding is actually optimized inlatent space. (Inference) The target embedding is also cropped and concatenated with personalized embedding. The pre-trainedtext-to-image model, which conditioned that embedding, generates an image which has the meaning of target text and thesubject of source image.

Figure 3: Cross Attention maps in the final timestep of text-to-image diffusion models. The source text is “a standing dog’and the target text is “a sitting dog”. Cross Attention maps (a) conditioned with esrc (b) conditioned with [e0src, ehper], (c)conditioned with [e0tgt, ehper]. (d) Cross attention maps by Imagic [9].

Figure 6: Text-driven image manipulation results featuring a female doctor.

结论

我们提出了一种使用稳定扩散的高度个性化的文本到图像生成方法，该方法简单而强大。只需一张图片，我们的方法就可以生成高度个性化的文本标记，从而在保持主题特点方面具有卓越的性能。此外，我们的方法不需要模型微调或复杂的损失函数。这些属性使我们能够使用仅需三分钟的简单优化过程快速轻松地处理图像。此外，我们通过在运动、背景和纹理三个方面演示图像编辑结果，展示了我们方法的卓越功能。

上周重要论文摘要 2023-04-03

上周重要论文摘要 2023-04-03

释义可以避开 AI 生成文本检测器，但检索是一种有效的防御策略

摘要

精选图片

结论

Make-It-3D: 使用Diffusion Prior从单个图像创建高保真 3D 对象

摘要

精选图片

结论

ChatGPT 在文本注释任务方面优于人类工作者

摘要

精选图片

结论

EVA-CLIP: 大规模改进 CLIP 训练技术

摘要

精选图片

结论

Text-to-Image Diffusion Models 是零样本分类器

摘要

精选图片

结论

StyleDiffusion: 基于文本的 Prompt-Embedding Inversion

摘要

精选图片

结论

Stable Diffusion 图像编辑中使用高度个性化的文本嵌入

摘要

精选图片

结论

相关阅读更多精彩内容

友情链接更多精彩内容