The End Of Knowledge - Vault 6/87 - CVPR - 2023 - A Watermark for Large Language Models

graph LR classDef watermarking fill:#f9d4d4, font-weight:bold, font-size:14px classDef detection fill:#d4f9d4, font-weight:bold, font-size:14px classDef robustness fill:#d4d4f9, font-weight:bold, font-size:14px classDef challenges fill:#f9f9d4, font-weight:bold, font-size:14px A[A Watermark for
Large Language Models] --> B[Watermarking
Techniques] A --> C[Detection
Methods] A --> D[Robustness
and
Security] A --> E[Challenges
and
Analysis] B --> B1[Embedding signals
into invisible
text. 1] B --> B2[Partitioning vocabulary
into green/red
tokens. 2] B --> B3[Adding constant δ
to green
list tokens. 3] B --> B4[Beam search
amplifies watermark
effectively. 7] B --> B5[Public/private watermarks
for
transparency. 8] B --> B6[Applying multiple watermarks
for
flexibility. 11] C --> C1[Using proportion
of green tokens
for detection. 4] C --> C2[Distribution spread
measures watermark
strength. 5] C --> C3[Bound on watermarks
effect on
perplexity. 20] C --> C4[Longer sequences
enable stronger
detection. 24] C --> C5[Varying strength
based on text
predictability. 25] C --> C6[Balancing detection
accuracy and
error rates. 22] D --> D1[Hard to remove
without quality
loss. 9] D --> D2[Removing watermark
through
rephrasing. 13] D --> D3[Changing sub-word
tokenization to
affect hash. 14] D --> D4[Using similar-looking
unicode
characters. 15] D --> D5[Prompting model
to change
output predictably. 16] D --> D6[Normalizing text
before watermark
testing. 17] E --> E1[Less effective
on deterministic
text. 10] E --> E2[Minimal effect
on factual
accuracy. 18] E --> E3[Hard to detect
watermark through
text alone. 19] E --> E4[Some attacks
increase token
usage costs. 26] E --> E5[Fine-tuning model
to defend
against attacks. 27] E --> E6[Framework for understanding
watermark
behavior. 30] class A,B,B1,B2,B3,B4,B5,B6 watermarking class C,C1,C2,C3,C4,C5,C6 detection class D,D1,D2,D3,D4,D5,D6 robustness class E,E1,E2,E3,E4,E5,E6 challenges

Resume:

1.- Watermarking for language models: Embedding signals into generated text that are invisible to humans but algorithmically detectable.

2.- Green/red list: Randomly partitioning vocabulary into "green" (allowed) and "red" (discouraged) tokens for each generation step.

3.- Soft watermarking: Adding a constant δ to logits of green list tokens, adaptively enforcing watermark based on text entropy.

4.- Detection via z-statistic: Using proportion of green tokens to detect watermark with interpretable p-values.

5.- Spike entropy: Measure of distribution spread, useful for analyzing watermark strength.

6.- Watermark strength vs text quality tradeoff: Stronger watermarks may distort generated text.

7.- Beam search synergy: Using beam search amplifies watermark while maintaining text quality.

8.- Public/private watermarking: Allows transparency and independent verification while maintaining stronger private detection.

9.- Watermark robustness: Difficult to remove without significantly modifying text or degrading quality.

10.- Low entropy challenges: Watermark less effective on highly deterministic text sequences.

11.- Multiple watermarks: Applying several watermarks simultaneously for flexibility and stronger detection.

12.- Selective watermarking: Activating watermark in response to suspicious API usage.

13.- Paraphrasing attacks: Attempts to remove watermark through manual or automated rephrasing.

14.- Tokenization attacks: Modifying text to change sub-word tokenization and impact hash computation.

15.- Homoglyph attacks: Using unicode characters that look identical to change tokenization.

16.- Generative attacks: Prompting model to change output in predictable, reversible ways (e.g. emoji insertion).

17.- Canonicalization: Normalizing text before watermark testing to defend against certain attacks.

18.- Impact on factuality: Soft watermarking has minimal effect on model's factual accuracy.

19.- Watermark discovery: Difficulty of detecting watermark presence solely through text analysis.

20.- Perplexity impact: Theoretical bound on how watermarking affects model perplexity.

21.- Private mode: Using secret random key for watermarking, hosted behind secure API.

22.- False positive/negative tradeoffs: Balancing watermark detection accuracy and error rates.

23.- Watermark parameters: Effects of green list size (γ) and logit boost (δ) on watermark strength.

24.- Sequence length impact: Longer sequences allow for stronger watermark detection.

25.- Entropy-based adaptation: Watermark strength varies based on text predictability.

26.- API cost considerations: Some attacks increase token usage, raising costs for attackers.

27.- Negative example training: Potential defense against certain attacks through model fine-tuning.

28.- Repeated n-gram handling: Ignoring repeated phrases to improve watermark sensitivity.

29.- Oracle model evaluation: Using larger model to assess perplexity of watermarked text.

30.- Theoretical analysis: Mathematical framework for understanding watermark behavior and detection sensitivity.

Knowledge Vault built byDavid Vivancos 2024