Risk Concerns and Safety Alignment Research
- On the opportunities and risks of foundation models, Bommasani et al, 2021
- Sociodemographic Bias in Language Models: A Survey and Forward Path, Gupta et al, 2024
- This Prompt is Measuring <MASK>: Evaluating Bias Evaluation in Language Models, Goldfarb-Tarrant et al, 2023
- The Woman Worked as a Babysitter: On Biases in Language Generation, Sheng et al, 2019
- RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models, Deshpande et al, 2023
- Scalable extraction of training data from (production) language models, Nasr et al, 2023
- Fine-tuning aligned language models compromises safety, even when users do not intend to, Qi et al 2023
Technical Solutions
- Constitutional AI: Harmlessness from AI Feedback, Bai el al, 2022b
- Collective Constitutional AI: Aligning a Language Model with Public Input, Huang et al, 2024b
- Scalable agent alignment via reward modeling: a research direction, Leike et al, 2018
- Safe and Responsible large language model development, Raza et al, 2024
- Ai alignment: A comprehensive survey, Ji et al, 2024
Ethical Frameworks
- Aligning ai with shared human values, Hendrycks et al, 2023
- Deconstructing the ethics of large language models from long-standing issues to new-emerging dilemmas, Deng et al, 2024a
Jailbreaking
- Low-resource languages jailbreak gpt-4, Yong et al, 2024
- Jailbreak Paradox: The Achilles’ Heel of LLMs, Rao et al, 2024
Bias and Fairness
- Bias and Fairness in Large Language Models: A Survey, Gallegos et al, 2024
- Evaluating Metrics for Bias in Word Embeddings, Schroder et al, 2022
- WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, Han et al, 2024
Miscellaneous
- Information Theory: A Tutorial Introduction, Stone 2019