Navigation auf uzh.ch

Suche

Department of Computational Linguistics

Increasing Information Accessibility Through Text-to-Image Alignments with LLMs and Diffusion Models

Introduction

Automatic Text Simplification (ATS) is a process that transforms linguistically complex text into a simpler version while preserving its original meaning. This transformation is crucial for making textual content accessible to specific populations, such as individuals with cognitive impairments. The core of ATS involves using a transformation function 𝑓, which maps an original text 𝑆 to its simplified version, 𝑆′. This function aims to maximize a user-specific utility function, predominantly focusing on enhancing information accessibility.

Problem Statement

Despite significant progress driven by advancements in Large Language Models (LLMs), the potential for increasing information accessibility through ATS has not been fully explored. In human simplification practice, images are often added alongside the simplified text (see examples here). Recent develo pments in diffusion models, which excel in generating detailed images from textual prompts, make including images in ATS possible.

Challenges

A key challenge in implementing ATS using any language models is the risk of generating hallucinations, i.e. the simplifications might introduce factually incorrect or unverifiable content relative to the original texts. These inaccuracies are detrimental as they can lead to semantic drifts, misinterpretation of the intended information, and complicate the evaluation of ATS models. Maintaining faithfulness and factual accuracy is thus crucial, especially when aiding populations with reading disabilities.

Research Questions

The primary objective of this project is to develop a multimodal ATS system that enhances information accessibility through the synergistic integration of textual and visual information. This project aims to address several key research questions:

  • Integration Feasibility: Can diffusion models be effectively combined with Large Language Model (LLM)-based ATS systems to augment text with corresponding visual content?
  • Metrics Development: How can we accurately measure information accessibility within a multimodal framework, ensuring that the combined visual and textual outputs are effectively enhancing comprehension?
  • Multi-Modal Alignment: What methods are most effective for generating visual representations from textual counterparts? Additionally, how can we align text with images to detect hallucinations in text simplifications?
  • Visual Augments: What is an appropriate density-level of visual information (e.g. how many images can maximize the utility of visual information and therefore can lead to the best information accessibility)? 

Project Objectives

This project involves further project objectives:

  • Implement and test LLM-based text simplification systems on existing datasets. 
  • Deploy diffusion models (text-to-image) to generate visual counterparts. 
  • Study the alignments between the text simplifications and the generated visual counterparts, i.e. at what level of granularity are the images aligned to the texts? 
  • Study the relation between the simplified text and the visual counterparts.
  • Propose new metrics for evaluating text simplification quality, especially in case of hallucinations.

Requirements

This project requires prior experience or interest in acquiring expertise in:

  • Generative models for NLP, knowledge in diffusion model is a plus.
  • Understanding in evaluating machine learning models.
  • Deep learning with PyTorch and Huggingface (parallel computing with MultiThread is a plus).

Contact

We are looking for highly motivated students majoring in computer science/data science/mathematics/electrical engineering. Please send your CV and transcript to Yingqiang Gao (yingqiang.gao@uzh.ch)  and cc Prof. Dr. Sarah Ebling (ebling@cl.uzh.ch).

References

  1. Rombach, Robin, et al. "High-Resolution Image Synthesis with Latent Diffusion Models." 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022.
  2. Scarton, Carolina, and Lucia Specia. "Learning simplifications for specific target audiences." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018.
  3. Martin, Louis, et al. "Controllable Sentence Simplification." Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020.
  4. Norré, Magali, et al. "Extending a text-to-pictograph system to French and to Arasaac." Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021). 2021.
  5. Klein, Lars Henning, Roland Aydin, and Robert West. "Emojinize: Enriching Any Text with Emoji Translations." arXiv preprint arXiv:2403.03857 (2024).
  6. Ebling, Sarah, et al. "Automatic text simplification for German." Frontiers in Communication 7 (2022): 706718.
  7. Peng, Letian, et al. "EmojiLM: Modeling the New Emoji Language." arXiv preprint arXiv:2311.01751 (2023).
  8. Saharia, Chitwan, et al. "Photorealistic text-to-image diffusion models with deep language understanding." Advances in neural information processing systems 35 (2022): 36479-36494.
  9. Devaraj, Ashwin, et al. "Evaluating Factuality in Text Simplification." Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022.