Publications
A snapshot of publications I have contributed to.
2025
- ICCV Highlight PaperTikZero: Zero-Shot Text-Guided Graphics Program SynthesisJonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, and Simone Paolo PonzettoIn Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, Hawaii, Oct 2025
Automatically synthesizing figures from text captions is a compelling capability. However, achieving high geometric precision and editability requires representing figures as graphics programs in languages like TikZ, and aligned training data (i.e., graphics programs with captions) remains scarce. Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available. We reconcile these disparate data sources by presenting TikZero, which decouples graphics program generation from text understanding by using image representations as an intermediary bridge. It enables independent training on graphics programs and captioned images and allows for zero-shot text-guided graphics program synthesis during inference. We show that our method substantially outperforms baselines that can only operate with caption-aligned graphics programs. Furthermore, when leveraging caption-aligned graphics programs as a complementary training signal, TikZero matches or exceeds the performance of much larger models, including commercial systems like GPT-4o. Our code, datasets, and select models are publicly available.
@inproceedings{belouadi2025tikzero, author = {Belouadi, Jonas and Ilg, Eddy and Keuper, Margret and Tanaka, Hideki and Utiyama, Masao and Dabre, Raj and Eger, Steffen and Ponzetto, Simone Paolo}, location = {Honolulu, Hawaii}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision}, month = oct, title = {{TikZero}: Zero-Shot Text-Guided Graphics Program Synthesis}, year = {2025} }
- MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal ModelsJonas Belouadi, Tamy Boubekeur, and Adrien KaiserPreprint, Sep 2025
Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structure and intermediate states enable a modular, interpretable workflow for interactive appearance modeling. However, creating such graphs remains challenging and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as textual programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a multimodal program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures static correctness while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.
@misc{belouadi2025multimat, author = {Belouadi, Jonas and Boubekeur, Tamy and Kaiser, Adrien}, url = {https://arxiv.org/abs/2509.22151}, archiveprefix = {arXiv}, eprint = {2509.22151}, primaryclass = {cs.CV}, month = sep, title = {{MultiMat}: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models}, year = {2025} }
- ScImage: How Good are Multimodal Large Language Models at Scientific Text-to-Image Generation?Leixin Zhang, Steffen Eger, Yinjie Cheng, Weihe Zhai, Jonas Belouadi, Fahimeh Moafian, and Zhixue ZhaoIn The Thirteenth International Conference on Learning Representations, Apr 2025
Multimodal large language models (LLMs) have demonstrated impressive capabilities in generating high-quality images from textual instructions. However, their performance in generating scientific images—a critical application for accelerating scientific progress—remains underexplored. In this work, we address this gap by introducing ScImage, a benchmark designed to evaluate the multimodal capabilities of LLMs in generating scientific images from textual descriptions. ScImage assesses three key dimensions of understanding: spatial, numeric, and attribute comprehension, as well as their combinations, focusing on the relationships between scientific objects (e.g., squares, circles). We evaluate seven models, GPT-4o, Llama, AutomaTikZ, Dall-E, StableDiffusion, GPT-o1 and Qwen2.5-Coder-Instruct using two modes of output generation: code-based outputs (Python, TikZ) and direct raster image generation. Additionally, we examine four different input languages: English, German, Farsi, and Chinese. Our evaluation, conducted with 11 scientists across three criteria (correctness, relevance, and scientific accuracy), reveals that while GPT4-o produces outputs of decent quality for simpler prompts involving individual dimensions such as spatial, numeric, or attribute understanding in isolation, all models face challenges in this task, especially for more complex prompts.
@inproceedings{zhang2025scimage, author = {Zhang, Leixin and Eger, Steffen and Cheng, Yinjie and Zhai, Weihe and Belouadi, Jonas and Moafian, Fahimeh and Zhao, Zhixue}, url = {https://openreview.net/forum?id=ugyqNEOjoU}, booktitle = {The Thirteenth International Conference on Learning Representations}, month = apr, title = {{ScImage}: How Good are Multimodal Large Language Models at Scientific Text-to-Image Generation?}, year = {2025} }
2024
- NeurIPS Spotlight PaperDeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZJonas Belouadi, Simone Paolo Ponzetto, and Steffen EgerIn The Thirty-eighth Annual Conference on Neural Information Processing Systems, Vancouver, Canada, Dec 2024
Creating high-quality scientific figures can be time-consuming and challenging, even though sketching ideas on paper is relatively easy. Furthermore, recreating existing figures that are not stored in formats preserving semantic information is equally complex. To tackle this problem, we introduce DeTikZify, a novel multimodal language model that automatically synthesizes scientific figures as semantics-preserving TikZ graphics programs based on sketches and existing figures. To achieve this, we create three new datasets: DaTikZv2, the largest TikZ dataset to date, containing over 360k human-created TikZ graphics; SketchFig, a dataset that pairs hand-drawn sketches with their corresponding scientific figures; and MetaFig, a collection of diverse scientific figures and associated metadata. We train DeTikZify on MetaFig and DaTikZv2, along with synthetically generated sketches learned from SketchFig. We also introduce an MCTS-based inference algorithm that enables DeTikZify to iteratively refine its outputs without the need for additional training. Through both automatic and human evaluation, we demonstrate that DeTikZify outperforms commercial Claude 3 and GPT-4V in synthesizing TikZ programs, with the MCTS algorithm effectively boosting its performance. We make our code, models, and datasets publicly available.
@inproceedings{belouadi2024detikzify, author = {Belouadi, Jonas and Ponzetto, Simone Paolo and Eger, Steffen}, location = {Vancouver, Canada}, url = {https://openreview.net/forum?id=bcVLFQCOjc}, booktitle = {The Thirty-eighth Annual Conference on Neural Information Processing Systems}, month = dec, title = {{DeTikZify}: Synthesizing Graphics Programs for Scientific Figures and Sketches with {TikZ}}, year = {2024} }
- AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZJonas Belouadi, Anne Lauscher, and Steffen EgerIn The Twelfth International Conference on Learning Representations, Vienna, Austria, May 2024
Generating bitmap graphics from text has gained considerable attention, yet for scientific figures, vector graphics are often preferred. Given that vector graphics are typically encoded using low-level graphics primitives, generating them directly is difficult. To address this, we propose the use of TikZ, a well-known abstract graphics language that can be compiled to vector graphics, as an intermediate representation of scientific figures. TikZ offers human-oriented, high-level commands, thereby facilitating conditional language modeling with any large language model. To this end, we introduce DaTikZ, the first large-scale TikZ dataset consisting of 120k TikZ drawings aligned with captions. We fine-tune LLaMA on DaTikZ, as well as our new model CLiMA, which augments LLaMA with multimodal CLIP embeddings. In both human and automatic evaluation, CLiMA and LLaMA outperform commercial GPT-4 and Claude 2 in terms of similarity to human-created figures, with CLiMA additionally improving text-image alignment. Our detailed analysis shows that all models generalize well and are not susceptible to memorization. GPT-4 and Claude 2, however, tend to generate more simplistic figures compared to both humans and our models. We make our framework, AutomaTikZ, along with model weights and datasets, publicly available.
@inproceedings{belouadi2024automatikz, author = {Belouadi, Jonas and Lauscher, Anne and Eger, Steffen}, location = {Vienna, Austria}, url = {https://openreview.net/forum?id=v3K5TVP8kZ}, booktitle = {The Twelfth International Conference on Learning Representations}, month = may, title = {{AutomaTikZ}: Text-Guided Synthesis of Scientific Vector Graphics with {TikZ}}, year = {2024} }
- ChatGPT: A Meta-Analysis after 2.5 MonthsChristoph Leiter, Ran Zhang, Yanran Chen, Jonas Belouadi, Daniil Larionov, Vivian Fresen, and Steffen EgerJun 2024
ChatGPT, a chatbot developed by OpenAI, has gained widespread popularity and media attention since its release in November 2022. However, little hard evidence is available regarding its perception in various sources. In this paper, we analyze over 300,000 tweets and more than 150 scientific papers to investigate how ChatGPT is perceived and discussed. Our findings show that ChatGPT is generally viewed as of high quality, with positive sentiment and emotions of joy dominating social media. Its perception has slightly decreased since its debut, however, with joy decreasing and (negative) surprise on the rise, and it is perceived more negatively in languages other than English. In recent scientific papers, ChatGPT is characterized as a great opportunity across various fields including the medical domain, but also as a threat concerning ethics and receives mixed assessments for education. Our comprehensive meta-analysis of ChatGPT’s perception after 2.5 months since its release can contribute to shaping the public debate and informing its future development. We make our data available.
@article{leiter2023chatgpt, author = {Leiter, Christoph and Zhang, Ran and Chen, Yanran and Belouadi, Jonas and Larionov, Daniil and Fresen, Vivian and Eger, Steffen}, publisher = {Elsevier}, url = {https://www.sciencedirect.com/science/article/pii/S2666827024000173}, doi = {https://doi.org/10.1016/j.mlwa.2024.100541}, issn = {2666-8270}, journaltitle = {Machine Learning with Applications}, keywords = {ChatGPT,Sentiment analysis,Emotion analysis,Science,Large language models}, month = jun, pages = {100541}, title = {{ChatGPT}: A Meta-Analysis after 2.5 Months}, year = {2024} }
2023
- ACL Honorable Mention at the Best Paper AwardsByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language ModelsJonas Belouadi and Steffen EgerIn Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, Canada, Jul 2023
State-of-the-art poetry generation systems are often complex. They either consist of task-specific model pipelines, incorporate prior knowledge in the form of manually created constraints, or both. In contrast, end-to-end models would not suffer from the overhead of having to model prior knowledge and could learn the nuances of poetry from data alone, reducing the degree of human supervision required. In this work, we investigate end-to-end poetry generation conditioned on styles such as rhyme, meter, and alliteration. We identify and address lack of training data and mismatching tokenization algorithms as possible limitations of past attempts. In particular, we successfully pre-train ByGPT5, a new token-free decoder-only language model, and fine-tune it on a large custom corpus of English and German quatrains annotated with our styles. We show that ByGPT5 outperforms other models such as mT5, ByT5, GPT-2 and ChatGPT, while also being more parameter efficient and performing favorably compared to humans. In addition, we analyze its runtime performance and demonstrate that it is not prone to memorization. We make our code, models, and datasets publicly available.
@inproceedings{belouadi2023bygpt5, author = {Belouadi, Jonas and Eger, Steffen}, location = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.acl-long.406}, booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics}, doi = {10.18653/v1/2023.acl-long.406}, month = jul, pages = {7364--7381}, title = {{ByGPT5}: End-to-End Style-conditioned Poetry Generation with Token-free Language Models}, year = {2023} }
- EACL Outstanding Paper AwardUScore: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine TranslationJonas Belouadi and Steffen EgerIn Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, May 2023
The vast majority of evaluation metrics for machine translation are supervised, i.e., (i) are trained on human scores, (ii) assume the existence of reference translations, or (iii) leverage parallel data. This hinders their applicability to cases where such supervision signals are not available. In this work, we develop fully unsupervised evaluation metrics. To do so, we leverage similarities and synergies between evaluation metric induction, parallel corpus mining, and MT systems. In particular, we use an unsupervised evaluation metric to mine pseudo-parallel data, which we use to remap deficient underlying vector spaces (in an iterative manner) and to induce an unsupervised MT system, which then provides pseudo-references as an additional component in the metric. Finally, we also induce unsupervised multilingual sentence embeddings from pseudo-parallel data. We show that our fully unsupervised metrics are effective, i.e., they beat supervised competitors on 4 out of our 5 evaluation datasets. We make our code publicly available.
@inproceedings{belouadi2023uscore, author = {Belouadi, Jonas and Eger, Steffen}, location = {Dubrovnik, Croatia}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.eacl-main.27}, booktitle = {Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics}, doi = {10.18653/v1/2023.eacl-main.27}, month = may, pages = {358--374}, title = {{UScore}: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation}, year = {2023} }
2022
- Reproducibility Issues for BERT-based Evaluation MetricsYanran Chen, Jonas Belouadi, and Steffen EgerIn Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, Dec 2022
Reproducibility is of utmost concern in machine learning and natural language processing (NLP). In the field of natural language generation (especially machine translation), the seminal paper of Post (2018) has pointed out problems of reproducibility of the dominant metric, BLEU, at the time of publication. Nowadays, BERT-based evaluation metrics considerably outperform BLEU. In this paper, we ask whether results and claims from four recent BERT-based metrics can be reproduced. We find that reproduction of claims and results often fails because of (i) heavy undocumented preprocessing involved in the metrics, (ii) missing code and (iii) reporting weaker results for the baseline metrics. (iv) In one case, the problem stems from correlating not to human scores but to a wrong column in the csv file, inflating scores by 5 points. Motivated by the impact of preprocessing, we then conduct a second study where we examine its effects more closely (for one of the metrics). We find that preprocessing can have large effects, especially for highly inflectional languages. In this case, the effect of preprocessing may be larger than the effect of the aggregation mechanism (e.g., greedy alignment vs. Word Mover Distance).
@inproceedings{chen2022reproducibility, author = {Chen, Yanran and Belouadi, Jonas and Eger, Steffen}, location = {Abu Dhabi, United Arab Emirates}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.emnlp-main.192}, booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing}, doi = {10.18653/v1/2022.emnlp-main.192}, month = dec, pages = {2965--2989}, title = {Reproducibility Issues for {BERT}-based Evaluation Metrics}, year = {2022} }