Professor Choi Jun-seok's research team at Sogang University has a paper accepted at the best international conference in artificial intelligence, ICLR 2026
A research team led by Professor Choi Jun-seok at Sogang University has had their paper accepted at the prestigious International Conference on Learning Representations (ICLR) 2026, addressing issues in large vision-language models.
The research team from Sogang University, led by Professor Choi Jun-seok, has successfully had their paper accepted for presentation at the International Conference on Learning Representations (ICLR) 2026, one of the most esteemed conferences in the artificial intelligence and machine learning fields. The team, including doctoral and master's students, has conducted an in-depth analysis of the role of delimiter tokens in large vision-language models (LVLMs), specifically in multi-image input scenarios, and proposed an effective technique to enhance performance.
Traditional LVLMs excel at single-image tasks but face significant challenges when handling multiple images concurrently due to the problem of cross-image information leakage, which hampers inference accuracy. While existing models utilize delimiter tokens to distinguish between images, the research team's findings reveal that these tokens often fail to prevent information leakage between images effectively. As a solution, they have introduced a simple yet potent technique that involves scaling the hidden states of delimiter tokens. This approach enhances intra-image interactions while suppressing unnecessary interactions between different images, enabling clearer distinction and more accurate reasoning for multiple image inputs.
The significance of this research lies in its re-examination of the role of delimiter tokens—often an overlooked aspect in LVLMs—offering a practical solution capable of reliably enhancing multi-input reasoning performance without necessitating structural changes or additional training of the model. This breakthrough is poised to contribute to the reliability and accuracy of AI systems dealing with multiple images and documents in the future.