Bulletpapers - Understand complex papers in seconds

May 2024

Powerful Chinese image generation

This paper introduces Hunyuan-DiT, a text-to-image model that can generate detailed, high-quality images from both English and Chinese text prompts. Key innovations include a tailored transformer architecture, a data pipeline for iterative optimization, refined image captions, and multi-turn dialog for prompt refinement.

May 2024

Evaluating LLMs for data annotation

This paper provides an overview of 12 recent studies exploring the use of large language models (LLMs) for data annotation tasks. It summarizes key benefits like lower costs and faster speeds, but also limitations around bias, sensitivity to prompts, and English language preference. An empirical analysis examines how well the LLM GPT's predicted opinion distributions al...

April 2024

Benchmark for evaluating language models on generation tasks in Indian languages

The authors release IndicGenBench, a new benchmark to measure the ability of language models to perform generation tasks like summarization, translation, and question answering across 29 languages native to India. It extends existing datasets, providing multi-way parallel test data in many under-resourced Indic languages for the first time.

April 2024

SQuAD2.0 dataset translated into Basque

This paper presents EuSQuAD, an automatically translated and aligned version of the SQuAD2.0 question answering dataset for Basque. Over 142,000 question-answer pairs were generated. Experiments show EuSQuAD helps train better QA models compared to using the original English SQuAD2.0. A new manually annotated Basque QA test set of 490 questions is also introduced.

March 2024

Multilingual news recommendation dataset

The xMIND dataset enables research on multilingual and cross-lingual news recommendation. It covers 14 diverse languages and benchmarks neural recommenders, revealing performance drops in cross-lingual transfer.

December 2023

Large language model speech translation

This paper introduces LLM-ST, a novel speech translation model built on a large language model. It combines the language model with a speech encoder and uses multi-task instruction tuning to generate accurate, timestamped transcriptions and translations from long audio inputs. Experiments on English and Chinese datasets demonstrate exceptional performance, setting a new...

November 2023

Assessing translation of English and Indian languages with large language models

This paper explores using large language models for machine translation between English and 22 Indian languages. The authors evaluate raw LLMs, in-context learning, and fine-tuning approaches. They find that fine-tuned LLaMA models achieve the best performance, producing reasonable translations even for low-resource Indian languages. Their results highlight the potentia...

November 2023

Multilingual text generation in images

This paper introduces AnyText, a diffusion model framework for generating realistic, readable text in images. It can render text in multiple languages at specified positions, even irregular shapes. A novel text embedding module fuses semantic and glyph info. AnyText outperforms prior methods significantly.

November 2023

Unsupervised word substitution using language models

This paper proposes a new unsupervised approach to lexical simplification. It generates substitute words for a target word in context by sampling additional contexts containing the target word from a large text corpus. The model scores substitute candidates based on how frequently they are predicted for the target word across the original and sampled contexts. Experimen...

October 2023

Simplifying few-shot text classification

This paper proposes a technique called Label-Aware Automatic Verbalizer (LAAV) to improve few-shot text classification using prompt-based learning. LAAV leverages class labels to help language models generate better representative words for each class. Experiments on 5 datasets across 5 languages show LAAV outperforms prior verbalization methods, especially in low-resou...

October 2023

Self-training speech segmentation with a pretrained model

This paper proposes a method to improve unsupervised speech segmentation systems by iteratively self-training a pretrained speech model. It fine-tunes XLS-R, a pretrained speech model, on noisy word boundaries from existing systems. This consistently boosts performance, especially for the DP-Parse system. On 5 languages, it increased average token F1 by 130% to 40.7. It...

June 2023

Large legal text corpus in 24 languages

This paper introduces a new 689GB corpus called MultiLegalPile, comprising legal texts in 24 languages from 17 jurisdictions. The corpus combines legislative texts, court rulings, contracts, and other sources. Models pretrained on it achieve state-of-the-art on legal benchmarks.

May 2023

Unlocking the Secrets of Complex Entities: A New Era of Multilingual NER

This paper presents the findings from SemEval-2023 Task 2, a shared task focused on identifying complex named entities across 12 languages. The task used a new large-scale dataset called MULTICONER V2 containing over 2 million instances with fine-grained entity types like ATHLETE, DISEASE, and VISUALWORK. Results showed the continued challenges of processing complex ent...

February 2023

Demystifying Transformer and HowNet for Text Matching

This paper proposes a novel approach to text matching that fuses Transformer encoding and external knowledge from HowNet to handle synonyms and polysemy. The model encodes sentence pairs with Transformer, incorporates HowNet semantic knowledge, and fuses this via an attention mechanism. Experiments show accuracy improvements on financial and paraphrase datasets compared...

September 2020

Bridging Music Genres Across Languages

This paper presents a method to learn multilingual music genre embeddings which enable effective cross-lingual music genre translation. Music genres annotated in different languages are aligned into a common vector space. This allows translating music genres from multiple sources in various languages to a target music genre vocabulary.

The history of multilingual