Published on 9 May 2024
Analysis of growth rates for surrogate loss bounds in classification
This paper analyzes the growth rates of bounds relating the estimation error of surrogate losses to that of the 0-1 loss, known as H-consistency bounds, in both binary and multi-class classification. It shows these bounds exhibit a universal square-root rate near 0 for common smooth losses like cross-entropy, under mild assumptions. This implies directly reducing the surrogate estimation error leads to a square-root decay in the target 0-1 estimation error. The paper also thoroughly compares losses based on minimizability gaps, which become ...
Published on 9 May 2024
Accelerating diffusion models for fast image generation
This paper proposes a method to distill a complex, multi-step diffusion model into a fast, single-step conditional GAN model that can generate images nearly as well but much more quickly. Their key ideas are to interpret diffusion distillation as an image-to-image translation task using noise and image pairs from the diffusion model, and to create losses that work directly in the model's latent space, avoiding decoding to pixels. Their one-step model outperforms other state-of-the-art diffusion distillation techniques.
Published on 9 May 2024
Evaluating Language Models for Driving
This paper provides a comprehensive analysis of state-of-the-art multimodal language models in simulated driving environments. The models demonstrate significant limitations in reasoning logically about basic vehicle dynamics, interactions between vehicles, trajectory planning, and unexpected events across sequences of images. To enable analysis, the authors introduce DriveSim, a specialized driving simulator generating diverse scenarios. They also release full code and a dataset for continued research. Results reveal critical gaps in curren...
Published on 9 May 2024
Transforming Text into Images, Videos, 3D Objects and Audio via Flow-based Diffusion Transformers
This paper introduces Lumina-T2X, a family of flow-based large diffusion transformers designed to transform noise into images, videos, 3D objects and audio conditioned on text. Key techniques like tokenized representations, learnable placeholders, RoPE, RMSNorm and flow matching enable unified training and flexible generation across modalities and resolutions. Models scale up to 7B parameters with 35% of the training costs of a 600M model, achieving ultra-high-res images and long 720p videos.
Published on 9 May 2024
Efficient ranking of text options through selective pairwise comparisons
This paper proposes a framework to efficiently rank a set of text options by quality. It uses selective pairwise comparisons judged by a language model, combining these decisions probabilistically. With just a small subset of all possible comparisons, it can predict quality scores that correlate well with human judgment, while greatly reducing computational costs.
Published on 9 May 2024
Safe Reinforcement Learning Using Uncertainty-Aware Models
The paper proposes a new reinforcement learning method called CERL that maintains safety while learning policies, using Bayesian neural networks to model uncertainty and suggest policies robust to inaccuracies. CERL outperformed other methods on constrained MDP tasks from image inputs.
Published on 9 May 2024
Self-supervised modeling for text recognition
This paper proposes Symmetric Superimposition Modeling (SSM), a self-supervised approach for text recognition that captures both character shapes and linguistic context by reconstructing original and inverted images from their superimposition. SSM operates at both pixel and feature levels. At the pixel level, it reconstructs original and inverted images to capture shapes and texture-level context. At the feature level, it reconstructs features of the original and inverted images under different augmentations to model semantic-level context a...
Published on 8 May 2024
Event-based open-vocabulary scene parsing
This paper introduces OpenESS, a method to perform event-based semantic segmentation with open-ended textual queries instead of fixed labels. It transfers knowledge from image and text models to event data, allowing segmentation of new categories without retraining. Key techniques include contrastive learning between events and image regions, and optimizing event embeddings to match text meanings.
Published on 8 May 2024
Data-efficient 3D scene understanding for autonomous vehicles
This paper proposes a semi-supervised framework called LaserMix++ that leverages both LiDAR point clouds and camera images to improve 3D scene understanding for autonomous driving with far less labeled data. Key innovations include multi-modal data mixing, transferring knowledge from images to point clouds, and generating auxiliary labels from language models, which enhance regularization and feature learning.
Published on 8 May 2024
Evaluating and Reducing Hallucinations in Vision-Language Models
The paper proposes THRONE, a new benchmark to evaluate 'Type I' hallucinations (in open-ended responses) in large vision-language models (LVLMs). It utilizes language models to identify hallucinations and introduces metrics to quantify them. The paper demonstrates that reducing 'Type II' hallucinations (in responses to specific questions) does not reduce Type I hallucinations, and that existing methods for evaluating Type I hallucinations are limited. Finally, a simple data augmentation method is introduced that reduces both Type I and Type ...
Published on 8 May 2024
Decoder-decoder architecture for efficient language models
The paper proposes YOCO, a decoder-decoder architecture for large language models. It consists of a self-decoder that encodes global key-value caches, and a cross-decoder that reuses those caches. This design reduces GPU memory usage and speeds up inference compared to regular Transformer decoders. Experiments show YOCO scales well in terms of model size, training data, and context length. At 1 million tokens it achieves high accuracy on retrieval tasks. Profiling shows orders of magnitude less memory usage and faster prefilling.
Published on 8 May 2024
Efficient GNN training on disk
This paper introduces DiskGNN, a system to efficiently train graph neural networks on disk when graphs exceed CPU memory. DiskGNN achieves high efficiency and model accuracy through offline sampling to optimize data layout, four-level caching, batched packing, and pipelined training. Experiments show DiskGNN speeds up state-of-the-art systems by over 8x while matching accuracy.
Published on 8 May 2024
Language-guided robot control for surgical tasks
This paper presents SuFIA, a framework that uses large language models (LLMs) and perception modules to plan and execute robotic control for surgical sub-tasks. This allows for a learning-free approach to surgical automation without needing motion primitives or examples. SuFIA incorporates re-planning and human oversight to mitigate errors. Experiments in simulation and on a physical robot platform demonstrate SuFIA's ability to autonomously perform common surgical tasks under challenging conditions.
Published on 8 May 2024
Online Platform Content Moderation Policy Study
This paper analyzes content moderation policies from 43 major online platforms to understand their approaches to moderating copyright infringement, harmful speech, and misleading content. Using a custom web scraper and unified annotation scheme, the authors find significant variation across platforms and topics attributable to differing legal regimes. The paper lays groundwork for studying evolving moderation policies and their impacts.
Published on 8 May 2024
Accelerating diffusion models with distillation for fast high-quality image generation
This paper proposes a distillation framework to accelerate diffusion models, enabling high-quality and diverse image generation using only 1-3 sampling steps. Key innovations include Backward Distillation to reduce train-test discrepancy, Shifted Reconstruction Loss to transfer both structure and detail knowledge, and Noise Correction to enhance initial sample quality.
Published on 8 May 2024
Efficient attention computation for transformers
This paper develops a convolution-based method to efficiently approximate attention in transformers, reducing the quadratic complexity to nearly linear. It shows any attention matrix can be decomposed into convolution matrices, which enables fast Fourier transform for faster computation without changing model parameters.
Published on 8 May 2024
Text-driven 3D human pose estimation
This paper proposes FinePOSE, a new diffusion model-based approach for estimating 3D human poses from 2D keypoints. It introduces a novel fine-grained part-aware prompt learning mechanism to provide precise guidance for each human body part's movement. FinePOSE also establishes communications between the learned prompts and poses to enhance the diffusion model's denoising capability. Experiments show state-of-the-art performance on public benchmarks. An extension to multi-human scenarios also demonstrates potential.
Published on 8 May 2024
SPIDER: Fast rank and select queries
This paper introduces SPIDER, a new succinct data structure for answering rank and select queries on bit vectors. SPIDER uses only 3.82% extra space, yet outperforms prior methods. For rank queries, SPIDER is the fastest known method on large inputs. For select queries, it narrows the performance gap between space-efficient and less space-efficient techniques. Key ideas include interleaving metadata with the bit vector to improve cache performance, and using predictions to accelerate select queries.
Published on 8 May 2024
Synthesizing broadcast channels using shared randomness
This paper studies the problem of synthesizing a two-user broadcast channel using a common message, when the input terminal shares independent randomness with each output terminal. The authors provide inner and lower bounds on the rate tradeoff between communication and shared randomness. These bounds are tight for some special cases like point-to-point channels and channels without inputs studied in prior work.
Published on 8 May 2024
Team of robots track moving people by sharing object observations
A team of mobile robots can more accurately track moving people in their environment by sharing observations of people's locations with each other in real-time. But robots accumulate error in their position estimates, so they must repeatedly estimate the change in coordinate frames between themselves and neighbors. This paper presents a full system for robots to build maps of generic objects seen recently, align the maps to estimate relative positions, and use those to share observations of people for collaborative tracking.
Published on 8 May 2024
Using game techniques to engage software engineering students
This paper investigates applying gamification in software engineering education through a tertiary study, finding it can increase student engagement and motivation. However, improper implementation may negatively impact performance. Key areas for gamification are testing and quality, with competition and cooperation the most used game elements.
Published on 8 May 2024
Detecting unusual certificates
The authors propose using the Isolation Forest algorithm to detect anomalous X.509 certificates in Certificate Transparency logs. This unsupervised machine learning method builds random trees to isolate outliers. It identifies certificates significantly different from typical ones based on quantitative attributes like subject name length or public key type, without needing to pre-define anomalies. When standards compliance checks are insufficient, it can reveal potential issues needing investigation. The technique seems promising when traine...
Published on 8 May 2024
Encoder-decoder model for interactive free verse generation with controllable high-quality rhyming
The paper proposes a novel fine-tuning approach to generate lyrics and free verse poems with controllable, high-quality rhyming. By prepending the rhyming word to the start of each line, the model makes the critical rhyming decision first while still generating the verse left-to-right. Extensive experiments show this approach produces more readable text and better rhyming compared to prior state-of-the-art methods. A high-quality multilingual dataset is also introduced to demonstrate wide applicability.
Published on 8 May 2024
Protecting privacy in conversational agents
This paper introduces a new threat model where adversarial third parties manipulate context to trick conversational agents into leaking private user data. The authors propose AirGapAgent, an architecture that restricts agent access to only necessary user data for a task. Experiments show AirGapAgent protects up to 97% of user data from context manipulation attacks while maintaining utility.
Published on 8 May 2024
3D perception of vehicle surroundings
This paper surveys recent research on 3D occupancy perception, which seeks to capture detailed 3D structures around vehicles to enable autonomous driving systems to precisely understand complex environments. It highlights that occupancy perception combines inputs from multiple sensors and fuses information across data sources. Key challenges include converting 2D images to 3D representations, integrating multi-camera and multi-frame observations, and training networks. The paper analyzes performance on datasets and discusses future opportuni...
Published on 8 May 2024
Quantized neural network training equivalence
This paper proves that many proposed complex gradient estimators for quantized neural networks are equivalent to simpler estimators like the straight-through estimator. After adjustments to the learning rate and weight initialization, models using complex estimators train almost identically to those using the straight-through estimator.
Published on 8 May 2024
Recovering lost watermarks using image denoising
This paper proposes a robust image watermarking model that introduces a denoising module between the noise layer and decoder in the typical encoder-decoder architecture. The denoising module reduces noise and recovers watermark information lost during attacks, improving robustness. Additionally, a SE module is added to the encoder to fuse watermarking information pixel and channel-wise, enhancing efficiency. Experiments show the model matches or exceeds state-of-the-art methods under high noise levels. Ablations demonstrate the value of each...
Published on 8 May 2024
Radar-based human pose estimation through multi-format feature fusion
This paper introduces ProbRadarM3F, a novel radar-based model for indoor human pose estimation. It fuses traditional heatmap features from radar signals with new positional encoding features guided by generated probability maps. This allows it to capture more of the latent spatial information in radar data. Experiments show ProbRadarM3F outperforms prior state-of-the-art methods on the HuPR dataset for 14 keypoint detection, demonstrating the value of multi-format radar feature fusion.