View All Papers

Alfie Ranstead Matt Falconer Cláudio Lemos

New Bulletpapers

Sorted by the 'paper publish date' field, newest first

of 308

Published on 9 May 2024

Analysis of growth rates for surrogate loss bounds in classification

This paper analyzes the growth rates of bounds relating the estimation error of surrogate losses to that of the 0-1 loss, known as H-consistency bounds, in both binary and multi-class classification. It shows these bounds exhibit a universal square-root rate near 0 for common smooth losses like cross-entropy, under mild assumptions. This implies directly reducing the surrogate estimation error leads to a square-root decay in the target 0-1 estimation error. The paper also thoroughly compares losses based on minimizability gaps, which become ...

Published on 9 May 2024

Accelerating diffusion models for fast image generation

This paper proposes a method to distill a complex, multi-step diffusion model into a fast, single-step conditional GAN model that can generate images nearly as well but much more quickly. Their key ideas are to interpret diffusion distillation as an image-to-image translation task using noise and image pairs from the diffusion model, and to create losses that work directly in the model's latent space, avoiding decoding to pixels. Their one-step model outperforms other state-of-the-art diffusion distillation techniques.

Published on 9 May 2024

Evaluating Language Models for Driving

This paper provides a comprehensive analysis of state-of-the-art multimodal language models in simulated driving environments. The models demonstrate significant limitations in reasoning logically about basic vehicle dynamics, interactions between vehicles, trajectory planning, and unexpected events across sequences of images. To enable analysis, the authors introduce DriveSim, a specialized driving simulator generating diverse scenarios. They also release full code and a dataset for continued research. Results reveal critical gaps in curren...

Published on 9 May 2024

Transforming Text into Images, Videos, 3D Objects and Audio via Flow-based Diffusion Transformers

This paper introduces Lumina-T2X, a family of flow-based large diffusion transformers designed to transform noise into images, videos, 3D objects and audio conditioned on text. Key techniques like tokenized representations, learnable placeholders, RoPE, RMSNorm and flow matching enable unified training and flexible generation across modalities and resolutions. Models scale up to 7B parameters with 35% of the training costs of a 600M model, achieving ultra-high-res images and long 720p videos.

Published on 9 May 2024

Efficient ranking of text options through selective pairwise comparisons

This paper proposes a framework to efficiently rank a set of text options by quality. It uses selective pairwise comparisons judged by a language model, combining these decisions probabilistically. With just a small subset of all possible comparisons, it can predict quality scores that correlate well with human judgment, while greatly reducing computational costs.

Published on 9 May 2024

Safe Reinforcement Learning Using Uncertainty-Aware Models

The paper proposes a new reinforcement learning method called CERL that maintains safety while learning policies, using Bayesian neural networks to model uncertainty and suggest policies robust to inaccuracies. CERL outperformed other methods on constrained MDP tasks from image inputs.

Published on 9 May 2024

Self-supervised modeling for text recognition

This paper proposes Symmetric Superimposition Modeling (SSM), a self-supervised approach for text recognition that captures both character shapes and linguistic context by reconstructing original and inverted images from their superimposition. SSM operates at both pixel and feature levels. At the pixel level, it reconstructs original and inverted images to capture shapes and texture-level context. At the feature level, it reconstructs features of the original and inverted images under different augmentations to model semantic-level context a...

Published on 8 May 2024

Event-based open-vocabulary scene parsing

This paper introduces OpenESS, a method to perform event-based semantic segmentation with open-ended textual queries instead of fixed labels. It transfers knowledge from image and text models to event data, allowing segmentation of new categories without retraining. Key techniques include contrastive learning between events and image regions, and optimizing event embeddings to match text meanings.

Published on 8 May 2024

Data-efficient 3D scene understanding for autonomous vehicles

This paper proposes a semi-supervised framework called LaserMix++ that leverages both LiDAR point clouds and camera images to improve 3D scene understanding for autonomous driving with far less labeled data. Key innovations include multi-modal data mixing, transferring knowledge from images to point clouds, and generating auxiliary labels from language models, which enhance regularization and feature learning.

Published on 8 May 2024

Evaluating and Reducing Hallucinations in Vision-Language Models

The paper proposes THRONE, a new benchmark to evaluate 'Type I' hallucinations (in open-ended responses) in large vision-language models (LVLMs). It utilizes language models to identify hallucinations and introduces metrics to quantify them. The paper demonstrates that reducing 'Type II' hallucinations (in responses to specific questions) does not reduce Type I hallucinations, and that existing methods for evaluating Type I hallucinations are limited. Finally, a simple data augmentation method is introduced that reduces both Type I and Type ...

Published on 8 May 2024

Decoder-decoder architecture for efficient language models

The paper proposes YOCO, a decoder-decoder architecture for large language models. It consists of a self-decoder that encodes global key-value caches, and a cross-decoder that reuses those caches. This design reduces GPU memory usage and speeds up inference compared to regular Transformer decoders. Experiments show YOCO scales well in terms of model size, training data, and context length. At 1 million tokens it achieves high accuracy on retrieval tasks. Profiling shows orders of magnitude less memory usage and faster prefilling.

Published on 8 May 2024

Efficient GNN training on disk

This paper introduces DiskGNN, a system to efficiently train graph neural networks on disk when graphs exceed CPU memory. DiskGNN achieves high efficiency and model accuracy through offline sampling to optimize data layout, four-level caching, batched packing, and pipelined training. Experiments show DiskGNN speeds up state-of-the-art systems by over 8x while matching accuracy.

Published on 8 May 2024

Language-guided robot control for surgical tasks

This paper presents SuFIA, a framework that uses large language models (LLMs) and perception modules to plan and execute robotic control for surgical sub-tasks. This allows for a learning-free approach to surgical automation without needing motion primitives or examples. SuFIA incorporates re-planning and human oversight to mitigate errors. Experiments in simulation and on a physical robot platform demonstrate SuFIA's ability to autonomously perform common surgical tasks under challenging conditions.

Published on 8 May 2024

Online Platform Content Moderation Policy Study

This paper analyzes content moderation policies from 43 major online platforms to understand their approaches to moderating copyright infringement, harmful speech, and misleading content. Using a custom web scraper and unified annotation scheme, the authors find significant variation across platforms and topics attributable to differing legal regimes. The paper lays groundwork for studying evolving moderation policies and their impacts.

Published on 8 May 2024

Accelerating diffusion models with distillation for fast high-quality image generation

This paper proposes a distillation framework to accelerate diffusion models, enabling high-quality and diverse image generation using only 1-3 sampling steps. Key innovations include Backward Distillation to reduce train-test discrepancy, Shifted Reconstruction Loss to transfer both structure and detail knowledge, and Noise Correction to enhance initial sample quality.

Published on 8 May 2024

Efficient attention computation for transformers

This paper develops a convolution-based method to efficiently approximate attention in transformers, reducing the quadratic complexity to nearly linear. It shows any attention matrix can be decomposed into convolution matrices, which enables fast Fourier transform for faster computation without changing model parameters.

Published on 8 May 2024

Text-driven 3D human pose estimation

This paper proposes FinePOSE, a new diffusion model-based approach for estimating 3D human poses from 2D keypoints. It introduces a novel fine-grained part-aware prompt learning mechanism to provide precise guidance for each human body part's movement. FinePOSE also establishes communications between the learned prompts and poses to enhance the diffusion model's denoising capability. Experiments show state-of-the-art performance on public benchmarks. An extension to multi-human scenarios also demonstrates potential.

Published on 8 May 2024

SPIDER: Fast rank and select queries

This paper introduces SPIDER, a new succinct data structure for answering rank and select queries on bit vectors. SPIDER uses only 3.82% extra space, yet outperforms prior methods. For rank queries, SPIDER is the fastest known method on large inputs. For select queries, it narrows the performance gap between space-efficient and less space-efficient techniques. Key ideas include interleaving metadata with the bit vector to improve cache performance, and using predictions to accelerate select queries.

Published on 8 May 2024

Synthesizing broadcast channels using shared randomness

This paper studies the problem of synthesizing a two-user broadcast channel using a common message, when the input terminal shares independent randomness with each output terminal. The authors provide inner and lower bounds on the rate tradeoff between communication and shared randomness. These bounds are tight for some special cases like point-to-point channels and channels without inputs studied in prior work.

Published on 8 May 2024

Team of robots track moving people by sharing object observations

A team of mobile robots can more accurately track moving people in their environment by sharing observations of people's locations with each other in real-time. But robots accumulate error in their position estimates, so they must repeatedly estimate the change in coordinate frames between themselves and neighbors. This paper presents a full system for robots to build maps of generic objects seen recently, align the maps to estimate relative positions, and use those to share observations of people for collaborative tracking.

Published on 8 May 2024

Using game techniques to engage software engineering students

This paper investigates applying gamification in software engineering education through a tertiary study, finding it can increase student engagement and motivation. However, improper implementation may negatively impact performance. Key areas for gamification are testing and quality, with competition and cooperation the most used game elements.

Published on 8 May 2024

Detecting unusual certificates

The authors propose using the Isolation Forest algorithm to detect anomalous X.509 certificates in Certificate Transparency logs. This unsupervised machine learning method builds random trees to isolate outliers. It identifies certificates significantly different from typical ones based on quantitative attributes like subject name length or public key type, without needing to pre-define anomalies. When standards compliance checks are insufficient, it can reveal potential issues needing investigation. The technique seems promising when traine...

Published on 8 May 2024

Encoder-decoder model for interactive free verse generation with controllable high-quality rhyming

The paper proposes a novel fine-tuning approach to generate lyrics and free verse poems with controllable, high-quality rhyming. By prepending the rhyming word to the start of each line, the model makes the critical rhyming decision first while still generating the verse left-to-right. Extensive experiments show this approach produces more readable text and better rhyming compared to prior state-of-the-art methods. A high-quality multilingual dataset is also introduced to demonstrate wide applicability.

Published on 8 May 2024

Protecting privacy in conversational agents

This paper introduces a new threat model where adversarial third parties manipulate context to trick conversational agents into leaking private user data. The authors propose AirGapAgent, an architecture that restricts agent access to only necessary user data for a task. Experiments show AirGapAgent protects up to 97% of user data from context manipulation attacks while maintaining utility.

Published on 8 May 2024

3D perception of vehicle surroundings

This paper surveys recent research on 3D occupancy perception, which seeks to capture detailed 3D structures around vehicles to enable autonomous driving systems to precisely understand complex environments. It highlights that occupancy perception combines inputs from multiple sensors and fuses information across data sources. Key challenges include converting 2D images to 3D representations, integrating multi-camera and multi-frame observations, and training networks. The paper analyzes performance on datasets and discusses future opportuni...

Published on 8 May 2024

Quantized neural network training equivalence

This paper proves that many proposed complex gradient estimators for quantized neural networks are equivalent to simpler estimators like the straight-through estimator. After adjustments to the learning rate and weight initialization, models using complex estimators train almost identically to those using the straight-through estimator.

Published on 8 May 2024

Recovering lost watermarks using image denoising

This paper proposes a robust image watermarking model that introduces a denoising module between the noise layer and decoder in the typical encoder-decoder architecture. The denoising module reduces noise and recovers watermark information lost during attacks, improving robustness. Additionally, a SE module is added to the encoder to fuse watermarking information pixel and channel-wise, enhancing efficiency. Experiments show the model matches or exceeds state-of-the-art methods under high noise levels. Ablations demonstrate the value of each...

Published on 8 May 2024

Radar-based human pose estimation through multi-format feature fusion

This paper introduces ProbRadarM3F, a novel radar-based model for indoor human pose estimation. It fuses traditional heatmap features from radar signals with new positional encoding features guided by generated probability maps. This allows it to capture more of the latent spatial information in radar data. Experiments show ProbRadarM3F outperforms prior state-of-the-art methods on the HuPR dataset for 14 keypoint detection, demonstrating the value of multi-format radar feature fusion.

of 308