abstracts.txt

Title: AIs Playing Games with Language: Balancing Uncertainty, Truth, and Usefulness in Large Language Models

Abstract: In this talk, I'll discuss three silly games that show the same pattern: computers are very good, there are still things humans are better at, and that we can create much stronger human--computer teams.  We first begin with games that test memory: testing the recall of obscure facts. While AI has been viewed as superhuman at the task of question answering, it isn't universally so. After building a new human-in-the-loop authoring system for drafting challenging examples, we show that a new measure of adversarial datasets based on item response theory (which can capture the gap between humans and computer skill) is decreasing but not yet closed, with computers still struggling on abstract reasoning and knowing when they know the correct answer. Given these disparate skill sets, we then analyze how we can best build human and computer teams to learn new facts and detect false statements: computers can help humans identify false statements---but only when the computer is not confidently incorrect. Finally, I close with a similar line of results for another silly language game, Diplomacy, where computers have still not reached dominance but can be used to assist human players think strategically and detect lies, which we capture using an analysis of grounded statements with abstract meaning representation and value functions.  I'll then close with how these results help inform how we can construct better human--computer interactions in "the real world".


Title: AI's Certain Uncertainty Problem

Modern AI tools will gladly give the answer to most questions.  And
while most of those answers will be correct, some of them are not.
But neither the users nor the systems can reliably detect when a
system is confidently incorrect.  We first show that this has
real-world consequences in human-centered tasks: users are more likely
to make mistakes in a fact-checking task when presented with
convincingly wrong outputs from large language models. We then discuss
how to improve our estimation of uncertainty, but how it requires
deeper access than is currently available from many popular LLMs.
Next, we discuss ongoing research to evaluate topic models and label
induction that depends on better uncertainty estimation.  Finally, I
end with a proposal about how to improve our evaluation of LLM
uncertainty in the spirit of previous AI "grand challenges".


Title: If We Want AI to be Interpretable, We Need to Measure Interpretability

 Abstract: AI tools are ubiquitous, but most users treat it as a black
box: a handy tool that suggests purchases, flags spam, or
autocompletes text. While researchers have presented explanations for
making AI less of a black box, a lack of metrics make it hard to
optimize explicitly for interpretability. Thus, I propose two metrics
for interpretability suitable for unsupervised and supervised AI
methods.
For unsupervised topic models, I discuss our proposed "intruder"
interpretability metric, how it contradicts the previous evaluation
metric for topic models (perplexity), and discuss its uptake in the
community over the last decade. For supervised question answering
approaches, I show how human-computer cooperation can be measured and
directly optimized by a multi-armed bandit approach to learn what
kinds of explanations help specific users. I will then briefly discuss
how similar setups can help users navigate information-rich domains
like fact checking, translation, and web search.

Bio: Jordan Boyd-Graber is an associate professor in the University of
Maryland's Computer Science Department, iSchool, UMIACS, and Language
Science Center. He generally works on how humans can interact with AI
tools, starting first with topic models, then translation, then
negotiation, and most recently question answering. He and his students
have won "best of" awards at NIPS (2009, 2015), NAACL (2016), CoNLL
(2015), and EMNLP (2024). Jordan also won the British Computing
Society's 2015 Karen Spärk Jones Award and a 2017 NSF CAREER award.


Researchers who work in Natural Language Processing work on
"language".  But where does language data come from?  Rather than
using found data from a "natural" distribution, this talk will make
the case for using adversarial data.  After putting this in historical
context, I’ll discuss two examples of adversarial authoring with a
human-in-the-loop system: question writing and fact-checking. For
question answering, we develop a retrieval-based adversarial authoring
platform and create incentives to get people to use our system in the
first place, write interesting questions humans can answer but that
challenge a QA system: . We then turn to fact checking, creating a new
game (Fool Me Twice) to solicit difficult-to-verify claims—that can be
either true or false—and to test how difficult the claims are both for
humans and computers. We argue that the focus on retrieval is
important for knowledge-based adversarial examples because it
highlights diverse information, prevents frustration in authors, and
takes advantage of users’ expertise.  I'll finally turn to detecting
deception in the board game Diplomacy, which further highlights the
abilities of humans vs. computers.


==========================


Sorry for the delay, and thanks for the reminder!

You've probably heard of ChatGPT and other large language models.
They can do some neat stuff!  But they still have some weaknesses.
After briefly reviewing how transformer language models work, we will
go into the weaknesses of models in 2023: recognizing unconventional
language, numerical and logical reasoning, multihop questions,
cross-cultural understanding, and incorporating new information.
Next, we'll describe how to elicit questions that challenge today's
computer models using adversarial human-in-the-loop authoring and how
computers and humans compare at their ability to answer those
questions.

I’ll discuss two examples of our work putting experienced writers in front of a retrieval-driven
adversarial authoring system: question writing and fact-checking. For question answering, we develop
a retrieval-based adversarial authoring platform and create incentives to get people to use our system
in the first place, write interesting questions humans can answer, and challenge a QA system. While
the best humans lose to computer QA systems on normal questions, computers struggle to answer our
adversarial questions. We then turn to fact checking, creating a new game (Fool Me Twice) to solicit
difficult-to-verify claims—that can be either true or false—and to test how difficult the claims are both
for humans and computers. We argue that the focus on retrieval is important for knowledge-based adversarial examples because it highlights diverse information, prevents frustration in authors, and takes
advantage of users’ expertise.


Title: Manchester vs. Cranfield: Why do we do question answering and
how can we do it better?

In this talk, I will examine why researchers are interested in
question answering.  I introduce the distinction between the
Manchester (exploration of intelligence) vs. Cranfield (serving the
user) paradigms.  After outlining this distinction, I will then argue
for why the Manchester paradigm can reveal the shortfalls of existing
QA systems from how we find answers to how we sort systems on
leaderboards, drawing on the best practices of the human question
answering (i.e., trivia) community.  Finally, I'll outline how
Cranfield-style systems can better measure the interpretability of
machine learning systems through explicitly optimizing a user-centric
metric, which allows the creation of user-specific explanations of
artificial intelligence.


Title: Comparing and Augmenting Human and Computer Intelligence through Natural
Language Question Answering

Humans' and computers' language abilities are complementary: computers
can better memorize, while humans grasp nuance.  Building the
future of language-mediated interactions with computers requires
balancing these abilities.  We use these differences to make machine
learning more interpretable for users (by measuring how useful
explanations are) and to make machine learning examples more
challenging for computers (by creating expert-authored adversarial
training examples).  After reviewing these completed works in the
domain of question answering, I then discuss ongoing work to improve
models that can predict whether a human will answer questions
correctly for educational applications and model computer question
answering systems and datasets to better understand the state of the
art in natural language processing research.


In this talk, I will
discuss how to measure the interpretability of machine learning
explanations using human-machine cooperation.  


Artificial Intelligence isn't a game show (but it should be)


Artificial intelligence is viewed as a goal in science (let's build
intelligent machines) and in education (let's train software engineers
to build smart assistants).

Despite the serious implications for policy and for education, the
most widely-accepted view of the end goal of Artificial Intelligence
is a parlor game: a trivial "imitation game" (known today as
the Turing Test).

Likewise, many of the watersheds in the public understanding of AI
progress have been in games like chess or go.  Sometimes, they're a
literal game show like Jeopardy!

After discussing why existing game show exhibitions have given an
innacurate impression of how well we're doing with question
answering, I'll discuss how we can use the skills and strategies of high school
trivia competitions to improve the science of AI, communicate the
limitations of AI, and to broaden participation in computer science
and artificial intelligence.


Title:
Big Data Analysis with Topic Models: Evaluation, Interaction, and Multilingual Extensions

Abstract:

A common information need is to understand large, unstructured
datasets: millions of e-mails during e-discovery, a decade worth of
science correspondence, or a day's tweets.  In the last decade, topic
models have become a common tool for navigating such datasets even
across languages.  This talk investigates the foundational research
that allows successful tools for these data exploration tasks: how to
know when you have an effective model of the dataset; how to correct
bad models; how to measure topic model effectiveness; and how to
detect framing and spin using these techniques.  After introducing
topic models, I argue why traditional measures of topic model
quality---borrowed from machine learning---are inconsistent with how
topic models are actually used.  In response, I describe interactive
topic modeling, a technique that enables users to impart their
insights and preferences to models in a principled, interactive way.
I will then address measuring topic model effectiveness in real-world
tasks.


Jordan Boyd-Graber (UMD) will be presenting work entitled "Cooperative and Competitive Human-Machine Learning through Question Answering":

Abstract:
My research goal is to create machine learning algorithms that are
interpretable to humans, that can understand human strengths and
weaknesses, and can help humans improve themselves.  In this talk,
I'll discuss how we accomplish this through a trivia game called quiz
bowl. These questions are written so that they can be interrupted by
someone who knows more about the answer; that is, harder clues are at
the start of the question and easier clues are at the end of the
question: a player must decide when it has enough information to "buzz
in". After arguing in favor of this task compared to other question 
answering formats, we describe our question answering system: 
how and when to buzz using deep learning and reinforcement learning.

More importantly, however, this setting also helps us build systems to
adapt in cooperation and competition with humans.  In competition, we
are also able to understand the skill sets of our competitors to
adjust our strategy to optimize our performance against players using
a deep mixture of experts opponent model.  The game of quiz bowl also
allows opportunities to better understand interpretability in deep
learning models to *help* human players perform better with machine
cooperation.  This cooperation helps us with a related task,
simultaneous machine translation.

Finally, I'll discuss opportunities for broader participation through
open human-computer competitions:
http://qanta.org/

Jordan Boyd-Graber is an associate professor in the University of
Maryland's Computer Science Department, iSchool, UMIACS, and Language
Science Center. Jordan's research focus is in applying machine
learning and Bayesian probabilistic models to problems that help us
better understand social interaction or the human cognitive
process. He and his students have won "best of" awards at NIPS (2009,
2015), NAACL (2016), and CoNLL (2015), and Jordan won the British
Computing Society's 2015 Karen Spärk Jones Award and a 2017 NSF CAREER
award. His research has been funded by DARPA, IARPA, NSF, NCSES, ARL,
NIH, and Lockheed Martin and has been featured by CNN, Huffington
Post, New York Magazine, and the Wall Street Journal.

====================

Jordan Boyd-Graber (UMD) will be presenting work entitled "Cooperative and Competitive Machine Learning through Question Answering":

Abstract: My research goal is to create machine learning algorithms
that are interpretable to humans, that can understand human strengths
and weaknesses, and can help humans improve themselves.  In this talk,
I'll discuss how we accomplish this through a trivia game called quiz
bowl. These questions are written so that they can be interrupted by
someone who knows more about the answer (unlike, say, Jeopardy!); that
is, harder clues are at the start of the question and easier clues are
at the end of the question: a player must decide when it has enough
information to "buzz in". Our system determines the right answer, uses
deep reinforcement to determine when to buzz, and can beat the best
humans at the game.

More importantly, this setting also helps us build systems to adapt in
cooperation and competition with humans.  If we give humans access to
the predictions of our question answering system, we can craft
adversarial that remain natural for humans to answer but very
challenging for computer.  The game of quiz bowl also allows
opportunities to better understand interpretability in deep learning
models to *help* human players perform better with machine
cooperation.  This cooperation helps us with a related task,
simultaneous machine translation.

Finally, I'll discuss opportunities for broader participation through
open human-computer competitions:
http://qanta.org/

Jordan Boyd-Graber is an associate professor in the University of
Maryland's Computer Science Department, iSchool, UMIACS, and Language
Science Center. Jordan's research focus is in applying machine
learning and Bayesian probabilistic models to problems that help us
better understand social interaction or the human cognitive
process. He and his students have won "best of" awards at UAI (2018),
NIPS (2009, 2015), NAACL (2016), and CoNLL (2015), and Jordan won the
British Computing Society's 2015 Karen Spärk Jones Award and a 2017
NSF CAREER award. His research has been funded by DARPA, IARPA, NSF,
NCSES, ARL, NIH, and Lockheed Martin and has been featured by CNN,
Huffington Post, New York Magazine, and the Wall Street Journal.

===========================================

Computational Modeling of Relationships, Spin, and Betrayal

Computational approaches allow us to understand how text encodes
latent social factors.  My talk focuses on recent work that
attempts to use unannotated or found annotations to better understand
relationships in the real world, in fiction, and in games.  First,
I'll discuss a probabilistic model that attempts to predict the
political polarization of a speaker based on how they frame policy
issues.  Based on data from the US congress, we describe the Tea Party
movement and its relationship to mainstream Republicans.  Next, I'll
discuss a deep dictionary learning approach to categorize the
relationships between fictional characters over time.  We'll show that
this approach improves over topic model baselines in both prediction
and human interpretability.  Finally, I'll discuss work to detect
deception in an online game called Diplomacy and which clues can
predict betrayal.

================================================================

Title: Thinking on your Feet: Reinforcement Learning for Incremental Language Tasks

Abstract: 

In this talk, I'll discuss two real-world language applications that require "thinking on your feet": synchronous machine translation (or "machine simultaneous interpretation") and question answering (when questions are revealed one piece at a time).  In both cases, effective algorithms for these tasks must interrupt the input stream and decide when to provide output.

Synchronous machine translation is when a sentence is being produced one word at a time in a foreign language and we want to produce a translation in English simultaneously (i.e., with as little delay between a foreign language word and its English translation). This is particularly difficult in verb-final languages like German or Japanese, where an English translation can barely begin until the verb is seen. Effective translation thus requires predictions of unseen elements of the sentence (e.g., the main verb in German and Japanese, or relative clauses in Japanese, or post-positions in Japanese). We use reinforcement learning to decide when to trust our verb predictions. It must learn to balance incorrect translation versus timely translations, and must use those predictions to translate the sentence.

For question answering, we use a specially designed dataset that challenges humans: a trivia game called quiz bowl. These questions are written so that they can be interrupted by someone who knows more about the answer; that is, harder clues are at the start of the question and easier clues are at the end of the question. We create a novel neural network system to predict answers from incomplete questions and use reinforcement learning to decide when to guess.  We are able to answer questions earlier in the questions than most college trivia contestants.

================================================================

Title:

Big Data Analysis with Topic Models: Human Interaction and Social
Science Applications

Abstract:

A common information need is to understand large, unstructured
datasets: millions of e-mails during e-discovery, a decade worth of
science correspondence, or a day's tweets.  In the last decade, topic
models have become a common tool for navigating such datasets.  This
talk investigates the foundational research that allows successful
tools for these data exploration tasks: how to know when you have an
effective model of the dataset; how to correct bad models; how to
measure topic model effectiveness; and how to detect framing and spin
using these techniques.  After introducing topic models, I argue why
traditional measures of topic model quality---borrowed from machine
learning---are inconsistent with how topic models are actually used.
In response, I describe interactive topic modeling, a technique that
enables users to impart their insights and preferences to models in a
principled, interactive way.  I will then address measuring topic
model effectiveness in real-world tasks.  Finally, I'll discuss
ongoing collaborations with political scientists to use these
techniques to detect spin and framing in political and online
interactions.

================================================================

Opening the Black Box of Machine Learning: Interactive, Interpretable Interfaces for Exploring Linguistic Tasks

Abstract: Machine learning is ubiquitous, but most users treat it as a
black box: a handy tool that suggests purchases, flags spam, or
autocompletes text. I present qualities that ubiquitous machine
learning should have to allow for a future filled with fruitful,
natural interactions with humans: interpretability, interactivity, and
an understanding of human qualities. After introducing these
properties, I present machine learning applications that begin to
fulfill these desirable properties. I begin with a traditional
information processing task---making sense and categorizing large
document collections---and show that machine learning methods can
provide interpretable, efficient techniques to categorize large
document collections with a human in the loop. From there, I turn to
language-based games that require machines and humans to compete and
cooperate and discuss how this can improve and measure
interpretability in machine learning.