Tải bản đầy đủ (.pdf) (11 trang)

Báo cáo khoa học: "Bayesian Synchronous Tree-Substitution Grammar Induction and its Application to Sentence Compression" pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.5 MB, 11 trang )

Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 937–947,
Uppsala, Sweden, 11-16 July 2010.
c
2010 Association for Computational Linguistics
Bayesian Synchronous Tree-Substitution Grammar Induction
and its Application to Sentence Compression
Elif Yamangil and Stuart M. Shieber
Harvard University
Cambridge, Massachusetts, USA
{elif, shieber}@seas.harvard.edu
Abstract
We describe our experiments with training
algorithms for tree-to-tree synchronous
tree-substitution grammar (STSG) for
monolingual translation tasks such as
sentence compression and paraphrasing.
These translation tasks are characterized
by the relative ability to commit to parallel
parse trees and availability of word align-
ments, yet the unavailability of large-scale
data, calling for a Bayesian tree-to-tree
formalism. We formalize nonparametric
Bayesian STSG with epsilon alignment in
full generality, and provide a Gibbs sam-
pling algorithm for posterior inference tai-
lored to the task of extractive sentence
compression. We achieve improvements
against a number of baselines, including
expectation maximization and variational
Bayes training, illustrating the merits of
nonparametric inference over the space of


grammars as opposed to sparse parametric
inference with a fixed grammar.
1 Introduction
Given an aligned corpus of tree pairs, we might
want to learn a mapping between the paired trees.
Such induction of tree mappings has application
in a variety of natural-language-processing tasks
including machine translation, paraphrase, and
sentence compression. The induced tree map-
pings can be expressed by synchronous grammars.
Where the tree pairs are isomorphic, synchronous
context-free grammars (SCFG) may suffice, but in
general, non-isomorphism can make the problem
of rule extraction difficult (Galley and McKeown,
2007). More expressive formalisms such as syn-
chronous tree-substitution (Eisner, 2003) or tree-
adjoining grammars may better capture the pair-
ings.
In this work, we explore techniques for inducing
synchronous tree-substitution grammars (STSG)
using as a testbed application extractive sentence
compression. Learning an STSG from aligned
trees is tantamount to determining a segmentation
of the trees into elementary trees of the grammar
along with an alignment of the elementary trees
(see Figure 1 for an example of such a segmenta-
tion), followed by estimation of the weights for the
extracted tree pairs.
1
These elementary tree pairs

serve as the rules of the extracted grammar. For
SCFG, segmentation is trivial — each parent with
its immediate children is an elementary tree — but
the formalism then restricts us to deriving isomor-
phic tree pairs. STSG is much more expressive,
especially if we allow some elementary trees on
the source or target side to be unsynchronized, so
that insertions and deletions can be modeled, but
the segmentation and alignment problems become
nontrivial.
Previous approaches to this problem have
treated the two steps — grammar extraction and
weight estimation — with a variety of methods.
One approach is to use word alignments (where
these can be reliably estimated, as in our testbed
application) to align subtrees and extract rules
(Och and Ney, 2004; Galley et al., 2004) but
this leaves open the question of finding the right
level of generality of the rules — how deep the
rules should be and how much lexicalization they
should involve — necessitating resorting to heuris-
tics such as minimality of rules, and leading to
1
Throughout the paper we will use the word STSG to re-
fer to the tree-to-tree version of the formalism, although the
string-to-tree version is also commonly used.
937
large grammars. Once a given set of rules is ex-
tracted, weights can be imputed using a discrimi-
native approach to maximize the (joint or condi-

tional) likelihood or the classification margin in
the training data (taking or not taking into account
the derivational ambiguity). This option leverages
a large amount of manual domain knowledge en-
gineering and is not in general amenable to latent
variable problems.
A simpler alternative to this two step approach
is to use a generative model of synchronous
derivation and simultaneously segment and weight
the elementary tree pairs to maximize the prob-
ability of the training data under that model; the
simplest exemplar of this approach uses expecta-
tion maximization (EM) (Dempster et al., 1977).
This approach has two frailties. First, EM search
over the space of all possible rules is computation-
ally impractical. Second, even if such a search
were practical, the method is degenerate, pushing
the probability mass towards larger rules in order
to better approximate the empirical distribution of
the data (Goldwater et al., 2006; DeNero et al.,
2006). Indeed, the optimal grammar would be one
in which each tree pair in the training data is its
own rule. Therefore, proposals for using EM for
this task start with a precomputed subset of rules,
and with EM used just to assign weights within
this grammar. In summary, previous methods suf-
fer from problems of narrowness of search, having
to restrict the space of possible rules, and overfit-
ting in preferring overly specific grammars.
We pursue the use of hierarchical probabilistic

models incorporating sparse priors to simultane-
ously solve both the narrowness and overfitting
problems. Such models have been used as gener-
ative solutions to several other segmentation prob-
lems, ranging from word segmentation (Goldwa-
ter et al., 2006), to parsing (Cohn et al., 2009; Post
and Gildea, 2009) and machine translation (DeN-
ero et al., 2008; Cohn and Blunsom, 2009; Liu
and Gildea, 2009). Segmentation is achieved by
introducing a prior bias towards grammars that are
compact representations of the data, namely by en-
forcing simplicity and sparsity: preferring simple
rules (smaller segments) unless the use of a com-
plex rule is evidenced by the data (through repeti-
tion), and thus mitigating the overfitting problem.
A Dirichlet process (DP) prior is typically used
to achieve this interplay. Interestingly, sampling-
based nonparametric inference further allows the
possibility of searching over the infinite space of
grammars (and, in machine translation, possible
word alignments), thus side-stepping the narrow-
ness problem outlined above as well.
In this work, we use an extension of the afore-
mentioned models of generative segmentation for
STSG induction, and describe an algorithm for
posterior inference under this model that is tai-
lored to the task of extractive sentence compres-
sion. This task is characterized by the availabil-
ity of word alignments, providing a clean testbed
for investigating the effects of grammar extraction.

We achieve substantial improvements against a
number of baselines including EM, support vector
machine (SVM) based discriminative training, and
variational Bayes (VB). By comparing our method
to a range of other methods that are subject dif-
ferentially to the two problems, we can show that
both play an important role in performance limi-
tations, and that our method helps address both as
well. Our results are thus not only encouraging for
grammar estimation using sparse priors but also il-
lustrate the merits of nonparametric inference over
the space of grammars as opposed to sparse para-
metric inference with a fixed grammar.
In the following, we define the task of extrac-
tive sentence compression and the Bayesian STSG
model, and algorithms we used for inference and
prediction. We then describe the experiments in
extractive sentence compression and present our
results in contrast with alternative algorithms. We
conclude by giving examples of compression pat-
terns learned by the Bayesian method.
2 Sentence compression
Sentence compression is the task of summarizing a
sentence while retaining most of the informational
content and remaining grammatical (Jing, 2000).
In extractive sentence compression, which we fo-
cus on in this paper, an order-preserving subset of
the words in the sentence are selected to form the
summary, that is, we summarize by deleting words
(Knight and Marcu, 2002). An example sentence

pair, which we use as a running example, is the
following:
• Like FaceLift, much of
ATM’s screen perfor-
mance depends on the underlying applica-
tion.
• ATM’s screen performance depends on the
underlying application.
938
Figure 1: A portion of an STSG derivation of the example sentence and its extractive compression.
where the underlined words were deleted. In su-
pervised sentence compression, the goal is to gen-
eralize from a parallel training corpus of sentences
(source) and their compressions (target) to unseen
sentences in a test set to predict their compres-
sions. An unsupervised setup also exists; meth-
ods for the unsupervised problem typically rely
on language models and linguistic/discourse con-
straints (Clarke and Lapata, 2006a; Turner and
Charniak, 2005). Because these methods rely on
dynamic programming to efficiently consider hy-
potheses over the space of all possible compres-
sions of a sentence, they may be harder to extend
to general paraphrasing.
3 The STSG Model
Synchronous tree-substitution grammar is a for-
malism for synchronously generating a pair of
non-isomorphic source and target trees (Eisner,
2003). Every grammar rule is a pair of elemen-
tary trees aligned at the leaf level at their frontier

nodes, which we will denote using the form
c
s
/c
t
→ e
s
/e
t
, γ
(indices s for source, t for target) where c
s
, c
t
are
root nonterminals of the elementary trees e
s
, e
t
re-
spectively and γ is a 1-to-1 correspondence be-
tween the frontier nodes in e
s
and e
t
. For example,
the rule
S / S → (S (PP (IN Like) NP
[]
) NP

[1]
VP
[2]
) /
(S NP
[1]
VP
[2]
)
can be used to delete a subtree rooted at PP. We
use square bracketed indices to represent the align-
ment γ of frontier nodes — NP
[1]
aligns with
NP
[1]
, VP
[2]
aligns with VP
[2]
, NP
[]
aligns with
the special symbol  denoting a deletion from the
source tree. Symmetrically -aligned target nodes
are used to represent insertions into the target tree.
Similarly, the rule
NP /  → (NP (NN FaceLift)) / 
can be used to continue deriving the deleted sub-
tree. See Figure 1 for an example of how an STSG

with these rules would operate in synchronously
generating our example sentence pair.
STSG is a convenient choice of formalism for
a number of reasons. First, it eliminates the iso-
morphism and strong independence assumptions
of SCFGs. Second, the ability to have rules deeper
than one level provides a principled way of model-
ing lexicalization, whose importance has been em-
phasized (Galley and McKeown, 2007; Yamangil
and Nelken, 2008). Third, we may have our STSG
operate on trees instead of sentences, which allows
for efficient parsing algorithms, as well as provid-
ing syntactic analyses for our predictions, which is
desirable for automatic evaluation purposes.
A straightforward extension of the popular EM
algorithm for probabilistic context free grammars
(PCFG), the inside-outside algorithm (Lari and
Young, 1990), can be used to estimate the rule
weights of a given unweighted STSG based on a
corpus of parallel parse trees t = t
1
, . . . , t
N
where
t
n
= t
n,s
/t
n,t

for n = 1, . . . , N. Similarly, an
939
Figure 2: Gibbs sampling updates. We illustrate a sampler move to align/unalign a source node with a
target node (top row in blue), and split/merge a deletion rule via aligning with  (bottom row in red).
extension of the Viterbi algorithm is available for
finding the maximum probability derivation, use-
ful for predicting the target analysis t
N+1,t
for a
test instance t
N+1,s
. (Eisner, 2003) However, as
noted earlier, EM is subject to the narrowness and
overfitting problems.
3.1 The Bayesian generative process
Both of these issues can be addressed by taking
a nonparametric Bayesian approach, namely, as-
suming that the elementary tree pairs are sampled
from an independent collection of Dirichlet pro-
cess (DP) priors. We describe such a process for
sampling a corpus of tree pairs t.
For all pairs of root labels c = c
s
/c
t
that we
consider, where up to one of c
s
or c
t

can be  (e.g.,
S / S, NP / ), we sample a sparse discrete distribu-
tion G
c
over infinitely many elementary tree pairs
e = e
s
/e
t
sharing the common root c from a DP
G
c
∼ DP(α
c
, P
0
(· | c)) (1)
where the DP has the concentration parameter α
c
controlling the sparsity of G
c
, and the base dis-
tribution P
0
(· | c) is a distribution over novel el-
ementary tree pairs that we describe more fully
shortly.
We then sample a sequence of elementary tree
pairs to serve as a derivation for each observed de-
rived tree pair. For each n = 1, . . . , N, we sam-

ple elementary tree pairs e
n
= e
n,1
, . . . , e
n,d
n
in
a derivation sequence (where d
n
is the number of
rules used in the derivation), consulting G
c
when-
ever an elementary tree pair with root c is to be
sampled.
e
iid
∼ G
c
, for all e whose root label is c
Given the derivation sequence e
n
, a tree pair t
n
is
determined, that is,
p(t
n
| e

n
) =

1 e
n,1
, . . . , e
n,d
n
derives t
n
0 otherwise.
(2)
The hyperparameters α
c
can be incorporated
into the generative model as random variables;
however, we opt to fix these at various constants
to investigate different levels of sparsity.
For the base distribution P
0
(· | c) there are a
variety of choices; we used the following simple
scenario. (We take c = c
s
/c
t
.)
Synchronous rules For the case where neither c
s
nor c

t
are the special symbol , the base dis-
tribution first generates e
s
and e
t
indepen-
dently, and then samples an alignment be-
tween the frontier nodes. Given a nontermi-
nal, an elementary tree is generated by first
making a decision to expand the nontermi-
nal (with probability β
c
) or to leave it as a
frontier node (1 − β
c
). If the decision to ex-
pand was made, we sample an appropriate
rule from a PCFG which we estimate ahead
940
of time from the training corpus. We expand
the nonterminal using this rule, and then re-
peat the same procedure for every child gen-
erated that is a nonterminal until there are no
generated nonterminal children left. This is
done independently for both e
s
and e
t
. Fi-

nally, we sample an alignment between the
frontier nodes uniformly at random out of all
possible alingments.
Deletion/insertion rules If c
t
= , that is, we
have a deletion rule, we need to generate
e = e
s
/. (The insertion rule case is symmet-
ric.) The base distribution generates e
s
using
the same process described for synchronous
rules above. Then with probability 1 we align
all frontier nodes in e
s
with . In essence,
this process generates TSG rules, rather than
STSG rules, which are used to cover deleted
(or inserted) subtrees.
This simple base distribution does nothing to
enforce an alignment between the internal nodes
of e
s
and e
t
. One may come up with more sophis-
ticated base distributions. However the main point
of the base distribution is to encode a control-

lable preference towards simpler rules; we there-
fore make the simplest possible assumption.
3.2 Posterior inference via Gibbs sampling
Assuming fixed hyperparameters α = {α
c
} and
β = {β
c
}, our inference problem is to find the
posterior distribution of the derivation sequences
e = e
1
, . . . , e
N
given the observations t =
t
1
, . . . , t
N
. Applying Bayes’ rule, we have
p(e | t) ∝ p(t | e)p(e) (3)
where p(t | e) is a 0/1 distribution (2) which does
not depend on G
c
, and p(e) can be obtained by
collapsing G
c
for all c.
Consider repeatedly generating elementary tree
pairs e

1
, . . . , e
i
, all with the same root c, iid from
G
c
. Integrating over G
c
, the e
i
become depen-
dent. The conditional prior of the i-th elementary
tree pair given previously generated ones e
<i
=
e
1
, . . . , e
i−1
is given by
p(e
i
| e
<i
) =
n
e
i
+ α
c

P
0
(e
i
| c)
i − 1 + α
c
(4)
where n
e
i
denotes the number of times e
i
occurs
in e
<i
. Since the collapsed model is exchangeable
in the e
i
, this formula forms the backbone of the
inference procedure that we describe next. It also
makes clear DP’s inductive bias to reuse elemen-
tary tree pairs.
We use Gibbs sampling (Geman and Geman,
1984), a Markov chain Monte Carlo (MCMC)
method, to sample from the posterior (3). A
derivation e of the corpus t is completely specified
by an alignment between the source nodes and the
corresponding target nodes (as well as  on either
side), which we take to be the state of the sampler.

We start at a random derivation of the corpus, and
at every iteration resample a derivation by amend-
ing the current one through local changes made
at the node level, in the style of Goldwater et al.
(2006).
Our sampling updates are extensions of those
used by Cohn and Blunsom (2009) in MT, but are
tailored to our task of extractive sentence compres-
sion. In our task, no target node can align with
 (which would indicate a subtree insertion), and
barring unary branches no source node i can align
with two different target nodes j and j

at the same
time (indicating a tree expansion). Rather, the
configurations of interest are those in which only
source nodes i can align with , and two source
nodes i and i

can align with the same target node
j. Thus, the alignments of interest are not arbitrary
relations, but (partial) functions from nodes in e
s
to nodes in e
t
or . We therefore sample in the
direction from source to target. In particular, we
visit every tree pair and each of its source nodes i,
and update its alignment by selecting between and
within two choices: (a) unaligned, (b) aligned with

some target node j or . The number of possibil-
ities j in (b) is significantly limited, firstly by the
word alignment (for instance, a source node dom-
inating a deleted subspan cannot be aligned with
a target node), and secondly by the current align-
ment of other nearby aligned source nodes. (See
Cohn and Blunsom (2009) for details of matching
spans under tree constraints.)
2
2
One reviewer was concerned that since we explicitly dis-
allow insertion rules in our sampling procedure, our model
that generates such rules wastes probability mass and is there-
fore “deficient”. However, we regard sampling as a separate
step from the data generation process, in which we can for-
mulate more effective algorithms by using our domain knowl-
edge that our data set was created by annotators who were
instructed to delete words only. Also, disallowing insertion
rules in the base distribution unnecessarily complicates the
definition of the model, whereas it is straightforward to de-
fine the joint distribution of all (potentially useful) rules and
then use domain knowledge to constrain the support of that
distribution during inference, as we do here. In fact, it is pos-
941
More formally, let e
M
be the elementary tree
pair rooted at the closest aligned ancestor i

of

node i when it is unaligned; and let e
A
and e
B
be the elementary tree pairs rooted at i

and i re-
spectively when i is aligned with some target node
j or . Then, by exchangeability of the elementary
trees sharing the same root label, and using (4), we
have
p(unalign) =
n
e
M
+ α
c
M
P
0
(e
M
| c
M
)
n
c
M
+ α
c

M
(5)
p(align with j) =
n
e
A
+ α
c
A
P
0
(e
A
| c
A
)
n
c
A
+ α
c
A
(6)
×
n
e
B
+ α
c
B

P
0
(e
B
| c
B
)
n
c
B
+ α
c
B
(7)
where the counts n
e
·
, n
c
·
are with respect to the
current derivation of the rest of the corpus; except
for n
e
B
, n
c
B
we also make sure to account for hav-
ing generated e

A
. See Figure 2 for an illustration
of the sampling updates.
It is important to note that the sampler described
can move from any derivation to any other deriva-
tion with positive probability (if only, for example,
by virtue of fully merging and then resegment-
ing), which guarantees convergence to the poste-
rior (3). However some of these transition prob-
abilities can be extremely small due to passing
through low probability states with large elemen-
tary trees; in turn, the sampling procedure is prone
to local modes. In order to counteract this and to
improve mixing we used simulated annealing. The
probability mass function (5-7) was raised to the
power 1/T with T dropping linearly from T = 5
to T = 0. Furthermore, using a final tempera-
ture of zero, we recover a maximum a posteriori
(MAP) estimate which we denote e
MAP
.
3.3 Prediction
We discuss the problem of predicting a target tree
t
N+1,t
that corresponds to a source tree t
N+1,s
unseen in the observed corpus t. The maximum
probability tree (MPT) can be found by consid-
ering all possible ways to derive it. However a

much simpler alternative is to choose the target
tree implied by the maximum probability deriva-
sible to prove that our approach is equivalent up to a rescaling
of the concentration parameters. Since we fit these parame-
ters to the data, our approach is equivalent.
tion (MPD), which we define as
e

= argmax
e
p(e | t
s
, t)
= argmax
e

e
p(e | t
s
, e)p(e | t)
where e denotes a derivation for t = t
s
/t
t
. (We
suppress the N + 1 subscripts for brevity.) We
approximate this objective first by substituting
δ
e
MAP

(e) for p(e | t) and secondly using a finite
STSG model for the infinite p(e | t
s
, e
MAP
), which
we obtain simply by normalizing the rule counts in
e
MAP
. We use dynamic programming for parsing
under this finite model (Eisner, 2003).
3
Unfortunately, this approach does not ensure
that the test instances are parsable, since t
s
may
include unseen structure or novel words. A work-
around is to include all zero-count context free
copy rules such as
NP / NP → (NP NP
[1]
PP
[2]
) / (NP NP
[1]
PP
[2]
)
NP /  → (NP NP
[]

PP
[]
) / 
in order to smooth our finite model. We used
Laplace smoothing (adding 1 to all counts) as it
gave us interpretable results.
4 Evaluation
We compared the Gibbs sampling compressor
(GS) against a version of maximum a posteriori
EM (with Dirichlet parameter greater than 1) and
a discriminative STSG based on SVM training
(Cohn and Lapata, 2008) (SVM). EM is a natural
benchmark, while SVM is also appropriate since
it can be taken as the state of the art for our task.
4
We used a publicly available extractive sen-
tence compression corpus: the Broadcast News
compressions corpus (BNC) of Clarke and Lap-
ata (2006a). This corpus consists of 1370 sentence
pairs that were manually created from transcribed
Broadcast News stories. We split the pairs into
training, development, and testing sets of 1000,
3
We experimented with MPT using Monte Carlo integra-
tion over possible derivations; the results were not signifi-
cantly different from those using MPD.
4
The comparison system described by Cohn and Lapata
(2008) attempts to solve a more general problem than ours,
abstractive sentence compression. However, given the nature

of the data that we provided, it can only learn to compress
by deleting words. Since the system is less specialized to the
task, their model requires additional heuristics in decoding
not needed for extractive compression, which might cause a
reduction in performance. Nonetheless, because the compar-
ison system is a generalization of the extractive SVM com-
pressor of Cohn and Lapata (2007), we do not expect that the
results would differ qualitatively.
942
SVM EM GS
Precision 55.60 58.80 58.94
Recall 53.37 56.58 64.59
Relational F1 54.46 57.67 61.64
Compression rate 59.72 64.11 65.52
Table 1: Precision, recall, relational F1 and com-
pression rate (%) for various systems on the 200-
sentence BNC test set. The compression rate for
the gold standard was 65.67%.
SVM EM GS Gold
Grammar 2.75

2.85

3.69 4.25
Importance 2.85 2.67

3.41 3.82
Comp. rate 68.18 64.07 67.97 62.34
Table 2: Average grammar and importance scores
for various systems on the 20-sentence subsam-

ple. Scores marked with ∗ are significantly dif-
ferent than the corresponding GS score at α < .05
and with † at α < .01 according to post-hoc Tukey
tests. ANOVA was significant at p < .01 both for
grammar and importance.
170, and 200 pairs, respectively. The corpus was
parsed using the Stanford parser (Klein and Man-
ning, 2003).
In our experiments with the publicly available
SVM system we used all except paraphrasal rules
extracted from bilingual corpora (Cohn and Lap-
ata, 2008). The model chosen for testing had pa-
rameter for trade-off between training error and
margin set to C = 0.001, used margin rescaling,
and Hamming distance over bags of tokens with
brevity penalty for loss function. EM used a sub-
set of the rules extracted by SVM, namely all rules
except non-head deleting compression rules, and
was initialized uniformly. Each EM instance was
characterized by two parameters: α, the smooth-
ing parameter for MAP-EM, and δ, the smooth-
ing parameter for augmenting the learned gram-
mar with rules extracted from unseen data (add-
(δ − 1) smoothing was used), both of which were
fit to the development set using grid-search over
(1, 2]. The model chosen for testing was (α, δ) =
(1.0001, 1.01).
GS was initialized at a random derivation. We
sampled the alignments of the source nodes in ran-
dom order. The sampler was run for 5000 itera-

tions with annealing. All hyperparameters α
c
, β
c
were held constant at α, β for simplicity and were
fit using grid-search over α ∈ [10
−6
, 10
6
], β ∈
[10
−3
, 0.5]. The model chosen for testing was
(α, β) = (100, 0.1).
As an automated metric of quality, we compute
F-score based on grammatical relations (relational
F1, or RelF1) (Riezler et al., 2003), by which the
consistency between the set of predicted grammat-
ical relations and those from the gold standard is
measured, which has been shown by Clarke and
Lapata (2006b) to correlate reliably with human
judgments. We also conducted a small human sub-
jective evaluation of the grammaticality and infor-
mativeness of the compressions generated by the
various methods.
4.1 Automated evaluation
For all three systems we obtained predictions for
the test set and used the Stanford parser to extract
grammatical relations from predicted trees and the
gold standard. We computed precision, recall,

RelF1 (all based on grammatical relations), and
compression rate (percentage of the words that are
retained), which we report in Table 1. The results
for GS are averages over five independent runs.
EM gives a strong baseline since it already uses
rules that are limited in depth and number of fron-
tier nodes by stipulation, helping with the overfit-
ting we have mentioned, surprisingly outperform-
ing its discriminative counterpart in both precision
and recall (and consequently RelF1). GS however
maintains the same level of precision as EM while
improving recall, bringing an overall improvement
in RelF1.
4.2 Human evaluation
We randomly subsampled our 200-sentence test
set for 20 sentences to be evaluated by human
judges through Amazon Mechanical Turk. We
asked 15 self-reported native English speakers for
their judgments of GS, EM, and SVM output sen-
tences and the gold standard in terms of grammat-
icality (how fluent the compression is) and impor-
tance (how much of the meaning of and impor-
tant information from the original sentence is re-
tained) on a scale of 1 (worst) to 5 (best). We re-
port in Table 2 the average scores. EM and SVM
perform at very similar levels, which we attribute
to using the same set of rules, while GS performs
at a level substantially better than both, and much
closer to human performance in both criteria. The
943

Figure 3: RelF1, precision, recall plotted against
compression rate for GS, EM, VB.
human evaluation indicates that the superiority of
the Bayesian nonparametric method is underap-
preciated by the automated evaluation metric.
4.3 Discussion
The fact that GS performs better than EM can be
attributed to two reasons: (1) GS uses a sparse
prior and selects a compact representation of the
data (grammar sizes ranged from 4K-7K for GS
compared to a grammar of about 35K rules for
EM). (2) GS does not commit to a precomputed
grammar and searches over the space of all gram-
mars to find one that bests represents the corpus.
It is possible to introduce DP-like sparsity in EM
using variational Bayes (VB) training. We exper-
iment with this next in order to understand how
dominant the two factors are. The VB algorithm
requires a simple update to the M-step formulas
for EM where the expected rule counts are normal-
ized, such that instead of updating the rule weight
in the t-th iteration as in the following
θ
t+1
c,e
=
n
c,e
+ α − 1
n

c,.
+ Kα − K
where n
c,e
represents the expected count of rule
c → e, and K is the total number of ways
to rewrite c, we now take into account our
DP(α
c
, P
0
(· | c)) prior in (1), which, when
truncated to a finite grammar, reduces to a
K-dimensional Dirichlet prior with parameter
α
c
P
0
(· | c). Thus in VB we perform a variational
E-step with the subprobabilities given by
θ
t+1
c,e
=
exp (Ψ(n
c,e
+ α
c
P
0

(e | c)))
exp (Ψ(n
c,.
+ α
c
))
where Ψ denotes the digamma function. (Liu and
Gildea, 2009) (See MacKay (1997) for details.)
Hyperparameters were handled the same way as
for GS.
Instead of selecting a single model on the devel-
opment set, here we provide the whole spectrum of
models and their performances in order to better
understand their comparative behavior. In Figure
3 we plot RelF1 on the test set versus compres-
sion rate and compare GS, EM, and VB (β = 0.1
fixed, (α, δ) ranging in [10
−6
, 10
6
]×(1, 2]). Over-
all, we see that GS maintains roughly the same
level of precision as EM (despite its larger com-
pression rates) while achieving an improvement in
recall, consequently performing at a higher RelF1
level. We note that VB somewhat bridges the gap
between GS and EM, without quite reaching GS
performance. We conclude that the mitigation of
the two factors (narrowness and overfitting) both
contribute to the performance gain of GS.

5
4.4 Example rules learned
In order to provide some insight into the grammar
extracted by GS, we list in Tables (3) and (4) high
5
We have also experimented with VB with parametric in-
dependent symmetric Dirichlet priors. The results were sim-
ilar to EM with the exception of sparse priors resulting in
smaller grammars and slightly improving performance.
944
(ROOT (S CC
[]
NP
[1]
VP
[2]
.
[3]
)) / (ROOT (S NP
[1]
VP
[2]
.
[3]
))
(ROOT (S NP
[1]
ADVP
[]
VP

[2]
(. .))) / (ROOT (S NP
[1]
VP
[2]
(. .)))
(ROOT (S ADVP
[]
(, ,) NP
[1]
VP
[2]
(. .))) / (ROOT (S NP
[1]
VP
[2]
(. .)))
(ROOT (S PP
[]
(, ,) NP
[1]
VP
[2]
(. .))) / (ROOT (S NP
[1]
VP
[2]
(. .)))
(ROOT (S PP
[]

,
[]
NP
[1]
VP
[2]
.
[3]
)) / (ROOT (S NP
[1]
VP
[2]
.
[3]
))
(ROOT (S NP
[]
(VP VBP
[]
(SBAR (S NP
[1]
VP
[2]
))) .
[3]
)) / (ROOT (S NP
[1]
VP
[2]
.

[3]
))
(ROOT (S ADVP
[]
NP
[1]
(VP MD
[2]
VP
[3]
) .
[4]
)) / (ROOT (S NP
[1]
(VP MD
[2]
VP
[3]
) .
[4]
))
(ROOT (S (SBAR (IN as) S
[]
) ,
[]
NP
[1]
VP
[2]
.

[3]
)) / (ROOT (S NP
[1]
VP
[2]
.
[3]
))
(ROOT (S S
[]
(, ,) CC
[]
(S NP
[1]
VP
[2]
) .
[3]
)) / (ROOT (S NP
[1]
VP
[2]
.
[3]
))
(ROOT (S PP
[]
NP
[1]
VP

[2]
.
[3]
)) / (ROOT (S NP
[1]
VP
[2]
.
[3]
))
(ROOT (S S
[1]
(, ,) CC
[]
S
[2]
(. .))) / (ROOT (S NP
[1]
VP
[2]
(. .)))
(ROOT (S S
[]
,
[]
NP
[1]
ADVP
[2]
VP

[3]
.
[4]
)) / (ROOT (S NP
[1]
ADVP
[2]
VP
[3]
.
[4]
))
(ROOT (S (NP (NP NNP
[]
(POS ’s)) NNP
[1]
NNP
[2]
) / (ROOT (S (NP NNP
[1]
NNP
[2]
)
(VP (VBZ reports)) .
[3]
)) (VP (VBZ reports)) .
[3]
))
Table 3: High probability ROOT / ROOT compression rules from the final state of the sampler.
(S NP

[1]
ADVP
[]
VP
[2]
) / (S NP
[1]
VP
[2]
)
(S INTJ
[]
(, ,) NP
[1]
VP
[2]
(. .)) / (S NP
[1]
VP
[2]
(. .))
(S (INTJ (UH Well)) ,
[]
NP
[1]
VP
[2]
.
[3]
) / (S NP

[1]
VP
[2]
.
[3]
)
(S PP
[]
(, ,) NP
[1]
VP
[2]
) / (S NP
[1]
VP
[2]
)
(S ADVP
[]
(, ,) S
[1]
(, ,) (CC but) S
[2]
.
[3]
) / (S S
[1]
(, ,) (CC but) S
[2]
.

[3]
)
(S ADVP
[]
NP
[1]
VP
[2]
) / (S NP
[1]
VP
[2]
)
(S NP
[]
(VP VBP
[]
(SBAR (IN that) (S NP
[1]
VP
[2]
))) (. .)) / (S NP
[1]
VP
[2]
(. .))
(S NP
[]
(VP VBZ
[]

ADJP
[]
SBAR
[1]
)) / S
[1]
(S CC
[]
PP
[]
(, ,) NP
[1]
VP
[2]
(. .)) / (S NP
[1]
VP
[2]
(. .))
(S NP
[]
(, ,) NP
[1]
VP
[2]
.
[3]
) / (S NP
[1]
VP

[2]
.
[3]
)
(S NP
[1]
(, ,) ADVP
[]
(, ,) VP
[2]
) / (S NP
[1]
VP
[2]
)
(S CC
[]
(NP PRP
[1]
) VP
[2]
) / (S (NP PRP
[1]
) VP
[2]
)
(S ADVP
[]
,
[]

PP
[]
,
[]
NP
[1]
VP
[2]
.
[3]
) / (S NP
[1]
VP
[2]
.
[3]
)
(S ADVP
[]
(, ,) NP
[1]
VP
[2]
) / (S NP
[1]
VP
[2]
)
Table 4: High probability S / S compression rules from the final state of the sampler.
probability subtree-deletion rules expanding cate-

gories ROOT / ROOT and S / S, respectively. Of
especial interest are deep lexicalized rules such as
a pattern of compression used many times in the
BNC in sentence pairs such as “NPR’s Anne Gar-
rels reports” / “Anne Garrels reports”. Such an
informative rule with nontrivial collocation (be-
tween the possessive marker and the word “re-
ports”) would be hard to extract heuristically and
can only be extracted by reasoning across the
training examples.
5 Conclusion
We explored nonparametric Bayesian learning
of non-isomorphic tree mappings using Dirich-
let process priors. We used the task of extrac-
tive sentence compression as a testbed to investi-
gate the effects of sparse priors and nonparamet-
ric inference over the space of grammars. We
showed that, despite its degeneracy, expectation
maximization is a strong baseline when given a
reasonable grammar. However, Gibbs-sampling–
based nonparametric inference achieves improve-
ments against this baseline. Our investigation with
variational Bayes showed that the improvement is
due both to finding sparse grammars (mitigating
overfitting) and to searching over the space of all
grammars (mitigating narrowness). Overall, we
take these results as being encouraging for STSG
induction via Bayesian nonparametrics for mono-
lingual translation tasks. The future for this work
would involve natural extensions such as mixing

over the space of word alignments; this would al-
low application to MT-like tasks where flexible
word reordering is allowed, such as abstractive
sentence compression and paraphrasing.
References
James Clarke and Mirella Lapata. 2006a. Constraint-
based sentence compression: An integer program-
ming approach. In Proceedings of the 21st Interna-
945
tional Conference on Computational Linguistics and
44th Annual Meeting of the Association for Compu-
tational Linguistics, pages 144–151, Sydney, Aus-
tralia, July. Association for Computational Linguis-
tics.
James Clarke and Mirella Lapata. 2006b. Models
for sentence compression: A comparison across do-
mains, training requirements and evaluation mea-
sures. In Proceedings of the 21st International Con-
ference on Computational Linguistics and 44th An-
nual Meeting of the Association for Computational
Linguistics, pages 377–384, Sydney, Australia, July.
Association for Computational Linguistics.
Trevor Cohn and Phil Blunsom. 2009. A Bayesian
model of syntax-directed tree to string grammar in-
duction. In EMNLP ’09: Proceedings of the 2009
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 352–361, Morristown, NJ,
USA. Association for Computational Linguistics.
Trevor Cohn and Mirella Lapata. 2007. Large mar-
gin synchronous generation and its application to

sentence compression. In Proceedings of the Con-
ference on Empirical Methods in Natural Language
Processing and on Computational Natural Lan-
guage Learning, pages 73–82, Prague. Association
for Computational Linguistics.
Trevor Cohn and Mirella Lapata. 2008. Sentence
compression beyond word deletion. In COLING
’08: Proceedings of the 22nd International Confer-
ence on Computational Linguistics, pages 137–144,
Manchester, United Kingdom. Association for Com-
putational Linguistics.
Trevor Cohn, Sharon Goldwater, and Phil Blun-
som. 2009. Inducing compact but accurate tree-
substitution grammars. In NAACL ’09: Proceed-
ings of Human Language Technologies: The 2009
Annual Conference of the North American Chap-
ter of the Association for Computational Linguistics,
pages 548–556, Morristown, NJ, USA. Association
for Computational Linguistics.
A. Dempster, N. Laird, and D. Rubin. 1977. Max-
imum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society,
39 (Series B):1–38.
John DeNero, Dan Gillick, James Zhang, and Dan
Klein. 2006. Why generative phrase models under-
perform surface heuristics. In StatMT ’06: Proceed-
ings of the Workshop on Statistical Machine Trans-
lation, pages 31–38, Morristown, NJ, USA. Associ-
ation for Computational Linguistics.
John DeNero, Alexandre Bouchard-C

ˆ
ot
´
e, and Dan
Klein. 2008. Sampling alignment structure under
a Bayesian translation model. In EMNLP ’08: Pro-
ceedings of the Conference on Empirical Methods in
Natural Language Processing, pages 314–323, Mor-
ristown, NJ, USA. Association for Computational
Linguistics.
Jason Eisner. 2003. Learning non-isomorphic tree
mappings for machine translation. In ACL ’03: Pro-
ceedings of the 41st Annual Meeting on Associa-
tion for Computational Linguistics, pages 205–208,
Morristown, NJ, USA. Association for Computa-
tional Linguistics.
Michel Galley and Kathleen McKeown. 2007. Lex-
icalized Markov grammars for sentence compres-
sion. In Human Language Technologies 2007:
The Conference of the North American Chapter of
the Association for Computational Linguistics; Pro-
ceedings of the Main Conference, pages 180–187,
Rochester, New York, April. Association for Com-
putational Linguistics.
Michel Galley, Mark Hopkins, Kevin Knight, and
Daniel Marcu. 2004. What’s in a translation
rule? In Daniel Marcu Susan Dumais and Salim
Roukos, editors, HLT-NAACL 2004: Main Proceed-
ings, pages 273–280, Boston, Massachusetts, USA,
May 2 - May 7. Association for Computational Lin-

guistics.
S. Geman and D. Geman. 1984. Stochastic Relaxation,
Gibbs Distributions and the Bayesian Restoration of
Images. pages 6:721–741.
Sharon Goldwater, Thomas L. Griffiths, and Mark
Johnson. 2006. Contextual dependencies in un-
supervised word segmentation. In Proceedings of
the 21st International Conference on Computational
Linguistics and 44th Annual Meeting of the Associa-
tion for Computational Linguistics, pages 673–680,
Sydney, Australia, July. Association for Computa-
tional Linguistics.
Hongyan Jing. 2000. Sentence reduction for auto-
matic text summarization. In Proceedings of the
sixth conference on Applied natural language pro-
cessing, pages 310–315, Morristown, NJ, USA. As-
sociation for Computational Linguistics.
Dan Klein and Christopher D. Manning. 2003. Fast
exact inference with a factored model for natural
language parsing. In Advances in Neural Informa-
tion Processing Systems 15 (NIPS, pages 3–10. MIT
Press.
Kevin Knight and Daniel Marcu. 2002. Summa-
rization beyond sentence extraction: a probabilis-
tic approach to sentence compression. Artif. Intell.,
139(1):91–107.
K. Lari and S. J. Young. 1990. The estimation of
stochastic context-free grammars using the Inside-
Outside algorithm. Computer Speech and Lan-
guage, 4:35–56.

Ding Liu and Daniel Gildea. 2009. Bayesian learn-
ing of phrasal tree-to-string templates. In Proceed-
ings of the 2009 Conference on Empirical Methods
in Natural Language Processing, pages 1308–1317,
Singapore, August. Association for Computational
Linguistics.
946
David J.C. MacKay. 1997. Ensemble learning for hid-
den markov models. Technical report, Cavendish
Laboratory, Cambridge, UK.
Franz Josef Och and Hermann Ney. 2004. The align-
ment template approach to statistical machine trans-
lation. Comput. Linguist., 30(4):417–449.
Matt Post and Daniel Gildea. 2009. Bayesian learning
of a tree substitution grammar. In Proceedings of the
ACL-IJCNLP 2009 Conference Short Papers, pages
45–48, Suntec, Singapore, August. Association for
Computational Linguistics.
Stefan Riezler, Tracy H. King, Richard Crouch, and
Annie Zaenen. 2003. Statistical sentence condensa-
tion using ambiguity packing and stochastic disam-
biguation methods for lexical-functional grammar.
In NAACL ’03: Proceedings of the 2003 Conference
of the North American Chapter of the Association
for Computational Linguistics on Human Language
Technology, pages 118–125, Morristown, NJ, USA.
Association for Computational Linguistics.
Jenine Turner and Eugene Charniak. 2005. Super-
vised and unsupervised learning for sentence com-
pression. In ACL ’05: Proceedings of the 43rd An-

nual Meeting on Association for Computational Lin-
guistics, pages 290–297, Morristown, NJ, USA. As-
sociation for Computational Linguistics.
Elif Yamangil and Rani Nelken. 2008. Mining
wikipedia revision histories for improving sentence
compression. In Proceedings of ACL-08: HLT,
Short Papers, pages 137–140, Columbus, Ohio,
June. Association for Computational Linguistics.
947

×