NLP in Scala with Breeze and Epic
David Hall
UC Berkeley
ScalaNLP Ecosystem
Breeze Epic Puck
•
Linear Algebra
•
Scientific Computing
•
Optimization
•
Natural Language
Processing
•
Structured Prediction
•
Super-fast GPU
parser for English
≈
≈
Numpy/Scipy
PyStruct/NLTK
≈
{ }
Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap.
PP
NP
VP
VP
NP
V
S
VP
S
It certainly won’t get there on looks.
Natural Language Processing
Epic for NLP
Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap.
PP
NP
VP
VP
NP
V
S
VP
S
Named Entity Recognition
Person Organization Location Misc
NER with Epic
>
import epic.models.NerSelector
>
val nerModel = NerSelector.loadNer("en").get
>
val tokens = epic.preprocess.tokenize("Almost 20 years
ago, Bill Watterson walked away from \"Calvin &
Hobbes.\"")
>
println(nerModel.bestSequence(tokens).render("O"))
Almost 20 years ago , [PER:Bill Watterson] walked away
from `` [LOC:Calvin & Hobbes] . ''
Not a location!
Annotate a bunch of data?
Building an NER system
>
val data: IndexedSeq[Segmentation[Label, String]]
= ???
>
val system = SemiCRF.buildSimple(data,
startLabel,
outsideLabel)
>
println(system.bestSequence(tokens).render("O"))
Almost 20 years ago , [PER:Bill Watterson] walked away
from `` [MISC:Calvin & Hobbes] . ''
Gazetteers
Gazetteers
Using your own gazetteer
>
val data: IndexedSeq[Segmentation[Label, String]]
= ???
>
val myGazetteer = ???
>
val system = SemiCRF.buildSimple(data,
startLabel,
outsideLabel,
gaz = myGazetteer)
Gazetteer
•
Careful with gazetteers!
•
If built from training data, system will use it
and only it to make predictions!
•
So, only known forms will be detected.
•
Still, can be very useful…
Semi-CR-What?
•
Semi-Markov Conditional Random Field
•
Don’t worry about the name.
Semi-CRFs
Semi-CRFs
+ score(Berkeley, CA)
+ score(- A bowl of )
+ score(Churchill-Brenneis Orchards)
+ score(Page mandarins and medjool dates)
score(Chez Panisse)
Features
=
w(starts-with-Chez)
+
w(starts-with-C…)
+
w(ends-with-P…)
+
w(starts-sentence)
+
w(shape:Xxx Xxx)
+
w(two-words)
+
w(in-gazetteer)
score(Chez Panisse)
Building your own features
val dsl = new WordFeaturizer.DSL[L](counts) with SurfaceFeaturizer.DSL
import dsl._
word(begin) // word at the beginning of the span
+ word(end – 1) // end of the span
+ word(begin – 1) // before (gets things like Mr.)
+ word (end) // one past the end
+ prefixes(begin) // prefixes up to some length
+ suffixes(begin)
+ length(begin, end) // names tend to be 1-3 words
+ gazetteer(begin, end)
Using your own featurizer
>
val data: IndexedSeq[Segmentation[Label, String]]
= ???
>
val myFeaturizer = ???
>
val system = SemiCRF.buildSimple(data,
startLabel,
outsideLabel,
featurizer = myFeaturizer)
Features
•
So far, we’ve been able to do everything with
(nearly) no math.
•
To understand more, need to do some math.
Machine Learning Primer
•
Training example (x, y)
–
x: sentence of some sort
–
y: labeled version
•
Goal:
–
want score(x, y) > score(x, y’), forall y’.
Machine Learning Primer
score(x, y) = wTf(x, y)
Machine Learning Primer
score(x, y) = w.t * f(x, y)
Machine Learning Primer
score(x, y) = w dot f(x, y)
Machine Learning Primer
score(x, y) >= score(x, y’)