Tải bản đầy đủ (.pptx) (66 trang)

nlp in scala with breeze and epic

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.58 MB, 66 trang )

NLP in Scala with Breeze and Epic
David Hall
UC Berkeley
ScalaNLP Ecosystem
Breeze Epic Puck

Linear Algebra

Scientific Computing

Optimization

Natural Language
Processing

Structured Prediction

Super-fast GPU
parser for English


Numpy/Scipy
PyStruct/NLTK

{ }
Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap.
PP
NP
VP
VP


NP
V
S
VP
S
It certainly won’t get there on looks.
Natural Language Processing
Epic for NLP
Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap.
PP
NP
VP
VP
NP
V
S
VP
S
Named Entity Recognition
Person Organization Location Misc
NER with Epic
>
import epic.models.NerSelector
>
val nerModel = NerSelector.loadNer("en").get
>
val tokens = epic.preprocess.tokenize("Almost 20 years
ago, Bill Watterson walked away from \"Calvin &
Hobbes.\"")
>

println(nerModel.bestSequence(tokens).render("O"))
Almost 20 years ago , [PER:Bill Watterson] walked away
from `` [LOC:Calvin & Hobbes] . ''
Not a location!
Annotate a bunch of data?
Building an NER system
>
val data: IndexedSeq[Segmentation[Label, String]]
= ???
>
val system = SemiCRF.buildSimple(data,
startLabel,
outsideLabel)
>
println(system.bestSequence(tokens).render("O"))
Almost 20 years ago , [PER:Bill Watterson] walked away
from `` [MISC:Calvin & Hobbes] . ''
Gazetteers

Gazetteers
Using your own gazetteer
>
val data: IndexedSeq[Segmentation[Label, String]]
= ???
>
val myGazetteer = ???
>
val system = SemiCRF.buildSimple(data,
startLabel,
outsideLabel,

gaz = myGazetteer)
Gazetteer

Careful with gazetteers!

If built from training data, system will use it
and only it to make predictions!

So, only known forms will be detected.

Still, can be very useful…
Semi-CR-What?

Semi-Markov Conditional Random Field

Don’t worry about the name.
Semi-CRFs
Semi-CRFs
+ score(Berkeley, CA)
+ score(- A bowl of )
+ score(Churchill-Brenneis Orchards)
+ score(Page mandarins and medjool dates)
score(Chez Panisse)
Features
=
w(starts-with-Chez)
+
w(starts-with-C…)
+
w(ends-with-P…)

+
w(starts-sentence)
+
w(shape:Xxx Xxx)
+
w(two-words)
+
w(in-gazetteer)
score(Chez Panisse)
Building your own features
val dsl = new WordFeaturizer.DSL[L](counts) with SurfaceFeaturizer.DSL
import dsl._
word(begin) // word at the beginning of the span
+ word(end – 1) // end of the span
+ word(begin – 1) // before (gets things like Mr.)
+ word (end) // one past the end
+ prefixes(begin) // prefixes up to some length
+ suffixes(begin)
+ length(begin, end) // names tend to be 1-3 words
+ gazetteer(begin, end)

Using your own featurizer
>
val data: IndexedSeq[Segmentation[Label, String]]
= ???
>
val myFeaturizer = ???
>
val system = SemiCRF.buildSimple(data,
startLabel,

outsideLabel,
featurizer = myFeaturizer)
Features

So far, we’ve been able to do everything with
(nearly) no math.

To understand more, need to do some math.
Machine Learning Primer

Training example (x, y)

x: sentence of some sort

y: labeled version

Goal:

want score(x, y) > score(x, y’), forall y’.
Machine Learning Primer
score(x, y) = wTf(x, y)
Machine Learning Primer
score(x, y) = w.t * f(x, y)
Machine Learning Primer
score(x, y) = w dot f(x, y)
Machine Learning Primer
score(x, y) >= score(x, y’)

×