Tải bản đầy đủ (.pdf) (8 trang)

Phương pháp giá trị tối ưu trong ước lượng tham số

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (772.37 KB, 8 trang )

PHƯƠNG PHÁP GIÁ TRỊ TỐI ƯU TRONG ƯỚC LƯỢNG THAM SỐ
Nguyễn Thị Hương
Trường Đại Học Hà Nội
Tóm tắt - Ước lượng giá trị tối ưu là phương pháp nhằm chọn ra một công cụ ước lượng
tham số mà không cần sử dụng đến các hàm phân bố. Phương pháp này được xem như việc
ước lượng các giá trị lớn nhất của hàm. Nói cách khác, MLE là phương pháp xác định các giá trị
tyham số của một mơ hình thống kê. Các giá trị tham số được tìm có thể làm tối ưu hố các quy
trình mơ tả các mẫu dữ liệu thực tế được khảo sát.
Trong bài báo này, tơi sẽ giải thích rõ nội dung và vai trị của phương pháp MLE trong
việc ước lượng tham số thông qua các ví dụ điển hình. Một số nội dung sử dụng kiến thức căn
bản về xác suất, cũng như các định nghĩa, định lý của xác suất có điều kiện và các biến cố độc
lập. MLE là một phương pháp khá hiệu quả và đơn giản được áp dụng trong hầu hết các bài
toán ước lượng tham số. Hơn thế nữa, với các bài tốn có khơng gian mẫu lớn, thì MLE là
phương pháp ước lượng đạt hiệu quả và độ tin cậy cao. Do đó, MLE được sử dụng rộng rãi
trong các vấn đề liên quan đến thống kê.
Từ khóa - Ước tính khả năng tối đa (MLE), phân phối Bernoulli, phân phối Poisson, chức
năng khả năng.
Abstract: Maximum likelihood estimation is a method for choosing estimators of
parameters that avoids using prior distributions and loss functions. It chooses as the estimate of
𝜃 the value of 𝜃 that provides the largest value of the likelihood function. In other words,
maximum likelihood estimation is a method that determines values for the parameters of a
model. The parameter values are found such that they maximize the likelihood that the process
describe by the model produced the data that were actually observed.
Keywords: Maximum likelihood estimation (MLE), Bernoulli distribution, Poisson
distribution, likelihood function.

MAXIMUM LIKELIHOOD METHOD FOR
PARAMETER ESTIMATION
I. INTRODUCTION
In this paper I’ll explain what the maximum likelihood method for parameter
estimation is and go through a simple example to demonstrate the method. Some of the


content requires knowledge of fundamental probability concepts such as the definition
of joint probability and independence of events. MLE is a simple method of
constructing an estimator without having to specify a loss function and a prior
distribution, and it was introduced by R.A. Fisher in 1912. Maximum likelihood
estimation can be applied in most problems, it has a strong intuitive appeal, and it will

79


often yield a reasonable estimator of θ. Furthermore, if the sample is large, the method
will typically yield an excellent estimator of θ. For these reasons, the method of
maximum likelihood is probably the most widely used method of estimation in
statistics.
II. MAXIMUM LIKELIHOOD METHOD
1) Definition

Let the random variables 𝑋1 , 𝑋2 , 𝑋3 , … , 𝑋𝑛 have joint density denoted
𝑓𝜃 (𝑥1 , 𝑥2 , … , 𝑥𝑛 ) = 𝑓(𝑥1 , 𝑥2 , … , 𝑥𝑛 |𝜃)
Given observed values 𝑋1 = 𝑥1 , 𝑋2 = 𝑥1 , … , 𝑋𝑛 = 𝑥𝑛 , the likelihood of 𝜃 is the function
𝑙𝑖𝑘 (𝜃) = 𝑓 (𝑥1 , 𝑥2 , … , 𝑥𝑛 |𝜃 ) = 𝑓𝑛 (𝑥|𝜃)
considered as a function of 𝜃.
If the distribution is discrete, f will be the frequency distribution function.
In
words:
𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑖𝑛𝑔 𝑡ℎ𝑒 𝑔𝑖𝑣𝑒𝑛 𝑑𝑎𝑡𝑎 𝑎𝑠 𝑎 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑙𝑖𝑘(𝜃).

𝑙𝑖𝑘 (𝜃) =

The maximum likelihood estimate (MLE) of 𝜃 is that value of 𝜃 that maximize
𝑙𝑖𝑘(𝜃): it is the value that makes the observed data the “most probable”.

Likelihood Function: When the joint probability density function (p.d.f) or the
joint probability mass function (p.m.f) 𝑓𝑛 (𝑥|𝜃) of the observations in a random sample
is regarded as a function of 𝜃 for given values of 𝑥1 , 𝑥2 , … , 𝑥𝑛 , it is called the likelihood
function.
What are parameter?
Often in machine learning we use a model to describe the process that results in the
data that are observed. For example, we may use a random forest model to classify
whether customers may cancel a subscription from a service (known as churn
modelling ) or we may use a linear model to predict the revenue that will be generated
for a company depending on how much they may spend on advertising (this would be an
example of linear regression). Each model contains its own set of parameters that
ultimately defines what the model looks like.
For a linear model we can write this as y = mx + c. In this example x could
represent the advertising spend and y might be the revenue generated. m and c are
parameters for this model Different values for these parameters will give different lines
(see figure below).

80


So parameters define a blueprint for the model. It is only when specific values are
chosen for the parameters that we get an instantiation for the model that describes a
given phenomenon.
Calculating the Maximum Likelihood Estimates
Now that we have an intuitive understanding of what maximum likelihood
estimation is we can move on to learning how to calculate the parameter values. The
values that we find are called the maximum likelihood estimates (MLE).
Again we’ll demonstrate this with an example. Suppose we have three data points
this time and we assume that they have been generated from a process that is adequately
described by a Gaussian distribution. These points are 9, 9.5 and 11. How do we

calculate the maximum likelihood estimates of the parameter values of the Gaussian
distribution μ and σ?
What we want to calculate is the total probability of observing all of the data, i.e.
the joint probability distribution of all observed data points. To do this we would need to
calculate some conditional probabilities, which can get very difficult. So it is here that
we’ll make our first assumption. The assumption is that each data point is generated
independently of the others. This assumption makes the maths much easier. If the events
(i.e. the process that generates the data) are independent, then the total probability of
observing all of data is the product of observing each data point individually (i.e. the
product of the marginal probabilities).
The probability density of observing a single data point x, that is generated from a
Gaussian distribution is given by:
𝑃(𝑥; 𝜇, 𝜎) =

1
𝜎√2𝜋

exp (−

(𝑥 − 𝜇)2
)
2𝜎 2

The semi colon used in the notation 𝑃(𝑥; 𝜇, 𝜎) is there to emphasize that the

81


symbols that appear after it are parameters of the probability distribution. So it shouldn’t
be confused with a conditional probability (which is typically represented with a vertical

line e.g. P(A| B).
In this example the total (joint) probability density of observing the three data
points is given by:
𝑃(9,9.5,11; 𝜇, 𝜎)
1
(9 − 𝜇)2
1
(9.5 − 𝜇)2
1
(11 − 𝜇)2
=
exp (−
).
exp (−
).
exp (−
)
2𝜎 2
2𝜎 2
2𝜎 2
𝜎√2𝜋
𝜎√2𝜋
𝜎√2𝜋
We just have to figure out the values of μ and σ that results in giving the maximum
value of the above expression.
The likelihood
The above expression for the total probability is actually quite a pain to
differentiate, so it is almost always simplified by taking the natural logarithm of the
expression. This is absolutely fine because the natural logarithm is a monotonically
increasing function. This means that if the value on the x-axis increases, the value on the

y-axis also increases (see figure below). This is important because it ensures that the
maximum value of the log of the probability occurs at the same point as the original
probability function. Therefore we can work with the simpler log-likelihood instead of
the original likelihood.

Monotonic behaviour of the original function, y = x on the left and the (natural)
logarithm function y = ln(x). These functions are both monotonic because as you go
from left to right on the x-axis the y value always increases.

82


Example of a non-monotonic function because as you go from left to right on the graph
the value of f(x) goes up, then goes down and then goes back up again.
Taking logs of the original expression gives us:
1
(9 − 𝜇)2
1
(9.5 − 𝜇)2
1
ln(𝑃(𝑥; 𝜇, 𝜎)) = ln (
)−
+
𝑙𝑛
(
)

+ 𝑙𝑛 (
)
2

2
2𝜎
2𝜎
𝜎√2𝜋
𝜎√2𝜋
𝜎√2𝜋
(11 − 𝜇)2

2𝜎 2
This expression can be simplified again using logarithms to obtain:
3
1
ln(𝑃 (𝑥; 𝜇, 𝜎)) = −3 ln(𝜎) − ln(2𝜋) − 2 [(9 − 𝜇)2 + (9.5 − 𝜇)2 + (11 − 𝜇)2 ]
2
2𝜎
This expression can be differentiated to find the maximum. In this example we’ll find the
MLE of the mean, μ. To do this we take the partial derivative of the function with respect
to 𝜇, giving
𝜕ln(𝑃(𝑥;𝜇,𝜎))
𝜕𝜇

1

= 𝜎2 [9 + 9.5 + 11 − 3𝜇].

Finally, setting the left hand side of the equation to zero and then rearranging for μ gives:
9 + 9.5 + 11
𝜇=
= 9.833
3

And there we have our maximum likelihood estimate for 𝜇.
2) Examples of maximum Likelihood Estimators
 Test for a Disease.
Suppose that you are walking down the street and notice that the Department of

Public Health is giving a free medical test for a certain disease. The test is 90
percent reliable in the following sense: If a person has the disease, there is a
probability of 0.9 that the test will give a positive response; whereas, if a person
does not have the disease, there is a probability of only 0.1 that the test will give
83


a positive response. We shall let X stand for the result of the test, where 𝑋 = 1
means that the test is positive and 𝑋 = 0 means that the test is negative. Let the
parameter space be Ω = {0.1, 0.9}, where 𝜃 = 0.1 means that the person tested
does not have the disease, and 𝜃 = 0.9 means that the person has the disease.
This parameter space was chosen so that, given 𝜃, 𝑋 has the Bernoulli
distribution with parameter 𝜃. The likelihood function is
𝑓 (𝑥|𝜃) = 𝜃 𝑥 (1 − 𝜃)1−𝑥
If x = 0 is observed, then
0.9 𝑖𝑓 𝜃 = 0.1
𝑓 (0|𝜃 ) = {
0.1 𝑖𝑓 𝜃 = 0.9

Clearly, 𝜃 = 0.1 maximizes the likelihood when 𝑥 = 0 is observed. If 𝑥 = 1
is observed, then
0.1 𝑖𝑓 𝜃 = 0.1
𝑓 (1|𝜃 ) = {
0.9 𝑖𝑓 𝜃 = 0.9


Clearly, 𝜃 = 0.9 maximizes the likelihood when 𝑥 = 1 is observed. Hence, we
have that the M.L.E. is
0.1 𝑖𝑓 𝑋 = 0
𝜃̂ = {
0.9 𝑖𝑓 𝑋 = 1


Poisson Distribution

Consider a Poisson distribution with probability mass function
𝑒 −𝜇 𝜇 𝑥
𝑓 (𝑥 |𝜇 ) =
,
𝑥!

𝑥 = 0,1,2, …

Suppose that a random sample 𝑥1 , 𝑥2 , … , 𝑥𝑛 is taken from the distribution. What is the
maximum likelihood estimate of 𝜇?
Solution: The likelihood function is
𝑛

𝑛

𝑒 −𝑛𝜇 𝜇 ∑𝑖=1 𝑥𝑖
𝐿(𝑥1 , 𝑥2 , … ; 𝜇) = ∏ 𝑓(𝑥𝑖 |𝜇 ) =
∏𝑛𝑖=1 𝑥𝑖 !
𝑖=1

Now consider

𝑛

𝑛

ln 𝐿(𝑥1 , 𝑥2 , … ; 𝜇) = −𝑛𝜇 + ∑ 𝑥𝑖 𝑙𝑛𝜇 − 𝑙𝑛 ∏ 𝑥𝑖 !
𝑖=1

84

𝑖=1


𝑛

𝜕 ln 𝐿(𝑥1 , 𝑥2 , … ; 𝜇)
𝑥𝑖
= −𝑛 + ∑
𝜕𝜇
𝜇
𝑖=1

Solving for 𝜇̂ , the maximum likelihood estimator, involves setting the derivative to zero
and solving for the parameter. Thus,
𝑛

𝜇̂ = ∑
𝑖=1

𝑥𝑖
=𝑥

𝑛

The second derivative of the log-likelihood function is negative, which implies
that the solution above indeed is a maximum. Since 𝜇 is the mean of the Poisson
distribution, the sample average would certainly seem like a reasonable estimator.
3) Limitation of Maximum Likelihood Estimation
Despite its intuitive appeal, the method of maximum likelihood is not necessarily
appropriate in all problem. Since max{𝑋1 , … , 𝑋𝑛 } < 𝜃 with probability 1, it follows that
𝜃̂ surely underestimates the value of 𝜃. Indeed, if any prior distribution is assigned to 𝜃,
then the Bayes estimator exceeds 𝜃̂ will, of course, depend on the particular prior
distribution that is used and on the observed values of {𝑋1 , … , 𝑋𝑛 }
Finally, we shall mention one point concerning the interpretation of the M.L.E.
The M.L.E. is the value of 𝜃 that maximizes the conditional p.f. or p.d.f. of the data 𝑿
given 𝜃. Therefore, the maximum likelihood estimate is the value of 𝜃 that assigned the
highest probability to seeing the observed data. It is not necessarily the value of the
parameter that appears to be most likely given the data. To say how likely are different
values of the parameter, one would need a probability distribution for the parameter. Of
course, the posterior distribution of the parameter would serve this purpose, but no
posterior distribution is involved in the calculation of the M.L.E. Hence, it is not
legitimate to interpret the M.L.E. as the most likely value of the parameter after having
seen the data.
III. SUMMARY
The maximum likelihood estimate of a parameter θ is that value of θ that provides
the largest value of the likelihood function 𝑓𝑛 (𝑥|𝜃) for fixed data x. If 𝛿(x) denotes the
maximum likelihood estimate, then 𝜃̂ = 𝛿(x) is the maximum likelihood estimator
(M.L.E.). The method of maximum likelihood allows the analyst to make use of
knowledge of the distribution in determining an appropriate estimator. The method of
maximum likelihood cannot be applied without knowledge of the underlying
distribution.


85


REFERENCES
[1] Pedersen, A. R. (1995). “A new approach to maximum likelihood estimation for
stochastic differential equations based on discrete observations”. Scand. J. Statist.,
22:55–71.
[2] Spanos, A. (1999). Probability theory and statistical inference. Cam- bridge, UK:
Cambridge University Press.
[3] DeGroot, M. H., & Schervish, M. J. (2002). Probability and statistics (3rd ed.).
Boston, MA: Addison-Wesley.
[4] Bickel, P. J., & Doksum, K. A. (1977). Mathematical statistics. Oakland, CA:
Holden-day, Inc.
[5] Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6,
461–464.

86



×