16
COMPUTER BASED NUMERICAL AND STATISTICAL TECHNIQUES
Percentage Error in r =
100
100
2
rR
R
rr
r
δδ
×=×
∂
∂
Because δ r =
()
δ
δδ
=× =
∂
∂
2
100
100
2
2
2
Rh
RR
R
r
r
r
r
h
On substituting r = 4.5 and value of δR from (1)
=
×××
××=
×
×
2
100 0.002 50.50 0.1 50.50 5.5
11 11 20.25
2(4.5)
h
= 0.12
Percentage Error in h =
δδδ
×=× =
∂
−+
∂
2
2
100 100
100
1
2
2
2
2
hRR
R
hh h
r
h
h
=
2
2
100 100 50.5 0.002
0.505.
20 11 11
1
2
2
R
r
h
δ×
=× =
−+
Example 29. Two sides and included angle of a triangle are 9.6 cm, 7.8 cm and 45° respectively.
Find the possible error in the area of a triangle if the error in sides is correct to a millimeter and the angle
is measured correct to one degree.
Sol. Assume that the area of the triangle ABC ⇒
=
1
2
X
bc sin A
Error in the measurement of sides and angles are
∠b = 0.05 cm, ∠c = 0.05 cm, and ∠A =
×=
1
0.01745 0.008725
2
radians
11 1
sin , sin and cos
22 2
XX X
cA b A bc A
bc A
∂∂ ∂
== =
∂∂ ∂
∂∂ ∂
δ<δ +δ +δ
∂∂ ∂
XX X
Xb c A
bc c
< ××× + ××× + ××××
11 11 1 1
0.05 9.6 0.05 7.8 0.008725 9.6 7.8
22 2
22 2
1
[0.05 4.8 0.05 3.9 0.008725 4.8 7.8]
2
<×+×+××
0.761664
0.5385778 0.539 sq. cm.
1.4142135
<=≈
ERRORS AND FLOATING POINT
17
Example 30. The error in the measurement of area of a circle is not allowed to exceed 0.5%. How
accurately the radius should be measured.
Sol. Area of the circle =πr
2
= A (say)
∂
∂
A
r
= 2πr
Percentage Error in A=
δ
×=100 0.5
A
A
Therefore δA =
×= π
2
0.5 1
100 200
Ar
Percentage Error in r =
δ
×100
r
r
=
π
δ
=
∂
π
∂
2
2
1
100 100
200
2
r
A
A
rr
r
=
1
0.25.
4
=
Example 31. The error in the measurement of the area of a circle is not allowed to exceed 0.1%. How
accurately should the diameter be measured?
Sol. Let d is the diameter of a circle, and then its area is given by
π
=
2
4
d
A
. Therefore,
∂
∂
A
d
=
π
2
d
Since
∂
δ=δ
∂
,
A
Ad
d
therefore
=
A
d
Ad
δ
δ
∂∂
Now Percentage Error in
δ
=×=100 0.1
A
A
A
Therefore, δA =
××π
=×=
2
0.1 0.001
0.001
100 4
Ad
A
Similarly, Percentage Error in
δ
=×100
d
d
d
=
δ×π
×= ×
∂
π
∂
2
100 100 0.001 2
4
Ad
A
dd d
d
=
×π
×= =
π
2
0.1 2 0.1
0.05
42
d
dd
·
Example 32. In a ∆ABC, b = 9.5 cm, c = 8.5 cm and A = 45
o
, find allowable errors in b, c, and
A such that the area of ∆ABC may be determined nearest to a square centimeter.
Sol. Let area of the ∆ABC be given by,
X =
1
sin
2
bc A
18
COMPUTER BASED NUMERICAL AND STATISTICAL TECHNIQUES
(1)
δ
δ=
∂
∂
3
X
b
X
b
=
0.5 0.5
0.055 cm.
131
3sin 8.5
22
2
cA
==
×××
(2)
3
X
c
X
c
δ
δ=
∂
∂
=
0.5 0.5
0.049 cm.
131
3sin 9.5
22
2
bA
==
×××
(3)
δ
δ=
∂
∂
3
X
A
X
A
=
0.5 0.5
0.006 radians.
13 1
3cos 9.58.5
22
2
bc A
==
××××
Example 33. In a triangle ∆ABC, a = 2.3 cm, b = 5.7 cm and
∠
B = 90
o
. If possible errors in the
computed value of b and a are 2 mm and 1 mm respectively, find the possible error in the measurement
of angle A.
Sol. Given δb = 2 mm = 0.2 cm
δa = 1 mm = 0.1 cm
sin A =
−
⇒=
1
sin
a
a
A
bb
∂
∂
A
a
=
222
2
11 1
1
b
aba
b
⋅=
−
−
∂
∂
A
b
=
2
222
2
1
1
aa
b
abba
b
⋅− =−
−
−
δA <
AA
ab
aa
∂∂
δ+δ
∂∂
22 22
12.3
0.1 0.2
(5.7) (2.3) 5.7 (5.7) (2.3)
<× +×
−−
0.1 0.46
0.0346 radians
5.2154 29.7276
<+ =
Example 34. In a triangle ∆ABC, a = 30 cm, b = 80 cm and
∠
B = 90
o
. Find the maximum error
in the computed value of A, if possible errors in a and b are
1
%
3
and
1
%
4
respectively.
Sol. Since sin A =
a
b
⇒ A = sin
–1
a
b
δA <
AA
ab
ab
∂∂
δ+δ
∂∂
(1)
Given that
δ
×
100
a
a
=
⇒δ =
1
0
.1
3
a
Also
δ
×
100
b
b
=
⇒δ =
1
0
.2
4
b
ERRORS AND FLOATING POINT
19
We have
∂
∂
A
a
=
−
2
2
1
ba
and
22
∂−
=
∂
−
Aa
b
bb a
Substituting these values in equation (1), we have
δA < 0.00135 + 0.00100 < 0.00235 radians
or δA < 8°5′.
Example 35. Find the smaller root of the equation x
2
– 30x + 1 = 0 correct to three places of decimal.
State different algorithm, which algorithm is better and why?
Sol. Roots of the given equation x
2
– 30x + 1 = 0 are
x =
30 900 4 30 896
22
±− ±
=
and smaller root is
30 896
2
−
(1) First method:
15 224 0.0333704−=
(2) Second method:
()()
()
15 224 15 224
15 224
−+
+
=
225 224 1
15 224 15 224
−
=
++
=
11
0.0333704
15 14.966629 29.966629
==
+
·
Therefore second algorithm is comparatively a better one as this gives the result correct to
four figures.
Example 36. Find the smaller root of the equation x
2
– 400x + 1 = 0 using four-digit arithmetic.
Sol. We know that the smaller root of the equation ax
2
+ bx + c = 0, b > 0 is given by,
x =
2
4
2
bb ac
a
−−
Here a = 1 = 0.1000 × 10
1
b = 400 = 0.4000 × 10
3
c = 1= 0.1000 × 10
1
b
2
– 4ac = 0.1600 × 10
6
– 0.4000 × 10
1
= 0.1600 × 10
6
(To four-digit accuracy)
2
4
bac−
= 0.4000 × 10
3
On substituting these values in the above formula we obtain x = 0.0000.
However, this formula can also be written as
x =
2
2
4
c
bb ac+−
20
COMPUTER BASED NUMERICAL AND STATISTICAL TECHNIQUES
or x =
1
33
0.2000 10
0.4000 10 0.4000 10
×
×+ ×
x =
1
3
0.2000 10
0.8000 10
×
×
= 0.0025.
This is the exact root of the given equation.
Remark: When two nearly equal numbers are subtracted then there is a loss of significant
figures.
e.g., 43.206 – 42.995 = 0.211
Here given numbers are correct to five figures while the result 0.211 is correct to three
figures only. Similarly numbers 12450 and 12360 are correct to four figures and their difference
90 is correct to one figure only. The error due to loss of significant figures sometimes renders the
result of computation worthless. Using techniques below can minimize such error:
(1)
ab−
by
ab
ab
−
+
(2) sin a – sin b by 2 cos
sin
22
ab ab+−
(3) 1 – cos a by 2 sin
2
2
a
or
−
24
2! 4!
aa
+
(4) log a – log b by log
a
b
etc.
1.5 ERROR IN SERIES APPROXIMATION
The error committed in a series approximation can be evaluated by using the remainder after n
terms. Taylor’s series for f(x) at x = a is given by,
f(x) = f(a) + (x − a)f′ (a) +
() ()
21
1
( ) ( ) ( )
2! ( 1)!
n
n
n
xa xa
fa f a Rx
n
−
−
−−
′′
++ +
−
where
()
() ();
!
n
n
n
xa
Rx f a x
n
−
=ζ<ζ<
.
This term R
n
(x) is called remainder term and for a convergent series it tends to zero as
n →∞
. Thus if we approximate f(x) by the first n terms of a series then maximum error committed
in this approximation is given by the R
n
(x) and if accuracy required is already given then it is
possible to find the number of terms n such that the finite series yields the required accuracy.
Example 37. The Maclaurin’s expansion for e
x
is given by,
e
x
=
,
!! ()!!
23 n1 n
xx x x
1x e0 x
23 n1 n
−
++++ + + <<
−
ξ
ξ
Find the number of terms, such that their sum yields the value of e
x
correct to 8 decimal places at
x = 1.
ERRORS AND FLOATING POINT
21
Sol. Given that
−
ξ
=++ + + + + <ξ<
−
23 1
1 ,0
2! 3! ( 1)! !
nn
x
xx x x
ex e x
nn
Then the remainder term is,
R
n
(x)=
!
n
x
e
n
ξ
So that ξ = x gives maximum absolute error
E
a(max)
=
!
n
x
x
e
n
and E
r(max)
=
!
n
x
n
For an 8 decimal accuracy at x = 1,
8
11
10 12
!2
n
n
−
<⇒=
Hence we have 12 terms of the expansion in order that its sum is correct to 8 decimal places.
Example 38. Find the number of terms of the exponential series such that their sum gives the value
of e
x
correct to six decimal places at x = 1.
Sol. The exponential series is given by,
e
x
= 1 + x +
23 1
( )
2! 3! ( 1)!
n
n
xx x
Rx
n
−
++ + +
−
(1)
where R
n
(x) =
,0
!
n
x
ex
n
θ
<θ<
.
Maximum absolute error at θ = x is E
a(max)
=
!
n
x
x
e
n
and Maximum Relative Error is E
r(max)
=
!
n
x
n
Hence E
r(max)
at x = 1 is
1
!n
For a six decimal accuracy at x = 1, we get
66
11
10 ! 2 10
10
!2
nn
n
−
<⇒>×⇒=
Therefore, we get n = 10.
Hence we have 10 terms of series (1) to obtain the sum correct to 6 decimal places.
Example 39. Obtain a second-degree polynomial approximation to f(x) = (l + x)
1/2
, x∈[0,0.1] using
the Taylor series expansion about x = 0. Use the expansion to approximate f(0.05) and find a bound of the
truncation error.
Sol. Given that f(x) = (1 + x)
1/2
, f(0) = 1
f′ (x)=
1/2
1
(1 )
,
2
x
−
+
f ′(0) =
1
2
()
f
x
′′
= −
3/2
1
(1 )
4
−
+ x
, f ′′(0) =
1
4
−
22
COMPUTER BASED NUMERICAL AND STATISTICAL TECHNIQUES
()
′′′
fx
=
5/2
3
(1 )
8
−
+ x
, f ′′′(0) =
3
8
Thus, the Taylor series expansion with remainder term is given by
(1 + x)
1/2
=
()
23
5
1/ 2
1
1,00.1
2816
1
+− + <ξ<
+ξ
xx x
The Truncation term is as,
T = (1 + x)
1/2
–
2
1
28
xx
+−
=
3
1/2 5
1
16
[(1 ) ]
x
+ξ
Also
()
2
1
0.05
0.05
(0.05) 1 0.10246875 10
28
≈+ − = ×
f
The bound of the truncation error, for
[0,0.1]x
∈
is
()
()
3
00.1
1/ 2
5
0.1
max
16 1 ]
≤≤
≤
+
x
T
x
()
3
4
0.1
0.625 10 .
16
−
≤=×
Example 40. The function f(x) = tan
–1
x can be expanded as
tan
–1
x=
( )
−
−
−+− +− +
−
n1
35 2
n1
xx x
x1
35 2n1
Find n such that series determines tan
–1
(1) correct to eight significant digits.
Sol. If we retain n terms then (n + 1)th term = (–1)
n
2
1
2
1
n
x
n
+
+
For x = 1, (n +1)th term =
()
1
2
1
n
n
−
+
To determine a tan
–1
(1) correct up to eight significant digits,
()
8
8
1
1
10 2 1 2 1
0
212
n
n
n
−
−
<× ⇒ +>×
+
8
10 1
⇒= +n
. Satisfies
Example 41. Use the series log
e
+
=+++
−
35
1x x x
2x
1x 3 5
to compute the value of log (1.2)
correct to seven decimal places and find the number of terms retained.
Sol. Assume
1
1
1.2
1
11
x
x
x
+
=⇒=
−
If we retain n terms then, (n + 1)th terms =
21 2
1
2( 1/11)
21 21
nn
x
nn
++
=
++
ERRORS AND FLOATING POINT
23
For seven decimal accuracy,
21
7
21 1
10
2111 2
n
n
+
−
<×
+
21
7
(
21)(11) 41
0
n
n
+
+>×
This gives
3.
n
≥
After retaining the first three terms of the series, we get
log
e
(1.2) = 2
35
35
xx
x
++
At
1
1
1
x
=
= 0.1823215.
PROBLEM SET 1.1
1. Prove that the relative error of a product of three non-zero numbers does not exceed the
sum of the relative errors of the given numbers.
2. If y = (0.31x + 2.73)/(x+ 0.35) where the coefficients are round off, find the absolute and
relative errors in y when x =
0.5 0.1
±
.[Ans. e
a
= 2.9047, 4.6604; e
r
= 0.9464, 1.225]
3. Find the quotient q = x/y, where x = 4.536 and y = 1.32, both x and y being correct to the
digits given. Find also the relative error in the result. [Ans. q
= 3.44, e
r
= 0.0039]
4. If S = 4x
2
y
3
z
–4
, find the maximum absolute error and maximum relative errors in S. When
errors in x = 1 , y = 2, z = 3 respectively are equal to 0.001, 0.002, 0.003.
[Ans. 0.0035, 0.0089]
5. Obtain the range of values within which the exact value of
1.265(10.21 7.54)
47
−
lies, if all the
numerical quantities are rounded-off. [Hint: on taking e
a
<1%] [Ans. 0.06186<x<0.08186]
6. Find the number of terms of the exponential series such that their sum yields the value of
e
x
correct to 8 decimal places at x = 1. [Ans. n
= 12]
7. Estimate the error in evaluating f(x) =
2
log
cos
xe
around x = 2 if the absolute error in x is
10
–6
.[Ans. 0.0053 × 10
−3
]
8. (a)
∞
−
=
∑
0
6
k
S
, calculate the actual sum by using the infinite series. [Ans. 12]
(b) Assume three-digit arithmetic. Find the sum (up to 11 terms) by adding largest to
smallest. Also find the absolute, relative and percentage errors.
(c) Find the sum up to 11 terms by adding smallest to largest and also find the absolute,
relative and percentage errors.
9. Find the absolute, relative and percentage errors of the approximations as
(a)
1
0.1
11
≈
(b)
1
0.00
11
≈
(c)
5
0.56
9
≈
(d)
4
0.44
9
≈
[Ans. (a)
e
a
= 0.009, e
r
= 0.0999, e
p
= 9.9]
[Ans. (b)
e
a
= 0.009, e
r
= 0.01, e
p
= 1.0]
[Ans. (c)
e
a
= 0.004, e
r
= 0.008, e
p
= 0.8]
[Ans. (d)
e
a
= 0.0044, e
r
= 0.0099, e
p
= 0.9]
24
COMPUTER BASED NUMERICAL AND STATISTICAL TECHNIQUES
10. Describe the possible causes of serious error in calculating A =
()
+
sin
1cos
x
x
for cos x ≈ –1
11. Find the percentage error if the number 5007932 is approximated to four significant figures.
[Ans. 0.018%]
12. Compute the relative maximum error in the function u =
2
7
y
x
x
, when x = y = z = 1 and
errors in x, y, z be 0.001. [Ans. 0.006]
13. Obtain a second-degree polynomial approximation to the function f(x) =
∈
+
2
1
,[1,2]
1
x
x
using
Taylor’s series expansion about x = 1. Find a bound on the truncation error. [Ans. 0.25]
14. Find the number of terms in the series expansion of the function f(x) = cos x , such that
their sum gives the value of cos x correct to five decimal places for all values of x in the
range
22
x
ππ
−≤≤+
. Find also the truncation error. [Ans. n = 6, Trun. error = 0.020]
1.6 SOME IMPORTANT MATHEMATICAL PRELIMINARIES
There are some important mathematical preliminaries given below which would be useful in
numerical computation.
(a) If f(x) is continuous in a
≤
x
≤
b, and if f(a) and f(b) are of opposite sign, then f(d) = 0
for at least one number d such that a < d < b.
(b) Intermediate Value Theorem: Let f(x) be continuous in
a
x
b
≤≤
and let any number
between f(a) and f(b), then there exists a number d in a < x < b such that f(d) = l.
(c) Mean Value Theorem for Derivatives: If f(x) is continuous in [a, b] and f ′(x) exists in
(a, b) then
∃
at least one value of x, say d, between a and b
∋
,
() ()
() ,
fb fa
fd a d b
ba
−
′
=<<
−
(d) Rolles Theorem: If f(x) is continuous in a ≤ x ≤ b,
(
)
f
x
′
exists in a < x < b and
f(a) = f(b) = 0 then
∃
at least one value of x, say d,
∋
f ’(d) = 0, a < d< b
(e) Generalized Form of Rolles Theorem: If f(x) is n times differentiable on
a
x
b
≤≤
and
f(x) vanishes at the (n + 1) distinct points x
0
, x
1
, x
2
, x
n
in (a, b), then there exists
a number d in a < x < b
∋
f
n
(d) = 0.
( f ) Taylors Series for a Function of One Variable: If f(x) is continuous and possesses
continuous derivatives of order n in an interval that includes x = a, then in that interval
()
−
−
−
−
′′′
=+− + + + +
−
1
2
1
()
( ) ( ) ( ) ( ) ( ) ( ) ( )
2! ( 1)!
n
n
n
xa
xa
fx fa x af a f a f a R x
n
where R
n
(x), the remainder term, can be expressed in the form
−
=<<
()
() (), .
!
n
n
n
xa
Rx f da d x
n
ERRORS AND FLOATING POINT
25
(g) Maclaurins Expansion: ′′′
=+ + + + +
2
( ) (0) (0) (0) (0)
2! !
n
n
xx
fx f xf f f
n
(h) Taylors Series for a Function of Two Variables:
f(x
1
+ ∆x
1
, x
2
+ ∆x
2
)
=
()
∂∂ ∂ ∂ ∂
+∆+∆+ ∆+ ∆∆+ ∆ +
∂∂ ∂∂
∂∂
222
2
2
12 1 2 1 1 2 2
22
12 12
12
1
( , ) 2 ( )
2
ff f f f
fx x x x x x x x
xx xx
xx
This form can easily be generalized for function of several variables. Therefore
+∆ +∆ +∆ =
1122 123
( , , , ) ( , , , )
nn n
fx xx x x x fxxx x
+
∂∂ ∂
∆+ ∆+ ∆
∂∂ ∂
12
12
n
n
ff f
xx x
xx x
+
()
−
−
∂∂ ∂ ∂
∆+ ∆+ ∆⋅∆++ ∆⋅∆+
∂∂ ∂ ∂
∂∂
22 2 2
2
2
111
22
12 1
1
1
( ) 2 2
2
nn nn
nn
n
ff f f
xx xx xx
xx x x
xx
1.7 FLOATING POINT
The IEEE floating-point standards prescribe precisely how floating-point numbers should be
represented, and the results of all operations on floating point numbers. There are two standards:
IEEE 754 is for binary arithmetic, and IEEE 854 covers decimal arithmetic as well. The only IEEE
754, adopted almost universally by computer manufacturers. Unfortunately, not all manufacturers
implement every detail of IEEE arithmetic the same way. Every one does indeed represent numbers
with the same bit patterns and rounds results correctly (or tries to). But exceptional conditions
(like 1/0, sqrt(–1) etc.), whose semantics are also specified in detail by the IEEE standards, are not
always handled the same way. It turns out that many manufacturers believe (sometimes rightly
and sometimes wrongly) that confirming to every detail of IEEE arithmetic would make their
machines either a bit slower or a bit more expensive, enough so make them less attractive in the
market place. The IEEE standards 754 for binary arithmetic specify 4 floating-point formats:
single, single extended, double and double extended.
Floating point numbers are represented in the form + – significand *2^ (exponent), where
the significand is a non-negative number. A normalized significand lies in the half open interval
[1, 2), i.e., it has no leading zero bits.
Macheps is the short for “machine epsilon”, and is used below for round off error analysis.
The distance between 1 and the next larger floating point number is 2*macheps. When the exponent
has neither its largest possible value (a string of all once) nor its smallest value (a string of all
zeros), then the significand is necessarily normalized, and lies in [1, 2). When the exponent has
its largest possible value (all once), the floating-point number is +infinity, −infinity, or NAN (not-
a-number). The largest finite floatingpoint number is called the overflow threshold.
When the exponent has its smallest possible value (all zero), the significand may have
leading zeros, and is called either subnormal or de-normal (unless it is exactly zero). The subnormal
floating-point numbers represent very tiny numbers between the smallest nonzero normalized
floating-point number (the underflow threshold) and zero. An operation that underflows yield a
subnormal number or possibly zero; this is called gradual underflow. The alternative, simply
returning a zero, is called flush to zero. When the significand is zero, the floating-point number
is + – 0.