Calculus I for Machine Learning
Some Applications of Concepts of Sequence and Series
Mohammed Nasser Professor, Dept. of Statistics, RU,Bangladesh Email:
[email protected] 1
P.C. Mahalanobis(1893-1972), the pioneer of statistics in ASIA “A good mathematician may not be a good statistician but a good statistician must be good mathematician” 2
Andrey Nikolaevich Kolmogorov (Russian) (25 April 1903 – 20 October 1987) In 1933, Kolmogorov published the book, Foundations of the Theory of Probability, laying the modern axiomatic 3 foundations of probability theory
Statistics+Machine Learning
Vladimir Vapnik
Jerome H. Friedman
4
Learning and Inference The inductive inference process: Observe a phenomenon Construct a model of the phenomenon Make predictions
→This is more or less the definition of natural
sciences ! →The goal of Machine Learning is to automate this process
→The goal of Learning Theory is to formalize it. 5
What is Learning? • ‘The action of receiving instruction or acquiring knowledge’
•‘A process which leads to the modification of behaviour or the acquisition of new abilities or responses, and which is additional to natural development by growth or maturation’
6
Machine Learning • Negnevitsky: ‘In general, machine
learning involves adaptive mechanisms that enable computers to learn from experience, learn by example and learn by analogy’ (2005:165) •Callan: ‘A machine or software tool would not be viewed as intelligent if it could not adapt to changes in its environment’ (2003:225) •Luger: ‘Intelligent agents must be able to change through the course of their interactions with the world’ (2002:351) 7
The Sub-Fields of ML
Classification
• Supervised Learning Regression Clustering
Unsupervised Learning Density estimation Reinforcement Learning
8
Classical Problem
What is the wt of the elephant?
What is the wt/distance of sun? 9
Classical Problem
What is the wt/size of baby in the womb?
What is the wt of a DNA molecule? 10
Solution of the Classical Problem Let us suppose somehow we have x1,x2,- - -xn measurements
One million dollar question: How can we choose the optimum one among infinite possible alternatives to combine these n obs. to estimate the target,μ What is the optimum n? 11
We need the concepts: ith observations
Probability distributions -
Target that we want to estimate
Probability measures,
X i i , ~ F (x / ) 12
Our Targets We want to chose T s.t.T(Xi,….,Xn) is always very near to μ How do we quantify the problem?
Let us elaborate this issue through examples.
13
Inference with a Single Observation Population
?
Sampling
Parameter: Inference
Observation Xi • Each observation Xi in a random sample is a representative of unobserved variables in population
• Each observation is an estimator of μ but its variance is as much as the poppulation variance 14
Normal Distribution • In this problem normal distribution is the most popular model for our overall population • Can calculate the probability of getting observations greater than or less than any value
• Usually we don’t have a single observation, but instead the mean of a set of observations 15
Inference with Sample Mean Population
?
Sampling
Sample
Parameter: Inference
Estimation
Statistic: x
• Sample mean is our estimate of population mean • How much would the sample mean change if we took a different sample? • Key to this question: Sampling Distribution of x 16
Sampling Distribution of Sample Mean • Distribution of values taken by statistic in all possible samples of size n from the same population • Model assumption: our observations xi are sampled from a population with mean and variance 2
Population Unknown Parameter:
Sample 1 of size n Sample 2 of size n Sample 3 of size n Sample 4 of size n Sample 5 of size n Sample 6 of size n Sample 7 of size n Sample 8 of size n . . .
x x x x x x x x
Distribution of these values?
17
Points to Be Remembered If population is finite
If population is countably infinite
If population is uncountably infinite
No of sample means are finite No of sample means are countably infinite
No of sample means are uncountably infinite
18
Meaning of Sampling Distribution Replications B=10000 19
• Comparing the sampling distribution of the sample mean when n = 1 vs. n = 10
20
Examination on a Real Data Set We also consider a real set of health data of 1491 Japanese adult male students from various districts of Japan as population. Four head measurements: head length, head breadth, head height and headcircumference and two physical measurements: stature and weight Data were taken by one observer, Funmio Ohtsuki (Hossain et al. 2005) using the technique of Martin and Saller (1957).
21
Histogram and Density of Head Length (Truncated at the Left)
22
Basic Information about Two Populations Type
Mean
Variance b1
b2
size
original
178.99 37.13
.08
2.98 1491
Truncated 181.85 19.63
.80
3.45 1063
23
Sampling Distributions
n (X n )
Xn
, n=10. 20, 100 & 500
, n=10. 20, 100 & 500 Replications=10000
24
Boxplots of Means for Original Population
Xn
n (X n )
Replications=10000
25
Descriptive Statistics of Sampling Distribution of Means for Original Population
biassim [1,] 0.0221
varsim varasim 3.5084 35.0836
[2,] -0.0230
1.8560
37.1210
[3,] 0.0022
0.3634
36.3167
[4,] 0.0041
0.0715
35.7484 37.13 26
Density of Means for Original Population
27
Histograms of Means for Truncated Population
28
Boxplots of Means for Truncated Population
29
Descriptive Statistics of Sampling Distribution of Means for Truncated Population
[1,]
biassim -0.0105
[2,] -0.0002 [3,] [4,]
-0.0014
varsim varasim 2.0025 20.0249 0.9810
19.62088
0.1958
19.5790
-0.0029 0.0395
19.7419
19.63 30
Chi-square with Two D.F.
31
Boxplots of Means for
2 2
32
Histogram of Means for
2 2
33
Central Limit Theorem • If the sample size is large enough, then the sample mean x has an approximately Normal distribution
• This is true no matter what the shape of the distribution of the original data!!!! 34
Histogram of 100000 Obs from Standard Cauchy
35
Xn
N=500
36
n (X n ) N=500 37
Central Limit Theorem
This is a special case of
convergence in distribution
(x )
1 2
e
1 2 2
t2
dt
FX (x ) Pr ( n (X n ) x ) n
Subject to existence of mean and variance
x
FX (x ) (x ) x as n n
Research is going on to relax i.i.d. condition
38
How many sequences in CLT? Basic random functional sequence, X n (w) Derived random functional sequence, Real sequence,
1/
X n (w)
n
to compare convergence of
X n (w)
to 0. Another real nonnegative functional sequence, FX ( x ) n
39
Significance of CLT From mathematics we know, we could approximate an by a as accurate as we wish when an → a Sampling distribution of means can be approximated by normal distribution when CLT holds and sample size is fairly large.
It justifies to build confidence interval for μ using sample mean and normal table in non-normal cases 40
More Topics Worth-studying
To have error bounds like sup FX n (x ) F (x ) g(n ) n n
To characterize extreme fluctuations using sequences like logn, loglogn etc Law of the iterated Logarithm (Hartman and Wintner,1941)
41
Berry Essen
Theorem(1941,1945)
Check uniformity of convergence:Uniform convergence is better than simple pointwise convergence.Polya theorem guarantees that since normal cdf is everywhere continuous, covergence in CLT is uniform.
Why do we use x to estimate μ? P1. E(X )= μ
Meaning of “E”??
P2. V(X )=E[( X -μ)2 ]=V(X)/n P.3
What is its significance??
X converges to μ in probability
a n ( )
Lt an ( ) Lt Pr( X n ) 1 0
n
Subject to
n
t [1 F (t ) F ( t )] 0 as t
42
1
Why do we use x to estimate μ?
Pr( : Lt X n () 0) 1 n
Subject to
E( X1 )
2
Condition 2 implies condition1
P.4 X converges to μ almost surely.
43
Difference between Two Limits
Lt an ( ) Lt P r( X n ) 1 0
n
n
Probability is calculated first,
P r( : Lt
n
then limit is taken.
X n () 0) 1
Limit is calculated first,
then probability is calculated. 44
Why do we use x? P.5 Let X ~N(μ.σ2)↔ε~N(0, σ2). Then X~ N(μ.σ2/n) So we can make statements like Pr[a(Xn)