Calculus I for Machine Learning

February 12, 2018 | Author: Anonymous | Category: Math, Statistics And Probability, Normal Distribution

Short Description

Download Calculus I for Machine Learning...

Description

Calculus I for Machine Learning

Some Applications of Concepts of Sequence and Series

Mohammed Nasser Professor, Dept. of Statistics, RU,Bangladesh Email: [email protected] 1

P.C. Mahalanobis(1893-1972), the pioneer of statistics in ASIA “A good mathematician may not be a good statistician but a good statistician must be good mathematician” 2

Andrey Nikolaevich Kolmogorov (Russian) (25 April 1903 – 20 October 1987) In 1933, Kolmogorov published the book, Foundations of the Theory of Probability, laying the modern axiomatic 3 foundations of probability theory

Statistics+Machine Learning

Vladimir Vapnik

Jerome H. Friedman

4

Learning and Inference The inductive inference process: Observe a phenomenon Construct a model of the phenomenon Make predictions

→This is more or less the definition of natural

sciences ! →The goal of Machine Learning is to automate this process

→The goal of Learning Theory is to formalize it. 5

What is Learning? • ‘The action of receiving instruction or acquiring knowledge’

•‘A process which leads to the modification of behaviour or the acquisition of new abilities or responses, and which is additional to natural development by growth or maturation’

6

Machine Learning • Negnevitsky: ‘In general, machine

learning involves adaptive mechanisms that enable computers to learn from experience, learn by example and learn by analogy’ (2005:165) •Callan: ‘A machine or software tool would not be viewed as intelligent if it could not adapt to changes in its environment’ (2003:225) •Luger: ‘Intelligent agents must be able to change through the course of their interactions with the world’ (2002:351) 7

The Sub-Fields of ML

Classification

• Supervised Learning Regression Clustering

Unsupervised Learning Density estimation Reinforcement Learning

8

Classical Problem

What is the wt of the elephant?

What is the wt/distance of sun? 9

Classical Problem

What is the wt/size of baby in the womb?

What is the wt of a DNA molecule? 10

Solution of the Classical Problem Let us suppose somehow we have x1,x2,- - -xn measurements

One million dollar question: How can we choose the optimum one among infinite possible alternatives to combine these n obs. to estimate the target,μ What is the optimum n? 11

We need the concepts: ith observations

Probability distributions -

Target that we want to estimate

Probability measures,

X i    i ,  ~ F (x /  ) 12

Our Targets  We want to chose T s.t.T(Xi,….,Xn) is always very near to μ How do we quantify the problem?

Let us elaborate this issue through examples.

13

Inference with a Single Observation Population

?

Sampling

Parameter:  Inference

Observation Xi • Each observation Xi in a random sample is a representative of unobserved variables in population

• Each observation is an estimator of μ but its variance is as much as the poppulation variance 14

Normal Distribution • In this problem normal distribution is the most popular model for our overall population • Can calculate the probability of getting observations greater than or less than any value

• Usually we don’t have a single observation, but instead the mean of a set of observations 15

Inference with Sample Mean Population

?

Sampling

Sample

Parameter:  Inference

Estimation

Statistic: x

• Sample mean is our estimate of population mean • How much would the sample mean change if we took a different sample? • Key to this question: Sampling Distribution of x 16

Sampling Distribution of Sample Mean • Distribution of values taken by statistic in all possible samples of size n from the same population • Model assumption: our observations xi are sampled from a population with mean  and variance 2

Population Unknown Parameter: 

Sample 1 of size n Sample 2 of size n Sample 3 of size n Sample 4 of size n Sample 5 of size n Sample 6 of size n Sample 7 of size n Sample 8 of size n . . .

x x x x x x x x

Distribution of these values?

17

Points to Be Remembered If population is finite

If population is countably infinite

If population is uncountably infinite

No of sample means are finite No of sample means are countably infinite

No of sample means are uncountably infinite

18

Meaning of Sampling Distribution Replications B=10000 19

• Comparing the sampling distribution of the sample mean when n = 1 vs. n = 10

20

Examination on a Real Data Set We also consider a real set of health data of 1491 Japanese adult male students from various districts of Japan as population. Four head measurements: head length, head breadth, head height and headcircumference and two physical measurements: stature and weight  Data were taken by one observer, Funmio Ohtsuki (Hossain et al. 2005) using the technique of Martin and Saller (1957).

21

Histogram and Density of Head Length (Truncated at the Left)

22

Basic Information about Two Populations Type

Mean

Variance b1

b2

size

original

178.99 37.13

.08

2.98 1491

Truncated 181.85 19.63

.80

3.45 1063

23

Sampling Distributions

n (X n   )

Xn

, n=10. 20, 100 & 500

, n=10. 20, 100 & 500 Replications=10000

24

Boxplots of Means for Original Population

Xn

n (X n   )

Replications=10000

25

Descriptive Statistics of Sampling Distribution of Means for Original Population

biassim [1,] 0.0221

varsim varasim 3.5084 35.0836

[2,] -0.0230

1.8560

37.1210

[3,] 0.0022

0.3634

36.3167

[4,] 0.0041

0.0715

35.7484 37.13 26

Density of Means for Original Population

27

Histograms of Means for Truncated Population

28

Boxplots of Means for Truncated Population

29

Descriptive Statistics of Sampling Distribution of Means for Truncated Population

[1,]

biassim -0.0105

[2,] -0.0002 [3,] [4,]

-0.0014

varsim varasim 2.0025 20.0249 0.9810

19.62088

0.1958

19.5790

-0.0029 0.0395

19.7419

19.63 30

Chi-square with Two D.F.

31

Boxplots of Means for



2 2

32

Histogram of Means for



2 2

33

Central Limit Theorem • If the sample size is large enough, then the sample mean x has an approximately Normal distribution



• This is true no matter what the shape of the distribution of the original data!!!! 34

Histogram of 100000 Obs from Standard Cauchy

35

Xn

N=500

36

n (X n   ) N=500 37

Central Limit Theorem

This is a special case of

convergence in distribution

(x ) 





1 2



e

1 2 2

t2

dt

FX (x )  Pr ( n (X n  )  x ) n

Subject to existence of mean and variance

x

FX (x )  (x ) x as n   n

Research is going on to relax i.i.d. condition

38

How many sequences in CLT?  Basic random functional sequence, X n (w)  Derived random functional sequence,  Real sequence,

1/

X n (w)  

n

to compare convergence of

X n (w)  

to 0.  Another real nonnegative functional sequence, FX ( x ) n

39

Significance of CLT  From mathematics we know, we could approximate an by a as accurate as we wish when an → a  Sampling distribution of means can be approximated by normal distribution when CLT holds and sample size is fairly large.

 It justifies to build confidence interval for μ using sample mean and normal table in non-normal cases 40

More Topics Worth-studying

 To have error bounds like sup FX n (x )  F (x )  g(n ) n n

 To characterize extreme fluctuations using sequences like logn, loglogn etc Law of the iterated Logarithm (Hartman and Wintner,1941)

41

Berry Essen

Theorem(1941,1945)

 Check uniformity of convergence:Uniform convergence is better than simple pointwise convergence.Polya theorem guarantees that since normal cdf is everywhere continuous, covergence in CLT is uniform.

Why do we use x to estimate μ? P1. E(X )= μ

Meaning of “E”??

 P2. V(X )=E[( X -μ)2 ]=V(X)/n P.3

What is its significance??

X converges to μ in probability

a n ( )

Lt an ( )  Lt Pr( X n     )  1   0

n 

Subject to

n 

t [1  F (t )  F ( t )]  0 as t  

42

1

Why do we use x to estimate μ?

Pr( : Lt X n ()    0)  1 n 

Subject to

E( X1 )  

2

Condition 2 implies condition1

P.4 X converges to μ almost surely.

43

Difference between Two Limits

Lt an ( )  Lt P r( X n     )  1   0

n 

n 

Probability is calculated first,

P r( : Lt

n 

then limit is taken.

X n ()    0)  1

Limit is calculated first,

then probability is calculated. 44

Why do we use x? P.5 Let X ~N(μ.σ2)↔ε~N(0, σ2). Then X~ N(μ.σ2/n) So we can make statements like Pr[a(Xn)

Calculus I for Machine Learning

Short Description

Description

Comments

We need your help!