Slides - CVIT - IIIT Hyderabad

January 9, 2018 | Author: Anonymous | Category: Science, Health Science, Pediatrics
Share Embed Donate


Short Description

Download Slides - CVIT - IIIT Hyderabad...

Description

Content Level Access to Digital Library of India Pages Praveen Krishnan, Ravi Shekhar, C.V. Jawahar CVIT, IIIT Hyderabad

IIIT Hyderabad

Digital Library of India (DLI)

http://www.dli.iiit.ac.in/

Vision : To enhance access to information and knowledge to masses.

• Partner to Million Book Universal Digital Library Programme. Information for people

Dataset for researchers

IIIT Hyderabad

Vamshi Ambati, N.Balakrishnan, Raj Reddy, Lakshmi Pratha, C V Jawahar: The Digital Library of India Project: Process, Policies and Architecture, ICDL , 2007.

Digital Library of India (DLI)

Vision : To enhance access to information and knowledge to masses. Content

Languages

Statistics

IIIT Hyderabad

• #Books 4 Lakhs • 41 different languages • #Pages 134 Million • Includes - Hindi, Telugu, Marathi.. • #Words 26 Billion - English, French, Greek.. Source: http://www.new1.dli.ernet.in/

Digital Library of India (DLI)

Meta data search • Supports Meta data based search. • No Content Level Access Indian freedom struggle and independence Search IIIT Hyderabad

Digital Library of India (DLI)

• Need Content Level Access • Content + Meta Data Indian freedom struggle and independence Search IIIT Hyderabad

Digital Library of India (DLI)

• Need Content Level Access • Content + Meta Data Indian freedom struggle and independence Search

Reliable Text Representation

?

IIIT Hyderabad

Goal Digital Library of India Search

• Build a search engine with support for Indian languages. • Word Spotting

IIIT Hyderabad

Goal Indian Language Document Search Engine

Text Query Support

खोज

Page 1 IIIT Hyderabad

Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य

खोज

Multi Keyword Support Page 1 IIIT Hyderabad

Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य

खोज

Ranks based on # Occurrences Page 1 IIIT Hyderabad

Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य

खोज

Semantically Related Words Page 1 IIIT Hyderabad

Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य

खोज

Seamless scaling to billions of word images.

Sub second retrieval Page 1 IIIT Hyderabad

Text from OCR

Hindi Page

Telugu Page

IIIT Hyderabad

- Hindi: Title - Praachin Bhaartiy Vichaar Aur Vibhutiyaan, Published in 1624 - Telugu: Title - Andhra Vagmayaramba Dasha, Published in 1960

Text from OCR

Hindi Page IIIT Hyderabad

Cuts

Telugu Page

Text from OCR

Hindi Page IIIT Hyderabad

Merges Cuts

Telugu Page

Text from OCR

Hindi Page

Telugu Page

IIIT Hyderabad

Variations in Script,Cuts Font and Typesetting.

Text from OCR

Char % Hindi

Telugu

IIIT Hyderabad

[1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011.

Text from OCR

Word % Hindi

Telugu

IIIT Hyderabad

[1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011.

Text from OCR

Search % Hindi

Telugu

IIIT Hyderabad

BoVW for Image Retrieval Text Retrieval

Image Recognition Query Image

Ranked Retrieved Results IIIT Hyderabad

Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003

BoVW for Image Retrieval • Fixed Length Representation • Invariant to popular deformation Query Image

Ranked Retrieved Results IIIT Hyderabad

Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003

BoVW for Document Image Retrieval

IIIT Hyderabad

R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

BoVW for Document Image Retrieval

Histogram of Visual Words

IIIT Hyderabad

R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

BoVW for Document Image Retrieval Cuts

IIIT Hyderabad

R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

BoVW for Document Image Retrieval Cuts

Histogram of Visual Words

IIIT Hyderabad

R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

BoVW for Document Image Retrieval Merges

IIIT Hyderabad

R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

BoVW for Document Image Retrieval Merges

Histogram of Visual Words

IIIT Hyderabad

R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

BoVW for Document Image Retrieval • Robust against degradation • Lost Geometry • Use Spatial Verification – SIFT based. – Longest Subsequence alignment. y 1 0.5

Clean

0

0.5

IIIT Hyderabad

V 1

1 V 2

V 6

1.5 Cuts 2 V 4

V 4

2.5 V 8

3 V 9

x

Merge

R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012. I. Z. Yalniz and R. Manmatha. An Efficient Framework for Searching Text in Noisy Document Images. In DAS, 2012

Query Expansion Querying Database

Query Image Rank 1

Rank 2

Histogram Rank 3

Rank 4

Rank 5

Rank 6

IIIT Hyderabad

Refined Histogram

Query Expansion Querying Database

Query Image Rank 1

Rank 2

Query Histogram Rank 3

Rank 4

Better Results

Rank 5

Rank 6

IIIT Hyderabad

Text Query Support • Originally formulated in a “query by example” setting.

Input Query Image

Histogram

IIIT Hyderabad

Text Query Support • Originally formulated in a “query by example” setting. • Need Text Queries

Input Text Query

IIIT Hyderabad

Text Query Histogram

Observations • Are the results of OCR and BoVW complementary?

IIIT Hyderabad

BoVW

OCR

OCR

BoVW

Observations

mAP

• mAP v/s Word Length

IIIT Hyderabad

No. of Characters

Observations • “OCR system has a high precision while BoVW approach has a high recall.” • Example: #GT = 5 OCR Out List; Precision = 1 ; Recall = 0.4

BoVW Out List; Precision = 0.8 ; Recall = 1 IIIT Hyderabad

Fusion • Fusion Techniques:• Naïve Fusion

mAP Chart

OCR

IIIT Hyderabad

Fusion • Fusion Techniques:• Naïve Fusion

mAP Chart

BoVW

IIIT Hyderabad

Fusion • Fusion Techniques:• Naïve Fusion Concatenating OCR Results with BoVW OCR

BoVW

mAP Chart

IIIT Hyderabad

Fusion • Fusion Techniques:• Edit Distance Based Fusion

OCR

BoVW

mAP Chart

IIIT Hyderabad

Fusion • Fusion Techniques:• Edit Distance Based Fusion

• Reordering BoVW • BoVW score • Modified Edit distance cost BoVW

mAP Chart

IIIT Hyderabad

Fusion • Fusion Techniques:• Edit Distance Based Fusion

• Reordering BoVW • BoVW score • Modified Edit distance cost BoVW

mAP Chart

IIIT Hyderabad

Fusion • Fusion Techniques:• Edit Distance Based Fusion

OCR

BoVW

mAP Chart

IIIT Hyderabad

Fusion • Fusion Techniques:• Hybrid Fusion

OCR

BoVW

mAP Chart

IIIT Hyderabad

Fusion • Fusion Techniques:• Hybrid Fusion

mAP Chart

• Re-querying BoVW using • OCR retrieved results. • Using rank aggregation techniques BoVW IIIT Hyderabad

Fusion • Fusion Techniques:• Hybrid Fusion

mAP Chart

• Re-querying BoVW using • OCR retrieved results. • Using rank aggregation techniques BoVW IIIT Hyderabad

Fusion • Fusion Techniques:• Hybrid Fusion

OCR

BoVW

mAP Chart

IIIT Hyderabad

Experimental Results

IIIT Hyderabad

Experimental Details • OCR [1] • Feature Detector – Harris Interest point detection. [2]

• Feature Descriptor – SIFT [2]

• Indexing – Lucene [3] IIIT Hyderabad

[1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,”in ICDAR MOCR Workshop, 2011. [2] http://www.vlfeat.org [3] http://lucene.apache.org/

Test Bed

Sample Word Images Language

#Books

#Pages

#Words

#Annotation

Hindi (HS1)

11

1000

362,593

Yes

Hindi (HS2)

52

10,196

4,290,864

No

Telugu (TS1)

11

1000

161,276

Yes

Telugu (TS2)

69

13,871

2,531,069

No

DLI Corpus IIIT Hyderabad

• In addition, we used HP1 & TP1 fully annotated dataset

Evaluation Measures • Precision

• Recall



TP = True Positive FP = False Positive FN = False Negative

mAP (Mean Average Precision) Mean of the area under the precision recall curve for all the queries.

• Precision @ 10 Shows how accurate top 10 retrieved IIIT Hyderabad

results are. Precision-Recall Curve

BoVW Search Language

#Query

BoVW + Query Expansion

mAP

Prec@10

mAP

Prec@10

Hindi (HP1)

100

62.54

81.30

66.09

83.86

Telugu (TP1)

100

71.13

78

73.08

79.89

Comparison of naïve BoVW with BoVW + Query Expansion IIIT Hyderabad

BoVW Search Language

#Query

BoVW using Text Queries

mAP

Prec@10

mAP

Prec@10

Hindi (HP1)

100

62.54

81.30

56.32

73.89

Telugu (TP1)

100

71.13

78

69.06

78.83

Comparison of naïve BoVW with BoVW + Text Query Support IIIT Hyderabad

Naïve Language #Query

Edit Distance

Hybrid

mAP

Prec@10

mAP

Prec@10

mAP

Prec@10

Hindi (HP1)

100

75.66

90.7

79.58

90.8

80.37

91.4

Telugu (TP1)

100

76.02

81.2

78.01

81.4

80.23

83.7

Comparative performance of different fusion techniques on HP1 & TP1 IIIT Hyderabad

OCR Language #Query

BoVW

Fusion

mAP

Prec@10

mAP

Prec@10

mAP

Prec@10

Hindi (HS1)

100

14.95

62.60

60.55

95.5

68.81

95.6

Telugu (TS1)

100

27.03

62.10

74.38

90.6

78.41

91.9

Performance statistics on DLI Annotated Corpus IIIT Hyderabad

Language Hindi (HS2)

Telugu (TS2)

#Query

50

50

Precision @ N

OCR

BoVW

Fusion

Prec@10

82.03

96.94

97.11

Prec@20

75.16

94.83

95.42

Prec@30

71.12

92.82

93.16

Prec@10

90.85

99.14

99.14

Prec@20

85.42

98.00

98.85

Prec@30

80.76

96.38

96.57

Performance statistics on DLI Un-Annotated Corpus IIIT Hyderabad

Retrieved Results

IIIT Hyderabad

Retrieved Results

IIIT Hyderabad

Failure Cases

• The word images shown in the figure fails in both OCR and BoVW. • Reason: – (a) Word Image smaller in length and containing a character not used these days. IIIT Hyderabad

– (b) A highly degraded word image.

Implementation Details • Search Engine Development – An elegant web based search and retrieval interface.

No of Images

Time in milliseconds

Lucene Scalability

IIIT Hyderabad

Sample Retrieved Page No of Visual Words

Search Architecture (Ongoing) Search Query

Ranked Results

Delegator

Partial Scores FUSION

Query Expansion

Ranking OCR

BoVW

IIIT Hyderabad

OCR Index Web Service

BoVW Index Web Service

Web Service

Ongoing Work • Learn to improve from annotated dataset – Use of visual confusion matrix to improve BoVW results from annotated datasets.

• Necessity of Costly Features for Re-ranking – The images shows in failure cases would require costly features to show up. – Use of machine learning algorithms. IIIT Hyderabad

• Exploration of features better than SIFT.

Thank You

IIIT Hyderabad

View more...

Comments

Copyright � 2017 NANOPDF Inc.
SUPPORT NANOPDF