Slides - CVIT - IIIT Hyderabad
Short Description
Download Slides - CVIT - IIIT Hyderabad...
Description
Content Level Access to Digital Library of India Pages Praveen Krishnan, Ravi Shekhar, C.V. Jawahar CVIT, IIIT Hyderabad
IIIT Hyderabad
Digital Library of India (DLI)
http://www.dli.iiit.ac.in/
Vision : To enhance access to information and knowledge to masses.
• Partner to Million Book Universal Digital Library Programme. Information for people
Dataset for researchers
IIIT Hyderabad
Vamshi Ambati, N.Balakrishnan, Raj Reddy, Lakshmi Pratha, C V Jawahar: The Digital Library of India Project: Process, Policies and Architecture, ICDL , 2007.
Digital Library of India (DLI)
Vision : To enhance access to information and knowledge to masses. Content
Languages
Statistics
IIIT Hyderabad
• #Books 4 Lakhs • 41 different languages • #Pages 134 Million • Includes - Hindi, Telugu, Marathi.. • #Words 26 Billion - English, French, Greek.. Source: http://www.new1.dli.ernet.in/
Digital Library of India (DLI)
Meta data search • Supports Meta data based search. • No Content Level Access Indian freedom struggle and independence Search IIIT Hyderabad
Digital Library of India (DLI)
• Need Content Level Access • Content + Meta Data Indian freedom struggle and independence Search IIIT Hyderabad
Digital Library of India (DLI)
• Need Content Level Access • Content + Meta Data Indian freedom struggle and independence Search
Reliable Text Representation
?
IIIT Hyderabad
Goal Digital Library of India Search
• Build a search engine with support for Indian languages. • Word Spotting
IIIT Hyderabad
Goal Indian Language Document Search Engine
Text Query Support
खोज
Page 1 IIIT Hyderabad
Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य
खोज
Multi Keyword Support Page 1 IIIT Hyderabad
Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य
खोज
Ranks based on # Occurrences Page 1 IIIT Hyderabad
Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य
खोज
Semantically Related Words Page 1 IIIT Hyderabad
Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य
खोज
Seamless scaling to billions of word images.
Sub second retrieval Page 1 IIIT Hyderabad
Text from OCR
Hindi Page
Telugu Page
IIIT Hyderabad
- Hindi: Title - Praachin Bhaartiy Vichaar Aur Vibhutiyaan, Published in 1624 - Telugu: Title - Andhra Vagmayaramba Dasha, Published in 1960
Text from OCR
Hindi Page IIIT Hyderabad
Cuts
Telugu Page
Text from OCR
Hindi Page IIIT Hyderabad
Merges Cuts
Telugu Page
Text from OCR
Hindi Page
Telugu Page
IIIT Hyderabad
Variations in Script,Cuts Font and Typesetting.
Text from OCR
Char % Hindi
Telugu
IIIT Hyderabad
[1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011.
Text from OCR
Word % Hindi
Telugu
IIIT Hyderabad
[1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011.
Text from OCR
Search % Hindi
Telugu
IIIT Hyderabad
BoVW for Image Retrieval Text Retrieval
Image Recognition Query Image
Ranked Retrieved Results IIIT Hyderabad
Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003
BoVW for Image Retrieval • Fixed Length Representation • Invariant to popular deformation Query Image
Ranked Retrieved Results IIIT Hyderabad
Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003
BoVW for Document Image Retrieval
IIIT Hyderabad
R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval
Histogram of Visual Words
IIIT Hyderabad
R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval Cuts
IIIT Hyderabad
R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval Cuts
Histogram of Visual Words
IIIT Hyderabad
R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval Merges
IIIT Hyderabad
R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval Merges
Histogram of Visual Words
IIIT Hyderabad
R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval • Robust against degradation • Lost Geometry • Use Spatial Verification – SIFT based. – Longest Subsequence alignment. y 1 0.5
Clean
0
0.5
IIIT Hyderabad
V 1
1 V 2
V 6
1.5 Cuts 2 V 4
V 4
2.5 V 8
3 V 9
x
Merge
R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012. I. Z. Yalniz and R. Manmatha. An Efficient Framework for Searching Text in Noisy Document Images. In DAS, 2012
Query Expansion Querying Database
Query Image Rank 1
Rank 2
Histogram Rank 3
Rank 4
Rank 5
Rank 6
IIIT Hyderabad
Refined Histogram
Query Expansion Querying Database
Query Image Rank 1
Rank 2
Query Histogram Rank 3
Rank 4
Better Results
Rank 5
Rank 6
IIIT Hyderabad
Text Query Support • Originally formulated in a “query by example” setting.
Input Query Image
Histogram
IIIT Hyderabad
Text Query Support • Originally formulated in a “query by example” setting. • Need Text Queries
Input Text Query
IIIT Hyderabad
Text Query Histogram
Observations • Are the results of OCR and BoVW complementary?
IIIT Hyderabad
BoVW
OCR
OCR
BoVW
Observations
mAP
• mAP v/s Word Length
IIIT Hyderabad
No. of Characters
Observations • “OCR system has a high precision while BoVW approach has a high recall.” • Example: #GT = 5 OCR Out List; Precision = 1 ; Recall = 0.4
BoVW Out List; Precision = 0.8 ; Recall = 1 IIIT Hyderabad
Fusion • Fusion Techniques:• Naïve Fusion
mAP Chart
OCR
IIIT Hyderabad
Fusion • Fusion Techniques:• Naïve Fusion
mAP Chart
BoVW
IIIT Hyderabad
Fusion • Fusion Techniques:• Naïve Fusion Concatenating OCR Results with BoVW OCR
BoVW
mAP Chart
IIIT Hyderabad
Fusion • Fusion Techniques:• Edit Distance Based Fusion
OCR
BoVW
mAP Chart
IIIT Hyderabad
Fusion • Fusion Techniques:• Edit Distance Based Fusion
• Reordering BoVW • BoVW score • Modified Edit distance cost BoVW
mAP Chart
IIIT Hyderabad
Fusion • Fusion Techniques:• Edit Distance Based Fusion
• Reordering BoVW • BoVW score • Modified Edit distance cost BoVW
mAP Chart
IIIT Hyderabad
Fusion • Fusion Techniques:• Edit Distance Based Fusion
OCR
BoVW
mAP Chart
IIIT Hyderabad
Fusion • Fusion Techniques:• Hybrid Fusion
OCR
BoVW
mAP Chart
IIIT Hyderabad
Fusion • Fusion Techniques:• Hybrid Fusion
mAP Chart
• Re-querying BoVW using • OCR retrieved results. • Using rank aggregation techniques BoVW IIIT Hyderabad
Fusion • Fusion Techniques:• Hybrid Fusion
mAP Chart
• Re-querying BoVW using • OCR retrieved results. • Using rank aggregation techniques BoVW IIIT Hyderabad
Fusion • Fusion Techniques:• Hybrid Fusion
OCR
BoVW
mAP Chart
IIIT Hyderabad
Experimental Results
IIIT Hyderabad
Experimental Details • OCR [1] • Feature Detector – Harris Interest point detection. [2]
• Feature Descriptor – SIFT [2]
• Indexing – Lucene [3] IIIT Hyderabad
[1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,”in ICDAR MOCR Workshop, 2011. [2] http://www.vlfeat.org [3] http://lucene.apache.org/
Test Bed
Sample Word Images Language
#Books
#Pages
#Words
#Annotation
Hindi (HS1)
11
1000
362,593
Yes
Hindi (HS2)
52
10,196
4,290,864
No
Telugu (TS1)
11
1000
161,276
Yes
Telugu (TS2)
69
13,871
2,531,069
No
DLI Corpus IIIT Hyderabad
• In addition, we used HP1 & TP1 fully annotated dataset
Evaluation Measures • Precision
• Recall
•
TP = True Positive FP = False Positive FN = False Negative
mAP (Mean Average Precision) Mean of the area under the precision recall curve for all the queries.
• Precision @ 10 Shows how accurate top 10 retrieved IIIT Hyderabad
results are. Precision-Recall Curve
BoVW Search Language
#Query
BoVW + Query Expansion
mAP
Prec@10
mAP
Prec@10
Hindi (HP1)
100
62.54
81.30
66.09
83.86
Telugu (TP1)
100
71.13
78
73.08
79.89
Comparison of naïve BoVW with BoVW + Query Expansion IIIT Hyderabad
BoVW Search Language
#Query
BoVW using Text Queries
mAP
Prec@10
mAP
Prec@10
Hindi (HP1)
100
62.54
81.30
56.32
73.89
Telugu (TP1)
100
71.13
78
69.06
78.83
Comparison of naïve BoVW with BoVW + Text Query Support IIIT Hyderabad
Naïve Language #Query
Edit Distance
Hybrid
mAP
Prec@10
mAP
Prec@10
mAP
Prec@10
Hindi (HP1)
100
75.66
90.7
79.58
90.8
80.37
91.4
Telugu (TP1)
100
76.02
81.2
78.01
81.4
80.23
83.7
Comparative performance of different fusion techniques on HP1 & TP1 IIIT Hyderabad
OCR Language #Query
BoVW
Fusion
mAP
Prec@10
mAP
Prec@10
mAP
Prec@10
Hindi (HS1)
100
14.95
62.60
60.55
95.5
68.81
95.6
Telugu (TS1)
100
27.03
62.10
74.38
90.6
78.41
91.9
Performance statistics on DLI Annotated Corpus IIIT Hyderabad
Language Hindi (HS2)
Telugu (TS2)
#Query
50
50
Precision @ N
OCR
BoVW
Fusion
Prec@10
82.03
96.94
97.11
Prec@20
75.16
94.83
95.42
Prec@30
71.12
92.82
93.16
Prec@10
90.85
99.14
99.14
Prec@20
85.42
98.00
98.85
Prec@30
80.76
96.38
96.57
Performance statistics on DLI Un-Annotated Corpus IIIT Hyderabad
Retrieved Results
IIIT Hyderabad
Retrieved Results
IIIT Hyderabad
Failure Cases
• The word images shown in the figure fails in both OCR and BoVW. • Reason: – (a) Word Image smaller in length and containing a character not used these days. IIIT Hyderabad
– (b) A highly degraded word image.
Implementation Details • Search Engine Development – An elegant web based search and retrieval interface.
No of Images
Time in milliseconds
Lucene Scalability
IIIT Hyderabad
Sample Retrieved Page No of Visual Words
Search Architecture (Ongoing) Search Query
Ranked Results
Delegator
Partial Scores FUSION
Query Expansion
Ranking OCR
BoVW
IIIT Hyderabad
OCR Index Web Service
BoVW Index Web Service
Web Service
Ongoing Work • Learn to improve from annotated dataset – Use of visual confusion matrix to improve BoVW results from annotated datasets.
• Necessity of Costly Features for Re-ranking – The images shows in failure cases would require costly features to show up. – Use of machine learning algorithms. IIIT Hyderabad
• Exploration of features better than SIFT.
Thank You
IIIT Hyderabad
View more...
Comments