Calbug: A case study of digitization challenges for entomology

January 24, 2018 | Author: Anonymous | Category: Science, Biology, Zoology, Entomology
Share Embed Donate


Short Description

Download Calbug: A case study of digitization challenges for entomology...

Description

Calbug: a case study of digitization challenges for Entomology collections Joan Ball, Joyce Gross, Traci Gryzmala, Gordon Nishida, Peter Oboyski, Rosemary Gillespie, George Roderick, Kipling Will

Photo by: Marek Jakubowski

Background Workflow & Challenges Progress Future Direction

Photo by: Marek Jakubowski

What is CalBug? Essig Museum of Entomology

California Academy of Sciences California State Collection of Arthropods Bohart Museum, UC Davis Entomology Research Museum, UC Riverside San Diego Natural History Museum LA County Museum Santa Barbara Museum of Natural History

Goals 1.) Digitize and geo-reference 1.2 Million specimens from eight California institutions spanning 110 years of specimen collecting 2.) Analyze spatial and temporal changes in distributions due to land use change, invasive species, habitat fragmentation, and climate change

Photo by: Marek Jakubowski

Stratified data capture: All specimens of selected taxa

Stratified data capture: All specimens of species found in field stations • Images and Field Notes • Species Checklists • Historical Climate Records Digital Data: • Climate Sensor Networks

UC Natural Reserve System

Background Workflow & Challenges Progress Future Direction

Photo by: Marek Jakubowski

Workflow

1. Select taxa for databasing

5a. Manually enter data into MySQL database with some error checking

2. Sort specimens by location & date

3. Arrange labels to view all text, add catalog # label

4. Take, name, and save digital image of labels

6. Error Checking

7. Georeference locality 5b. Online crowd-sourcing of manual data entry 8. Upload data to cache

5c. Optical Character Recognition & data parsing

9. Temporospatial analyses

Imaging

Workflow

1. Select taxa for databasing

Challenges: Labels are small and stacked beneath specimen

2. Sort specimens by by location location & & date date

3. Arrange labels to view all text, add catalog # label

4. Take, name, and save digital image of labels

Specimen handling is inefficient, process extremely time consuming Current Imaging Rate: 60 specimens per hour per person

Data Entry

Workflow Crowd Sourcing: 5a. Manually enter data into MySQL database with some error checking

- Interactive website -Volunteers enter data 3X

Evaluate multiple entries for consistency

5b. Online crowd-sourcing of manual data entry

Museum staff – focus on imaging, QAQC, public relations

5c. Optical Character Recognition & data parsing

Develop dictionaries of common abbreviations and California localities- pick lists and controlled fields to reduce error…

OCR “Smart” parsing program– assign data elements to database fields based on context and dictionary terms

Workflow 6. Error Checking

7. Georeference locality

8. Upload data to cache

Data quality, access & analysis

Georeferencing & Mapping: Error Checking: Example: Analyzing data

Biogeomancer by locality and date throughout to identify typographic -Sort Dragonfly specimens CA over 100errors, years by record number to find carry-over -and Combine with: observation data, 1914errors. survey, current Estimate coordinates and error radius based on field studies standardized protocols of records with label images. -Compare Changes10in%biodiversity, species composition, and distribution Data Cache - Metrics of climate and land use change

9. Temporospatial analyses

Source: Cal-Adapt and the Public Interest Energy Research program, California Energy Commission

Publicly available data layers

Temperature (max, min, mean)

Species Distributions

Past, Present, Projected Future

Past, Present, Projected Future

Precipitation

Land Use

Past, Present, Projected Future

Private, Public, and Protected

Land Cover

Soils

Topography

Hydrology

Ongoing Research Projects • In support of taxonomy and undergraduate research

~23,000 georeferenced specimens in the EMEC database from the Californian Floristic province.

#specimens Years

~23,000 georeferenced specimens in the EMEC database from the Californian Floristic province.

J. Powell

Years

reconfiguration

#specimens

WW2

Background Our Database Workflow & Challenges Progress & Future Direction

Photo by: Marek Jakubowski

Progress Made – Essig museum Data Entered: EMEC total 122,000 -42,000 since 1, Sept 2010 -55,289 CA specimens -65,000 georeferenced Images Taken: 44,200 images

Progress Made – Collaborators CDFA: 14,000 Sphecidae, pests Bohart: 25,000 Sphecidae

SBMNH: 140,000 Coleoptera (museum records and literature)

Photo from CA Beetle Project site

Riverside: 26,500 bees CAS: 15,000 Neuroptera

Photo by: Texas A&M University Photo by: Robin Coville

Timeline Analysis of data: Arthropod response to global change

Start of Calbug

Year 1: 240,000 Specimens Digitized

Year 2: Image and Digitize 320,000 Specimens; QAQC

Year 4: Image and Digitize 320,000 Specimens; QAQC

Year 3: Image and Digitize 320,000 Specimens; QAQC

Imaging Goal – Next 3 years: 320,000 images per year 6,500 images per week (48 weeks)

Finish

Year 5: Georeferencing

Future Directions – Simplify and disperse the workflow 1. Select taxa for databasing

5b. Online crowd-sourcing of manual data entry

3. Arrange labels to view all text, add catalog # label

4 (Modified). Run sheets of specimens through imaging station

5c. Optical Character Recognition & data parsing

6. Error Checking

7. Georeference locality

8. Upload data to cache

9. Temporospatial analyses

Remove sorting step (2), and museum staff data entry (5a) Speed up image capture through assembly line process (4) - Set up stations for specific handling tasks - Automate file naming and saving Develop dictionaries of localities, and common abbreviations to reduce error and speed data entry

Looking ahead • Data from many millions of additional specimens will remain to be captured • “brute force” entry needs to be coupled with any technological advances that we can harness • Intermediate products are necessary

Acknowledgements All participating organizations National Science Foundation John Weiczorek, Michelle Koo, Carol Spencer Berkeley Natural History Museums Consortium Biodiversity Sciences Technology (BSCIT) Citizen Science Alliance >20 Undergraduates

View more...

Comments

Copyright � 2017 NANOPDF Inc.
SUPPORT NANOPDF