Calbug: A case study of digitization challenges for entomology
Short Description
Download Calbug: A case study of digitization challenges for entomology...
Description
Calbug: a case study of digitization challenges for Entomology collections Joan Ball, Joyce Gross, Traci Gryzmala, Gordon Nishida, Peter Oboyski, Rosemary Gillespie, George Roderick, Kipling Will
Photo by: Marek Jakubowski
Background Workflow & Challenges Progress Future Direction
Photo by: Marek Jakubowski
What is CalBug? Essig Museum of Entomology
California Academy of Sciences California State Collection of Arthropods Bohart Museum, UC Davis Entomology Research Museum, UC Riverside San Diego Natural History Museum LA County Museum Santa Barbara Museum of Natural History
Goals 1.) Digitize and geo-reference 1.2 Million specimens from eight California institutions spanning 110 years of specimen collecting 2.) Analyze spatial and temporal changes in distributions due to land use change, invasive species, habitat fragmentation, and climate change
Photo by: Marek Jakubowski
Stratified data capture: All specimens of selected taxa
Stratified data capture: All specimens of species found in field stations • Images and Field Notes • Species Checklists • Historical Climate Records Digital Data: • Climate Sensor Networks
UC Natural Reserve System
Background Workflow & Challenges Progress Future Direction
Photo by: Marek Jakubowski
Workflow
1. Select taxa for databasing
5a. Manually enter data into MySQL database with some error checking
2. Sort specimens by location & date
3. Arrange labels to view all text, add catalog # label
4. Take, name, and save digital image of labels
6. Error Checking
7. Georeference locality 5b. Online crowd-sourcing of manual data entry 8. Upload data to cache
5c. Optical Character Recognition & data parsing
9. Temporospatial analyses
Imaging
Workflow
1. Select taxa for databasing
Challenges: Labels are small and stacked beneath specimen
2. Sort specimens by by location location & & date date
3. Arrange labels to view all text, add catalog # label
4. Take, name, and save digital image of labels
Specimen handling is inefficient, process extremely time consuming Current Imaging Rate: 60 specimens per hour per person
Data Entry
Workflow Crowd Sourcing: 5a. Manually enter data into MySQL database with some error checking
- Interactive website -Volunteers enter data 3X
Evaluate multiple entries for consistency
5b. Online crowd-sourcing of manual data entry
Museum staff – focus on imaging, QAQC, public relations
5c. Optical Character Recognition & data parsing
Develop dictionaries of common abbreviations and California localities- pick lists and controlled fields to reduce error…
OCR “Smart” parsing program– assign data elements to database fields based on context and dictionary terms
Workflow 6. Error Checking
7. Georeference locality
8. Upload data to cache
Data quality, access & analysis
Georeferencing & Mapping: Error Checking: Example: Analyzing data
Biogeomancer by locality and date throughout to identify typographic -Sort Dragonfly specimens CA over 100errors, years by record number to find carry-over -and Combine with: observation data, 1914errors. survey, current Estimate coordinates and error radius based on field studies standardized protocols of records with label images. -Compare Changes10in%biodiversity, species composition, and distribution Data Cache - Metrics of climate and land use change
9. Temporospatial analyses
Source: Cal-Adapt and the Public Interest Energy Research program, California Energy Commission
Publicly available data layers
Temperature (max, min, mean)
Species Distributions
Past, Present, Projected Future
Past, Present, Projected Future
Precipitation
Land Use
Past, Present, Projected Future
Private, Public, and Protected
Land Cover
Soils
Topography
Hydrology
Ongoing Research Projects • In support of taxonomy and undergraduate research
~23,000 georeferenced specimens in the EMEC database from the Californian Floristic province.
#specimens Years
~23,000 georeferenced specimens in the EMEC database from the Californian Floristic province.
J. Powell
Years
reconfiguration
#specimens
WW2
Background Our Database Workflow & Challenges Progress & Future Direction
Photo by: Marek Jakubowski
Progress Made – Essig museum Data Entered: EMEC total 122,000 -42,000 since 1, Sept 2010 -55,289 CA specimens -65,000 georeferenced Images Taken: 44,200 images
Progress Made – Collaborators CDFA: 14,000 Sphecidae, pests Bohart: 25,000 Sphecidae
SBMNH: 140,000 Coleoptera (museum records and literature)
Photo from CA Beetle Project site
Riverside: 26,500 bees CAS: 15,000 Neuroptera
Photo by: Texas A&M University Photo by: Robin Coville
Timeline Analysis of data: Arthropod response to global change
Start of Calbug
Year 1: 240,000 Specimens Digitized
Year 2: Image and Digitize 320,000 Specimens; QAQC
Year 4: Image and Digitize 320,000 Specimens; QAQC
Year 3: Image and Digitize 320,000 Specimens; QAQC
Imaging Goal – Next 3 years: 320,000 images per year 6,500 images per week (48 weeks)
Finish
Year 5: Georeferencing
Future Directions – Simplify and disperse the workflow 1. Select taxa for databasing
5b. Online crowd-sourcing of manual data entry
3. Arrange labels to view all text, add catalog # label
4 (Modified). Run sheets of specimens through imaging station
5c. Optical Character Recognition & data parsing
6. Error Checking
7. Georeference locality
8. Upload data to cache
9. Temporospatial analyses
Remove sorting step (2), and museum staff data entry (5a) Speed up image capture through assembly line process (4) - Set up stations for specific handling tasks - Automate file naming and saving Develop dictionaries of localities, and common abbreviations to reduce error and speed data entry
Looking ahead • Data from many millions of additional specimens will remain to be captured • “brute force” entry needs to be coupled with any technological advances that we can harness • Intermediate products are necessary
Acknowledgements All participating organizations National Science Foundation John Weiczorek, Michelle Koo, Carol Spencer Berkeley Natural History Museums Consortium Biodiversity Sciences Technology (BSCIT) Citizen Science Alliance >20 Undergraduates
View more...
Comments