Search Issues Associated with Uncorrected OCR

Newspapers and OCR

The newspapers in this collection have uncorrected OCR (Optical Character Recognition), which may render keyword searching ineffective -- to see why, take a look at the this sample, from p.205 of the Connecticut war record, May 1864:


The original, on the left, seems fairly legible, but someone forgot to tell the computer -- the same lines rendered by a respected OCR application appear on the right. If you were looking for the term "Col. Wooster" in this issue, using keyword searching, you would never find it. We strongly recommend that you take advantage of our browse filtering to find the issues that are likely to have what you are seeking, and then read through them instead of relying on keyword searching of the transcription field.

Why is the OCR so awful?

Even the most sophisticated Optical Character Recognition software has limitations in its ability to recognize printed text. Old newspapers are particularly challenging because the pages are large, the typeface is small, and the layout is often cluttered with non-text items. Add to this poor legibility of the original and, often, the fact that the digitized version of the newspaper was made from microfilm, not the original, and you end up with a "perfect storm" of ambiguity and misinterpretation. Sadly, this will only improve when the technology gets much better, or a vast army of volunteers corrects the transcription fields of millions of already-digitized pages manually.

Advantages of Filtered Browsing

The tabbed browse themes to the left of the Search Box allow browsing by titles, places, topics and date range. Because the filters reference non-transcription metadata fields, text legibility is not an issue. While filtered browsing can only narrow your search, and requires a little more participation on your part, at least it won't give "false negative" results. If a keyword search returns nothing, do give Filtered Browsing a try.