If using a scanner is impractical due to time or other logistical constraints, you can still successfully digitize hard copy data using a cell phone. Set the scanner to the highest resolution possible, especially if the print/text is small. So, if possible, use a document scanner to create image files of the hard copies you have acquired. Perhaps more importantly, taking photos of documents often introduces spatial distortion (3D page warping and 2D page skew/rotation) that can be difficult to correct post hoc. Although the quality of these photos can be quite good, the resolution of a photo of a document is often lower than it would be if a scanner were used. Depsite the ease of using a cell phone camera to rapidly capture pages of data, I do not recommend doing so when it is avoidable. If you have physical pieces of paper that contain tables of data, the first step is turning them into image files on your computer. Finally, the third details the process for setting up and using Textract. The second briefly discusses some options for optical content recognition (OCR). The first discusses getting hard copy data into digitized image files. This post is organized into three main sections. 1 But if you have more than a few pages to convert I think the investment is worth it compared to manually entering in all the data (or using a mediocre solution that requires spending time fixing its output). It also requires some very basic use of Python. This method requires investing some time at the beginning to get it up and running, and there is a per page charge (except for moderate use during the first year). I have spent considerable time working through this problem and am writing this post to guide other researchers and data scientists through my preferred method: using Amazon Textract. Many solutions for converting image-based tabular data are imperfect, either because the resulting spreadsheet files are poorly formatted or because the text recognition performs poorly (or both). a non-searchable PDF)? Or even worse, data in tables printed on physical paper? This is often a problem for researchers who would like to use data from government or archival sources in statistical analyses. Have you ever come across a great data source that is unusable because it is contained in an image-based file (e.g.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |