Jump to content

oK computer wizzes nows the time to show us


Recommended Posts

I have my jazz history book in published form, and due to a lot of stupid things the original disc that it was on may not be accessible -

can the book pages be scanned in such a way as the pages can be put into a Word file (or some such thing) and edited? or do I have to re-type the damn thing?

Link to comment
Share on other sites

OCR software is pretty good these days. I have used a trial version of ABBYY FineReader that works well - it has lots of output functions.

http://www.scanstore.com/Scanning_Software...p?ITEM_ID=18288

As with any software there are price points.

Here's a list of the popular ones. There's probably other free stuff out there, but maybe not as accurate

http://www.simpleocr.com/OCR_Software_Guide.asp

Link to comment
Share on other sites

groovy; thanks guys. As I hit my old age it occurs to me that I need to take a little more control of my work; I have three books written, a fourth I am trying to finish. I've never found a decent publisher who will handle my stuff, and I've made more money selling it myself anyway (basically I occupy a middle position, in publishing limbo; I sell enough to make it worthwhile to ME, but not enough for a major trade press).

thanks again -

Link to comment
Share on other sites

OCR software is pretty good these days. I have used a trial version of ABBYY FineReader that works well - it has lots of output functions.

http://www.scanstore.com/Scanning_Software...p?ITEM_ID=18288

As with any software there are price points.

Here's a list of the popular ones. There's probably other free stuff out there, but maybe not as accurate

http://www.simpleocr.com/OCR_Software_Guide.asp

I used to really like OmniPage (I think I had version 8 or 9). I've heard relatively good things of the software through OmniPage 12, but then the company was bought out and the customer support/service went completely to hell and none of it worked well with Vista (no big surprise there). OmniPage 16 supposedly bites.

I'm totally bummed because my home computer finally died, and I can't find the installation CDs, so I have to get new software. I'm leaning towards the ABBYY FineReader. I'm pretty sure that if you buy it (not just use the trial version) you can save out multiple pages. Any insight into this? Second, does it offer a "straighten page" option? For some of my scanned material, I just don't have an option but to try to utilize this feature. Thanks for any thoughts.

Link to comment
Share on other sites

I can't recall the name but I bought cheap software at Office Depot that converts PDF files into Word docs, so if you scan the pages as PDF files that would do the trick to make them editable.

Impossible. PDF encapsulates a description of a document that includes text, fonts, images, and 2D vector graphics which compose the document. Scanning a page results in a bitmap (image).

You're on Windows, right? Try http://jocr.sourceforge.net/ which is released under the GNU Public License: no trial or shareware teaser - simply download an use the software.

Link to comment
Share on other sites

I can't recall the name but I bought cheap software at Office Depot that converts PDF files into Word docs, so if you scan the pages as PDF files that would do the trick to make them editable.

Impossible. PDF encapsulates a description of a document that includes text, fonts, images, and 2D vector graphics which compose the document. Scanning a page results in a bitmap (image).

Scansoft PDF Converter 4 takes PDF files and converts to Word Doc or Publisher and other formats. For my scanner, when scanning printed pages, the default output options is PDF.

Link to comment
Share on other sites

Impossible. PDF encapsulates a description of a document that includes text, fonts, images, and 2D vector graphics which compose the document. Scanning a page results in a bitmap (image).

Not impossible. In fact, its fairly common for scanners, especially document imaging units (what we used to call copiers) to out to PDF format in addition to JPG, GIF, TIFF, BMP, etc... Both my office Ricoh copier and Canon desktop scanner not only provide output to PDF but also the option to perform in-line OCR.

Current OCR and crawling technology is truly amazing. Our enterprise search crawls, in addition to full text search on Office documents and PDF's, also performs full text OCR on image files. You might expect a CAD drawing to be indexed but it also will OCR a JPG or GIF, "sense" text, and index the content accordingly.

Can you tell I'm an IT guy? Jeez...

Link to comment
Share on other sites

I have my jazz history book in published form, and due to a lot of stupid things the original disc that it was on may not be accessible -

can the book pages be scanned in such a way as the pages can be put into a Word file (or some such thing) and edited? or do I have to re-type the damn thing?

do re-type the damn thing

Link to comment
Share on other sites

I can't recall the name but I bought cheap software at Office Depot that converts PDF files into Word docs, so if you scan the pages as PDF files that would do the trick to make them editable.

Impossible. PDF encapsulates a description of a document that includes text, fonts, images, and 2D vector graphics which compose the document. Scanning a page results in a bitmap (image).

Scansoft PDF Converter 4 takes PDF files and converts to Word Doc or Publisher and other formats. For my scanner, when scanning printed pages, the default output options is PDF.

What's that scanner model you use?

Link to comment
Share on other sites

Alright, I didn't think of in-line OCR.

Not so sure that PDF makes sense as output format when I want to have the text in an editable format. So if the scan unit provides in-line OCR it should be possible to save that OCR string as a text file.

I generally set my scanner to output TIFs, then after the editing and processing I usually just save to Word documents, but some people prefer saving to PDFs. I think the PDF-direct output is for people who just need a permanent record but don't intend to edit (receipts, accounting stuff and contracts tend to fall in this category).

Link to comment
Share on other sites

Alright, I didn't think of in-line OCR.

Not so sure that PDF makes sense as output format when I want to have the text in an editable format. So if the scan unit provides in-line OCR it should be possible to save that OCR string as a text file.

I generally set my scanner to output TIFs, then after the editing and processing I usually just save to Word documents, but some people prefer saving to PDFs. I think the PDF-direct output is for people who just need a permanent record but don't intend to edit (receipts, accounting stuff and contracts tend to fall in this category).

Yeah, I'd still like to know what the "PDF-direct" looks like in Dan's case: embedded image or text.

Link to comment
Share on other sites

allright now you all lost me - I'm on windows so I was thinking of using Rockefeller Center's link, the JOCR thing - am I missing anything here?

(and thanks, Serioza, but retyping 100,000 + words is not my favorite option) -

hey, here's an idea - anyone here want to hire out (for a reasonable fee, I hope) to do this?

Edited by AllenLowe
Link to comment
Share on other sites

Alright, I didn't think of in-line OCR.

Not so sure that PDF makes sense as output format when I want to have the text in an editable format. So if the scan unit provides in-line OCR it should be possible to save that OCR string as a text file.

I generally set my scanner to output TIFs, then after the editing and processing I usually just save to Word documents, but some people prefer saving to PDFs. I think the PDF-direct output is for people who just need a permanent record but don't intend to edit (receipts, accounting stuff and contracts tend to fall in this category).

Yeah, I'd still like to know what the "PDF-direct" looks like in Dan's case: embedded image or text.

Microtek ScanWizard 5. We got it from my wife's Aunt who got it as part of a free "bundle" when she bought her PC and had no use for it.

Like I said, if I scan text like a birth certificate, the default setting for the output is PDF. I've never tried to output a PDF when I scan a photograph or photo + text.

And if there was any question, the Scansoft software did a perfect job converting PDF to an editable file, keeping images intact and text boxes editable.

Link to comment
Share on other sites

well, lets figure based on time - how long would it take to scan a 350 page book?

You've got a few choices. Do you still have the page proofs? If so, I would use those.

If not, if your scanner supports multiple pages in a scan, then I would probably scan 10 pages at a time -- to avoid losing data due to crashes, etc. I'm in an unfortunate situation, since the scan settings reset for every single job, so in my case, I actually make a copy of the thing I am scanning, then automatically feed them through (this counts as a job). Make sure the settings are at least 300 dpi - 400 dpi is better if the output files aren't too large for your available storage.

If you have page proofs, or have already copied the book, I think converting to a TIF file will take 30 minutes, running batches through the scanner (this assumes you have a copier/scanner. Again, do this in installments (no more than 35-40 pages at a time). If you are scanning the book page by page, it might take 2-3 hours (or more if you have the software attempt OCR on the spot -- better to just save the files out and process later).

Then you will run the OCR software. If you mostly have text and few or no footnotes and pictures, then maybe 1 day of going through and cleaning up. It could be more. That's a lot of pages. If the scan isn't clean or you have tables, footnotes, etc., then I'd say 2-3 days of intense work. That's my general experience.

It shouldn't take long IF you have an extra copy of the book that you are willing to unbind. That way the pages could be placed in the automatic document feeder in reasonable increments (25, 50, 75p, etc...). Scan duplex and you're on your way!

Yeah, pretty much my advice. This does assume Allen has a multiple page scanner and not a flatbed!

Link to comment
Share on other sites

It shouldn't take long IF you have an extra copy of the book that you are willing to unbind. That way the pages could be placed in the automatic document feeder in reasonable increments (25, 50, 75p, etc...). Scan duplex and you're on your way!

Yeah, pretty much my advice. This does assume Allen has a multiple page scanner and not a flatbed!

Well - He could always go to Kinko's or a local mom & pop copy shop to see if they can do it for him. Kinko's may ask him for proof that he's the copyright holder though.

Link to comment
Share on other sites

With GOCR I'm not sure if there's some sort of a wrapper existing that lets you "batch OCR" multiple images (example all tiffs in folder my_book: /images/my_book/*.TIF) and stream them into ONE text file. On my platform I've been using tesseract in combination with ocube that let's you do exactly that. There should be a tesseract binary for windows but I don't know if there's a wrapper equivalent to ocube. Maybe try a web search for windows batch ocr or something like that.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...