Sign in to follow this  
Followers 0
AllenLowe

oK computer wizzes nows the time to show us

35 posts in this topic

I have my jazz history book in published form, and due to a lot of stupid things the original disc that it was on may not be accessible -

can the book pages be scanned in such a way as the pages can be put into a Word file (or some such thing) and edited? or do I have to re-type the damn thing?

Share this post


Link to post
Share on other sites

OCR software is pretty good these days. I have used a trial version of ABBYY FineReader that works well - it has lots of output functions.

http://www.scanstore.com/Scanning_Software...p?ITEM_ID=18288

As with any software there are price points.

Here's a list of the popular ones. There's probably other free stuff out there, but maybe not as accurate

http://www.simpleocr.com/OCR_Software_Guide.asp

Share this post


Link to post
Share on other sites

I can't recall the name but I bought cheap software at Office Depot that converts PDF files into Word docs, so if you scan the pages as PDF files that would do the trick to make them editable.

Share this post


Link to post
Share on other sites

groovy; thanks guys. As I hit my old age it occurs to me that I need to take a little more control of my work; I have three books written, a fourth I am trying to finish. I've never found a decent publisher who will handle my stuff, and I've made more money selling it myself anyway (basically I occupy a middle position, in publishing limbo; I sell enough to make it worthwhile to ME, but not enough for a major trade press).

thanks again -

Share this post


Link to post
Share on other sites

OCR software is pretty good these days. I have used a trial version of ABBYY FineReader that works well - it has lots of output functions.

http://www.scanstore.com/Scanning_Software...p?ITEM_ID=18288

As with any software there are price points.

Here's a list of the popular ones. There's probably other free stuff out there, but maybe not as accurate

http://www.simpleocr.com/OCR_Software_Guide.asp

I used to really like OmniPage (I think I had version 8 or 9). I've heard relatively good things of the software through OmniPage 12, but then the company was bought out and the customer support/service went completely to hell and none of it worked well with Vista (no big surprise there). OmniPage 16 supposedly bites.

I'm totally bummed because my home computer finally died, and I can't find the installation CDs, so I have to get new software. I'm leaning towards the ABBYY FineReader. I'm pretty sure that if you buy it (not just use the trial version) you can save out multiple pages. Any insight into this? Second, does it offer a "straighten page" option? For some of my scanned material, I just don't have an option but to try to utilize this feature. Thanks for any thoughts.

Share this post


Link to post
Share on other sites

I can't recall the name but I bought cheap software at Office Depot that converts PDF files into Word docs, so if you scan the pages as PDF files that would do the trick to make them editable.

Impossible. PDF encapsulates a description of a document that includes text, fonts, images, and 2D vector graphics which compose the document. Scanning a page results in a bitmap (image).

You're on Windows, right? Try http://jocr.sourceforge.net/ which is released under the GNU Public License: no trial or shareware teaser - simply download an use the software.

Share this post


Link to post
Share on other sites

I can't recall the name but I bought cheap software at Office Depot that converts PDF files into Word docs, so if you scan the pages as PDF files that would do the trick to make them editable.

Impossible. PDF encapsulates a description of a document that includes text, fonts, images, and 2D vector graphics which compose the document. Scanning a page results in a bitmap (image).

Scansoft PDF Converter 4 takes PDF files and converts to Word Doc or Publisher and other formats. For my scanner, when scanning printed pages, the default output options is PDF.

Share this post


Link to post
Share on other sites

all of the above.

Share this post


Link to post
Share on other sites

Impossible. PDF encapsulates a description of a document that includes text, fonts, images, and 2D vector graphics which compose the document. Scanning a page results in a bitmap (image).

Not impossible. In fact, its fairly common for scanners, especially document imaging units (what we used to call copiers) to out to PDF format in addition to JPG, GIF, TIFF, BMP, etc... Both my office Ricoh copier and Canon desktop scanner not only provide output to PDF but also the option to perform in-line OCR.

Current OCR and crawling technology is truly amazing. Our enterprise search crawls, in addition to full text search on Office documents and PDF's, also performs full text OCR on image files. You might expect a CAD drawing to be indexed but it also will OCR a JPG or GIF, "sense" text, and index the content accordingly.

Can you tell I'm an IT guy? Jeez...

Share this post


Link to post
Share on other sites

I have my jazz history book in published form, and due to a lot of stupid things the original disc that it was on may not be accessible -

can the book pages be scanned in such a way as the pages can be put into a Word file (or some such thing) and edited? or do I have to re-type the damn thing?

do re-type the damn thing

Share this post


Link to post
Share on other sites

Alright, I didn't think of in-line OCR.

Not so sure that PDF makes sense as output format when I want to have the text in an editable format. So if the scan unit provides in-line OCR it should be possible to save that OCR string as a text file.

Edited by rockefeller center

Share this post


Link to post
Share on other sites

I can't recall the name but I bought cheap software at Office Depot that converts PDF files into Word docs, so if you scan the pages as PDF files that would do the trick to make them editable.

Impossible. PDF encapsulates a description of a document that includes text, fonts, images, and 2D vector graphics which compose the document. Scanning a page results in a bitmap (image).

Scansoft PDF Converter 4 takes PDF files and converts to Word Doc or Publisher and other formats. For my scanner, when scanning printed pages, the default output options is PDF.

What's that scanner model you use?

Share this post


Link to post
Share on other sites

Alright, I didn't think of in-line OCR.

Not so sure that PDF makes sense as output format when I want to have the text in an editable format. So if the scan unit provides in-line OCR it should be possible to save that OCR string as a text file.

I generally set my scanner to output TIFs, then after the editing and processing I usually just save to Word documents, but some people prefer saving to PDFs. I think the PDF-direct output is for people who just need a permanent record but don't intend to edit (receipts, accounting stuff and contracts tend to fall in this category).

Share this post


Link to post
Share on other sites

Alright, I didn't think of in-line OCR.

Not so sure that PDF makes sense as output format when I want to have the text in an editable format. So if the scan unit provides in-line OCR it should be possible to save that OCR string as a text file.

I generally set my scanner to output TIFs, then after the editing and processing I usually just save to Word documents, but some people prefer saving to PDFs. I think the PDF-direct output is for people who just need a permanent record but don't intend to edit (receipts, accounting stuff and contracts tend to fall in this category).

Yeah, I'd still like to know what the "PDF-direct" looks like in Dan's case: embedded image or text.

Share this post


Link to post
Share on other sites

allright now you all lost me - I'm on windows so I was thinking of using Rockefeller Center's link, the JOCR thing - am I missing anything here?

(and thanks, Serioza, but retyping 100,000 + words is not my favorite option) -

hey, here's an idea - anyone here want to hire out (for a reasonable fee, I hope) to do this?

Edited by AllenLowe

Share this post


Link to post
Share on other sites

Define "reasonable" and I'm your man.

Share this post


Link to post
Share on other sites

Alright, I didn't think of in-line OCR.

Not so sure that PDF makes sense as output format when I want to have the text in an editable format. So if the scan unit provides in-line OCR it should be possible to save that OCR string as a text file.

I generally set my scanner to output TIFs, then after the editing and processing I usually just save to Word documents, but some people prefer saving to PDFs. I think the PDF-direct output is for people who just need a permanent record but don't intend to edit (receipts, accounting stuff and contracts tend to fall in this category).

Yeah, I'd still like to know what the "PDF-direct" looks like in Dan's case: embedded image or text.

Microtek ScanWizard 5. We got it from my wife's Aunt who got it as part of a free "bundle" when she bought her PC and had no use for it.

Like I said, if I scan text like a birth certificate, the default setting for the output is PDF. I've never tried to output a PDF when I scan a photograph or photo + text.

And if there was any question, the Scansoft software did a perfect job converting PDF to an editable file, keeping images intact and text boxes editable.

Share this post


Link to post
Share on other sites

well, lets figure based on time - how long would it take to scan a 350 page book?

Share this post


Link to post
Share on other sites

It shouldn't take long IF you have an extra copy of the book that you are willing to unbind. That way the pages could be placed in the automatic document feeder in reasonable increments (25, 50, 75p, etc...). Scan duplex and you're on your way!

Share this post


Link to post
Share on other sites

well, lets figure based on time - how long would it take to scan a 350 page book?

You've got a few choices. Do you still have the page proofs? If so, I would use those.

If not, if your scanner supports multiple pages in a scan, then I would probably scan 10 pages at a time -- to avoid losing data due to crashes, etc. I'm in an unfortunate situation, since the scan settings reset for every single job, so in my case, I actually make a copy of the thing I am scanning, then automatically feed them through (this counts as a job). Make sure the settings are at least 300 dpi - 400 dpi is better if the output files aren't too large for your available storage.

If you have page proofs, or have already copied the book, I think converting to a TIF file will take 30 minutes, running batches through the scanner (this assumes you have a copier/scanner. Again, do this in installments (no more than 35-40 pages at a time). If you are scanning the book page by page, it might take 2-3 hours (or more if you have the software attempt OCR on the spot -- better to just save the files out and process later).

Then you will run the OCR software. If you mostly have text and few or no footnotes and pictures, then maybe 1 day of going through and cleaning up. It could be more. That's a lot of pages. If the scan isn't clean or you have tables, footnotes, etc., then I'd say 2-3 days of intense work. That's my general experience.

It shouldn't take long IF you have an extra copy of the book that you are willing to unbind. That way the pages could be placed in the automatic document feeder in reasonable increments (25, 50, 75p, etc...). Scan duplex and you're on your way!

Yeah, pretty much my advice. This does assume Allen has a multiple page scanner and not a flatbed!

Share this post


Link to post
Share on other sites

It shouldn't take long IF you have an extra copy of the book that you are willing to unbind. That way the pages could be placed in the automatic document feeder in reasonable increments (25, 50, 75p, etc...). Scan duplex and you're on your way!

Yeah, pretty much my advice. This does assume Allen has a multiple page scanner and not a flatbed!

Well - He could always go to Kinko's or a local mom & pop copy shop to see if they can do it for him. Kinko's may ask him for proof that he's the copyright holder though.

Share this post


Link to post
Share on other sites

Since my scanner is flatbed and pretty damn slow, I'd probably be faster if I re-typed it.

And I'm a fast typer.

Share this post


Link to post
Share on other sites

thanks for the continued advice - I am thinking I should approach Kinkos first and see what they say - the copyright is clearly marked as mine on the title page - and I have enough books to take one apart -

Share this post


Link to post
Share on other sites

With GOCR I'm not sure if there's some sort of a wrapper existing that lets you "batch OCR" multiple images (example all tiffs in folder my_book: /images/my_book/*.TIF) and stream them into ONE text file. On my platform I've been using tesseract in combination with ocube that let's you do exactly that. There should be a tesseract binary for windows but I don't know if there's a wrapper equivalent to ocube. Maybe try a web search for windows batch ocr or something like that.

Share this post


Link to post
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now
Sign in to follow this  
Followers 0

  • Recently Browsing   0 members

    No registered users viewing this page.