Why isn't text extracted for my document

People have asked us why text wasn't extracted for a particular document on an SM. Here are the reasons why TechDoc cannot or will not extract text for some documents:

  1. The file type of the document may not be supported. TechDoc must have support for each specific file type in order to properly extract text for a file. As of TechDoc 8, the following file types are supported for text extraction: DGN, DOC, DOCM, DOCX, DOT, DOTM, DOTX, DWG, DWT, FLO, HTM, HTML, IGX, MCW, PDF, POTM, POTX, PPSM, PPSX, PPTM, PPTX, RTF, TIF, TIFF, VDX, VSD, VSDM, VSDX, VSSM, VSSX, VSTM, VSTX, VSX, VTX, XDP, XLS, XLSM, XLSX, XLTM, XLTX, and XML. TechDoc also has a default text extractor that will attempt to find ASCII or UNICODE text strings in all other files.
  2. The file may have protection that prevents TechDoc from accessing the contents to extract text. Microsoft Office and PDF file formats support encryption that requires a password to access the contents of a file. For files protected this way, TechDoc cannot bypass the security to access the file so no text will be extracted.
  3. Text will not be extracted and sent to an SM unless the DM is configured to index text on that SM. To check this, an Admin can go to the Admin screen and click on "SM Hosts" under "Show..." Look under the "Index Text" column to make sure that it is set to "Yes" for the SM in question.
  4. Text will not be extracted if a document has not been released yet.
  5. Text will not be extracted if a document does not permit anonymous read access at the same level as the SM that the search is being performed on. For example, if the document is submitted to a Community SM, the document must have Community read assigned to the document. Otherwise, someone searching on the document could search on text to discover content of a document even though they might not have permission to read it at all.
  6. Text will not be extracted if the latest released generation of the document is set to restricted fetch access. Restricted fetch access on a generation says that no anonymous read is allowed on this particular generation of the document so no text will be extracted.
  7. Text will not be extracted if the document is set to a Doc Category that does not permit text extraction.
  8. Text will not be extracted if the latest released generation's is set to a Mime Type that does not permit text extraction.
  9. Text will not be extracted if the specific text extraction library for the appropriate file type encounters an error while trying to extract the text. This can be caused by a problem in the file itself, the version of file being too old or too new for the text extraction library to handle, etc. If this happens, a text extraction exception will be written to the TechDoc log on the DM at the time the document metadata is being generated and sent to the SM. The rest of the metadata will still be sent to the SM but no text will be present in the metadata.
  10. Text will not be extracted if the document does not contain any text. While this sounds extremely obvious, it can be very hard to tell. For example, a PDF that was created from scanning pages on a scanner may not contain any actual text. If the pages only contain images of the original pages, TechDoc will not extract any text. Make sure your scanning software performs OCR (Optical Character Recognition) and saves the associated text in the PDF as part of the scanning process. Various PDF editing tools also allow you to OCR an image-only PDF after the fact. However, if you are not familiar with OCR, you need to understand that most OCR solutions are still not very accurate these days. Keep in mind that an OCR solution that has 80% accuracy gets one in every five characters wrong and that may even be with a clean, unwrinkled, type-written sheet of paper. If you search on the word "guide", TechDoc won't find it if the OCR process actually saved it as "8uide"!