Skip to content

Pdf

This module can be used to extract text from PDF files.

Note: That it is not possible to extract text from scanned PDF files.

Usage

Extract text from PDF file

You can use the textBlocks function to extract text blocks from a PDF file.

js
// Load and instantiate the module, replace x.y.z with a proper version
var pdf = Module.load("Pdf", {version: "vX.Y.Z"});
var result = pdf.textBlocks("/path/to/file.pdf", {page: 2});
// result is an `TextBlocks` object which has a blocks property containing the extracted text blocks
var blocks = result.blocks;

The options argument (the 2nd argument) can contain the following properties:

  • page the page number to extract text from. Defaults to all pages.
  • password the password to decrypt the PDF file.
  • betweenLineMultiplier the space allowed between blocks (average space multiplied with this variable). Defaults to 1.3.

The TextBlocks object has a find method which can be used to find text blocks in the PDF file. You can use it like:

js
// Find a block of text below the headline "Some headline"
result.find("below", "Some headline");
// We dont want rotated text, so we can use the `allowRotatedText` property
var block1 = result.find("below", "Some headline", {allowRotatedText: false});
// You can also use a boundingBox (rectangle) as an argument to find another block relative to it
var block2 = result.find("below", block1.boundingBox);

You can use

  • pdf.Below aka "Below"
  • pdf.Above aka "Above"
  • pdf.LeftOf aka "LeftOf"
  • pdf.RightOf aka "RightOf"
  • pdf.Nearest aka "Nearest" to find the nearest block of text

as the first argument and a regular expression as the second argument.

TextBlocks also has a blocks property which contains the extracted text blocks. Each TextBlock object has the following properties:

  • text (string) the combined text of all lines in the block
  • lines (array of strings) the individual lines in the block
  • separator (string) the separator used to separate the lines in the block
  • boundingBox (object) the bounding box of the block
    • topLeft (number) the coordinate of the top left corner of the bounding box
    • topRight (number) the coordinate of the top right corner of the bounding box
    • bottomLeft (number) the coordinate of the bottom left corner of the bounding box
    • bottomRight (number) the coordinate of the bottom right corner of the bounding box
    • width (number) the width of the bounding box
    • height (number) the height of the bounding box
  • readingOrder (number) the reading order of the block
  • textOrientation (number) the text orientation of the block

Releases

v1.0.3 (2022-02-17)

  • Feature: Added support for password protected PDF files

v1.0.2 (2022-02-16)

Initial release.