Skip to content

Pdf

This module can be used to extract text from PDF files.

Note: That it is not possible to extract text from scanned PDF files.

Manatee compatibility

This module is compatible with Manatee v2.0.

Usage

Extract text from PDF file

You can use the textBlocks function to extract text blocks from a PDF file.

js
// Load and instantiate the module, replace x.y.z with a proper version
var pdf = Module.load("Pdf", {version: "vX.Y.Z"});
var result = pdf.textBlocks("/path/to/file.pdf", {page: 2});
// result is an `TextBlocks` object which has a blocks property containing the extracted text blocks
var blocks = result.blocks;

The options argument (the 2nd argument) can contain the following properties:

  • page the page number to extract text from. Defaults to all pages.
  • password the password to decrypt the PDF file.
  • betweenLineMultiplier the space allowed between blocks (average space multiplied with this variable). Defaults to 1.3.

The TextBlocks object has a find method which can be used to find text blocks in the PDF file. You can use it like:

js
// Find a block of text below the headline "Some headline"
result.find("below", "Some headline");
// We dont want rotated text, so we can use the `allowRotatedText` property
var block1 = result.find("below", "Some headline", {allowRotatedText: false});
// You can also use a boundingBox (rectangle) as an argument to find another block relative to it
var block2 = result.find("below", block1.boundingBox);

You can use

  • pdf.Below aka "Below"
  • pdf.Above aka "Above"
  • pdf.LeftOf aka "LeftOf"
  • pdf.RightOf aka "RightOf"
  • pdf.Nearest aka "Nearest" to find the nearest block of text

as the first argument and a regular expression as the second argument.

TextBlocks also has a blocks property which contains the extracted text blocks. Each TextBlock object has the following properties:

  • text (string) the combined text of all lines in the block
  • lines (array of strings) the individual lines in the block
  • separator (string) the separator used to separate the lines in the block
  • boundingBox (object) the bounding box of the block
    • topLeft (number) the coordinate of the top left corner of the bounding box
    • topRight (number) the coordinate of the top right corner of the bounding box
    • bottomLeft (number) the coordinate of the bottom left corner of the bounding box
    • bottomRight (number) the coordinate of the bottom right corner of the bounding box
    • width (number) the width of the bounding box
    • height (number) the height of the bounding box
  • readingOrder (number) the reading order of the block
  • textOrientation (number) the text orientation of the block

Releases

v3.0.0 (2023-12-19)

  • Release for Manatee v2

v1.0.3 (2022-02-17)

  • Feature: Added support for password protected PDF files

v1.0.2 (2022-02-16)

Initial release.