Pdf
This module can be used to extract text from PDF files.
Note: That it is not possible to extract text from scanned PDF files.
Usage
Extract text from PDF file
You can use the textBlocks
function to extract text blocks from a PDF file.
js
// Load and instantiate the module, replace x.y.z with a proper version
var pdf = Module.load("Pdf", {version: "vX.Y.Z"});
var result = pdf.textBlocks("/path/to/file.pdf", {page: 2});
// result is an `TextBlocks` object which has a blocks property containing the extracted text blocks
var blocks = result.blocks;
The options argument (the 2nd argument) can contain the following properties:
page
the page number to extract text from. Defaults to all pages.password
the password to decrypt the PDF file.betweenLineMultiplier
the space allowed between blocks (average space multiplied with this variable). Defaults to 1.3.
The TextBlocks
object has a find
method which can be used to find text blocks in the PDF file. You can use it like:
js
// Find a block of text below the headline "Some headline"
result.find("below", "Some headline");
// We dont want rotated text, so we can use the `allowRotatedText` property
var block1 = result.find("below", "Some headline", {allowRotatedText: false});
// You can also use a boundingBox (rectangle) as an argument to find another block relative to it
var block2 = result.find("below", block1.boundingBox);
You can use
pdf.Below
aka"Below"
pdf.Above
aka"Above"
pdf.LeftOf
aka"LeftOf"
pdf.RightOf
aka"RightOf"
pdf.Nearest
aka"Nearest"
to find the nearest block of text
as the first argument and a regular expression as the second argument.
TextBlocks
also has a blocks
property which contains the extracted text blocks. Each TextBlock
object has the following properties:
text
(string) the combined text of all lines in the blocklines
(array of strings) the individual lines in the blockseparator
(string) the separator used to separate the lines in the blockboundingBox
(object) the bounding box of the blocktopLeft
(number) the coordinate of the top left corner of the bounding boxtopRight
(number) the coordinate of the top right corner of the bounding boxbottomLeft
(number) the coordinate of the bottom left corner of the bounding boxbottomRight
(number) the coordinate of the bottom right corner of the bounding boxwidth
(number) the width of the bounding boxheight
(number) the height of the bounding box
readingOrder
(number) the reading order of the blocktextOrientation
(number) the text orientation of the block
Releases
v1.0.3 (2022-02-17)
- Feature: Added support for password protected PDF files
v1.0.2 (2022-02-16)
Initial release.