Skip to main content

PDF files

This example goes over how to load data from PDF files. By default, one document will be created for each page in the PDF file, you can change this behavior by setting the splitPages option to false.

Setup

npm install pdf-parse

Usage, one document per page

import { PDFLoader } from "langchain/document_loaders/fs/pdf";

const loader = new PDFLoader("src/document_loaders/example_data/example.pdf");

const docs = await loader.load();

Usage, one document per file

import { PDFLoader } from "langchain/document_loaders/fs/pdf";

const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", {
splitPages: false,
});

const docs = await loader.load();

Usage, custom pdfjs build

By default we use the pdfjs build bundled with pdf-parse, which is compatible with most environments, including Node.js and modern browsers. If you want to use a more recent version of pdfjs-dist or if you want to use a custom build of pdfjs-dist, you can do so by providing a custom pdfjs function that returns a promise that resolves to the PDFJS object.

In the following example we use the "legacy" (see pdfjs docs) build of pdfjs-dist, which includes several polyfills not included in the default build.

npm install pdfjs-dist
import { PDFLoader } from "langchain/document_loaders/fs/pdf";

const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", {
// you may need to add `.then(m => m.default)` to the end of the import
pdfjs: () => import("pdfjs-dist/legacy/build/pdf.js"),
});

Eliminating extra spaces

PDFs come in many varieties, which makes reading them a challenge. The loader parses individual text elements and joins them together with a space by default, but if you are seeing excessive spaces, this may not be the desired behavior. In that case, you can override the separator with an empty string like this:

import { PDFLoader } from "langchain/document_loaders/fs/pdf";

const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", {
parsedItemSeparator: "",
});

const docs = await loader.load();

Loading directories

import { DirectoryLoader } from "langchain/document_loaders/fs/directory";
import { PDFLoader } from "langchain/document_loaders/fs/pdf";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

/* Load all PDFs within the specified directory */
const directoryLoader = new DirectoryLoader(
"src/document_loaders/example_data/",
{
".pdf": (path: string) => new PDFLoader(path),
}
);

const docs = await directoryLoader.load();

console.log({ docs });

/* Additional steps : Split text into chunks with any TextSplitter. You can then use it as context or save it to memory afterwards. */
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});

const splitDocs = await textSplitter.splitDocuments(docs);
console.log({ splitDocs });

API Reference:


Help us out by providing feedback on this documentation page: