Skip to main content

Reader

Read Documents from In Memory PDF

See ReadDocumentsFromInMemoryPdfChainTest

Read the in memory pdf into a single document

InMemoryPdf inMemoryPdf = new InMemoryPdf(
IOUtils.toByteArray(ReadDocumentsFromInMemoryPdfChainTest.class.getResourceAsStream("/pdf/qa/book-of-john-3.pdf")),
"my-in-memory.pdf");

Stream<Map<String, String>> readDocuments = new ReadDocumentsFromInMemoryPdfChain().run(inMemoryPdf)

// the readDocuments contains a (pdfContent, "my-in-memory.pdf") pair

Read documents for each page of the in memory pdf

InMemoryPdf inMemoryPdf = new InMemoryPdf(
IOUtils.toByteArray(ReadDocumentsFromInMemoryPdfChainTest.class.getResourceAsStream("/pdf/qa/book-of-john-3.pdf")),
"my-in-memory.pdf");

Stream<Map<String, String>> readDocuments = new ReadDocumentsFromInMemoryPdfChain(PdfReadMode.PAGES).run(inMemoryPdf)

// the readDocuments contains (content, source) pairs for all read pdf pages (source is "my-in-memory.pdf" + the pdf page number)

Read Documents from PDF

See ReadDocumentsFromPdfChainTest

Read each pdf in the given directory into a single document each

Stream<Map<String, String>> readDocuments = new ReadDocumentsFromPdfChain()
.run(Paths.get("path/to/my/pdf/folder"))

// the readDocuments contains (content, source) pairs for all read pdfs (source is the pdf filename)

Read each page of each pdf in the given directory into a single document each

Stream<Map<String, String>> readDocuments = new ReadDocumentsFromPdfChain(PdfReadMode.PAGES)
.run(Paths.get("path/to/my/pdf/folder"))

// the readDocuments contains (content, source) pairs for all read pdf pages (source is the pdf filename + the pdf page number)