Reader
Read Documents from In Memory PDF
See ReadDocumentsFromInMemoryPdfChainTest
Read the in memory pdf into a single document
InMemoryPdf inMemoryPdf = new InMemoryPdf(
IOUtils.toByteArray(ReadDocumentsFromInMemoryPdfChainTest.class.getResourceAsStream("/pdf/qa/book-of-john-3.pdf")),
"my-in-memory.pdf");
Stream<Map<String, String>> readDocuments = new ReadDocumentsFromInMemoryPdfChain().run(inMemoryPdf)
// the readDocuments contains a (pdfContent, "my-in-memory.pdf") pair
Read documents for each page of the in memory pdf
InMemoryPdf inMemoryPdf = new InMemoryPdf(
IOUtils.toByteArray(ReadDocumentsFromInMemoryPdfChainTest.class.getResourceAsStream("/pdf/qa/book-of-john-3.pdf")),
"my-in-memory.pdf");
Stream<Map<String, String>> readDocuments = new ReadDocumentsFromInMemoryPdfChain(PdfReadMode.PAGES).run(inMemoryPdf)
// the readDocuments contains (content, source) pairs for all read pdf pages (source is "my-in-memory.pdf" + the pdf page number)
Read Documents from PDF
See ReadDocumentsFromPdfChainTest
Read each pdf in the given directory into a single document each
Stream<Map<String, String>> readDocuments = new ReadDocumentsFromPdfChain()
.run(Paths.get("path/to/my/pdf/folder"))
// the readDocuments contains (content, source) pairs for all read pdfs (source is the pdf filename)
Read each page of each pdf in the given directory into a single document each
Stream<Map<String, String>> readDocuments = new ReadDocumentsFromPdfChain(PdfReadMode.PAGES)
.run(Paths.get("path/to/my/pdf/folder"))
// the readDocuments contains (content, source) pairs for all read pdf pages (source is the pdf filename + the pdf page number)