Java Code Examples for org.apache.pdfbox.text.PDFTextStripper#getText()

The following examples show how to use org.apache.pdfbox.text.PDFTextStripper#getText() . You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.
Example 1
Source File: ExtractText.java    From testarea-pdfbox2 with Apache License 2.0 6 votes vote down vote up
/**
 * <a href="https://stackoverflow.com/questions/45895768/pdfbox-2-0-7-extracttext-not-working-but-1-8-13-does-and-pdfreader-as-well">
 * PDFBox 2.0.7 ExtractText not working but 1.8.13 does and PDFReader as well
 * </a>
 * <br/>
 * <a href="https://wetransfer.com/downloads/214674449c23713ee481c5a8f529418320170827201941/b2bea6">
 * test-2.pdf
 * </a>
 * <p>
 * Due to the broken <b>ToUnicode</b> maps the output of immediate text
 * extraction from this document is unsatisfying, cf. {@link #testTest2()}.
 * It can be improved by removing these <b>ToUnicode</b> maps as this test
 * shows.
 * </p>
 */
@Test
public void testNoToUnicodeTest2() throws IOException
{
    try (   InputStream resource = getClass().getResourceAsStream("test-2.pdf")    )
    {
        PDDocument document = Loader.loadPDF(resource);

        for (int pageNr = 0; pageNr < document.getNumberOfPages(); pageNr++)
        {
            PDPage page = document.getPage(pageNr);
            PDResources resources = page.getResources();
            removeToUnicodeMaps(resources);
        }

        PDFTextStripper stripper = new PDFTextStripper();
        String text = stripper.getText(document);

        System.out.printf("\n*\n* test-2.pdf without ToUnicode\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "test-2_NoToUnicode.txt").toPath(), Collections.singleton(text));
    }
}
 
Example 2
Source File: ExtractTextExample.java    From blog-codes with Apache License 2.0 6 votes vote down vote up
public static void main(String[] args) throws InvalidPasswordException, IOException {
    try (PDDocument document = PDDocument.load(new File("/home/lili/data/test.pdf"))) {
        if (!document.isEncrypted()) {
            PDFTextStripper tStripper = new PDFTextStripper();
            // 如果想抽取某一页或者某几页,可以使用下面的方法限定范围。
            // 目前是抽取所有页
            tStripper.setStartPage(0);
            tStripper.setEndPage(document.getNumberOfPages());
            String pdfFileInText = tStripper.getText(document);
            String lines[] = pdfFileInText.split("\\r?\\n"); 
            for (String line : lines) {
                System.out.println(line);  
            } 
        }
    }
}
 
Example 3
Source File: ExtractText.java    From testarea-pdfbox2 with Apache License 2.0 6 votes vote down vote up
/**
 * <a href="https://stackoverflow.com/a/56580253/1729265">
 * wen li's answer to "PDFBox extracting paragraphs"
 * </a>
 * <br/>
 * <a href="https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf">
 * PDF32000_2008.pdf
 * </a>
 * <p>
 * Here it looks the other way around compared to what the OP claims:
 * there is a space at the end of all but the last paragraph line.
 * </p>
 */
@Test
public void testPDF32000pageii() throws IOException
{
    try (   InputStream resource = new URL("https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf").openStream()    )
    {
        PDDocument document = Loader.loadPDF(resource);
        PDFTextStripper stripper = new PDFTextStripper();
        stripper.setStartPage(2);
        stripper.setEndPage(2);
        //stripper.setSortByPosition(true);
        String text = stripper.getText(document);

        System.out.printf("\n*\n* PDF32000_2008.pdf Page ii\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "PDF32000_2008-page-ii.txt").toPath(), Collections.singleton(text));
    }
}
 
Example 4
Source File: ExtractColorText.java    From testarea-pdfbox2 with Apache License 2.0 6 votes vote down vote up
/**
 * <a href="https://stackoverflow.com/questions/59031734/get-text-color-in-pdfbox">
 * Get text color in PDFBox
 * </a>
 * <p>
 * This test has already been executed for the original color text stripper class from my answer to
 * <a href="https://stackoverflow.com/questions/21430341/identifying-the-text-based-on-the-output-in-pdf-using-pdfbox">
 * Identifying the text based on the output in PDF using PDFBOX
 * </a>
 * </p>
 * 
 * @throws IOException
 */
@Test
public void testExtractFromFurzoSample() throws IOException {
    try (   InputStream resource = getClass().getResourceAsStream("furzo Sample.pdf");
            PDDocument document = Loader.loadPDF(resource) ) {
        PDFTextStripper stripper = new ColorTextStripper();
        String text = stripper.getText(document);

        Files.write(new File(RESULT_FOLDER, "furzo Sample.txt").toPath(), text.getBytes("UTF-8"));

        System.out.println("/// furzo Sample.pdf ///");
        System.out.println("Stripped text with color:");
        System.out.println(">>>");
        System.out.println(text);
        System.out.println("<<<");
    }
}
 
Example 5
Source File: ExtractText.java    From testarea-pdfbox2 with Apache License 2.0 6 votes vote down vote up
/**
 * <a href="https://stackoverflow.com/a/56580253/1729265">
 * wen li's answer to "PDFBox extracting paragraphs"
 * </a>
 * <br/>
 * <a href="https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf">
 * PDF32000_2008.pdf
 * </a>
 * <p>
 * Here one sees that there is not always a space at the end of all
 * the non-last paragraph lines, "PDF/X" is split as "PDF/" and "X"
 * between lines, and there is no space in-between.
 * </p>
 */
@Test
public void testPDF32000pagevii() throws IOException
{
    try (   InputStream resource = new URL("https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf").openStream()    )
    {
        PDDocument document = Loader.loadPDF(resource);
        PDFTextStripper stripper = new PDFTextStripper();
        stripper.setStartPage(7);
        stripper.setEndPage(7);
        //stripper.setSortByPosition(true);
        String text = stripper.getText(document);

        System.out.printf("\n*\n* PDF32000_2008.pdf Page ii\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "PDF32000_2008-page-vii.txt").toPath(), Collections.singleton(text));
    }
}
 
Example 6
Source File: ExtractTextHelper.java    From o2oa with GNU Affero General Public License v3.0 5 votes vote down vote up
public static String pdf(byte[] bytes) {
	try {
		PDFParser parser = new PDFParser(new RandomAccessBuffer(bytes));
		parser.parse();
		try (COSDocument cos = parser.getDocument(); PDDocument pd = new PDDocument(cos)) {
			PDFTextStripper stripper = new PDFTextStripper();
			stripper.setStartPage(1);
			stripper.setEndPage(pd.getNumberOfPages());
			return stripper.getText(pd);
		}
	} catch (Exception e) {
		logger.error(e);
	}
	return null;
}
 
Example 7
Source File: ExtractText.java    From testarea-pdfbox2 with Apache License 2.0 5 votes vote down vote up
/**
 * <a href="http://stackoverflow.com/questions/38975091/pdfbox-gettext-not-returning-all-of-the-visible-text">
 * PDFBox getText not returning all of the visible text
 * </a>
 * <br>
 * <a href="https://dl.dropboxusercontent.com/u/14898138/03%20WP%20Enterprise%20BlackBerry%20Compete%20Datasheet_041612%20FINAL%20DRAFT.pdf">
 * 03 WP Enterprise BlackBerry Compete Datasheet_041612 FINAL DRAFT.pdf
 * </a>
 * <p>
 * There is some 'writing' actually done using vector graphics, not text,
 * but aside from that all is accounted for.
 * </p>
 */
@Test
public void test03WpEnterpriseBlackBerryCompeteDatasheet_041612FinalDraft() throws IOException
{
    try (   InputStream resource = getClass().getResourceAsStream("03 WP Enterprise BlackBerry Compete Datasheet_041612 FINAL DRAFT.pdf")    )
    {
        PDDocument document = Loader.loadPDF(resource);
        PDFTextStripper stripper = new PDFTextStripper();
        //stripper.setSortByPosition(true);
        String text = stripper.getText(document);

        System.out.printf("\n*\n* 03 WP Enterprise BlackBerry Compete Datasheet_041612 FINAL DRAFT.pdf\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "03 WP Enterprise BlackBerry Compete Datasheet_041612 FINAL DRAFT.txt").toPath(), Collections.singleton(text));
    }
}
 
Example 8
Source File: ExtractText.java    From testarea-pdfbox2 with Apache License 2.0 5 votes vote down vote up
/**
 * <a href="http://stackoverflow.com/questions/37862159/pdf-reading-via-pdfbox-in-java">
 * pdf reading via pdfbox in java 
 * </a>
 * <br/>
 * <a href="https://drive.google.com/file/d/0B_Ke2amBgdpebm96U05FcWFsSXM/view?usp=sharing">
 * Bal_532935_0314.pdf
 * </a>
 * <p>
 * The issue here is caused by PDFBox guessing an encoding. The underlying method
 * `PDFTextStreamEngine.showGlyph` does this for all unmappable glyphs from simple
 * fonts.
 * </p>
 */
@Test
public void testBal_532935_0314() throws IOException
{
    try (   InputStream resource = getClass().getResourceAsStream("Bal_532935_0314.pdf")    )
    {
        PDDocument document = Loader.loadPDF(resource);
        PDFTextStripper stripper = new PDFTextStripper();
        //stripper.setSortByPosition(true);
        String text = stripper.getText(document);

        System.out.printf("\n*\n* Bal_532935_0314.pdf\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "Bal_532935_0314.txt").toPath(), Collections.singleton(text));
    }
}
 
Example 9
Source File: ExtractText.java    From testarea-pdfbox2 with Apache License 2.0 5 votes vote down vote up
/**
 * <a href="https://stackoverflow.com/questions/51672080/pdfdomtree-does-not-detecting-white-spaces-while-converting-a-pdf-file-to-html">
 * PDFDomTree does not detecting white spaces while converting a pdf file to html
 * </a>
 * <br/>
 * <a href="https://drive.google.com/file/d/1SZNFCvGVbQzCxJiRr8HlW99ravC_Cm71/view?usp=sharing">
 * demo.pdf
 * </a>
 * <p>
 * PDFBox shows no issue extracting the text from the given file.
 * </p>
 */
@Test
public void testDemo() throws IOException
{
    try (   InputStream resource = getClass().getResourceAsStream("demo.pdf")    )
    {
        PDDocument document = Loader.loadPDF(resource);
        PDFTextStripper stripper = new PDFTextStripper();
        //stripper.setSortByPosition(true);
        String text = stripper.getText(document);

        System.out.printf("\n*\n* demo.pdf\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "demo.txt").toPath(), Collections.singleton(text));
    }
}
 
Example 10
Source File: SearchSubword.java    From testarea-pdfbox2 with Apache License 2.0 5 votes vote down vote up
List<TextPositionSequence> findSubwords(PDDocument document, int page, String searchTerm) throws IOException
{
    final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();
    PDFTextStripper stripper = new PDFTextStripper()
    {
        @Override
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException
        {
            System.out.printf("  -- %s\n", text);

            TextPositionSequence word = new TextPositionSequence(textPositions);
            String string = word.toString();

            int fromIndex = 0;
            int index;
            while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
            {
                hits.add(word.subSequence(index, index + searchTerm.length()));
                fromIndex = index + 1;
            }
            super.writeString(text, textPositions);
        }
    };
    
    stripper.setSortByPosition(true);
    stripper.setStartPage(page);
    stripper.setEndPage(page);
    stripper.getText(document);
    return hits;
}
 
Example 11
Source File: ExtractText.java    From testarea-pdfbox2 with Apache License 2.0 5 votes vote down vote up
/**
 * <a href="https://stackoverflow.com/questions/47515609/invalid-block-type-while-using-pdfbox-2-0-8">
 * Invalid block type while using pdfbox 2.0.8
 * </a>
 * <br>
 * <a href="https://www.dropbox.com/s/xjeksj0cay4x3vo/NoTemplateInError.pdf?dl=0">
 * NoTemplateInError.pdf
 * </a>
 * <p>
 * The issue cannot be reproduced.
 * </p>
 */
@Test
public void testNoTemplateInError() throws IOException
{
    try (   InputStream resource = getClass().getResourceAsStream("NoTemplateInError.pdf")    )
    {
        PDDocument document = Loader.loadPDF(resource);
        PDFTextStripper stripper = new PDFTextStripper();
        //stripper.setSortByPosition(true);
        String text = stripper.getText(document);

        System.out.printf("\n*\n* NoTemplateInError.pdf\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "NoTemplateInError.txt").toPath(), Collections.singleton(text));
    }
}
 
Example 12
Source File: ExtractTextTools.java    From o2oa with GNU Affero General Public License v3.0 5 votes vote down vote up
public static String pdf(byte[] bytes) {
	try {
		PDFParser parser = new PDFParser(new RandomAccessBuffer(bytes));
		parser.parse();
		try (COSDocument cos = parser.getDocument(); PDDocument pd = new PDDocument(cos)) {
			PDFTextStripper stripper = new PDFTextStripper();
			stripper.setStartPage(1);
			stripper.setEndPage(pd.getNumberOfPages());
			return stripper.getText(pd);
		}
	} catch (Exception e) {
		logger.error(e);
	}
	return null;
}
 
Example 13
Source File: PDFExtractionExample.java    From Java-for-Data-Science with MIT License 5 votes vote down vote up
public static void main(String[] args) {
    try {
        PDDocument document = PDDocument.load(new File("PDF File.pdf"));
            PDFTextStripper Tstripper = new PDFTextStripper();
            String documentText = Tstripper.getText(document);
            System.out.println(documentText);
    } catch (Exception e) {
        e.printStackTrace();
    }
}
 
Example 14
Source File: ExtractText.java    From testarea-pdfbox2 with Apache License 2.0 3 votes vote down vote up
/**
 * <a href="https://stackoverflow.com/questions/53551335/java-does-pdfbox-have-an-option-to-open-file-instead-of-loading-it">
 * Java- Does pdfBox have an option to open file instead of loading it?
 * </a>
 * <br/>
 * <a href="https://www.dropbox.com/s/osyk2ieoq6od2p8/10-million-password-list-top-1000000.pdf?dl=0">
 * 10-million-password-list-top-1000000.pdf
 * </a>
 * <p>
 * In contrast to the OP I did not need to fiddle with the memory
 * settings at all for a plain extraction. Furthermore, I got 999999
 * lines with words and 3 empty lines from the file, not 10000000
 * passwords.
 * </p>
 */
@Test
public void test10MillionPasswordListTop1000000() throws IOException
{
    try (   InputStream resource = getClass().getResourceAsStream("10-million-password-list-top-1000000.pdf")    )
    {
        PDDocument document = Loader.loadPDF(resource);
        PDFTextStripper stripper = new PDFTextStripper();
        //stripper.setSortByPosition(true);
        String text = stripper.getText(document);

        System.out.printf("\n*\n* 10-million-password-list-top-1000000.pdf\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "10-million-password-list-top-1000000.txt").toPath(), Collections.singleton(text));
    }
}
 
Example 15
Source File: ExtractVisibleText.java    From testarea-pdfbox2 with Apache License 2.0 3 votes vote down vote up
/**
 * <a href="https://stackoverflow.com/questions/47908124/pdfbox-removing-invisible-text-by-clip-filling-paths-issue">
 * PDFBox - Removing invisible text (by clip/filling paths issue)
 * </a>
 * <br/>
 * <a href="https://drive.google.com/open?id=1xcZOusx3cEdZX4AT8QAVDqZe33YWla0H">
 * test.pdf
 * </a> as testDmitryK.pdf
 * <p>
 * Indeed, using the original {@link PDFVisibleTextStripper} implementation
 * a lot of visible characters where dropped. This was due to the incorrect
 * calculation of the <code>end</code> of the character baseline in the methods
 * {@link PDFVisibleTextStripper#processTextPosition(org.apache.pdfbox.text.TextPosition)}
 * and {@link PDFVisibleTextStripper#deleteCharsInPath()}.
 * </p>
 * <p>
 * After patching those {@link PDFVisibleTextStripper} methods to make use of
 * <code>end</code> only optionally, running the test with that option results
 * in a decent extraction of visible text.
 * </p>
 */
@Test
public void testTestDmitryK() throws IOException {
    try (   InputStream resource = getClass().getResourceAsStream("testDmitryK.pdf")  ) {
        PDDocument document = Loader.loadPDF(resource);
        PDFTextStripper stripper = new PDFVisibleTextStripper();
        stripper.setSortByPosition(true);
        String text = stripper.getText(document);

        System.out.printf("\n*\n* testDmitryK.pdf\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "testDmitryK.txt").toPath(), Collections.singleton(text));
    }
}
 
Example 16
Source File: ExtractVisibleText.java    From testarea-pdfbox2 with Apache License 2.0 3 votes vote down vote up
/**
 * <a href="https://stackoverflow.com/questions/47358127/remove-invisible-text-from-pdf-using-pdfbox">
 * remove invisible text from pdf using pdfbox
 * </a>
 * <br/>
 * <a href="https://drive.google.com/file/d/1F8vrzcABwxVGdN5W-7etQggY5xKtGplU/view">
 * RevTeaser09072016.pdf
 * </a>
 * <p>
 * This class tests the {@link PDFVisibleTextStripper} to ignore text hidden
 * by clipping or by covering with a filled path in the OP's sample document.
 * </p>
 */
@Test
public void testExtractFromRevTeaser09072016() throws IOException {
    try (   InputStream resource = getClass().getResourceAsStream("RevTeaser09072016.pdf")  ) {
        PDDocument document = Loader.loadPDF(resource);
        PDFTextStripper stripper = new PDFVisibleTextStripper(true);
        //stripper.setSortByPosition(true);
        String text = stripper.getText(document);

        System.out.printf("\n*\n* RevTeaser09072016.pdf\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "RevTeaser09072016.txt").toPath(), Collections.singleton(text));
    }
}
 
Example 17
Source File: ExtractText.java    From testarea-pdfbox2 with Apache License 2.0 3 votes vote down vote up
/**
 * <a href="https://stackoverflow.com/questions/49746202/read-pdf-file-using-pdfbox-in-utf-8-in-java-scala">
 * Read pdf file using pdfbox in UTF-8 in java/scala
 * </a>
 * <br/>
 * <a href="https://1drv.ms/b/s!AmHcFaD-gMGyhg6eyqSy2gu9sLWl">
 * test.pdf
 * </a> as testKabirManandhar.pdf
 * <p>
 * The issue can be reproduced. The cause are incomplete ToUnicode
 * maps. There is an option, though: The embedded font programs
 * appear to include more complete mappings, so repairing the
 * ToUnicode table seems feasible.
 * </p>
 */
@Test
public void testTestKabirManandhar() throws IOException
{
    try (   InputStream resource = getClass().getResourceAsStream("testKabirManandhar.pdf")    )
    {
        PDDocument document = Loader.loadPDF(resource);
        PDFTextStripper stripper = new PDFTextStripper();
        //stripper.setSortByPosition(true);
        String text = stripper.getText(document);

        System.out.printf("\n*\n* testKabirManandhar.pdf\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "testKabirManandhar.txt").toPath(), Collections.singleton(text));
    }
}
 
Example 18
Source File: ExtractVisibleText.java    From testarea-pdfbox2 with Apache License 2.0 3 votes vote down vote up
/**
 * <a href="https://github.com/mkl-public/testarea-pdfbox2/issues/3">
 * One case fails to remove invisible texts or symbols
 * </a>
 * <br/>
 * <a href="https://github.com/mkl-public/testarea-pdfbox2/files/2481423/00000000000005fw6q.pdf">
 * 00000000000005fw6q.pdf
 * </a>
 * <p>
 * The "hidden text" recognized by Adobe here is only "hidden"
 * because it uses a glyph (page 1, Font F9, code 0000) for which
 * the embedded font draws nothing but which ToUnicode maps to
 * U+DBD0, a High Private Use Surrogate which by itself in general
 * makes no sense.
 * </p>
 */
@Test
public void test00000000000005fw6q() throws IOException {
    try (   InputStream resource = getClass().getResourceAsStream("00000000000005fw6q.pdf")  ) {
        PDDocument document = Loader.loadPDF(resource);
        PDFTextStripper stripper = new PDFVisibleTextStripper();
        stripper.setSortByPosition(true);
        String text = stripper.getText(document);

        System.out.printf("\n*\n* 00000000000005fw6q.pdf\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "00000000000005fw6q.txt").toPath(), Collections.singleton(text));
    }
}
 
Example 19
Source File: ExtractText.java    From testarea-pdfbox2 with Apache License 2.0 3 votes vote down vote up
/**
 * <a href="https://stackoverflow.com/questions/45895768/pdfbox-2-0-7-extracttext-not-working-but-1-8-13-does-and-pdfreader-as-well">
 * PDFBox 2.0.7 ExtractText not working but 1.8.13 does and PDFReader as well
 * </a>
 * <br/>
 * <a href="https://wetransfer.com/downloads/214674449c23713ee481c5a8f529418320170827201941/b2bea6">
 * test-2.pdf
 * </a>
 * <p>
 * Due to the broken <b>ToUnicode</b> maps the output of this test is
 * unsatisfying. It can be improved by removing these <b>ToUnicode</b>
 * maps, cf. {@link #testNoToUnicodeTest2()}.
 * </p>
 */
@Test
public void testTest2() throws IOException
{
    try (   InputStream resource = getClass().getResourceAsStream("test-2.pdf")    )
    {
        PDDocument document = Loader.loadPDF(resource);
        PDFTextStripper stripper = new PDFTextStripper();
        String text = stripper.getText(document);

        System.out.printf("\n*\n* test-2.pdf\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "test-2.txt").toPath(), Collections.singleton(text));
    }
}
 
Example 20
Source File: ExtractText.java    From testarea-pdfbox2 with Apache License 2.0 3 votes vote down vote up
/**
 * <a href="https://stackoverflow.com/questions/54644435/error-when-extracting-text-from-pdf-using-pdfbox">
 * Error when extracting text from pdf using pdfbox
 * </a>
 * <br/>
 * <a href="http://ishouhuo.cn/cannotExtract.pdf">
 * cannotExtract.pdf
 * </a>
 * <p>
 * Indeed, all required information for text extraction are missing from the font
 * PingFangSC in all its variants. Thus, text extraction results automatically are
 * lacking.
 * </p>
 */
@Test
public void testCannotExtract() throws IOException
{
    try (   InputStream resource = getClass().getResourceAsStream("cannotExtract.pdf")    )
    {
        PDDocument document =  Loader.loadPDF(resource);
        PDFTextStripper stripper = new PDFTextStripper();
        String text = stripper.getText(document);

        System.out.printf("\n*\n* cannotExtract.pdf\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "cannotExtract.txt").toPath(), Collections.singleton(text));
    }
}