Java Code Examples for org.apache.pdfbox.text.PDFTextStripper#setSortByPosition()

The following examples show how to use org.apache.pdfbox.text.PDFTextStripper#setSortByPosition() . You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.
Example 1
Source File: PrintTextLocations.java    From blog-codes with Apache License 2.0 6 votes vote down vote up
public static void main(String[] args) throws IOException {
	PDDocument document = null;
	try {
		document = PDDocument.load(new File("/home/lili/data/test.pdf"));

		PDFTextStripper stripper = new PrintTextLocations();
		stripper.setSortByPosition(true);
		stripper.setStartPage(0);
		stripper.setEndPage(document.getNumberOfPages());

		Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
		stripper.writeText(document, dummy);
	} finally {
		if (document != null) {
			document.close();
		}
	}
}
 
Example 2
Source File: GetLinesFromPDF.java    From blog-codes with Apache License 2.0 6 votes vote down vote up
/**
 * @throws IOException If there is an error parsing the document.
 */
public static void main( String[] args ) throws IOException {
    PDDocument document = null;
    String fileName = "/home/lili/data/test.pdf";
    try {
        document = PDDocument.load( new File(fileName) );
        PDFTextStripper stripper = new GetLinesFromPDF();
        stripper.setSortByPosition( true );
        stripper.setStartPage( 0 );
        stripper.setEndPage( document.getNumberOfPages() );
        Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
        stripper.writeText(document, dummy);
        
        // print lines
        for(String line:lines){
            System.out.println(line); 
        }
    }
    finally {
        if( document != null ) {
            document.close();
        }
    }
}
 
Example 3
Source File: ExtractText.java    From testarea-pdfbox2 with Apache License 2.0 6 votes vote down vote up
/**
     * <a href="https://stackoverflow.com/questions/53773479/java-rotated-file-extraction">
     * java- rotated file extraction?
     * </a>
     * <br/>
     * <a href="https://www.dropbox.com/s/g1pe8zb9m5kajif/lol.pdf?dl=0">
     * lol.pdf
     * </a>
     * <p>
     * Indeed, regular text extraction results on many lines, essentially
     * one for each text chunk. One can improve this in two ways, either
     * one activates sorting or one removes the Rotate entries from the
     * page dictionaries.
     * </p>
     */
    @Test
    public void testLol() throws IOException
    {
        try (   InputStream resource = getClass().getResourceAsStream("lol.pdf")    )
        {
            PDDocument document = Loader.loadPDF(resource);
// Option 1: Remove Rotate entries
//            for (PDPage page : document.getPages()) {
//                page.setRotation(0);
//            }

            PDFTextStripper stripper = new PDFTextStripper();
// Option 2: Sort by position
            stripper.setSortByPosition(true);
            String text = stripper.getText(document);

            System.out.printf("\n*\n* lol.pdf\n*\n%s\n", text);
            Files.write(new File(RESULT_FOLDER, "lol.txt").toPath(), Collections.singleton(text));
        }
    }
 
Example 4
Source File: SearchSubword.java    From testarea-pdfbox2 with Apache License 2.0 5 votes vote down vote up
List<TextPositionSequence> findSubwords(PDDocument document, int page, String searchTerm) throws IOException
{
    final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();
    PDFTextStripper stripper = new PDFTextStripper()
    {
        @Override
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException
        {
            System.out.printf("  -- %s\n", text);

            TextPositionSequence word = new TextPositionSequence(textPositions);
            String string = word.toString();

            int fromIndex = 0;
            int index;
            while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
            {
                hits.add(word.subSequence(index, index + searchTerm.length()));
                fromIndex = index + 1;
            }
            super.writeString(text, textPositions);
        }
    };
    
    stripper.setSortByPosition(true);
    stripper.setStartPage(page);
    stripper.setEndPage(page);
    stripper.getText(document);
    return hits;
}
 
Example 5
Source File: ExtractWordCoordinates.java    From testarea-pdfbox2 with Apache License 2.0 5 votes vote down vote up
/**
 * <a href="https://stackoverflow.com/questions/50330484/could-someone-give-me-an-example-of-how-to-extract-coordinates-for-a-word-usin">
 * Could someone give me an example of how to extract coordinates for a 'word' using PDFBox
 * </a>
 * <br/>
 * <a href="https://www.tutorialkart.com/pdfbox/how-to-get-location-and-size-of-images-in-pdf/attachment/apache-pdf/">
 * apache.pdf
 * </a>
 * <p>
 * This test shows how to extract word coordinates combining the ideas of
 * the two tutorials referenced by the OP.
 * </p>
 */
@Test
public void testExtractWordsForGoodJuJu() throws IOException {
    try (   InputStream resource = getClass().getResourceAsStream("apache.pdf")) {
        PDDocument document = Loader.loadPDF(resource);
        PDFTextStripper stripper = new GetWordLocationAndSize();
        stripper.setSortByPosition( true );
        stripper.setStartPage( 0 );
        stripper.setEndPage( document.getNumberOfPages() );
 
        Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
        stripper.writeText(document, dummy);
    }
}
 
Example 6
Source File: ExtractText.java    From testarea-pdfbox2 with Apache License 2.0 4 votes vote down vote up
/**
     * <a href="https://stackoverflow.com/questions/54822124/pdftextstripperbyarea-and-pdftextstripper-parsing-different-text-output-for-tabl">
     * PDFTextStripperByArea and PDFTextStripper parsing different Text Output for Table with Merged Cell or Table cell with multi-line text content
     * </a>
     * <br/>
     * <a href="https://www4.esc13.net/uploads/webccat/docs/PDFTables_12142005.pdf">
     * PDFTables_12142005.pdf
     * </a>
     * <p>
     * Cannot reproduce the problem, and the OP does not react to clarification requests.
     * </p>
     */
    @Test
    public void testPDFTables_12142005() throws IOException {
        try (   InputStream resource = getClass().getResourceAsStream("PDFTables_12142005.pdf")    )
        {
            PDDocument document =  Loader.loadPDF(resource);

            PDFTextStripper textStripper = new PDFTextStripper();
            textStripper.setSortByPosition(true);
            textStripper.setAddMoreFormatting(false);
            // textStripper.setSpacingTolerance(1.5F);
            //textStripper.setAverageCharTolerance(averageCharToleranceValue);

            textStripper.setStartPage(2);
            textStripper.setEndPage(2);

            textStripper.getCurrentPage();
            String text = textStripper.getText(document).trim();
            System.out.println("PDF text is: " + "\n" + text.trim());

            System.out.println("----------------------------------------------------------------");

            PDFTextStripperByArea stripper = new PDFTextStripperByArea();
            stripper.setSortByPosition(true);
            stripper.setAddMoreFormatting(false);
            // stripper.setSpacingTolerance(1.5F);

            Dimension dimension = new Dimension();
            dimension.setSize(document.getPage(1).getMediaBox().getWidth(),
                    document.getPage(1).getMediaBox().getHeight());
//            Rectangle2D rect = toJavaRect(document.getBleedBox(), dimension);
//            Rectangle2D rect1 = toJavaRect(document.getArtBox(), dimension);
            PDRectangle mediaBox = document.getPage(1).getMediaBox();
            Rectangle2D rect = new Rectangle2D.Float(mediaBox.getLowerLeftX(), mediaBox.getLowerLeftY(), mediaBox.getWidth(), mediaBox.getHeight());
            Rectangle2D rect1 = rect;

            /*
             * Rectangle2D rect = new
             * Rectangle2D.Float(document.getBleedBox().getLowerLeftX(),
             * document.getBleedBox().getLowerLeftY(), document.getBleedBox().getWidth(),
             * document.getBleedBox().getHeight());
             */

            /*
             * Rectangle2D rect1 = new
             * Rectangle2D.Float(document.getArtBox().getLowerLeftX(),
             * document.getArtBox().getLowerLeftY(), document.getArtBox().getWidth(),
             * document.getArtBox().getHeight());
             */

            /*
             * Rectangle2D rect = new
             * Rectangle2D.Float(document.getBleedBox().getLowerLeftX(),
             * document.getBleedBox().getUpperRightY(), document.getBleedBox().getWidth(),
             * document.getBleedBox().getHeight());
             */

            System.out.println("Rectangle bleedBox Content : " + "\n" + rect);
            System.out.println("----------------------------------------------------------------");
            System.out.println("Rectangle artBox Content : " + "\n" + rect1);
            System.out.println("----------------------------------------------------------------");
            stripper.addRegion("Test1", rect);
            stripper.addRegion("Test2", rect1);
            stripper.extractRegions(document.getPage(1));

            System.out.println("Text in the area-BleedBox : " + "\n" + stripper.getTextForRegion("Test1").trim());
            System.out.println("----------------------------------------------------------------");
            System.out.println("Text in the area1-ArtBox : " + "\n" + stripper.getTextForRegion("Test2").trim());
            System.out.println("----------------------------------------------------------------");

            StringBuilder artPlusBleedBox = new StringBuilder();
            artPlusBleedBox.append(stripper.getTextForRegion("Test2").trim());
            artPlusBleedBox.append("\r\n");
            artPlusBleedBox.append(stripper.getTextForRegion("Test1").trim());

            System.out.println("Whole Page Text : " + artPlusBleedBox);
            System.out.println("----------------------------------------------------------------");
            text = new String(text.trim().getBytes(), "UTF-8");
            String text2 = new String(artPlusBleedBox.toString().trim().getBytes(), "UTF-8");
            System.out.println(" Matches equals with Both Content : " + text.equals(artPlusBleedBox.toString()));
            System.out.println(" String Matches equals with Both Content : " + text.equalsIgnoreCase(text2));
        }
    }
 
Example 7
Source File: ExtractVisibleText.java    From testarea-pdfbox2 with Apache License 2.0 3 votes vote down vote up
/**
 * <a href="https://stackoverflow.com/questions/47908124/pdfbox-removing-invisible-text-by-clip-filling-paths-issue">
 * PDFBox - Removing invisible text (by clip/filling paths issue)
 * </a>
 * <br/>
 * <a href="https://drive.google.com/open?id=1xcZOusx3cEdZX4AT8QAVDqZe33YWla0H">
 * test.pdf
 * </a> as testDmitryK.pdf
 * <p>
 * Indeed, using the original {@link PDFVisibleTextStripper} implementation
 * a lot of visible characters where dropped. This was due to the incorrect
 * calculation of the <code>end</code> of the character baseline in the methods
 * {@link PDFVisibleTextStripper#processTextPosition(org.apache.pdfbox.text.TextPosition)}
 * and {@link PDFVisibleTextStripper#deleteCharsInPath()}.
 * </p>
 * <p>
 * After patching those {@link PDFVisibleTextStripper} methods to make use of
 * <code>end</code> only optionally, running the test with that option results
 * in a decent extraction of visible text.
 * </p>
 */
@Test
public void testTestDmitryK() throws IOException {
    try (   InputStream resource = getClass().getResourceAsStream("testDmitryK.pdf")  ) {
        PDDocument document = Loader.loadPDF(resource);
        PDFTextStripper stripper = new PDFVisibleTextStripper();
        stripper.setSortByPosition(true);
        String text = stripper.getText(document);

        System.out.printf("\n*\n* testDmitryK.pdf\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "testDmitryK.txt").toPath(), Collections.singleton(text));
    }
}
 
Example 8
Source File: ExtractVisibleText.java    From testarea-pdfbox2 with Apache License 2.0 3 votes vote down vote up
/**
 * <a href="https://stackoverflow.com/questions/47908124/pdfbox-removing-invisible-text-by-clip-filling-paths-issue">
 * PDFBox - Removing invisible text (by clip/filling paths issue)
 * </a>
 * <br/>
 * <a href="https://drive.google.com/open?id=1l0Yt9BJXs09bXcBD7pDbxFiZQQqnuaan">
 * test2.pdf
 * </a> as test2DmitryK.pdf
 * <p>
 * Indeed, even the {@link PDFVisibleTextStripper} implementation as originally
 * improved for {@link #testTestDmitryK()} failed for this document. The cause
 * is another normalization by PDFBox text stripping moving the origin into the
 * lower left corner of the crop box.
 * </p>
 * <p>
 * Patching the {@link PDFVisibleTextStripper} methods to add the lower left
 * crop box coordinate values again results in a decent extraction of visible
 * text.
 * </p>
 */
@Test
public void testTest2DmitryK() throws IOException {
    try (   InputStream resource = getClass().getResourceAsStream("test2DmitryK.pdf")  ) {
        PDDocument document = Loader.loadPDF(resource);
        PDFTextStripper stripper = new PDFVisibleTextStripper();
        stripper.setSortByPosition(true);
        String text = stripper.getText(document);

        System.out.printf("\n*\n* test2DmitryK.pdf\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "test2DmitryK.txt").toPath(), Collections.singleton(text));
    }
}
 
Example 9
Source File: ExtractVisibleText.java    From testarea-pdfbox2 with Apache License 2.0 3 votes vote down vote up
/**
 * <a href="https://github.com/mkl-public/testarea-pdfbox2/issues/3">
 * One case fails to remove invisible texts or symbols
 * </a>
 * <br/>
 * <a href="https://github.com/mkl-public/testarea-pdfbox2/files/2481423/00000000000005fw6q.pdf">
 * 00000000000005fw6q.pdf
 * </a>
 * <p>
 * The "hidden text" recognized by Adobe here is only "hidden"
 * because it uses a glyph (page 1, Font F9, code 0000) for which
 * the embedded font draws nothing but which ToUnicode maps to
 * U+DBD0, a High Private Use Surrogate which by itself in general
 * makes no sense.
 * </p>
 */
@Test
public void test00000000000005fw6q() throws IOException {
    try (   InputStream resource = getClass().getResourceAsStream("00000000000005fw6q.pdf")  ) {
        PDDocument document = Loader.loadPDF(resource);
        PDFTextStripper stripper = new PDFVisibleTextStripper();
        stripper.setSortByPosition(true);
        String text = stripper.getText(document);

        System.out.printf("\n*\n* 00000000000005fw6q.pdf\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "00000000000005fw6q.txt").toPath(), Collections.singleton(text));
    }
}
 
Example 10
Source File: ExtractVisibleText.java    From testarea-pdfbox2 with Apache License 2.0 3 votes vote down vote up
/**
 * <a href="https://stackoverflow.com/questions/59920280/pdfbox-2-0-invisible-lines-on-rotated-page-clip-path-issue">
 * PDFBox 2.0: invisible lines on rotated page - clip path issue
 * </a>
 * <br/>
 * <a href="https://drive.google.com/open?id=1Ex03HhDz17xQlsiTIY1cxaT_cb3nyTf3">
 * 1.pdf
 * </a>
 * <p>
 * Indeed, a number of lines get dropped. An analysis turns out that a glyph
 * origin positioned right on the clip path border has chances of being dropped.
 * This is due to different processing of those data with different errors.
 * </p>
 * @see #testFat1()
 */
@Test
public void test1() throws IOException {
    try (   InputStream resource = getClass().getResourceAsStream("1.pdf")  ) {
        PDDocument document = Loader.loadPDF(resource);
        PDFTextStripper stripper = new PDFVisibleTextStripper(false, new PrintStream(new File(RESULT_FOLDER, "1-drops.txt")));
        stripper.setSortByPosition(true);
        String text = stripper.getText(document);

        System.out.printf("\n*\n* 1.pdf\n*\n%s\n", text);
        Files.write(new File(RESULT_FOLDER, "1.txt").toPath(), Collections.singleton(text));
    }
}