1200字范文,内容丰富有趣,写作的好帮手!
1200字范文 > java读取pdf的文字 图片 线条和对应坐标

java读取pdf的文字 图片 线条和对应坐标

时间:2020-11-01 23:17:38

相关推荐

java读取pdf的文字 图片 线条和对应坐标

pdf文档的内容都是坐标定位的,文档内容主要包含文本、图片、线条;对于表格的解析,可以通过判断线条的位置来判断表格。PDFBox的api,不是很方便把内容和对应坐标读取出来。

Pdf2Dom是一个按绝对坐标的方式来把pdf转成html渲染的,Pdf2Dom基于Apache PDFBox库。

需要解析的pdf文档内容:

需要用到pdfbox和pdf2dom两个依赖包

MyPdf.java解析pdf代码

package com.penngo.pdf;public class MyPdf extends PDFDomTree{public MyPdf() throws IOException {super();}protected void startNewPage(){System.out.println("====页码:" + pagecnt);super.startNewPage();}@Overrideprotected void renderText(String data, TextMetrics metrics){System.out.println("====文本:" + data + "," + ",x:" + (int)metrics.getX() + ",top:" + (int)metrics.getTop() + ",width:" + (int)metrics.getWidth() + ",height:" + (int)metrics.getHeight() );curpage.appendChild(createTextElement(data, metrics.getWidth()));}@Overrideprotected void renderPath(List<PathSegment> path, boolean stroke, boolean fill) throws IOException{PathSegment path1 = path.get(0);System.out.println("====路径1:" + "x1:" + path.get(0).getX1() + ",y1:" + path1.getY1() + ",x2:" + path1.getX2() + ",y2:" + path1.getY2() + ",stroke:" + stroke + ",fill:" + fill);super.renderPath(path, stroke, fill);}@Overrideprotected void renderImage(float x, float y, float width, float height, ImageResource resource) throws IOException{System.out.println("====图片:" + "x:" + x + ",y:" + y + ",width:" + width + ",height:" + height);curpage.appendChild(createImageElement(x, y, width, height, resource));}public void parsePdf(PDDocument doc){try{DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();DOMImplementationLS impl = (DOMImplementationLS)registry.getDOMImplementation("LS");LSSerializer writer = impl.createLSSerializer();LSOutput output = impl.createLSOutput();writer.getDomConfig().setParameter("format-pretty-print", true);createDOM(doc);} catch (Exception e) {e.printStackTrace();}}public static void main(String[] args) {try {File pdfFile = new File("F:\\dev\\test\\test.pdf");PDDocument document = PDDocument.load(pdfFile);MyPdf pdfDomTree = new MyPdf();pdfDomTree.parsePdf(document);}catch(Exception e){e.printStackTrace();}}}

运行结果:

====页码:0

====文本:文章:/penngo/article/details/125436956,x:90,top:81,width:312,height:13

====文本:1.2,x:90,top:112,width:14,height:13

====文本:安装,x:110,top:112,width:20,height:13

====文本:HTML,x:133,top:112,width:29,height:13

====文本:Publisher,x:166,top:112,width:46,height:13

====图片❌90.0,y:139.2,width:414.48,height:177.36

====文本:插件,x:215,top:112,width:20,height:13

====文本:表头,x:90,top:331,width:20,height:13

====文本:1,x:113,top:331,width:6,height:13

====文本:表头,x:196,top:331,width:20,height:13

====文本:2,x:220,top:331,width:6,height:13

====文本:表头,x:303,top:331,width:20,height:13

====文本:3,x:326,top:331,width:6,height:13

====文本:表头,x:409,top:331,width:20,height:13

====文本:4,x:433,top:331,width:6,height:13

====文本:列,x:90,top:362,width:10,height:13

====文本:1,x:103,top:362,width:6,height:13

====文本:列,x:196,top:362,width:10,height:13

====文本:2,x:209,top:362,width:6,height:13

====文本:列,x:303,top:362,width:10,height:13

====文本:3,x:316,top:362,width:6,height:13

====文本:列,x:409,top:362,width:10,height:13

====线条size:1

====线条0,x1:84.36,y1:321.83997,x2:510.94,y2:321.83997,stroke:true,fill:false

====线条size:1

====线条0,x1:84.36,y1:353.54,x2:510.94,y2:353.54,stroke:true,fill:false

====线条size:1

====线条0,x1:84.36,y1:385.24,x2:510.94,y2:385.24,stroke:true,fill:false

====线条size:1

====线条0,x1:84.6,y1:321.59998,x2:84.6,y2:385.0,stroke:true,fill:false

====线条size:1

====线条0,x1:191.1,y1:321.59998,x2:191.1,y2:385.0,stroke:true,fill:false

====线条size:1

====线条0,x1:297.6,y1:321.59998,x2:297.6,y2:385.0,stroke:true,fill:false

====线条size:1

====线条0,x1:404.15,y1:321.59998,x2:404.15,y2:385.0,stroke:true,fill:false

====线条size:1

====线条0,x1:510.7,y1:321.59998,x2:510.7,y2:385.0,stroke:true,fill:false

====文本:4,x:422,top:362,width:6,height:13

源码下载

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。