1200字范文 > Java使用Jsoup爬取网页数据

Java使用Jsoup爬取网页数据

时间：2022-09-06 07:02:26

相关推荐

Java使用Jsoup爬取网页数据

1、引入依赖

<dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.13.1</version></dependency>

2、基本方法

public static void main(String[] args) throws IOException {//1、访问urlString url = "";//2、参数，没有可不写Map<String, String> params = new HashMap<>();// params.put("word","壁纸");//3、请求文档资源—有get和post两种方法，取决于要爬取的数据，爬取html一般使用get即可Document document = Jsoup.connect(url).data(params).get();// 4、选择器—用来选择元素，主要有几种方式/** 1、getElementsByTag 按照标签选择，通常用来选择 h1 ，img标题等元素* 2、getElementsByClass 按照元素的class命名选取* 3、getElementById 按照元素ID* 4.select 复合选择器，最常用的方法—如下*#nav 选择id为nav的元素*.container 选择class为 container的元素*p > img 选择p标签下的所有img标签，如果是p>a>img，则选择不到。层级必须对应*其余方法详情见官方API /cookbook/extracting-data/attributes-text-html*/Elements elements = document.select("div.post_body > p > img");/** 这里就已经获取到了所有的图片,可以通过attr方法获取标签的属性，获取图片url* 使用工具类写到本地文件夹就行了*/String FilePath = "D://imgDown/";for (int i = 0; i < elements.size() ; i++) {//此方法只有三个参数——URL地址，文件名字（全名，包含路径），超时时间FileUtil.downFile(elements.get(i).attr("src"),FilePath+i+".jpg",5000);} }

3、工具类代码—lombok看自己想法是否引入，不需要的话，删除掉log相关的打印就行了

import lombok.extern.slf4j.Slf4j;import java.io.*;import .HttpURLConnection;import .URL;@Slf4jpublic class FileUtil {public static boolean downFile(String urlString, String fileName, Integer timeout) {boolean ret = false;File file = new File(fileName);try {if(file.exists()){file.delete();}log.info("开始下载文件");URL url = new URL(urlString);HttpURLConnection con = (HttpURLConnection)url.openConnection();if (timeout != null) {con.setConnectTimeout(timeout);con.setReadTimeout(timeout);}con.connect();int contentLength = con.getContentLength();InputStream is = con.getInputStream();int len;File file2=new File(file.getParent());file2.mkdirs();if(file.isDirectory()){}else{file.createNewFile();//创建文件}OutputStream os = new FileOutputStream(file);while ((len = is.read(bs)) != -1) {os.write(bs, 0, len);}os.close();is.close();if(contentLength != file.length()){file.delete();ret = false;}else{ret = true;}} catch (IOException e) {file.delete();ret = false;}finally {return ret;}}}

4、补充

一般在爬取网页的过程中，标题都会是h1标签，可以根据标题创建文件夹，这样方便管理

String title = document.select("h1").text();String FilePath = "D://imgDown/"+title+"/";

部分网站做了请求头校验，在创建连接的时候，需要添加 headers()参数（单个添加用header），以下是比较常见的的数据，还有Token和Authorization等校验（极少数会有cookie校验），具体可以以浏览器抓取到的参数填写。

Map<String, String> headers = new HashMap<String, String>();//此处Host的值需要按照访问地址填写header.put("Host", "");header.put("User-Agent", " Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/0101 Firefox/5.0");header.put("Accept", " text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");header.put("Accept-Language", "zh-cn,zh;q=0.5");header.put("Accept-Charset", " GB2312,utf-8;q=0.7,*;q=0.7");header.put("Connection", "keep-alive");Document document = Jsoup.connect(url).headers(headers).data(params).get();

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。