1200字范文,内容丰富有趣,写作的好帮手!
1200字范文 > 使用java的HttpClient实现抓取网页数据

使用java的HttpClient实现抓取网页数据

时间:2021-03-06 08:27:24

相关推荐

使用java的HttpClient实现抓取网页数据

网络爬虫就是用程序帮助我们访问网络上的资源,我们一直以来都是使用HTTP协议来访问互联网上的网页,网络爬虫需要编写程序,在这里使用同样的HTTP协议来访问网页。

1.pom依赖

<dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpclient</artifactId><version>4.5.2</version></dependency><dependency><groupId>org.slf4j</groupId><artifactId>slf4j-log4j12</artifactId><version>1.7.25</version></dependency>

2.log4j的配置文件

log4j.properties

log4j.rootLogger=DEBUG,.yfy = DEBUG​log4j.appender.A1=org.apache.log4j.ConsoleAppenderlog4j.appender.A1.layout=org.apache.log4j.PatternLayoutlog4j.appender.A1.layout.ConversionPattern=%-d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c]-[%p] %m%n

3.GET请求

public class HttpGetTest {public static void main(String[] args) throws URISyntaxException {//1.创建HttpClient对象CloseableHttpClient httpClient = HttpClients.createDefault();​//设置请求地址是:/search?keys=Java//创建URIBuilderURIBuilder uriBuilder = new URIBuilder("/top250");uriBuilder.setParameter("start", "25");​//2.创建HttpGet对象,设置url访问地址HttpGet httpGet = new HttpGet(uriBuilder.build());​//配置请求信息//有时候因为网络,获取目标服务器的原因,请求需要更长的时间才能完成,我们需要自定义相关时间RequestConfig config = RequestConfig.custom().setConnectTimeout(1000) //创建连接的最长时间,单位是毫秒.setConnectionRequestTimeout(500)//设置获取连接的最长时间.setSocketTimeout(10 * 1000) //设置数据传输的最长时间.build();​System.out.println("发起请求的信息:" + httpGet);//3.使用HttpClient发起请求,获取responseCloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);//4.解析响应if (response.getStatusLine().getStatusCode() == 200) {String content = EntityUtils.toString(response.getEntity(), "utf8");System.out.println(content);System.out.println(content.length());}} catch (IOException e) {e.printStackTrace();} finally {try {response.close();} catch (IOException e) {e.printStackTrace();}try {httpClient.close();} catch (IOException e) {e.printStackTrace();}}​}}

4.POST请求

public class HttpParamTest {public static void main(String[] args) throws UnsupportedEncodingException {//1.创建HttpClient对象CloseableHttpClient httpClient = HttpClients.createDefault();​//2.创建HttpPost对象,设置url访问地址HttpPost httpPost = new HttpPost("/search");​System.out.println("发起请求的信息:" + httpPost);​//声明List集合,封装表单中的餐胡List<NameValuePair> params = new ArrayList<>();params.add(new BasicNameValuePair("keys", "Java"));​//创建表单的Entity对象UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params, "utf8");httpPost.setEntity(formEntity);​//3.使用HttpClient发起请求,获取responseCloseableHttpResponse response = null;try {response = httpClient.execute(httpPost);//4.解析响应if (response.getStatusLine().getStatusCode() == 200) {String content = EntityUtils.toString(response.getEntity(), "utf8");System.out.println(content.length());}} catch (IOException e) {e.printStackTrace();} finally {try {response.close();} catch (IOException e) {e.printStackTrace();}try {httpClient.close();} catch (IOException e) {e.printStackTrace();}}}}

5.连接池

如果每次请求都创建HttpClient,会有频繁创建和销毁的问题,可以使用连接池来解决这个问题

public class HttpClientPoolTest {public static void main(String[] args) {//创建连接池管理器PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();//设置最大连接数cm.setMaxTotal(100);//设置每个主机的最大连接数cm.setDefaultMaxPerRoute(10);//使用连接池管理器发起请求doGet(cm);doGet(cm);}​private static void doGet(PoolingHttpClientConnectionManager cm) {//不是每次创建新的HttpClient,而是从连接池中获取HttpClientCloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();​HttpGet httpGet = new HttpGet("/top250");CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if (response.getStatusLine().getStatusCode() == 200) {String content = EntityUtils.toString(response.getEntity(), "utf8");System.out.println(content.length());}} catch (IOException e) {e.printStackTrace();} finally {if (response != null) {try {response.close();} catch (IOException e) {e.printStackTrace();}}//不能关闭HttpClient,由连接池管理}}}

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。