sách gpt4 ai đã đi

java - 使用 Java 解析 HTML 数据(DOM 解析)

In lại 作者:太空宇宙 更新时间:2023-11-04 06:35:21 31 4
mua khóa gpt4 Nike

我已经为此工作了一段时间,但在 Stack Overflow 上没有找到任何相关内容。我正在使用一个旨在捕获 HTML 代码片段的解析器。根据代码(下文进一步),该文件的大小呈指数级增长,并且正在捕获我需要的字段 (li),但也非常重复,因为它一遍又一遍地捕获相同的数据。

这是我正在读取的文件(完整文件实际上有 100 多行,但本文仅包含 3 行):


Name: J0719
Description:
  1. Hop Counts: 2
  2. State: 3

Name: J0716
Description:
  1. Hop Counts: 3
  2. State: 2

Name: J0718
Description:
  1. Hop Counts: 1
  2. State: 5

Name: J0726
Description:
  1. Hop Counts: 8
  2. State: 4


我的完整代码在这里:

package ReadXMLFile_part2;

import java.io.*;

import org.jsoup.Jsoup;
import org.jsoup.select.Elements;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;


import java.util.Enumeration;
import java.util.logging.Level;
import java.util.logging.Logger;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

public class ReadXMLFile_part2 {

public static void main(String[] args) throws Exception {

PrintStream out = new PrintStream(new FileOutputStream("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/results2.xml"));
System.setOut(out);

System.out.println("*** JSOUP ***");

File input = new File("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/output2_TEST.html");
Document doc = null;
thử {
doc = Jsoup.parse(input,"UTF-8", "http://www.w3.org/1999/xhtml" );
} catch (IOException ex) {
Logger.getLogger(ReadXMLFile_part2.class.getName()).log(Level.SEVERE, null, ex);
}
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));

//For loops to capture the
  • fields in the file
    Element bracket = doc.getElementsByTag("bracket").first();
    Elements trs = bracket.getElementsByTag("description");
    for (Element description : trs) {
    for (Element li : description.getAllElements()) {
    System.out.println(li.text());
    }
    }
    System.out.println();

    //read a line from the console
    String lineFromInput = in.readLine();

    //output to the file a line
    out.println(lineFromInput);
    ra ngoài.đóng();
    }

    }
  • 我的问题是如何解析输入文件中由“li”标记的字段,以便我的输出文件为每个“li”标记都有一个新行。理想的输出应该是这样的(并防止无限循环):

    Name: J0719
    Hop Counts: 2
    State: 3
    Name: J0716
    Hop Counts: 3
    State: 2
    Name: J0718
    Hop Counts: 1
    State: 5
    Name: J0726
    Hop Counts: 8
    State: 4

    感谢并感谢对此提供的任何帮助!

    9月2日更新:虽然 previousElementSibling 在单独使用时很有用,但在尝试拉出“Description”字段时,我需要另一个某种类型的嵌套循环(否则 previousElementSibling 每次都会连续拉出第一个前一个元素)。我发现更快的解决方法是只更改原始代码中的标签,使其现在看起来像下面的代码:

    更新的 XML 文件:


  • Name: J0719

  • Description:
    1. Hop Counts 2
    2. State: 3

  • Name: J0716

  • Description:
    1. Hop Counts 3
    2. State: 2

  • Name: J0718

  • Description:
    1. Hop Counts 1
    2. State: 5

  • Name: J0719

  • Description:
    1. Hop Counts 8
    2. State: 4


    除了以下“for”循环之外,原始代码中的其他所有内容都保持不变

    //Updated Code:
    //For loops to capture the (li) fields in the file
    Elements brackets = doc.getElementsByTag("bracket");


    for (Element bracket : brackets) {
    Elements lis = bracket.select("li");

    for (Element li : lis){
    System.out.println(li.text());

    }
    phá vỡ;
    }
    System.out.println();

    唯一的另一件事是,在我看到文件大小停止增长后,我必须在执行后一段时间手动按下“停止”运行按钮。但我仍然看到输出文件生成了所需的结果。

    1 Câu trả lời

    如果我正确理解你的问题,你会遇到这样一个事实:xml 中的 tênbracket 节点不是父节点的子节点,而是紧随其后。我认为当您拥有 bracket 元素时获取正确的 tên 元素的解决方案是使用 JSOUP's DOM navigation methods ,即 previousElementSibling()

    您的循环可能如下所示:

    Elements brackets = doc.getElementsByTag("bracket");
    for (Element bracket : brackets) {
    Element lis = bracket.select("li");
    Element name = bracket.previousElementSibling();
    System.out.println(name.text());
    for (Element li : lis){
    System.out.println(li.text());
    }
    }

    关于java - 使用 Java 解析 HTML 数据(DOM 解析),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25491424/

    31 4 0
    Bài viết được đề xuất: java - Tween 引擎还是 Scene2d Action ?
    Bài viết được đề xuất: c - gp_hash_table : int64 as key
    Bài viết được đề xuất: 将 WORD 转换为字符串
    Bài viết được đề xuất: python - Sublime Text 2 构建路径错误
    太空宇宙
    Hồ sơ cá nhân

    Tôi là một lập trình viên xuất sắc, rất giỏi!

    Nhận phiếu giảm giá Didi Taxi miễn phí
    Mã giảm giá Didi Taxi
    Giấy chứng nhận ICP Bắc Kinh số 000000
    Hợp tác quảng cáo: 1813099741@qq.com 6ren.com