問題描述
目前我正在嘗試使用 SAX 解析器,但大約 3/4 的文件完全凍結了,我嘗試分配更多內存等但沒有得到任何改進.
Currently im trying to use a SAX Parser but about 3/4 through the file it just completely freezes up, i have tried allocating more memory etc but not getting any improvements.
有什么辦法可以加快速度嗎?更好的方法?
Is there any way to speed this up? A better method?
將其剝離,所以我現在有以下代碼,當在命令行中運行時,它仍然沒有我想要的那么快.
Stripped it to bare bones, so i now have the following code and when running in command line it still doesn't go as fast as i would like.
使用java -Xms-4096m -Xmx8192m -jar reader.jar"運行它,我得到超過文章 700000 附近的 GC 開銷限制
Running it with "java -Xms-4096m -Xmx8192m -jar reader.jar" i get a GC overhead limit exceeded around article 700000
主要:
public class Read {
public static void main(String[] args) {
pages = XMLManager.getPages();
}
}
XML 管理器
public class XMLManager {
public static ArrayList<Page> getPages() {
ArrayList<Page> pages = null;
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
SAXParser parser = factory.newSAXParser();
File file = new File("..\enwiki-20140811-pages-articles.xml");
PageHandler pageHandler = new PageHandler();
parser.parse(file, pageHandler);
pages = pageHandler.getPages();
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return pages;
}
}
頁面處理程序
public class PageHandler extends DefaultHandler{
private ArrayList<Page> pages = new ArrayList<>();
private Page page;
private StringBuilder stringBuilder;
private boolean idSet = false;
public PageHandler(){
super();
}
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
stringBuilder = new StringBuilder();
if (qName.equals("page")){
page = new Page();
idSet = false;
} else if (qName.equals("redirect")){
if (page != null){
page.setRedirecting(true);
}
}
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
if (page != null && !page.isRedirecting()){
if (qName.equals("title")){
page.setTitle(stringBuilder.toString());
} else if (qName.equals("id")){
if (!idSet){
page.setId(Integer.parseInt(stringBuilder.toString()));
idSet = true;
}
} else if (qName.equals("text")){
String articleText = stringBuilder.toString();
articleText = articleText.replaceAll("(?s)<ref(.+?)</ref>", " "); //remove references
articleText = articleText.replaceAll("(?s)\{\{(.+?)\}\}", " "); //remove links underneath headings
articleText = articleText.replaceAll("(?s)==See also==.+", " "); //remove everything after see also
articleText = articleText.replaceAll("\|", " "); //Separate multiple links
articleText = articleText.replaceAll("\n", " "); //remove new lines
articleText = articleText.replaceAll("[^a-zA-Z0-9- \s]", " "); //remove all non alphanumeric except dashes and spaces
articleText = articleText.trim().replaceAll(" +", " "); //convert all multiple spaces to 1 space
Pattern pattern = Pattern.compile("([\S]+\s*){1,75}"); //get first 75 words of text
Matcher matcher = pattern.matcher(articleText);
matcher.find();
try {
page.setSummaryText(matcher.group());
} catch (IllegalStateException se){
page.setSummaryText("None");
}
page.setText(articleText);
} else if (qName.equals("page")){
pages.add(page);
page = null;
}
} else {
page = null;
}
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
stringBuilder.append(ch,start, length);
}
public ArrayList<Page> getPages() {
return pages;
}
}
推薦答案
您的解析代碼可能工作正常,但是您正在加載的數據量可能太大而無法在 ArrayList代碼>.
Your parsing code is likely working fine, but the volume of data you're loading is probably just too large to hold in memory in that ArrayList
.
您需要某種管道將數據傳遞到其實際目的地,而無需任何時間一次將其全部存儲在內存中.
You need some sort of pipeline to pass the data on to its actual destination without ever store it all in memory at once.
我有時對這種情況所做的類似于以下情況.
What I've sometimes done for this sort of situation is similar to the following.
創建處理單個元素的接口:
Create an interface for processing a single element:
public interface PageProcessor {
void process(Page page);
}
通過構造函數向 PageHandler
提供 this 的實現:
Supply an implementation of this to the PageHandler
through a constructor:
public class Read {
public static void main(String[] args) {
XMLManager.load(new PageProcessor() {
@Override
public void process(Page page) {
// Obviously you want to do something other than just printing,
// but I don't know what that is...
System.out.println(page);
}
}) ;
}
}
public class XMLManager {
public static void load(PageProcessor processor) {
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
SAXParser parser = factory.newSAXParser();
File file = new File("pages-articles.xml");
PageHandler pageHandler = new PageHandler(processor);
parser.parse(file, pageHandler);
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
將數據發送到此處理器而不是將其放入列表中:
Send data to this processor instead of putting it in the list:
public class PageHandler extends DefaultHandler {
private final PageProcessor processor;
private Page page;
private StringBuilder stringBuilder;
private boolean idSet = false;
public PageHandler(PageProcessor processor) {
this.processor = processor;
}
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
//Unchanged from your implementation
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
//Unchanged from your implementation
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
// Elide code not needing change
} else if (qName.equals("page")){
processor.process(page);
page = null;
}
} else {
page = null;
}
}
}
當然,您可以讓您的界面處理多條記錄的塊,而不僅僅是一條記錄,并讓 PageHandler
將頁面本地收集到一個較小的列表中,并定期發送列表進行處理并清除列表.
Of course, you can make your interface handle chunks of multiple records rather than just one and have the PageHandler
collect pages locally in a smaller list and periodically send the list off for processing and clear the list.
或者(也許更好)您可以實現此處定義的 PageProcessor
接口,并在此處構建邏輯來緩沖數據并將其發送到塊中以進一步處理.
Or (perhaps better) you could implement the PageProcessor
interface as defined here and build in logic there that buffers the data and sends it on for further handling in chunks.
這篇關于如何在 Java 中解析大 (50 GB) XML 文件的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!