It look like this in XML. I want to get he Image src value...
<description><![CDATA[<div class="images"><img src="http://www.voicetv.co.th/cache/images/8a1a6f2aeb7b0e9c1d6bb3eae314165f.jpg" /></div>]]></description>
What I am doing is
if ((theElement.getElementsByTagName("description")).getLength() > 0) {
allChildern = theElement.getElementsByTagName("description").item(0).getChildNodes();
for (int index = 0; index < allChildern.getLength(); index++) {
description += allChildern.item(index).getNodeValue();
NodeList chNodes = allChildern.item(index).getChildNodes();
for (int i = 0; i < chNodes.getLength(); i++) {
String name = chNodes.item(i).getNodeName();
if(name.equals("div")) {
String clas = allChildern.item(index).getAttributes().getNamedItem("class").getNodeValue();
if(clas.equals("images")){
String nName = allChildern.item(index).getChildNodes().item(0).getNodeName();
if(nName.equals("img")) {
String nValue = allChildern.item(index).getChildNodes().item(0).getAttributes().getNamedItem("src").getNodeValue();
}
}
}
}
}
currentStory.setDescription(description);
}
But is is not working
The description element contains a CDATA node. This means that the <img> "element" you are trying to access is really just a piece of text (and not an element at all).
You'll need to parse the text as a new XML document in order to access it via DOM methods.
Warning: This might be a bit dirty, and it can also be fragile if the xml can contain comments that contains something that looks like image tags.
An alternative to using xml parsing for that short xml snippet that has a cdata section is to get the image url using regexp. Here's an example:
String xml = "<description><![CDATA[<div class=\"images\"><img src=\"http://www.voicetv.co.th/cache/images/8a1a6f2aeb7b0e9c1d6bb3eae314165f.jpg\"/></div>]]></description>";
Matcher matcher = Pattern.compile("<img src=\"([^\"]+)").matcher(xml);
while (matcher.find()) {
System.out.println("img url: " + matcher.group(1));
}
Related
I am trying to build a app that display the feed of a given twitter account without using Twitter's Oauth, instead I converted the twitter account feed page to an xml format using Twitrss.me the image src is buried inside the img tag under the src attribute
<img xmlns="http://www.w3.org/1999/xhtml" width="250"
src="https://pbs.twimg.com/media/CmEqtEuUoAISIC3.jpg"
xml:base="http://twitrss.me/twitter_user_to_rss/?user=capcomfighters" />
I was able to get all the other tags text content but I have no idea how to access the tag attributes and get the value for specific attribute
also if there is a better way to do this than parsing xml please let me know
I am using this function to get the content of the tags
private void ProcessXml(Document data) {
if (data != null) {
feedItems=new ArrayList<>();
Element root = data.getDocumentElement();
Node channel = root.getChildNodes().item(1);
NodeList items = channel.getChildNodes();
for (int i = 0; i < items.getLength(); i++) {
Node currentchild = items.item(i);
if (currentchild.getNodeName().equalsIgnoreCase("item")) {
FeedItem item=new FeedItem();
NodeList itemchilds = currentchild.getChildNodes();
for (int j = 0; j < itemchilds.getLength(); j++) {
Node current = itemchilds.item(j);
if (current.getNodeName().equalsIgnoreCase("title")){
item.setTitle(current.getTextContent());
item.setDescription(current.getTextContent());
}else if (current.getNodeName().equalsIgnoreCase("description")){
// item.setDescription(current.getTextContent());
}else if (current.getNodeName().equalsIgnoreCase("pubDate")){
item.setPubDate(current.getTextContent());
}else if (current.getNodeName().equalsIgnoreCase("img")){
item.setThumbnailUrl(current.getAttributes().getNamedItem("src").getNodeValue());
Log.d("PIC", "ProcessXml: "+current.getAttributes().getNamedItem("src").getNodeValue());
}
}
feedItems.add(item);
}
}
}
}
I'm trying to parse an XML file. The format of the XML file is:
<Testcase>
<Title>Low Load</Title>
<MO_Call>DIAL</MO_Call>
<Delay>10</Delay>
<MO_Call>HANGUP</MO_Call>
<Delay>10</Delay>
<MO_SMS>SEND</MO_SMS>
</Testcase>
I need to get the key-value pairs under the node "Testcase" and store in a datastructure. Ordering is important, and hence I'm considering a LinkedHashMap.
Please suggest the right way to get the key-value pairs from the XMl file.
In the above XML snippet, the corresponnding key-value pairs are:
Key: MO_Call, Value: DIAL
Key: Delay, Value: 10
I have written the below code for parsing the XML:
try {
builder = factory.newDocumentBuilder();
Document doc = builder.parse(new FileInputStream(CONFIG_PATH));
doc.getDocumentElement().normalize();
NodeList nodeList = doc.getElementsByTagName(TAG_TEST_CASE);
if (nodeList != null && nodeList.getLength() > 0) {
for (int i=0; i < nodeList.getLength(); i++) {
Element el = (Element) nodeList.item(i);
// If title doesn't match, check the next title under next 'Testcase' NodeList
if(el.getFirstChild().getNodeName().equals(TAG_TITLE) &&
!el.getFirstChild().getNodeValue().equals(title)) {
continue;
}
// else, title matches. So parse the child nodes.
// Start from index 1, since index 0 is title always
NodeList childNodeList = el.getChildNodes();
for(int j=1; j < childNodeList.getLength(); j++) {
Node childNode = childNodeList.item(j);
//Element childElement = (Element) childNodeList.item(j);
Log.d("Tool", "key=" +childNode.getNodeName()+ ", value=" +childNode.getTextContent());
}
}
}
}
I'm getting the output:
key=Title, value=Low Load
key=#text, value=
key=MO_Call, value=DIAL
key=#text, value=
key=Delay, value=10
key=#text, value=
key=MO_Call, value=HANGUP
key=#text, value=
key=Delay, value=10
key=#text, value=
key=MO_SMS, value=SEND
key=#text, value=
key=Title, value=Medium Load
key=#text, value=
key=Title, value=High Load
key=#text, value=
For some reason, the key "#text" is coming up along with null value. Please help me avoid this.
Please, reffer here for the #text. If you can alter the XML, remove ALL spaces that are unintentional, then re-test your code.
Example XML:
<?xml version="1.0" encoding="UTF-8"?><Testcase><Title>Low Load</Title><MO_Call>DIAL</MO_Call><Delay>10</Delay><MO_Call>HANGUP</MO_Call><Delay>10</Delay><MO_SMS>SEND</MO_SMS></Testcase>
Also, please show the encode used on the XML file, there could be typos.
This is also a good suggestion on how to parse
What I want to do...
I have a webview in my android app. I get a huge html content from the server as a string and a search string from the application user(the android phone user). Now I break the search string and create a regex out of it. I want all the html content that matches my regex to be highlighted when I display it into my WebView.
What I tried...
Since it is html, I just want to wrap the regex matched words into a pair of tags with yellow background.
Simple regex and replaceAll on the html Content that i get. Very wrong because it screws and replaces even what is inside the '<' and '>'.
I tried using Matcher and Pattern combo. It is difficult to omit what is inside the tags.
I used JSOUP Parser and it worked!
I traverse the html using NodeTraversor class. I used Matcher and Pattern classes to find and replace matched words with tags as i wanted to do.
But it is very slow. And I basically want to use it on Android and the size of it is like 284kB. I removed some unwanted classes and it is now 201kB but it is still too much for an android device. Additionally, the html content can be really large. I looked into JSoup source as well. It kind of iterates over every single character when it parses. I do not know whether all the parsers do the same but it is definitely slow for large html documents.
Here is my code -
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Highlighter {
private String regex;
private String htmlContent;
Pattern pat;
Matcher mat;
public Highlighter(String searchString, String htmlString) {
regex = buildRegexFromQuery(searchString);
htmlContent = htmlString;
pat = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
}
public String getHighlightedHtml() {
Document doc = Jsoup.parse(htmlContent);
final List<TextNode> nodesToChange = new ArrayList<TextNode>();
NodeTraversor nd = new NodeTraversor(new NodeVisitor() {
#Override
public void tail(Node node, int depth) {
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
String text = textNode.getWholeText();
mat = pat.matcher(text);
if(mat.find()) {
nodesToChange.add(textNode);
}
}
}
#Override
public void head(Node node, int depth) {
}
});
nd.traverse(doc.body());
for (TextNode textNode : nodesToChange) {
Node newNode = buildElementForText(textNode);
textNode.replaceWith(newNode);
}
return doc.toString();
}
private static String buildRegexFromQuery(String queryString) {
String regex = "";
String queryToConvert = queryString;
/* Clean up query */
queryToConvert = queryToConvert.replaceAll("[\\p{Punct}]*", " ");
queryToConvert = queryToConvert.replaceAll("[\\s]*", " ");
String[] regexArray = queryString.split(" ");
regex = "(";
for(int i = 0; i < regexArray.length - 1; i++) {
String item = regexArray[i];
regex += "(\\b)" + item + "(\\b)|";
}
regex += "(\\b)" + regexArray[regexArray.length - 1] + "[a-zA-Z0-9]*?(\\b))";
return regex;
}
private Node buildElementForText(TextNode textNode) {
String text = textNode.getWholeText().trim();
ArrayList<MatchedWord> matchedWordSet = new ArrayList<MatchedWord>();
mat = pat.matcher(text);
while(mat.find()) {
matchedWordSet.add(new MatchedWord(mat.start(), mat.end()));
}
StringBuffer newText = new StringBuffer(text);
for(int i = matchedWordSet.size() - 1; i >= 0; i-- ) {
String wordToReplace = newText.substring(matchedWordSet.get(i).start, matchedWordSet.get(i).end);
wordToReplace = "<b>" + wordToReplace+ "</b>";
newText = newText.replace(matchedWordSet.get(i).start, matchedWordSet.get(i).end, wordToReplace);
}
return new DataNode(newText.toString(), textNode.baseUri());
}
class MatchedWord {
public int start;
public int end;
public MatchedWord(int start, int end) {
this.start = start;
this.end = end;
}
}
}
Here is how I call it -
htmlString = getHtmlFromServer();
Highlighter hl = new Highlighter("Hello World!", htmlString);
new htmlString = hl.getHighlightedHTML();
I am sure what i'm doing is not the most optimal way. But I can't seem to think of anything else.
I want to
- reduce the time it takes to highlight it.
- reduce the size of library
Any suggestions?
How about highlighting them using javascript?
You know, everybody love javascript, and you can find example like this blog.
JTidy and HTMLCleaner are aloso among the best Java HTML Parser.
see
Comparison between different Java HTML Parser
and
What are the pros and cons of the leading Java HTML parsers?
I have an xml that i am parsing through DOM parser. the xml is of somewhat this sequence
<root>
<item1> abc </item1>
<item2> def </item2>
<item3> ghi </item3>
<item4>
<subItem4>
<name> xyz </name>
<id> 1 </id>
</subItem4>
<subItem4>
<name> asd </name>
<id> 2 </id>
</subItem4>
</item4>
</root>
According to this dummy xml i am reaching till subItem 4 but not to the childern of it. what i am trying is as follows to get innermost Items is:
NodeList slide = theElement.getElementsByTagName("item4").item(0).getChildNodes();
for(int i = 0; i<slide.getLength(); i++)
{
NodeList subSlides = theElement.getElementsByTagName("subItem4").item(0).getChildNodes();
for (int j=0; j<subSlides.getLength(); j++)
{
String subSlide_title = subSlides.item(i).getFirstChild().getNodeValue();
}
}
its not working. please can someone identify where am i doing the mistake in parsing. Any help is appreciated.
You are not using valid XML - you can't have spaces in tag names.
XML Names cannot contain white space, see here for valid values.
Update (following comment that the posted sample is representative of the actual XML):
Your access via the indexer of the node list is incorrect:
String subSlide_title = subSlides.item(i).getFirstChild().getNodeValue();
Try this instead (using j instead of i, as the inner loop variable is called):
String subSlide_title = subSlides.item(j).getFirstChild().getNodeValue();
NodeList nodes = doc.getElementsByTagName("item");
for (int i = 0; i < nodes.getLength(); i++) {
Element element = (Element) nodes.item(i);
NodeList nodesimg = element.getElementsByTagName("name");
for (int j = 0; j < nodesimg.getLength(); j++) {
Element line = (Element) nodesimg.item(j);
String value=getCharacterDataFromElement(line);
}
}
public static String getCharacterDataFromElement(Element e) {
Node child = e.getFirstChild();
if (child instanceof CharacterData) {
CharacterData cd = (CharacterData) child;
return cd.getData();
}
return "?";
}
I think above code will help you in parsing xml file.
The XML elements are all messed up.
There are literally 2 lines that don't have mistakes in them.
For instance
<subItem 4>
is syntactically wrong and I don't see what logical sense you could make out of it.
Do you mean
<subItem4>
as in the fourth sub item or
<subItem someAttribute="4">
I'd recommend learning XML, it's very simple... http://www.w3schools.com/xml/default.asp
If I parse the tag that contains <p>Some Text</p> tag, I get a null pointer exception.
My RSS feed is as follows:
<quaddeals_conditions><p>Limit one QuadDeal</p></quaddeals_conditions>
My code is:
if (name.equalsIgnoreCase("quaddeals_conditions")) {
property.normalize();
conditions = property.getFirstChild().getNodeValue();
}
You have an element inside an element .
Therefore retrieve all quaddeals and then iterate each one and retrieve from it the p element:
DocumentBuilder builder = factory.newDocumentBuilder();
Document dom = builder.parse(this.inputStream);
Element root = dom.getDocumentElement();
// snip
NodeList items = root.getElementsByTagName("quaddeals_conditions");
for (int i = 0; i < items.getLength(); i++) {
Node item = items.item(i);
NodeList properties = item.getChildNodes();
for (int j = 0; j < properties.getLength(); j++) {
Node property = properties.item(j);
String name = property.getNodeName();
if (name.equalsIgnoreCase("p")) {
property.getFirstChild().getNodeValue(); // Your paragraph data
}
}
}
Hope this helps.
is "name" not NULL? I dont see you check for that.
It's good coding practice to compare the other way if possible:
if ("quaddeals_conditions".equalsIgnoreCase(name))...
So even if "name" is NULL, you don't get a NullPointerException.
Always check for not null before accessing some object member.