Regular Expression in Android - android

I am making an Android application that can fetch the new announcements from the website of my university.
This is the HTML code in the website:
sample_html_code http://img690.imageshack.us/img690/1079/88210050.png
Text version:
<table border="1" width="90%" class="duyuru">
<tbody>
<tr>
<td>
<h3 class="duyuru">Additional Quotas for the Technical Electives</h3>
"19/09/2012"
<h4 class="duyuru">"Additional Quotas for Technical Electives offered in...</h4>
<span class="duyuru"></span>
<br>
Download
</td>
</tr>
</tbody>
</table>
I can get the first and third lines "Additional Quotas for Technical Electives" and "Additional Quotas for ..." by using the piece of code below. However, I cannot get the date information (19/09/2012) located between h3 and h4 lines.
String patternStr ="\\<h3 class=\"duyuru\".*?\\>(.*?)\\</h3\\>";
patternStr+="(.*?)"; // This line is problematic
patternStr+=".*?\\<h4 class=\"duyuru\".*?\\>(.*?)\\</h4\\>";
Pattern pattern = Pattern.compile(patternStr, Pattern.DOTALL);
Matcher matcher = pattern.matcher(content);
String name = "";
String date = "";
String details = "";
while (matcher.find()){
name = matcher.group(1);
date = matcher.group(2);
details = matcher.group(3);
Announcement announcement = new Announcement();
announcement.setName(name);
announcement.setDate(date);
announcement.setDetails(details);
announcements.add(announcement);
}
I tried using
.*?\"(.*?)\"
but it didn't work. When I do this, it gets the string "duyuru" from the line starting with h4 tag instead of the date information.
Anyone have an idea how can I grab the date information?
Thanks in advance.

Your regular expression misses the newlines and whitespace in the input.
The simplest possible match I could come up with is:
"\\<h3 class=\"duyuru\".*?\\>\\n?\\s*(.*?)\\n?\\s*\\</h3\\>"
But keep in mind that such a regular expression is highly specific to your HTML.
My advice would be to have a look at a real HTML parser for Java, such as TagSoup. Once you start using one of those, parsing this type of HTML document becomes a breeze.

Related

JSoup not showing all the html in Java (td and tr tags missing)

I'm having trouble getting all the html code under the tags. Here is my current code:
Document document = Jsoup.connect("http://stackoverflow.com/questions/2971155/what-is-the-fastest-way-to-scrape-html-webpage-in-android").get();
Elements desc = document.select("tr");
System.out.println(desc.toString());
It's for that question, and I'm trying to get the text from the question's description. But I'm getting not getting certain tr or td tags like the ones for the question. Here is td tag I'm trying to get:
<td class="postcell">
Under that tag is the actual post. Now when I print out what I'm actually getting, I'm getting a ton of empty td tags and some comments, but not the actual post.
<tr id="comment-37956942" class="comment ">
<td>
<table>
<tbody>
<tr>
<td class=" comment-score"> </td>
<td> </td>
</tr>
</tbody>
</table> </td>
<td class="comment-text">
<div style="display: block;" class="comment-body">
<span class="comment-copy">You shouldn't parse HTML with regexes: blog.codinghorror.com/parsing-html-the-cthulhu-way</span> –
﹕ motobói
And it keeps on going with empty td and tr tags. I can't find the actual question. Anyone know why this is happening?
Essentially, I just want the text from the question's post, and I don't know how to get it, so it would be nice if someone could show me how to get the text.
Jsoup is a parser. That means that it can't execute any javascript code, that could generate html. When you encounter this problem the only way to retrieve that content is through a headless browser, that includes a javascript engine. A popular library is selenium webdriver.
In order to determine if the content you are trying to parse is generated in the server (static content) or in the client (dynamic content-javascript generated) you can do the following:
Visit the page you want to parse
Press Ctrl + U
The steps above will open a new tab that contains the content that jsoup receives. If the content you need is not there, then it's generated by javascript.
Follow the steps and search for the content. If it's there, but jsoup still has problems, then most probably the case is that the site considers you a bot or a mobile device. Try setting the userAgent of a desktop browser and see what happens.
Document document = Jsoup.connect("http://stackoverflow.com/questions/2971155/what-is-the-fastest-way-to-scrape-html-webpage-in-android").userAgent("USER_AGENT_HERE").get();
Most importantly, when the site exposes and API for the users to extract information programmatically then it's better to just use that.
Stackoverflow has an API available

Print the values of matcher for Regular expression to logcat

Is this below code correct I am getting error if I print matcher.group(0). Am I print the values in correct way?Plz help.Logcat error is below.
String GetCddata=<p><a href="http://myimagefactorycollection.files.wordpress.com/2014/09/2db83fcf95c5fc036a00abfb412f50e4.jpg">
<img class="alignnone size-full wp-image-12" src="http://myimagefactorycollection.files.wordpress.com/2014/09/2db83fcf95c5fc036a00abfb412f50e4.jpg?w=529" alt="2db83fcf95c5fc036a00abfb412f50e4" />
</a><a href="https://myimagefactorycollection.files.wordpress.com/2014/09/0e397a47f88e18f8fb91d17db18c7edd-copy.jpg"><img class="alignnone size-full wp-image-4" src="http://myimagefactorycollection.files.wordpress.com/2014/09/0e397a47f88e18f8fb91d17db18c7edd-copy.jpg?w=529" alt="0e397a47f88e18f8fb91d17db18c7edd - Copy" />
</a></p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/myimagefactorycollection.wordpress.com/3/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/myimagefactorycollection.wordpress.com/3/" />
</a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=myimagefactorycollection.wordpress.com&blog=75018866&post=3&subd=myimagefactorycollection&ref=&feed=1" width="1" height="1" />
]];
Pattern pattern = Pattern.compile("(?<=\\<a href=)(.*?)\\>");
Matcher matcher = pattern.matcher(GetCddata);
Log.v("dd",matcher.group(0));
http://i.stack.imgur.com/GKpOD.png
group:
public String group()
Returns the input subsequence matched by the previous match. For a
matcher m with input sequence s, the expressions m.group() and
s.substring(m.start(), m.end()) are equivalent.
Note that some patterns, for example a*, match the empty string. This
method will return the empty string when the pattern successfully
matches the empty string in the input.
Specified by: group in interface MatchResult Returns: The (possibly
empty) subsequence matched by the previous match, in string form
Throws: IllegalStateException - If no match has yet been attempted, or if the previous match operation failed
Source :docs.oracle.com
Suggestion: check your Regex pattern, enclose your Log.v(); inside a try...catch block

document.evaluate does not returns proper TextNodes XPath

I am creating "Highlighter" for Android in WebView.
I am getting XPath expression for the selected Range in HTML through a function as follows
/HTML[1]/BODY[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]/text()[5]
Now i am evaluating the above XPath expression through this function in javascript
var resNode = document.evaluate('/HTML[1]/BODY[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]/text()[5]',document,null,XPathResult.FIRST_ORDERED_NODE_TYPE ,null);
var startNode = resNode.singleNodeValue;
but I am getting the startNode 'null'.
But, here is the interesting point:
if I evaluate this '/HTML[1]/BODY[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]' XPath expression using the same function, it gives the proper node i.e. a 'div'.
The difference between the two XPaths is the previous ones contains a textNode and later only div.
But the same thing is working fine on Desktop browsers.
Edited
Sample HTML
<html>
<head>
<script></script>
</head>
<body>
<div id="mainpage" class="highlighter-context">
<div> Some text here also....... </div>
<div> Some text here also.........</div>
<div>
<h1 class="heading"></h1>
<div class="left_side">
<ol></ol>
<h1></h1>
<div class="text_bio">
In human beings, height, colour of eyes, complexion, chin, etc. are
some recognisable features. A feature that can be recognised is known as
character or trait. Human beings reproduce through sexual reproduction. In this
process, two individuals one male and another female are involved. Male produces
male gamete or sperm and female produces female gamete or ovum. These gametes fuse
to form zygote which develops into a new young one which resembles to their parent.
During the process of sexual reproduction
</div>
</div>
<div class="righ_side">
Some text here also.........
</div>
<div class="clr">
Some text here also.......
</div>
</div>
</div>
</body>
</html>
getting XPath:
var selection = window.getSelection();
var range = selection.getRangeAt(0);
var xpJson = '{startXPath :"'+makeXPath(range.startContainer)+
'",startOffset:"'+range.startOffset+
'",endXPath:"'+makeXPath(range.endContainer)+
'",endOffset:"'+range.endOffset+'"}';
function to make XPath:
function makeXPath(node, currentPath) {
currentPath = currentPath || '';
switch (node.nodeType) {
case 3:
case 4:return makeXPath(node.parentNode, 'text()[' + (document.evaluate('preceding-sibling::text()', node, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null).snapshotLength + 1) + ']');
case 1:return makeXPath(node.parentNode, node.nodeName + '[' + (document.evaluate('preceding-sibling::' + node.nodeName, node, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null).snapshotLength + 1) + ']' + (currentPath ? '/' + currentPath : ''));
case 9:return '/' + currentPath;default:return '';
}
}
I am not working with XML but with HTML in webview.
I tried using Rangy serialize and deserialize but the Rangy "Serialize" works properly but not the "deserialize".
Any ideas guys, whats going wrong?
UPDATE
Finally got the root cause of the problem (not solution yet :( )
`what exactly is happening in android webview. -->> Somehow, the android webview is changing the DOM structure of the loaded HTML page. Even though the DIV doesn't contains any TEXTNODES, while selecting the text from DIV, i am getting TEXTNODE for every single line in that DIV. for example, for the same HTML page in Desktop browser and for the same text selection, the XPath getting from webview is entirely different from that of given in Desktop Browser'
XPath from Desktop Browser:
startXPath /HTML[1]/BODY[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]/text()[1]
startOffset: 184
endXPath: /HTML[1]/BODY[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]/text()[1]
endOffset: 342
Xpath from webview:
startXPath :/HTML[1]/BODY[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]/text()[3]
startOffset:0
endXPath:/HTML[1]/BODY[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]/text()[4]
endOffset:151
Well in your sample the path /HTML[1]/BODY[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]/text()[5] selects the fifth text child node of the div element
<div class="text_bio">
In human beings, height, colour of eyes, complexion, chin, etc. are
some recognisable features. A feature that can be recognised is known as
character or trait. Human beings reproduce through sexual reproduction. In this
process, two individuals one male and another female are involved. Male produces
male gamete or sperm and female produces female gamete or ovum. These gametes fuse
to form zygote which develops into a new young one which resembles to their parent.
During the process of sexual reproduction
</div>
That div has a single text child node so I don't see why text()[5] should select anything.

Android html parsing application htmlcleaner

Hi its my first post here I am writing it because I went throught every example google knows about on htmlcleaner... and I cant get my project running ;( Im tryng to make an Android app fetching and displaying data from flash rich webpage. The idea is to get only the most important data so that users wouldnt wast time, money processing power, nerves on atempting to brawse those pages on their smartphones... Its a country specific webpage... therefore country pecific app. On the page i want to parse there is this part
<li class="genre-3 genre-7 genre-9 mi-37 ">
<img src="picture.jpg" alt="altTitle">
<div class="superClass">
<a> aaa </a>
bbb
ccc
ddd
eee
</div>
<h2>title_of_super_product</h2>
<ul class="icons tooltip-enabled">
<li class="before"></li>
<li><img src="15_2.png" alt="15_2"></li>
</ul>
<div> </div>
<span class="material">some_material</span>
<span class="price">0.1USD</span>
<p class="text"> Some description </p>
<a class="button-more" href="http://link_to_more_info"></a>
</li>
The above is a ListItem, there are others similiar on the webpage. I have java class ready to fill it with data from the li lements. One clsss object for one li element. I need to extract the description, price, material, image links, stuff from superClass , meaning aaa,bbb,ccc,ddd, etc... The big question is how to do that? I thought that if i start from making a array that would consist of li elements i would be able to search each of them further for subelements i need... but it doest work ;(
TagNode[] liElements = rootNode.getElementsByName("li", true);
for (int i=0; liElements != null && i < liElements.length; i++) {
if(liElements.getAttributeByName("class").contains("genre"))
Log.d("li",liElements.getAttributeByName("class")); }
Gives only the first li element, then it spams nullPointerExceptions in the console Please please help, Im hopeless ;(;(;(
String classType =liElements.getAttributeByName("class");
if(classType!=null && classType.equals("genre........");
liElements[i]

Jsoup - Android - Parse info from Form data / input

I am using Jsoup in my application and I am attempting to parse information from an a few input tags in order to add them to a url and post data automatically.
The portion of HTML I am attempting to parse is as follow:
<div class='theDivClass'>
<form method="post" id="handlePurchase" name="makePurchase" action="/shop.php">
<input type="hidden" name="ProductCode" value="A1223MN" />
<input type="hidden" name="SystemVersion" value="3" >
<input type="hidden" name="ProductClass" value="BOOK" />
</form>
</div>
The desired output would be
x = A1223MN
y = 3
z = BOOK
I am halfway familiar with JSOUP in the sense that I am able to parse out text, images, and urls but for some reason this is not clicking for me.
Any help would be greatly appreciated.
You should be able to use this:
Elements hidden = doc.select("input[type=hidden]");
And then just pull the attr values from each element in hidden. I've just tried it and it seems to work as expected.
For completeness:
Map<String,String> hiddenList = new HashMap<String, String>();
Elements hidden = doc.select("input[type=hidden]");
for (Element el1 : hidden){
hiddenList.put(el1.attr("name"),el1.attr("value");
}
Will give you a Map of all hidden input fields in the document.
Element.select("input[name=productCode]").attr("value");
Element.select("input[name=SystemVersion]").attr("value");
Element.select("input[name=ProductClass]").attr("value");
There's another way I found:
FormElement f = (FormElement) doc.select("form#handlePurchase").first();
System.out.println(f.formData());
Result:
[ProductCode=A1223MN, SystemVersion=3, ProductClass=BOOK]
Closing this question as it appears from all of the research I have done, you cannot pull data from "hidden" input types.

Categories

Resources