scraping html data and parsing into list

scraping html data and parsing into list - android

I am writing an android app using python for android (sl4a) and what I want it to do is search a joke website and extract a joke. Then tell me that joke to wake me up. So far it saves the raw html source to a list but I need it to make a new list by saving the data between html tags then reading that data to me. Its the parser I can't get to work. Here's the code:
import android
droid = android.Android()
import urllib
current = 0
newlist = []
sock = urllib.urlopen("http://m.funtweets.com/random")
htmlSource = sock.read()
sock.close()
rawhtml = []
rawhtml.append (htmlSource)
while current < len(rawhtml):
while current != "<div class=":
if [current] == "</b></a>":
newlist.append (current)
current += 1
print newlist

use this LIB for parsing HTML in android http://jsoup.org/ its reach and widely accepted lib among developers its also available for python :)

This is how to do this:
[Code]
import re
import urllib2
page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
user = re.compile(r'<span>#</span>(\w+)')
text = re.compile(r"</b></a> (\w.*)")
user_lst =[match.group(1) for match in re.finditer(user, page)]
text_lst =[match.group(1) for match in re.finditer(text, page)]
for _user, _text in zip(user_lst, text_lst):
print '#{0}\n{1}\n'.format(_user,_text)
[/code]

Related

How do I serialize an NLP classification PyTorch model

I am attempting to use a new NLP model within the PyTorch android demo app Demo App Git however I am struggling to serialize the model so that it works with Android.
The demonstration given by PyTorch is as follows for a Resnet model:
model = torchvision.models.resnet18(pretrained=True)
model.eval()
example = torch.rand(1, 3, 224, 224)
traced_script_module = torch.jit.trace(model, example)
traced_script_module.save("app/src/main/assets/model.pt")
However I am not sure what to use for the 'example' input with my NLP model.
The model that I am using from a fastai tutorial and the python is linked here: model
Here is the Python used to create my model (using the Fastai library). It is the same as in the model link above, but in a simplified form.
from fastai.text import *
path = untar_data('http://files.fast.ai/data/examples/imdb_sample')
path.ls()
#: [PosixPath('/storage/imdb_sample/texts.csv')]
data_lm = TextDataBunch.from_csv(path, 'texts.csv')
data = (TextList.from_csv(path, 'texts.csv', cols='text')
.split_from_df(col=2)
.label_from_df(cols=0)
.databunch())
bs=48
path = untar_data('https://s3.amazonaws.com/fast-ai-nlp/imdb')
data_lm = (TextList.from_folder(path)
.filter_by_folder(include=['train', 'test', 'unsup'])
.split_by_rand_pct(0.1)
.label_for_lm()
.databunch(bs=bs))
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))
learn.unfreeze()
learn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7))
learn.save_encoder('fine_tuned_enc')
path = untar_data('https://s3.amazonaws.com/fast-ai-nlp/imdb')
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)
.split_by_folder(valid='test')
.label_from_folder(classes=['neg', 'pos'])
.databunch(bs=bs))
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('fine_tuned_enc')
learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7))
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))

I worked out how to do this after a while. The issue was that the Fastai model wasn't tracing correctly no matter what shape of input I was using.
In the end, I used another text classification model and got it to work. I wrote a tutorial about how I did it, in case it can help anyone else.
NLP PyTorch Tracing Tutorial
Begin by opening a new Jupyter Python Notebook using your preferred cloud machine provider (I use Paperspace).
Next, copy and run the code in the PyTorch Text Classification tutorial. But replace the line…
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
With…
device = torch.device("cpu")
NOTE: It caused issues tracing when the device was set to CUDA so I forced it on to the CPU. (this will slow training, but inference on the mobile will run at the same speed as it is cpu anyway)
Lastly, run the code below to correctly trace the model to allow it to be run on Android:
data = DataLoader(test_dataset, batch_size=1, collate_fn=generate_batch)
for text, offsets, cls in data:
text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
example = text, offsets
traced_script_module = torch.jit.trace(model, example)
traced_script_module.save("model.pt")
In addition, if you would like a CSV copy of the vocab list for use on Android when you are making predictions, run the following code afterwards:
import pandas as pd
vocab = train_dataset.get_vocab()
df = pd.DataFrame.from_dict(vocab.stoi, orient='index', columns=['token'])
df[:30]
df.to_csv('out.csv')
This model should work fine on Android using the PyTorch API.

Android Extract url with specific domain name from String

I am developing a JSON application. I am able to download all of the data but I'm running into an interesting issue. I am trying to grab a string with the domain name:
http://www.prindlepost.org/
When grabbing all of the JSON, I get an extremely large string which I am unable to paste in there. The part I am trying to parse out is:
<p>The road through Belgrade was quiet at 4 A.M. Besides the occasional whir of another car speeding by, my taxi was largely alone on the road. Through the windshield I could see the last traces of apartment blocks pass by as we left the outskirts of the city. Somewhere beyond the limits of my vision, I knew the airport waited, its converging neon runway lines already lighting up the pre-dawn darkness.</p>
<div class="more-link-wrap wpb_button"> Read more</div>
where I am focusing on:
Read more</div>
I'm unfamiliar with extracting strings like this. In the end, I want to be able to save the URL as its own string. For example, the above would be converted into:
String url = "http://www.prindlepost.org/2015/06/this-is-a-self-portrait/";
One thing to note, there are A LOT of URLs to narrowing down by class name may help me a bunch.
My initial guess was:
// <READ MORE>
Pattern p = Pattern.compile("href=\"(.*?)\"");
Matcher m = p.matcher(content);
String urlTemp = null;
if (m.find()) {
urlTemp = m.group(1); // this variable should contain the link URL
}
Log.d("LINK WITHIN TEXT", ""+urlTemp);
// </READ MORE>
Any help is appreciated!

It may be work trying to use something like: http://jsoup.org/
If you check out their example for parsing out links:
String html = "<p>The road through Belgrade was quiet at 4 A.M. Besides the occasional whir of another car speeding by, my taxi was largely alone on the road. Through the windshield I could see the last traces of apartment blocks pass by as we left the outskirts of the city. Somewhere beyond the limits of my vision, I knew the airport waited, its converging neon runway lines already lighting up the pre-dawn darkness.</p>"
+ "<div class=\"more-link-wrap wpb_button\">"
+ "<a href=\"http://www.prindlepost.org/2015/06/this-is-a-self-portrait/\" class=\"more-link\">"
+ "Read more</a></div>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/2015/06/this-is-a-self-portrait/"
String absHref = link.attr("abs:href"); // "http://www.prindlepost.org/2015/06/this-is-a-self-portrait/"

Receiving hexadecimal values from socket instead of strings on SocketServer

I opened a socket between an Android app and a python server. The combination is that the Server listens, and android connects to the Server.
Here is the server code. The problematic part takes place in the definition of handle :
import SocketServer
from time import sleep
import sys
HOST = '192.168.56.1'
PORT = 2000
class SingleTCPHandler(SocketServer.StreamRequestHandler):
def handle(self):
try:
while(1):
sleep(0.03)
data = self.rfile.readline().strip()
print data
except KeyboardInterrupt:
sys.exit(0)
class SimpleServer(SocketServer.ThreadingMixIn, SocketServer.TCPServer):
allow_reuse_address = True
def __init__(self, server_address, RequestHandlerClass):
SocketServer.TCPServer.__init__(self, server_address, RequestHandlerClass)
server = SimpleServer((HOST, PORT), SingleTCPHandler)
try:
server.serve_forever()
except KeyboardInterrupt:
sys.exit(0)
The connection is established normally, and the Android app sends the following data to the socket:
'0:0'
But the data is received on the Server as:
'\x000\x00:\x000\x00'
The variable that receives the data is:
data = self.rfile.readline().strip()
and printing gives the regular format:
In [2]: print data
0:0
I didn't manage to step into the print function with pdb to see what it does.
I'm looking for a way to convert the '\x000\x00:\x000\x00' to '0:0'.
Please advise on a way to convert the variable. You are welcome to comment/criticize the whole implementation. This is my first project in dealing with sockets so i don't know the pitfalls.
Update
This was the original Android code:
String podaci = "0:0";
public void Socketic() throws IOException {
Socket mojSocket = new Socket(urlServer, port);
DataOutputStream izlazdata = new DataOutputStream(
mojSocket.getOutputStream());
while (podaci != "end") {
try {
Thread.sleep(60);
} catch (InterruptedException e) {
e.printStackTrace();
}
izlazdata.writeChars(podaci);
izlazdata.flush();
}
izlazdata.close();
mojSocket.close();
};
And the problem was, as you suspected in:
izlazdata.writeChars(podaci);
writeChars uses the method writeChar. The API documentation for writeChar states:
Writes a char to the underlying output stream as a 2-byte value, high byte first...
The two bytes represent the 16bits which UTF-16 uses for encoding.
When we changed it to everything started working:
izlazdata.writeBytes(podaci);
Update
Based on the answers given, here is how the unwanted string is to be interpreted in terms of characters.
This solves my concrete problem, however, if someone would give a more generic solution to what happend here so that a larger lesson can be learned.
If not, i will accept Esailijas answer in a few days.

You need to show the code happening Android but it strongly seems like it's sending data in UTF-16BE. You should specify the encoding on the Android end. The characters are not hexadecimal literally, but because the NUL character is unprintable, python shows \x00 instead.
Another option is to decode it:
self.rfile.readline().decode("utf_16_be").strip()
note that the result of this is an unicode string.

Propper way to handle dumping and reloading of JSON data containing special characters on android?

Not sure if this has been answered already but a quick search didn't turn up a satisfying result..
I'm stuck with the following scenario:
web service with REST API and JSON formatted data blobs
android client app talking to this service and locally caching / processing the data
The we service is run by a German company so some of the strings in the result data contain special characters like German umlauts:
// example resonse
[
{
"title" : "reward 1",
"description" : "Ein gro\u00dfer Kaffee f\u00fcr dich!"
},
{
"title" : "reward 2",
"description" : "Eine Pizza f\u00fcr dich!"
},
...
]
Locally the app is parsing the data using a set of classes which mirror the response objects (e.g. Reward and RewardResponse classes for the upper example). Each of these classes can read and dump itself from / to JSON - however this is where things get ugly.
Taking the example above org.json will correctly parse the data and the resulting strings will contain proper Unicode versions of the special characters 'ß' (\u00df) and 'ü' (\u00fc).
final RewardResponse response = new RewardResponse(jsonData);
final Reward reward = response.get(0);
// this will print "Ein großer Kaffee für dich!"
Log.d("dump server data", reward.getDescription());
final Reward reward2 = new Reward(reward.toJSON());
// this will print "Ein gro�er Kaffee f�r dich!"
Log.d("dump reloaded data", reward2.getDescription());
As you can see there is a problem with loading the data generated by JSONObject.toString().
Mainly whats happening is that JSONObject will parse escapes in the form of "\uXXXX" but it will dump them as plain UTF-8 text.
In turn, when parsing it won't properly read the unicode and instead insert a replacement character in the result string (� above \uffff as code point).
My current workaround consists of a look-up table containing the Unicode Latin1 supplement characters and their respective escaped versions (\u00a0 up to \u00ff). But this also means I have to go over each and every dumped JSON text and replace the characters with their escaped versions each time I dump something.
Please tell me there is a better way for this!
(Note: there is this question however he had problems with local file encoding on disk.
My problem above, as you can see, is reproducible without ever writing to disk)
EDIT: As requested in the comments here's the toJSON() method:
public final String toJSON() {
JSONObject obj = new JSONObject();
// mTitle and mDescription contain the unmodified
// strings received from parsing.
obj.put("title", mTitle);
obj.put("description", mDescription);
return obj.toString();
}
As a side note it makes no difference if I use JSONObject.toString() or a JSONStringer.
(The documentation advises to use .toString())
EDIT: just to remove Reward from the equation, this reproduces the problem:
final JSONObject inputData = new JSONObject("{\"description\":\"Ein gro\\u00dfer Kaffee\"}");
final JSONObject parsedData = new JSONObject(inputData.toString());
Log.d("inputData", inputData.getString("description"));
Log.d("parsedData", parsedData.getString("description"));

[Note: posted as an answer for better formatting]
I just tried the example
final JSONObject inputData = new JSONObject("{\"description\":\"Ein gro\\u00dfer Kaffee\"}");
final JSONObject parsedData = new JSONObject(inputData.toString());
Log.d("inputData", inputData.getString("description"));
Log.d("parsedData", parsedData.getString("description"));
on my Nexus 7 running Android 4.2.1, and on Nexus S running 4.1.2, and it works as intended:
D/inputData(17281): Ein großer Kaffee
D/parsedData(17281): Ein großer Kaffee
In which Android version did you see the problem?

Loading Twitter XML in Flash CS5.5 Actionscript 3 for Android /iOS

This is probably a stupid question but I have spent the last 5 days searching the net and trying different ways to load a twitter feed into my air application for android (I would like to port it over to iOS but it needs building first)
The best results I had was loading the twitter XML feed
var xmlData:XML = new XML();
var theURL_ur:URLRequest = new URLRequest("http://twitter.com/statuses/user_timeline/weliebeneath.xml");
var loader_ul:URLLoader = new URLLoader(theURL_ur);
loader_ul.addEventListener("complete", fileLoaded);
function fileLoaded(e:Event):void
{
xmlData = XML(loader_ul.data);
txt.text = xmlData.text;
}
But it will not load into my dynamic text box. Any ideas?

I'm not going to comment what you did wrong in your code .. but all i can point you to is to dig some resources about XML in action script cuz this is a wide useful topic
here is what the code should look like
var xmlData:XML;
var loader_ul:URLLoader = new URLLoader();
loader_ul.load(new URLRequest("http://twitter.com/statuses/user_timeline/weliebeneath.xml"));
loader_ul.addEventListener(Event.COMPLETE, fileLoaded);
function fileLoaded(e:Event):void {
xmlData = new XML(e.target.data);
txt.text=xmlData;
}

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.

scraping html data and parsing into list - android

use this LIB for parsing HTML in android http://jsoup.org/ its reach and widely accepted lib among developers its also available for python :)

Related

How do I serialize an NLP classification PyTorch model

Android Extract url with specific domain name from String

Receiving hexadecimal values from socket instead of strings on SocketServer

Propper way to handle dumping and reloading of JSON data containing special characters on android?

Loading Twitter XML in Flash CS5.5 Actionscript 3 for Android /iOS

Categories

Resources