How can I extract table with JSOUP

How can I extract table with JSOUP - android

I'm writing an Android app and trying to figure out how should I construct my call to get table data from this webpage: http://uk.soccerway.com/teams/scotland/saint-mirren-fc/1916/squad/
I've read the cookbook from JSOUP website but because I haven't used this library before I am bit stuck. I came up with something like this:
doc = Jsoup.connect("http://uk.soccerway.com/teams/scotland/saint-mirrenfc/1916/squad/").get();
Element squad = doc.select("div.squad-container").first(); Element
Elements table = squad.select("table squad sortable");
As you can see I'm nowhere near getting players statistics yet. I think the next step should be to point new Element object to "tbody" tag inside the "table squad sortable"?
I know I will have to use for loop once I manage to read the table and then read each row inside the loop.
Unfortunately table structure is a bit complex for someone with no experience so I would really appreciate some advice!

Basically each row has the following selector -
#page_team_1_block_team_squad_3-table > tbody:nth-child(2) > tr:nth-child(X) where X is the row's number (starting at 1).
One way is to iterate over the rows and extract the info:
String url = "http://uk.soccerway.com/teams/scotland/saint-mirren-fc/1916/squad/";
String userAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0";
Document doc = null;
try {
doc = Jsoup.connect(url)
.userAgent(userAgent)
.get();
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
int i = 1;
Elements row;
do {
row = doc.select("#page_team_1_block_team_squad_3-table > tbody:nth-hild(2) > tr:nth-child(" + i + ")");
for (Element el : row) {
System.out.print(el.select(".shirtnumber").text() + " ");
System.out.println(el.select(".name").text());
i++;
}
} while (row != null);
This will print the number and name of each player. Since I don'r want to count the number of rows (and keep the program felxible for changes), I orefer to use do...while loop - I will iterate as ling as the row exists (or not empty).
The output I get:
1 J. Langfield
21 B. O'Brien
28 R. Willison
2 S. Demetriou
3 G. Irvine
4 A. Webster
...
Use your browser's developer tools to get the names of the other columns and use it to get all the info you need.

Related

I need to separate the text from a string based on column names

I am working on OCR based Android app, getting this text as string from the attached image dynamically (getting the text in Horizontal Direction from the image)
Text from Image:
"Part Name Part Cost Engine Oil and Oil Filter Replacement Rs 10K Alf Filter Rs 4500 Cabin AC Micro Filter Rs 4000 Pollen Filter Rs 1200 - 1500 AC Disinfectant Rs 3000 Fuel Filter Rs 6000 - 8000 Spark Plug Set Replacement (Applicable in TFSI / Petrol Car Range) Rs 10K Body Wash, Basic Clean 8. Engine Degrease Rs 3000 Body Wax Polish Detailed Rs 7000 - 8000 Car interior Dry Clean with Genn Clean Rs 8000 - 10000 Wheel Alignment \u0026 Balancing Rs 6000 - 7000 Brake Pads Replacernent (Pair) Rs 30K - 32K Brake Disc Replacernent (Pair) Rs 30K - 35K ..........".
I need to separate the Part Name and Part Cost(just 2 columns i.e Part Name, Part Cost) (ignore all extra text from the column heading). Separate the values from String and should store it in SQLIte Database Android. I am stuck how to get the values and separate them.

The text returned from the OCR isn't ideal. The first thing you should do is check if whatever OCR solution can be configured to provide a better output. Ideally, you want the lines to be separated by newline characters and the space between the columns to be interpreted as something more useful, such as a tab character.
If you have no way of changing the text you get, you'll have to find some way of parsing it. You may want to look into using a parser, such as ANTLR to make this easier.
The following observations may help you to come up with a parsing strategy:
Column 2 items all start with "Rs" or "Upto Rs".
Column 2 items end with:
A number (where a number is allowed to be a string of digits [0-9.], optionally followed by a "K"
"Lakh"
Column 1 items don't begin with a number or "Lakh"
So a basic algorithm could be:
List<String> column1 = new ArrayList<String>();
List<String> column2 = new ArrayList<String>();
String[] tokens = ocrString.split(" ");
List<String> column = column1;
String item = "";
for (int i = 0; i < tokens.length; i++) {
String token = tokens[i];
String nextToken = i == tokens.length - 1 ? "" : tokens[i+1];
if (column == column1) {
if (token == "Rs" || (token == "Upto" && nextToken == "Rs")) {
column = column2;
column.add(item); item = "";
i--; continue;
}
item += " " + token;
} else {
item += " " + token;
if (/*token is number or "Lakh" and nextToken is not*/) {
column.add(item); item = "";
column = column1;
}
}
}

Google Sheets API v4 - How to get the last row with value?

How to get the last row with value in the new Google Sheets API v4 ?
i use this to get a range from a sheet:
mService.spreadsheets().values().get("ID_SHEET", "Sheet1!A2:B50").execute();
how to detect the last row with value in the sheet ?

You can set the range to "A2:D" and this would fetch as far as the last data row in your sheet.

I managed to get it by counting the total rows from current Sheets.
Then append the data to the next row.
rowcount = this.mService.spreadsheets().values().get(spreadsheetId, range).execute().getValues().size()

Rather than retrieving all the rows into memory just to get the values in the last row, you can use the append API to append an empty table to the end of the sheet, and then parse the range that comes back in the response. You can then use the index of the last row to request just the data you want.
This example is in Python:
#empty table
table = {
'majorDimension': 'ROWS',
'values': []
}
# append the empty table
request = service.spreadsheets().values().append(
spreadsheetId=SPREADSHEET_ID,
range=RANGE_NAME,
valueInputOption='USER_ENTERED',
insertDataOption='INSERT_ROWS',
body=table)
result = request.execute()
# get last row index
p = re.compile('^.*![A-Z]+\d+:[A-Z]+(\d+)$')
match = p.match(result['tableRange'])
lastrow = match.group(1)
# lookup the data on the last row
result = service.spreadsheets().values().get(
spreadsheetId=SPREADSHEET_ID,
range=f'Sheetname!A{lastrow}:ZZ{lastrow}'
).execute()
print(result)

😢 Google Sheets API v4 does not have a response that help you to get the index of the last written row in a sheet (row that all cells below it are empty). Sadly, you'll have to workaround and fetch all sheet rows' into memory (I urge you to comment if I'm mistaken)
Example:
spreadsheet_id = '1TfWKWaWypbq7wc4gbe2eavRBjzuOcpAD028CH4esgKw'
range = 'Sheet1!A:Z'
rows = service.spreadsheets().values().get(spreadsheetId=spreadsheet_id, range=range).execute().get('values', [])
last_row = rows[-1] if rows else None
last_row_id = len(rows)
print(last_row_id, last_row)
Output:
13 ['this', 'is ', 'my', 'last', 'row']
💡 If you wish to append more rows to the last row, see this

You don't need to. Set a huge range (for example A2:D5000) to guarantee that all your rows will be located in it. I don't know if it has some further impact, may be increased memory consumption or something, but for now it's OK.
private List<String> getDataFromApi() throws IOException {
String spreadsheetId = "1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms";
String range = "A2:D5000";
List<String> results = new ArrayList<String>();
ValueRange response = this.mService.spreadsheets().values()
.get(spreadsheetId, range)
.execute();
List<List<Object>> values = response.getValues();
if (values != null) {
results.add("Name, Major");
for (List row : values) {
results.add(row.get(0) + ", " + row.get(3));
}
}
return results;
}
Look at the loop for (List row : values). If you have two rows in your table you will get two elements in values list.

Have a cell somewhere that doesn't interfere with your datarange with =COUNTA(A:A) formula and get that value.
In your case
=MAX(COUNTA(A:A50),COUNTA(B:B50))
?
If there could be empty cells inbetween the formula would be a little more tricky but I believe it saves you some memories.

2022 Update
I I don’t know if this will be relevant for someone in 2022, but now you can do it differently.
You can just set next value as range :
const column = "A"
const startIndex = 2
const range = column + startIndex + ":" + column
In resolve you get all data in column and range with last index.
I tested it on js and php

Following Mark B's answer, I created a function that performs a dummy append and then extracts the last row info from the dummy append's response.
def get_last_row_with_data(service, value_input_option="USER_ENTERED"):
last_row_with_data = '1'
try:
dummy_request_append = service.spreadsheets().values().append(
spreadsheetId='<spreadsheet id>',
range="{0}!A:{1}".format('Tab Name', 'ZZZ'),
valueInputOption='USER_ENTERED',
includeValuesInResponse=True,
responseValueRenderOption='UNFORMATTED_VALUE',
body={
"values": [['']]
}
).execute()
a1_range = dummy_request_append.get('updates', {}).get('updatedRange', 'dummy_tab!a1')
bottom_right_range = a1_range.split('!')[1]
number_chars = [i for i in list(bottom_right_range) if i.isdigit()]
last_row_with_data = ''.join(number_chars)
except Exception as e:
last_row_with_data = '1'
return last_row_with_data

How to insert a table in to group in Corona SDK (.Lua)?

I get error message when i try to insert a table into a group
My table code is containing images
Here is the code i am using for the table
local myJoints = {}
for i = 1,5 do
local link = {}
for j = 1,17 do
link[j] = display.newImage( "link.png" )
link[j].x = 121 + (i*34)
link[j].y = 55 + (j*17)
physics.addBody( link[j], { density=2.0, friction=0, bounce=0 } )
-- Create joints between links
if (j > 1) then
prevLink = link[j-1] -- each link is joined with the one above it
else
prevLink = wall -- top link is joined to overhanging beam
end
myJoints[#myJoints + 1] = physics.newJoint( "pivot", prevLink, link[j], 121 + (i*34), 46 + (j*17) )
end
end
and here is the code for group
GUI:insert(myJoints);
i have my background image in the GUI group and it is covering the table.
I don't know if it is actually possible to insert table into a group
Any help please
Thanks in Advance!

You can't insert a table into a group using the "insert" method because that method is looking for a display object. Try calling GUI.myJoints = myJoints. Also keep in mind that your table just references your display objects, which is different from having them in a group.

Android Jsoup in service - get text of span

Im pretty new to jsoup. For days im trying now to read out a simple number from a span without any success.
I hope to find help here. My html:
<div class="navi">
<div class="tab mail">
<a href="/comm.php/indexNew/" accesskey="8" title="Messages">
<span class="tabCount">1 </span>
<img src="/b2/message.png" alt="Messages" class="moIcon i24" />
</a>
</div>
The class tabCount excists 3 times though in the whole document and I am interested in the first span with this class.
Now I am trying in onCreate() of a service to create a thread with:
Thread downloadThread = new Thread() {
public void run() {
Document doc;
try {
doc = Jsoup.connect("https://www.bla.com").get();
String count = doc.select("div.navi").select("div.tab.mail").select("a[href]").first().select("tabCount").text();
Log.d("SOMETHING", "test"+(count));
} catch (IOException e) {
e.printStackTrace();
}
}
};
downloadThread.start();
This forces my app to crash. The same if i change text() to ownText(). if i remove text() then the app can start but it gives me null.
what am i doing wrong? By the way, besides the service a webview is loading the same url. might that be a problem?

You only need to select the element you're interested in, you don't need to get every outer element before. In your example you could try
String count = doc.select("span.tabCount").text();
Where you define the type of the element "span" and class name ".tabcount"
For an example that might help you, look at this link
Edit:
Try this code instead, this will get the value of the first span.
Elements elements = doc.select("span.tabCount");
String count = elements.first().text();
And if you want to print all elements you could do like this.
Elements elements = doc.select("span.tabCount");
for (Element e : elements) {
Log.d("Something", e.text();
}

Haven't you meant .select(".tabCount")?
BTW, on Android AsyncTasks are more convenient than Threads. Also, empty catch blocks are a bad practice.

Your select statement is wrong. You can insert the whole selection string in one line. Furthermore you have to prefix "tabCount" with a dot as it is a class.
String count = doc.select("div.navi div.tab.mail a").first().select(".tabCount").text();

Scraping site with jsoup issue

When I scrape a site using jsoup I am getting extra values that I do not want to recieve.
I only want to recieve his name not his team and position. Currently it is also scraping the position and team. I only want to recieve the name.
Page Source:
<td class="playertableData">5</td><td class="playertablePlayerName" id="playername_515" style="">Derrick Rose, Chi PG<a href="" class="flexpop" content="tabs#ppc"
My Code:
while (tdIter.hasNext()) {
int tdCount = 1;
Element tdEl = tdIter.next();
name = tdEl.getElementsByClass("playertablePlayerName")
.text();
Elements tdsEls = tdEl.select("td.playertableData");
Iterator<Element> columnIt = tdsEls.iterator();
namelist.add(name);
OUTPUT:
name: Derrick Rose, Chi PG

You are doing it wrong. By the line,
name = tdEl.getElementsByClass("playertablePlayerName").text();
you will get the complete text of the with class="playertablePlayerName" which includes an anchor tag and a plane text outside any tag. Means, you will get
Derric Rose, Chi PG
Which is your output. To solve this issue, you must include the condition for th anchor tag too. Try using the belove line as a replacement.
doc = Jsoup_Connect.doHttpGet();
Elements tdsEls = doc.getElementsByClass("playertablePlayerName");
name = tdsEls.get(0).child(0).text();
You can traverse through the child of the td you have already got. When you get correct tag, use the chained text() method.
Feel free to ask if you have any doubt.

You can probably hack up this code to get what you want:
Document doc = Jsoup.connect("http://games.espn.go.com/fba/playerrater?&slotCategoryId=0").get();
for (Element e : doc.select(".playertablePlayerName")) {
//this assumes the name is in the first anchor tag
//which it seems to be according to the url in your pastbin
System.out.println(e.select("a").first().text());
}
To translate to your code, I think this will work...
name = tdEl.select("a").first().text();
Let me know if this works for you.

Another solutions:
1.- First Name
String url = "http://games.espn.go.com/fba/playerrater?&slotCategoryId=0";
//First Name
try {
Document doc = Jsoup.connect(url).get();
Element e = doc.select("td.playertablePlayerName > a").first();
String name = e.text();
System.out.println(name);
}
catch (IOException e) {
}
2.- All the names
//All Names
try {
Document doc = Jsoup.connect(url).get();
Elements names = doc.select("td.playertablePlayerName > a");
for( Element e : names ) {
String name = e.text();
System.out.println(name);
}
}
catch (IOException e) {
}

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.

How can I extract table with JSOUP - android

Related

I need to separate the text from a string based on column names

Google Sheets API v4 - How to get the last row with value?

How to insert a table in to group in Corona SDK (.Lua)?

Android Jsoup in service - get text of span

Scraping site with jsoup issue

Categories

Resources