Best way to compare IP addresses quickly - android

I'm parsing two CSV files which contains IP addresses.
The first is a source CSV, and the second is a "Blacklist".
Because of the size of the source file, I'm trying to optimize the speed at which I find IP addresses that match the blacklist.
EDIT: The blacklist consists of IP Address "Blocks". This means that each record in the blacklist has two IP addresses: A Start Block (ex. 216.254.128.0) and an End Block. (Ex. 216.254.223.255)
This means that direct lookups etc, will NOT work.
I'm wondering what's the best way to approach this. The brute strength method would be:
String[] parts = sourceIP.split("\\."); // String array, each element is text between dots
int hi = 255;
int lo = 0;
int mid = (hi - lo) / 2 ;
if (Integer.valueOf(parts[0]) > mid) {
mid = lo;
}
I could then repeat this for each part to decide whether or not the IP address is in the black list.
This seems pretty aggressive and with 4k+ records, this could take a very, very long time.
It could take 10+ iterations to decide each part and that would then have to be repeated to check the "High" part of the IP blocks in the blacklist. That's 80+ iterations per record.
I'm hoping to get some input here to see the best method for comparing IP addresses.
What are your thoughts?
Would it be possible to use a quick bitwise mask to compare values rapidly by serializing INetAddress?
FILE STRUCTURE CLARIFICATION:
Source IP File:
Contains a list of records from a database. (Aprox 4k). Each record contains names, addresses, emails, and IP Address.
Blacklist:
Contains 4.2k records. Each record is an IP Address "Block". This consists of two IP Addresses. 1. Start and 2. End.
If the record in the Source list has an IP address that's found in the blacklist, I need to save that record and add it to a new file.

I assume you're talking IPV4 addresses of the form xxx.xxx.xxx.xxx.
You can easily convert an IP address into an integer. Each segment (i.e. xxx) is 8 bits (i.e. one byte). So four of them together makes a 32-bit integer. So, given an IP address like "192.168.100.12", you can split it into its four parts, parse each one to a byte and create an integer. Say, for example, that you created a byte array of the segments:
ipBytes[0] = 192;
ipBytes[1] = 168;
ipBytes[2] = 100;
ipBytes[3] = 12;
You can turn that into an integer:
int ipAddress = ipBytes[0];
ipAddress = (ipAddress << 8) | ipBytes[1];
ipAddress = (ipAddress << 8) | ipBytes[2];
ipAddress = (ipAddress << 8) | ipBytes[3];
There are more efficient ways to do that, but you get the idea. Your language's runtime library might already have something that'll parse an IP address and give you the bytes to make it an integer.
You have a set of IP address ranges that you want to check your source addresses against. Load each of the ranges into a structure like this:
class IPRange
{
public int startIp;
public int stopIp;
}
And store those in an array or list. Then sort the list by starting IP address.
For each source IP address, convert it to an integer and do a binary search of the list, searching the starting IP address. The source address itself might not be (probably won't be) found, but when the binary search terminates the mid value will hold the index of the range whose starting IP address is less than or equal to the source address. You then just have to check the source address against that item's ending IP address to see if it's in the range.
Binary search is O(log n). If you're searching a list of 4,300 ranges, it's going to take at most 13 probes to find an address in the array. That should be plenty fast enough, even when doing 4,000 different searches. You're only talking on the order of 50,000 total probes of the range array.
A couple of notes:
First, as I said above, I assume you're talking about IPV4 addresses. If you're talking about IPV6 addresses, the same concepts still apply but you'll need a 64 bit integer. I don't know enough about IPv6 to say how you'd convert the address to 64 bit integer. Probably you should depend on you runtime library to get the address bytes.
Second: I assume that ranges don't overlap. That is, you won't have something like:
start range end range
192.168.1.1 192.168.2.255
192.168.2.1 192.168.3.255
If you have that, then an IP address could fall within either of those ranges. You could potentially construct overlapping ranges that would allow addresses to fall through the cracks. If you have overlapping ranges, the problem becomes a little bit more complicated.

Put both files in a String. Use split(",") to split the ip's in the first string. Loop through the obtained ips array. For every ip search for it in the second String like blacklist.indexOf("," + ip + ",") But first add a "," at start and end of blacklist string.

Brute force it.
Load everything into ram, no reason not to.
Split the ips into a 2d array.
{0:123,123,123,123}
Blacklist into a 3d array.
Now you can start searching for integers.
When you have a match compare the next section.
If source value higher then compare to the END block same section.
When you have a match push to a new array and write it to a file at the end.
If this takes more time to run then it took me to type this then close the porn you have open because your ram is full and its using your page file.

You could use a data structure called Bloom Filter. Which is rather efficient performance and storage wise. As for an example, there's a question here, Most Efficient way of implementing a BlackList that has an answer that recommends this.
As far as I know, also Google Chrome uses this technique, as also explained rather nicely at Matthials Vallentine's blog post A Garden Variety of Bloom Filters.
Yet more explanation succinctly can be had found at Adobe leaked credentials checker. Some excerpts
The original leak is about 9.3GB uncompressed, of which 3.3GB is email
addresses [...] This means the data can fit into 512MB (i.e. 232 bits)
of memory and allows us to perform lookups in constant time [...] An
optimal bloom filter which is allowed to occupy 840MB would have
practically no false positives at all.

It seems like the most direction solution would be to use an interval tree to store the blacklist. Then check if the IP intersects with any of the intervals.
You also might want to consider using a Trie/hashtable to get fast lookups where the interval is the same. IE: 216.254.128.0 to 216.254.223.255 can be merged to 216.254.(128.0, 223.255), where the () is the interval. Thus you'd end up with two hash-table lookups (one for 216 and one for 254) then a search in an interval tree, which is likely to only contain a small number of elements.
You can also merge overlapping intervals into a single interval, which can probably be done as you build the interval tree. Which ends up being more like a binary search tree in that case.

Related

android, how to generate shorter version of uuid (13 chars) an app side

android app needs to generate uuid with 13 chars. But that may increase the chance of clashing.
Come up with this function, idea was adding the uuid's most/least SignificantBits, and then get the string from the Long. and then figure out the 13 byte length part from the result. Test run seems not seeing clash on single machine (+100,000 uuids).
But not sure the clashing possibility across machines.
is there a better way which generates 13 chars uuid and reasonable low classing rate?
val random = Random()
fun generateUUID() {
val uuid: UUID = UUID.randomUUID()
val theLong = if (random.nextBoolean()) {
uuid.mostSignificantBits + uuid.leastSignificantBits
} else {
uuid.leastSignificantBits + uuid.mostSignificantBits
}
return java.lang.Long.toString(theLong, Character.MAX_RADIX)
}
It won't be an UUID in the strict sense anymore; UUID describes a very specific data structure. Using the low bits of a proper UUID is generally a bad idea; those were never meant to be unique. Single machine tests will be inconclusive.
EDIT: now that I think of it, what exactly is "char" in the question? A decimal digit? A hex digit? A byte? An ASCII character? A Unicode character? If the latter, you can stuff a full proper UUID there. Just represent it as binary, not as a hexadecimal string. A UUID is 128 bits long. A Unicode codepoint is 20 bits, ergo 13 of those would cover 260 bits, that's well enough.
The Java char datatype is, effectively, slightly less than 16 bits. If by "13 chars" you mean a Java string of length 13 (or an array of 13 chars), you can still stuff a UUID there, with some trickery to avoid reserved UTF-16 surrogate pair values.
All that said, for globally unique ID generation, they usually use a combination of current time, a random number, and some kind of device specific identifier, hashed together. That's how canonical UUIDs work. Depending on the exact nature of the size limit (which is vague in the question), a different hash algorithm would be advisable.
EDIT: about using the whole range of Unicode. First things first: you do realize that both "du3d2t5fdaib4" and "8efc9756-70ff-4a9f-bf45-4c693bde61a4" are hex strings, right? They only use 16 characters, 0-9 and a-f? The dashes in case of the second one can be safely omitted, they're there just for readability. Meanwhile, a single Java char can have one of 63488 possible values - any codepoint from 0 to 0xFFFF, except for the subrange 0xD800..0xDFFF, would do. The string with all those crazy characters won't be nice looking or even printable; it could look something like "芦№Π║ثЯ"; some of the characters might not display in Android because they're not in the system font, but it will be unique all right.
Is it a requirement that the unique string displays nicely?
If no, let's see. A UUID is two 64-bit Java longs. It's a signed datatype in Java; would've been easier if it was unsigned, but there's no such thing. We can, however, treat two longs as 4 ints, and make sure the ints are positive.
Now we have 4 positive ints to stuff into 13 characters. We also don't want to mess with arithmetic that straddles variable boundaries, so let's convert each integer into a 3 character chunk with no overlap. This wastes some bits, but oh well, we have some bits to spare. An int is 4 bytes long, while 3 Java characters are 6 bytes long.
When composing the chars, we would like to avoid the area between D800 and DFFF. Also, we would want to avoid the codepoints from 0 to 1F - those are control characters, unprintable by design. Also, let's avoid character 0x20 - that's space. Now, I don't know exactly how will the string be used; whether or not it will be used in a text format that doesn't allow for escaping and therefore if certain other characters should be avoided to make things simpler downstream.
A contiguous character range is easier to work with, so let's completely throw away the range upwards from 0xD800, too. That leaves us with 0xD7DF distinct codepoints, starting from 0x21. Three of those is plenty enough to cover a 32-bit int. The rule for converting an int into a character triple is straightforward: divide the int by 0xD7DF twice, take the remainders, add the remainders to the base codepoint (which is 0x21). This algorithm is your vanilla "convert an int to a string in base N", with the knowledge that there can be no more than three digits.
All things considered, here goes Java:
public static String uuidToWeirdString(UUID uuid)
{
//Description of our alphabet: from 021 to 0xD7FF
final int ALPHA_SIZE = 0xD7DF, ALPHA_BASE = 0x21;
//Convert the UUID to a pair of signed, potentially negative longs
long low = uuid.getLeastSignificantBits(),
high = uuid.getMostSignificantBits();
//Convert to positive 32-bit ints, represented as signed longs
long []parts = {
(high >> 32) & 0xffffffff,
high & 0xffffffff,
(low >> 32) & 0xffffffff,
low & 0xffffffff
};
//Convert ints to char triples
int nPart, pos = 0;
char []c = new char[12];
for(nPart=0;nPart<4;nPart++)
{
long part = parts[nPart];
c[pos++] = (char)(ALPHA_BASE + part / (ALPHA_SIZE*ALPHA_SIZE));
c[pos++] = (char)(ALPHA_BASE + (part / ALPHA_SIZE ) % ALPHA_SIZE);
c[pos++] = (char)(ALPHA_BASE + part % ALPHA_SIZE);
}
return new String(c);
}
Feast your eyes on the beauty of the Unicode.
A UUID is a 128-bit data type, commonly shown in a 36-character hexadecimal representation, or about 4 bits per character.
Your example is "du3d2t5fdaib4". That only uses lower case Latin letters and Arabic numerals, which gives you about 5 bits per character, or 13×5=65 bits. If you also allow upper case Latin letters, that gives you about 6 bits per character, or 13×6=78 bits.
You cannot fit a 128-bit value into a 65- or 78-bit data type without throwing away nearly half of the bits, which will radically increase the odds of collision—perhaps even guarantee it depending on how the UUIDs were generated and which bits you throw away.

Arduino --> Android bluetooth communication (receive text with App Inventor)

I'm creating an Arduino based drone that can be controlled through an Android application.
In order to improve the user experience, I'd like to show the accelerometer/compass sensor's values on the application, so I need to send them from Arduino to Android, via Bluetooth. The values are simple integer number between 0 and 180.
The best solution I thought is to concatenate all the values (separated with a comma) in one string, and send it to the app, that will separate the single values (the string will be sent only when the app require it, in this case when a 'z' byte is received by Arduino).
if (Serial.available() > 0) {
if (Serial.read()=='z'){
Serial.println(String((int)sensor1) + ',' + String((int)sensor2) + ',' + String((int)sensor3));
}
}
Here are the App Inventor blocks:
It seems that the values are being received quite well, but there is a critical issue: somethimes the string is not received well, and that cause a lot of errors. Sometimes the received string is (for example) 10,10,10, but somethimes it is 10,10,1010 or just 10,10 ecc...
I also tried to send the values one by one, but the result was nearly the same.
I even tried to set 'numberOfBytes' to -1, using a delimiter byte, but this also was not succesful unfortunately.
I getting quite mad, so I hope there is another way to send thoose integers to Android, or to fix the system I'm already using.
I used Serial.print to send each result and then used Serial.write('>'); as the end marker.
In appinventor designer window set the Delimiter byte for Bluetooth client to 62 (the ASCII value for the > character ).
In the blocks window, use Bluetooth cliant1.Receive text and set number of bytes to -1
App invented will then read until a delimiter is found.
However it will cause the app to hang if it doesn't find one.
the problem is that you are not signaling the end of the string
I used his example on a project and was something like this:
while(Serial.available()>0){
Serial.println(String((int)Sensor1) + ',' + String((int)Sensor2)+ ',');
}
If you compare the two codes the difference will be a " , " the most at the end of the print and it solved the problem for you sitad

How to differentiate between cellular operators in an Android application?

I have an Android application which uses an .so file, The .so changes his behavior according the network the phone connected to, i.e. if you are connected to AT&T you need to do XYZ. if you work on Verizon you do ABC otherwise you do XY.
Is there any good way to differentiate between mobile networks?
I thought to use PLMN somehow, Is there any robust way of doing
that? (I want it to work while roaming too etc.).
I had seen this, but I need to do it only in the C code with no wrappers or Java engagement, meaning the following can't be used:
TelephonyManager telephonyManager =((TelephonyManager) Context.getSystemService(Context.TELEPHONY_SERVICE));
String operatorName = telephonyManager.getNetworkOperatorName();
You can get the currently used PLMN with the AT+COPS? command. From 27.007:
+COPS? +COPS: <mode>[,<format>,<oper>[,<AcT>]]
...
Read command returns the current mode, the currently selected operator and the
current Access Technology. If no operator is selected, <format>, <oper> and <AcT>
are omitted.
....
<oper>: string type; <format> indicates if the format is alphanumeric or numeric;
long alphanumeric format can be upto 16 characters long and short format up to 8
characters (refer GSM MoU SE.13 [9]); numeric format is the GSM Location Area
Identification number (refer 3GPP TS 24.008 [8] subclause 10.5.1.3) which
consists of a three BCD digit country code coded as in ITU-T E.212 Annex A
[10], plus a two BCD digit network code, which is administration specific;
returned <oper> shall not be in BCD format, but in IRA characters converted from
BCD; hence the number has structure: (country code digit 3)(country code digit 2)
(country code digit 1) (network code digit 3)(network code digit 2)(network code
digit 1)
Using the following two at commands (see also)
AT+COPN 7.21 - Read operator names
AT+COPS 7.3 - PLMN selection

Android string processing from TCP stream

I have a very basic TCP socket connection to a remote device that I can poll for status.
Aside from the socket programming, which I have mostly figured out through asynctask, I'm trying to come up with a way to parse out the returning string.
I query the device with something like "VOL?"
The device responds with the Volume of 12 different audio outputs with this:
"VOL:33,0,21,12,0,43,0,0,0,0,20,0"
The ":" character always and only comes back after the echo of the initial command, so I can use whatever comes before the colon to flag what sort of answer is coming in. (VOL, BAS, MUT, TRE, BAL, etc)
In the case of VOL, I simply want to chunk out everything that comes between the commas, so I can chop up and place into an array the volumes of all zones.
The only thing I can think of is to grab the length of the string, then run a for loop through it searching for commas one by one, but it seems ridiculously messy:
int oldPos = 0; //used in the upcoming 'if clause' to mark where the last comma was found
int y = 0; //used to old the resulting value's array position
String strIncoming; = //the incoming TCP string
for(int x = 0; x <= strIncoming.length(); x++){
if(",".equals(strIncoming[x]){
volzoneVal[y] = strIncoming.subString(oldPos,x);
oldPos = x;
y++;
}
}
there has GOT to be a better way, (and I'm not even sure this is going to work, I'm typing it here for the first time as I brainstorm this problem, so it's not been run or compiled)
Is there a better way to scan through a string looking for hits?
strIncoming.split(":")[0] will give you what was before first colon
strIncoming.split(":")[1].split(",") will give you array of individual strings
First, split the string on the colon, and then split[0] is your type. Then take split[1] and split it on the comma, and you'll have all your 12 different outputs ready to go (just convert them to integers).
Use Java's string split function and split on the comma as the delimiter. You will then have an array of your parameters. If you append some kind of "end string" character to each response, you will know the start and end based on the colon for the start and your end character for the end.

Doing order by using the Jaro-Winkler distance algorithm?

I am wondering how would I be able to run a SQLite order by in this manner
select * from contacts order by jarowinkler(contacts.name,'john smith');
I know Android has a bottleneck with user defined functions, do I have an alternative?
Step #1: Do the query minus the ORDER BY portion
Step #2: Create a CursorWrapper that wraps your Cursor, calculates the Jaro-Winkler distance for each position, sorts the positions, then uses the sorted positions when overriding all methods that require a position (e.g., moveToPosition(), moveToNext()).
Pre calculate string lengths and add them into separate column. Then sort entired table by that that length. Add indexes (if you can). Then add extra filters for example you don't want to compare "Srivastava Brahmaputra" to "John Smith". The length are out of wack by way too much so exclude these kind of comparison by length as a percentage of the total length. So if your word is 10 characters compare it only to words with 10+-2 or 10+-3 characters.
This way you will significantly reduce the number of times this algorithm needs to run.
Typically in the vocalbulary of 100 000 entries such filters reduce the number of comparisons to about 300. Unless you are doing a full blown record linkage and then I would wonder why use Android for that. You would still need to apply probabilistic methods for that and calculate scores and this is not a job for Android (at least not for now).
Also in MS SQL Server Jaro Winkler string distance wrapped into CLR function perform much better, since SQL Server doesn't supprt arays natively and much of the processing is around arrays. So implementation in T-SQL add too much overhead, but SQL-CLR works extremely fast.

Categories

Resources