Protobuf streaming (lazy serialization) API - android

We have an Android app that uses Protocol Buffers to store application data. The data format (roughly) is a single protobuf ("container") that contains a list of protobufs ("items") as a repeated field:
message Container {
repeated Item item = 1;
}
When we want to save a change to an item, we must recreate the protobuf container, add all the items to it, then serialize it and write it to a file.
The problem with this a approach is it potentially triples the memory used when saving because the data has to first be copied from the model class to the protobuf builder and then to a byte array when the protobuf is serialized, all before writing it out to a file stream.
What we would like is a way to create our protobuf container and lazily serialize it to a stream, then simply add each protobuf item (created from our model data) to the container which serializes and writes it to the stream, rather than keeping all the items in memory until we've created the entire container in memory.
Is there a way to build a protobuf and serialize it lazily to a stream?
If there's not a way to do this officially, are there any libraries that can help? Does anyone have any suggestions or ideas how to solve this in other ways? Alternative data formats or technologies (e.g. JSON or XML containing protobufs) that would make this possible?

For serialization:
protobuf is an appendable format, with individual items being merged, and repeated items being appended
Therefore, to write a sequence as a lazy stream, all you need to do is repeatedly write the same structure with only one item in the list: serializing a sequence of 200 x "Container with 1 Item" is 100% identical to serializing 1 x "Container with 200 Items".
So: just do that!
For deserialization:
That is technically very easy to read as a stream - it all, however, comes down to which library you are using. For example, I expose this in protobuf-net (a .NET / C# implementation) as Serializer.DeserializeItems<T>, which reads (fully lazy/streaming) a sequence of messages of type T, based on the assumption that they are in the form you describe in the question (so Serializer.DeserializeItems<Item> would be the streaming way that replaces Serializer.Deserialize<Container> - the outermost object kinda doesn't really exist in protobuf)
If this isn't available, but you have access to a raw reader API, what you need to do is:
read one varint for the header - this will be the value 10 (0x0A), i.e. "(1 << 3) | 2" for the field-number (1) and wire-type (2) respectively - so this could also be phrased: "read a single byte from the stream , and check the value is 10"
read one varint for the length of the following item
now:
if the reader API allows you to restrict the maximum number of bytes to process, use this length to specify the length that follows
or wrap the stream API with a length-limiting stream, limited to that length
or just manually read that many bytes, and construct an in-memory stream from the payload
rinse, repeat

There is no such thing. A protobuf is a packed structure. In order to do this effectively it would need all the data. You will have to add the "streaming protocol" yourself. Maybe send a protobuf msg every N items.

In the normal java version of Protocol buffers there is Delimited files where you write Protocol-Buffers one at a time. I am not sure if it is in the Android version
aLocation.writeDelimitedTo(out);
As Marc has indicated it easily implemented; just write a length followed
the serialised bytes. In normal (non android) java version of prortocol-buffers you can also do (you have to serialise to a byte array or something similar)
private CodedOutputStream codedStream = null;
public void write(byte[] bytes) throws IOException {
if (bytes != ConstClass.EMPTY_BYTE_ARRAY) {
codedStream.writeRawVarint32(bytes.length);
codedStream.writeRawBytes(bytes);
codedStream.flush();
}
}
and
private CodedInputStream coded;
public byte[] read() throws IOException {
if (coded == null) {
throw new IOException("Reader has not been opened !!!");
}
if (coded.isAtEnd()) {
return null;
}
return coded.readBytes().toByteArray();
Something may be possible in other Protocol-Buffers versions

Related

Programming object for processing large lists of data

I recently had the task of performing a cross-selection operation on some collections, to find an output collection that was matching my criteria. (I will omit the custom logic because it is not needed).
What I did was creating a class that was taking as a parameter Lists of elements, and I was then calling a function inside that class that was responsible for processing those lists of data and returning a value.
Point is, I'm convinced I'm not doing the right thing, because writing a class holding hundreds of elements, taking names lists as parameters, and returning another collection looks unconventional and awkward.
Is there a specific programming object or paradigm that allows you to process large numbers of large collections, maybe with a quite heavy custom selection/mapping logic?
I'm building for Android using Kotlin
First of all, when we talk about the performance, there is only one right answer - write benchmark and test.
About memory: list with 1,000,000 of unique Strings with average size 30 chars will take about 120 Mb (e.g. 10^6 * 30 * 4, where last is "size of char", let's think that this is Unicode character with 4 bytes). And please add 1-3% for collateral expenses, such as link references. Therefore: if you have hundreds of Strings then just load whole data into memory and use list, because this is the fastest solution (synchronous, immutable, etc.).
If you can do streaming-like operations, you can use sequences. They are pretty lazy, the same with Java Streams and .Net Linq. Please check example below, it requires small amount of memory.
fun countOfEqualLinesOnTheSamePositions(path1: String, path2: String): Flow<String> {
return File(path1).useLines { lines1 ->
File(path2).useLines { lines2 ->
lines1.zip(lines2)
.map { (line1, line2) ->
line1 == line2
}
.count()
}
}
}
If you couldn't store whole data in memory and you couldn't work with stream-like schema, you may:
Rework algorithm to single-pass to multiple-pass, there each is stream-like. For example, Huffman Coding is two-pass algorithm, so it can be used to compress 1Tb of data by using small amount of memory.
Store intermediate data on the disk (this is much complex for this short answer).
For additional optimizations:
To cover case of merging a lot of parallel streams, please consider also Kotlin Flow. It allows you to work asynchronously, to avoid IO blocks. For example, this can be useful to merge ~100 network streams.
To keep a lot of non-unique items in memory, please consider caching logic. It can save memory (however please benchmark first).
Try operate with ByteBuffers, instead of Strings. You can get much less allocation (because you can deallocate object explicitly), however code will be too complex.

Securing tensorflow-lite model

I'm developing an Android app that will hold a tensorflow-lite model for offline inference.
I know that it is impossible to completely avoid someone stealing my model, but I would like to make a hard time for someone trying it.
I thought to keep my .tflite model inside the .apk but without the weights of the top layer. Then, at execution time I could download the weights of the last layer and load it in memory.
So, if someone try to steal my model he would get a useless model because it couldn't be used due to the missing weights of the last layer.
It is possible to generate a tflite model without the weights of the last layer?
Is it possible load those weights in a already loaded model in memory?
This is how I loading my .tflite model:
tflite = new Interpreter(loadModelFile(), tfliteOptions);
// loads tflite grapg from file
private MappedByteBuffer loadModelFile() throws IOException {
AssetFileDescriptor fileDescriptor = mAssetManager.openFd(chosen);
FileInputStream inputStream = new FileInputStream(fileDescriptor.getFileDescriptor());
FileChannel fileChannel = inputStream.getChannel();
long startOffset = fileDescriptor.getStartOffset();
long declaredLength = fileDescriptor.getDeclaredLength();
return fileChannel.map(FileChannel.MapMode.READ_ONLY, startOffset, declaredLength);
}
Are there other approaches to make my model safer? I really need to make inference locally.
If we are talking about Keras models ( or any other model in TF ), we can easily remove the last layer and then convert it to a TF Lite model with tf.lite.TFLiteConverter. That should not be a problem.
Now, in Python, get the last layer's weights and convert it to a nice JSON file. This JSON file could be hosted on cloud ( like Firebase Cloud Storage ) and can be downloaded by the app.
The weights could be parsed as an array() object. The actiavtions from the TF Lite model could be dot multiplied with the weights parsed from the JSON. Lastly, we apply an activation to provide predictions, which we need indeed!
The model is so precisely trained that it could be rarely used for any other use case. So, I think we do not need to worry about that.
Also, it will be better if we use some cloud hosting platforms, which use requests and APIs instead of directly loading a raw model.

how to send protobuf as part of XML

i have create a protobuf sample code in android as follows
Person john =
Person.newBuilder()
.setId(1234)
.setName("John Doe")
.setEmail("jdoe#example.com")
.addPhone(
Person.PhoneNumber.newBuilder()
.setNumber("555-4321")
.setType(Person.PhoneType.HOME))
.build();
now i want to send this john object as a part of xml building block over network
so far i have seen following methods so that i can send bytes over network
john.toByteArray() and john.toByteString()
but i think when i will embed into xml tag as follows it will be the string representation only and from that i can not get the data back
"<data>" + john.toByteArray() + "</data>"
so how can i pass the protobuf message with XML ?
note: i don't want to use the base64 encoding as it will eventually increasing the size of a block
The fundamental problem here is that protobuf encoding is binary while XML is text. You can't embed binary data directly into text; you need to encode it as text somehow.
Unfortunately, there is simply no way to do that without increasing the data size. If size is your concern, base64 is likely your best option -- this is exactly what it was designed to do.
Another possibility would be to encode the message in protobuf text format using .toString() and then parse it using com.google.protobuf.TextFormat. This will produce a human-readable encoding, but it will be much larger than the binary encoding.
Yet another option would be to write a custom translator which uses the protobuf reflection iterfaces (e.g. com.google.protobuf.Message#getField()) to read individual fields and convert them to nested XML elements. However, this is complicated and will probably end up taking even more space than protobuf text format.
These are the options that I'm aware of:
using this 3rd party library.
With this method, you can generate xml that you can embed in your outer xml
using the TextFormat API, though parsing seem to be possible only in c.
from protobuf to string format:
TextFormat.shortDebugString((myProtobufMessage);
from string format to protobufm in c++ code
TextFormat:parseFromString(dataString, &myProtobufMessage);
(I didn't try this myself, but I saw this reference).
With this method you generate a String that you can embed in your XML, and on the receiving end, take that String and convert it to protobuf Message objects.
using protobuf > binary > base64 - instead of xml altogether. You can probably send all the data in the wrapping XML inside the protobuf message. This is what I do.
With this method you can forget about XML and use protobuf for everything. This is the original purpose of protobuf, and this is what it is best for.

Android: does short take really 2 bytes?

I am trying to make decision how to design my app.
I have about 300 instances of class just like this:
public class ParamValue {
protected String sValue = null;
protected short shValue = 0;
protected short mode = PARAM_VALUE_MODE_UNKNOWN;
/*
* ...
*/
}
I have an array of these instances. I can't find out does these shorts really take 2 bytes or they take anyway 4 bytes?
And i need to pass the list of these objects via AIDL as a List<Parcelable>. Parcel can't readShort() and writeShort(), it can only work with int. So, to use short here too i have to manually pack two my shorts into one int, parcel it, and then unpack back. Looks too obtrusive.
Could you please tell me how many bytes shorts take, and does it make sense to use short instead of int here?
UPDATE:
I updated my question for future readers.
So, I wrote a test app and I figured out that in my case there's absolutely no reason to use short, because it takes the same space as int. But if I define array of shorts like that:
protected short[] myValues[2];
then it takes less space than array of ints:
protected int[] myValues[2];
Technically, in the Java language, a short is 2 bytes. Inside the JVM, though, short is a storage type, not a full-fledged primitive data type like int, float, or double. JVM registers always hold 4 bytes at a time; there are no half-word or byte registers. Whether the JVM ever actually stores a short in two bytes inside an object, or whether it is always stored as 4 bytes, is really up to the implementation.
This all holds for a "real" JVM. Does Dalvik do things differently? Dunno.
According to the Java Virtual Machine Specification, Sec. 2.4.1, a short is always exactly two bytes.
The Java Native Interface allows direct access from native code to arrays of primitives stored in the VM. A similar thing can happen in Android's JNI. This pretty much guarantees that a Java short[] will be an array of 2-byte values in either a JVM-compliant environment or in a Dalvik virtual machine.

Reading contents of DataOutputStream

Can anyone tell me if there is a way to read/output the contents of a DataOutputStream? I am obtaining one using
new DataOutputStream( httpUrlConnection.getOutputStream() );
and am writing some post data to it. I would like to be able to see what data i am posting after writing to this output stream but cannot find a simple way.
Thanks for any help!
Sure you can 'see' the 'contents' of a DataOutputStream - I imagine it wouldn't be what you expected though! I imagine what you want to be able to do is examine the data getting passed through the stream, quite impossible with the regular class - indeed the very definition of a stream is that it doesn't contain all the data being managed by it, at any one time.
If you really need to be able to audit the data that you've supplied to any output stream then you could do something like this:
import java.io.DataOutputStream;
public class WatcherOutputStream extends DataOutputStream {
private byte[] data = null;
public WatcherOutputStream(OutputStream delegateStream) {
super(delegateStream);
}
public void write(byte[] b) {
// Store the bytes internally
// Pass off to delegate
super.write(b);
}
}
The data saving code and remaining write methods are left to you as an exercise. Of course, this is a horrible way to track the data you are writing out to the stream - it's an extra layer of overhead in both memory and speed. Another options would be to use AOP to examine the data as you write it. This method has the advantages of being less intrusive to your code, and easily maintainable in that you can easily add and remove point cuts without modifying your main program. I suspect that AOP may be more complicated a solution than you are looking for right now, but I am including this link to more reading, just in case it will be helpful.

Categories

Resources