efficient way to store large amount of html content

efficient way to store large amount of html content - android

I have a database, which has a table with fields like "title, album, artist..." and it also has many fields with html content for every record (up to 30).
Problem is, that this database has tens of thousands of records and is hundreds of megabytes large because of the html content. Because of the size of the sqlite file the search is very slow (also inserting new elements in a transaction is very slow ~10-30 second for 200 new rows). The very first LIKE query can take 10-15 seconds, other searches are fast enough (indices are created and work ok). When I removed the html content from the database the search was always instant.
So the question is, what is the best way to store that additional html content? Right now I play with the option to store it in separate files, but it can generate up to 600k files and more in the future, which is quiet slow to create. Storing the files in a zip archive will probably hit its file number limit. Other options are to zip files per table row, store the html in a separate table in the same database, or to create a separate database file for the html content.
What will give me the best performance? Or are there other better options? I need quick insert, update and serach.

There are a couple different things you could consider doing:
Split the data into separate tables. You could then have 1:1 mappings between the tables, and only join them in if necessary, speeding up queries without them.
Check your indexes. Just because you have them and you think they're working, doesn't mean they are. If I recall correctly, sqlite will use at most one index per query, so you need to make sure you have the best index possible available for the queries you're using. The ANALYZE command can help with that.

After some days of experimenting I came to this conclusion:
one database file with one table was the slowest (up to 10 seconds)
one database with two tables was twice as fast in the worst case scenario as one table
fastest was to have two separate database files. one with data needed for search and the other for the huge html data. this is almost instant in the worst case ~300ms and in normal usage it is instant
So I reccommend to use two separate database files in this scenario. If someone does not come with a faster/better solution I will accept this as the answer.

Related

What´s better? Several smaller databases or one large

I am doing application for learning words in foreign language, so I have this words stored in my database. These words are separated for example into 3 levels of difficulty. Every level is made of some groups of words, these groups introduces TABLES of SQLite db. I am using SQLiteOpenHelper as communication between application and databases.
Now my question. What is better?
Make 3 smaller databases, each for every level and use own
SQLiteOpenHelper, so together 3 dbs with 3 open helpers.
Make 1 large database, where will be that 3 levels, which means
many TABLES, but just only 1 SQLiteOpenHelper.
Thanks for any advice or opininon.

I suggest 1 large database (DB).
You should not be worried about making large DBs, DBs are invented to store a large amount of data (and even many-many tables). It is much easier to create and maintain one DB than multiple ones and your code will be much clearer using one DB.
And I don't know your program, but I would go even further: I would rather store all words in the same table if you store the same information of them, and add a column to show the level and another one to show the group which they belong to.
The main idea of SQL is that you don't really care how much space your DB will require and how much time it gonna take to find the result of a query because DataBase Managent Systems (in your case the SQLiteOpenHelper and SQLite) are insanely efficient considering space and time. Instead you should rather concentrate on designing a system that can be expanded easily (for example if you want to add another column to tables containing words (e.g. you want to store a new information about words) or want to add new levels or groups in a later stage of development) and has clear structure. You might lose a few milliseconds separating groups and levels via the SELECT command of SQL, but your DB will be much more flexible - you can add levels and groups and add more information about words with ease. The key of desinging a good DB: You should store different kind of data in different tables and same kind of data in same table...

The error that you mention in your comment is almost certainly a bug in your application code. There is no reason that an application with multiple databases should encounter that sort of error.
That said, my answer to your original question is that it is objectively "better" to use a single database.
It is better because you will have less code to maintain, no possibility of attempting to access the wrong database in a given situation, and the code will be more idiomatic - i.e. there's no benefit to using multiple databases, so if you were to use multiple databases, anyone reading your code would spend a lot of time trying to figure out why you did it.

Store large amounts of texts for Android app

I'm developing an Android app in which users will be able to write/save/modify potentially large pieces of text. I believe the amount of words will range from 10-1000. In the worst-case scenario, users will write a new piece of text everyday.
What is the best way to store these kinds of text data, holding in account the ability to easily modify saved pieces of text?

Store the data either as a file or in an sqlite database, if possible segment those pieces as separate records/files. For the loading part - there won't be any trouble of dealing with 1000 word within RAM, for example, if you load it inside a TextView. The limit to the size of text you allocate to your TextView is basically the amount of memory that you have.
I suggest testing your text editing view with ridiculously long texts at the end and if you see any issue (sluggish, runs out of memory etc), than you would have to take care of segmenting the document on your own. Hope this helps.

Best option Use sqlite database if you want to store data by day or by time. So, you can easily manage all your data.(if you looking for storage capacity then you can also manage your database in SDCARD(take backup in SDCARD, or create DB in SDCARD...etc.))
2nd Option store Data directly in SDCARD external storage(readable to user(He/She can Delete your data)).
you can use Encryption & Decryption in above both way.

Android Performance : Flat file vs SQLite

There are few questions related to this topic on stackoverflow, But I didn't get the proper answer. I have some doubts on performance of flat files, Is it better to use flat files instead of SQLite ? Can anybody have performance statistics ? Or example of proper way to code flat file in android.

Aside from performance benefits, here's a simple list of advantages of using SQLite rather than flat file:
You can query items as you wish -- don't need to load all of them and select which ones you need.
Deleting records is a much less painful process. No rewriting of whole files into wherever.
Updating a record is as easy as removing or creating one.
Have you ever tried doing cross-referencing lookups on a flat file? Not worth it.
To summarize, it's every advantage a Database has over a text file.

It depends on your requirement.
If your storage data size is structured-bulky in size then i suggest you for SQLite. On the other hand if the data size is just a single or few lines then flat file is best option.
What makes difference between them is, SQLite stores data in structured format, so it will be easier to find a record from multiple set of records which is very tedious process in case of flat file.
However when if you are storing blob kind of data then it is suggested to use combination of both, SQLite and file system both. i.e. store the image/sound/video data as file format and store their path in SQLite.
Also visit this accessing performance.

SQlite definitely way better in terms of performance and this gets even more important as the size of your data increases.
I've been working on a flutter app where I needed to display a filtered list of items dynamically based on typed text. I initially used a json file to store data and would read and store relevant values into a list, then filter this list as the user types.
This worked just fine with a few items so I thought I was fine until I tested with a real dataset which contained over 150,000 items. Trying to filter a list this large as a user types crashed the app. I moved to a database solution and all my problems were solved. Instant filtering and no more crashes

Which is better? Database or xmlfile? [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I really like Xml for saving data, but when does sqlite/database become the better option? eg, when the xml has more than x items or is greater than y MB?
I am coding an rss reader and I believe I made the wrong choice in using xml over a sqlite database to store a cache of all the feeds items. There are some feeds which have an xml file of ~1mb after a month, another has over 700 items, while most only have ~30 items and are ~50kb in size after a several months.
I currently have no plans to implement a cap because I like to be able to search through everything.
So, my questions are:
When is the overhead of sqlite/databases justified over using xml?
Are the few large xml files justification enough for the database when there are a lot of small ones, though even the small ones will grow over time? (a long long time)
updated (more info)
Every time a feed is selected in the GUI I reload all the items from that feeds xml file.
I also need to modify the read/unread status which seems really hacky when I loop through all nodes in the xml to find the item and then set it to read/unread.

Man do I have experience with this. I work on a project where we originally stored all of our data using XML, then moved to SQLite. There are many pros and cons to each technology, but it was performance that caused the switchover. Here is what we observed.
For small databases (a few meg or smaller), XML was much faster, and easier to deal with. Our data was naturally in a tree format, which made XML much more attractive, and XPath allowed us to do many queries in one simple line rather than having to walk down an ancestry tree.
We were programming in a Win32 environment, and used the standard Microsoft DOM library. We would load all the data into memory, parse it into a DOM tree and search, add, modify on the in memory copy. We would periodically save the data, and needed to rotate copies in case the machine crashed in the middle of a write.
We also needed to build up some "indexes" by hand using C++ tree maps. This, of course would be trivial to do with SQL.
Note that the size of the data on the filesystem was a factor of 2-4 smaller than the "in memory" DOM tree.
By the time the data got to 10M-100M size, we started to have real problems. Interestingly enough, at all data sizes, XML processing was much faster than SQLite turned out to be (because it was in memory, not on the hard drive)! The problem was actually twofold- first, loadup time really started to get long. We would need to wait a minute or so before the data was in memory and the maps were built. Of course once loaded the program was very fast. The second problem was that all of this memory was tied up all the time. Systems with only a few hundred meg would be unresponsive in other apps even though we ran very fast.
We actually looking into using a filesystem based XML database. There are a couple open sourced versions XML databases, we tried them. I have never tried to use a commercial XML database, so I can't comment on them. Unfortunately, we could never get the XML databases to work well at all. Even the act of populating the database with hundreds of meg of XML took hours.... Perhaps we were using it incorrectly. Another problem was that these databases were pretty heavyweight. They required Java and had full client server architecture. We gave up on this idea.
We found SQLite then. It solved our problems, but at a price. When we initially plugged SQLite in, the memory and load time problems were gone. Unfortunately, since all processing was now done on the harddrive, the background processing load went way up. While earlier we never even noticed the CPU load, now the processor usage was way up. We needed to optimize the code, and still needed to keep some data in memory. We also needed to rewrite many simple XPath queries as complicated multiquery algorithms.
So here is a summary of what we learned.
For tree data, XML is much easier to query and modify using XPath.
For small datasets (less than 10M), XML blew away SQLite in performance.
For large datasets (greater than 10M-100M), XML load time and memory usage became a big problem, to the point that some computers become unusable.
We couldn't get any opensource XML database to fix the problems associated with large datasets.
SQLite doesn't have the memory problems of XML DOM, but it is generally slower in processing the data (it is on the hard drive, not in memory). (note- SQLite tables can be stored in memory, perhaps this would make it as fast.... We didn't try this because we wanted to get the data out of memory.)
Storing and querying tree data in a table is not enjoyable. However, managing transactions and indexing partially makes up for it.

I basically agree with Mitchel, that this can be highly specific depending on what are you going to do with XML and SQLite. For your case (cache), it seems to me that using SQLite (or other embedded databases) makes more sense.
First I don't really think that SQLite will need more overhead than XML. And I mean both development time overhead and runtime overhead. Only problem is that you have a dependence on SQLite library. But since you would need some library for XML anyway it doesn't matter (I assume project is in C/C++).
Advantages of SQLite over XML:
everything in one file,
performance loss is lower than XML as cache gets bigger,
you can keep feed metadata separate from cache itself (other table), but accessible in the same way,
SQL is probably easier to work with than XPath for most people.
Disadvantages of SQLite:
can be problematic with multiple processes accessing same database (probably not your case),
you should know at least basic SQL. Unless there will be hundreds of thousands of items in cache, I don't think you will need to optimize it much,
maybe in some way it can be more dangerous from security standpoint (SQL injection). On the other hand, you are not coding web app, so this should not happen.
Other things are on par for both solutions probably.
To sum it up, answers to your questions respectively:
You will not know, unless you test your specific application with both back ends. Otherwise it's always just a guess. Basic support for both caches should not be a problem to code. Then benchmark and compare.
Because of the way XML files are organized, SQLite searches should always be faster (barring some corner cases where it doesn't matter anyway because it's blazingly fast). Speeding up searches in XML would require index database anyway, in your case that would mean having cache for cache, not a particularly good idea. But with SQLite you can have indexing as part of database.

Don't forget that you have a great database at your fingertips: the filesystem!
Lots of programmers forget that a decent directory-file structure is/has:
It's fast as hell
It's portable
It has a tiny runtime footprint
People are talking about splitting up XML files into multiple XML files... I would consider splitting your XML into multiple directories and multiple plaintext files.
Give it a go. It's refreshingly fast.

Use XML for data that the
application should know -
configuration, logging and what not.
Use databases(oracle, SQL server etc) for data that the user
interacts with directly or
indirectly - real data
Use SQLite if the user data is more
of a serialized collection - like
huge list of files and their content
or collection of email items etc.
SQLite is good at that.
Depends on the kind and the size of the data.

I wouldn't use XML for storing RSS items. A feed reader makes constant updates as it receives data.
With XML, you need to load the data from file first, parse it, then store it for easy search/retrieval/update. Sounds like a database...
Also, what happens if your application crashes? if you use XML, what state is the data in the XML file versus the data in memory. At least with SQLite you get atomicity, so you are assured that your application will start with the same state as when the last database write was made.

XML is best used as an interchange format when you need to move data from your application to somewhere else or share information between applications. A database should be the preferred method of storage for almost any size application.

When should XML be used for data persistence instead of a database? Almost never. XML is a data transport language. It is slow to parse and awkward to query. Parse the XML (don't shred it!) and convert the resulting data into domain objects. Then persist the domain objects. A major advantage of a database for persistence is SQL which means unstructured queries and access to common tools and optimization techniques.

I have made the switch to SQLite and I feel much better knowing it's in a database.
There are a lot of other benefits from this:
Adding new items is really simple
Sorting by multiple columns
Removing duplicates with a unique index
I've created 2 views, one for unread items and one for all items, not sure if this is the best use of views, but I really wanted to try using them.
I also benchmarked the xml vs sqlite using the StopWatch class, and the sqlite is faster, although it could just be that my way of parsing xml files wasn't the fastest method.
Small # items and size (25 items, 30kb)
~1.5 ms sqlite
~8.0 ms xml
Large # of items (700 items, 350kb)
~20 ms sqlite
~25 ms xml
Large file size (850 items, 1024kb)
~45 ms sqlite
~60 ms xml

To me it really depends on what you are doing with them, how many users/processes need access to them at the same time etc.
I work with large XML files all the time, but they are single process, import style items, that multi-user, or performance are not really needs.
SO really it is a balance.

If any time you will need to scale, use databases.

XML is good for storing data which is not completely structured and you typically want to exchange it with another application. I prefer to use a SQL database for data. XML is error prone as you can cause subtle errors due to typos or ommissions in the data itself. Some open source application frameworks use too many xml files for configuration, data, etc. I prefer to have it in SQL.
Since you ask for a rule of thumb, I would say that use XML based application data, configuration, etc if you are going to set it up once and not access/search it much. For active searches and updations, its best to go with SQL.
For example, a web server stores application data in a XML file and you dont really need to perform complex search, update the file. The web server starts, reads the xml file and thats that. So XML is perfect here. Suppose you use a framework like Struts. You need to use XML and the action configurations dont change much once the application is developed and deployed. So again, the XML file is a good way. Now if your Struts developed application allows extensive searches and updations, deletions, then SQL is the optimal way.
Offcourse, you will surely meet one or two developers in your organisation who will chant XML or SQL only and proclaim XML or SQL as the only way to go. Beware of such folks and do what 'feels' right for your application. Dont just follow a 'technology religion'.
Think of things like how often you need to update the data, how often you need to search the data. Then you will have your answer on what to use - XML or SQL.

I agree with #Bradley.
XML is very slow and not particularly useful as a storage format. Why bother? Will you be editing the data by hand using a text editor? If so, XML still isn't a very convenient format compared to something like YAML. With something like SQlite, queries are easier to write, and there's a well defined API for getting your data in and out.
XML is fine if you need to send data around between programs. But in the name of efficiency, you should probably produce the XML at sending time, and parse it into "real data" at receive time.
All the above means that your question about "when the overhead of a database is justified" is kind of moot. XML has a way higher overhead, all the time, than SQlite does. (Full-on databases like MSSQL are heavier, especially in administrative overhead, but that's a totally different question.)

XML can be stored as text and as a binary file format.
If your primary goal is to let a computer read / write a file format effeciently you should work with a binary file format.
Databases are an easy to use way of storing and maintaining data.
They are not the fastest way to store data that is a binary file format.
What can speed things up is using an in memory database / database type. Sqlite has this option.
And this sounds like the best way to do it for you.

My opinion is that you should use SQLite (or another appropriate embedded database) anytime you don't need a pure-text file format. Note, this is a pretty big exception. There are a lot of scenarios that require, or are benefited by, pure-text file formats.
As far as overhead goes, SQLite compiles to something like 250 k with normal flags. Many XML parsing libraries are larger than SQLite. You get no concurrency gains using XML. The SQLite binary file format is going to support much more efficient writes (largely because you can't append to the end of a well-formatted XML file). And even reading data, most of which I assume is fairly random access, is going to be faster using SQLite.
And to top it all off, you get access to the benefits of SQL like transactions and indexes.
Edit: Forgot to mention. One benefit of SQLite (as opposed to many databases) is that it allows any type in any row in any column. Basically, with SQLite you get the same freedom you have with XML in terms of datatypes. This also means that you don't have to worry about putting limits on text columns.

You should note that many large Relational DBs (Oracle and SQLServer) have XML datatypes to store data within a database and use XPath within the SQL statement to gain access to that data.
Also, there are native XML databases which work very much like SQLite in the sense they are one binary file holding a collection of documents (which could roughly be a table) then you can either XPath/XQuery on a single document or the whole collection. So with an XML database you can do things like store the days data as a separate XML document in the collection... so you just need to use that one document when your dealing with the data for today. But write an XQuery to figure out historical data on the collection of documents for that person. Slick.
I've used Berkeley XMLDB (now backed by Oracle). There are others if you search google for "Native XML Database". I've not seen a performance problem with storing/retrieving data in this manner.
XQuery is a different beast (but well worth learning), however you may be able to just use the XPaths you currently use with slight modifications.

A database is great as part of your program. If quering the data is part of your business logic.
XML is best as a file format, especially if you data format is:
1, Hierarchal
2, Likely to change in the future in ways you can't guess
3, The data is going to live longer than the program

I say it's not a matter of data size, but of data type. If your data is structured, use a relational database. If your data is semi-structured, use XML or - if the data amounts really grow too large - an XML database.

If your searching go with a db. You could split the xml files up into directories to ease seeking, but the managerial overhead easily gets quite heavy. You also get a lot more than just performance with a sql db...

Best practice for keeping data in memory and database at same time on Android

We're designing an Android app that has a lot of data ("customers", "products", "orders"...), and we don't want to query SQLite every time we need some record. We want to avoid to query the database as most as we can, so we decided to keep certain data always in memory.
Our initial idea is to create two simple classes:
"MemoryRecord": a class that will contain basically an array of objects (string, int, double, datetime, etc...), that are the data from a table record, and all methods to get those data in/out from this array.
"MemoryTable": a class that will contain basically a Map of [Key,MemoryRecord] and all methods to manipulate this Map and insert/update/delete record into/from database.
Those classes will be derived to every kind of table we have in the database. Of course there are other useful methods not listed above, but they are not important at this point.
So, when starting the app, we will load those tables from an SQLite database to memory using those classes, and every time we need to change some data, we will change in memory and post it into the database right after.
But, we want some help/advice from you. Can you suggest something more simple or efficient to implement such a thing? Or maybe some existing classes that already do it for us?
I understand what you guys are trying to show me, and I thank you for that.
But, let's say we have a table with 2000 records, and I will need to list those records. For each one, I have to query other 30 tables (some of them with 1000 records, others with 10 records) to add additional information in the list, and this while it's "flying" (and as you know, we must be very fast at this moment).
Now you'll be going to say: "just build your main query with all those 'joins', and bring all you need in one step. SQLite can be very fast, if your database is well designed, etc...".
OK, but this query will become very complicated and sure, even though SQLite is very fast, it will be "too" slow (2 a 4 seconds, as I confirmed, and this isn't an acceptable time for us).
Another complicator is that, depending on user interaction, we need to "re-query" all records, because the tables involved are not the same, and we have to "re-join" with another set of tables.
So, an alternative is bring only the main records (this will never change, no matter what user does or wants) with no join (this is very fast!) and query the other tables every time we want some data. Note that on the table with 10 records only, we will fetch the same records many and many times. In this case, it is a waste of time, because no matter fast SQLite is, it will always be more expensive to query, cursor, fetch, etc... than just grabbing the record from a kind of "memory cache". I want to make clear that we don't plan to keep all data in memory always, just some tables we query very often.
And we came to the original question: What is the best way to "cache" those records? I really like to focus the discussion on that and not "why do you need to cache data?"

The vast majority of the apps on the platform (contacts, Email, Gmail, calendar, etc.) do not do this. Some of these have extremely complicated database schemas with potentially a large amount of data and do not need to do this. What you are proposing to do is going to cause huge pain for you, with no clear gain.
You should first focus on designing your database and schema to be able to do efficient queries. There are two main reasons I can think of for database access to be slow:
You have really complicated data schemas.
You have a very large amount of data.
If you are going to have a lot of data, you can't afford to keep it all in memory anyway, so this is a dead end. If you have complicated structures, you would benefit in either case with optimizing them to improve performance. In both cases, your database schema is going to be key to good performance.
Actually optimizing the schema can be a bit a of a black art (and I am no expert on it), but some things to look out for are correctly creating indices on rows you will query, designing joins so they will take efficient paths, etc. I am sure there are lots of people who can help you with this area.
You could also try looking at the source of some of the platform's databases to get some ideas of how to design for good performance. For example the Contacts database (especially starting with 2.0) is extremely complicated and has a lot of optimizations to provide good performance on relatively large data and extensible data sets with lots of different kinds of queries.
Update:
Here's a good illustration of how important database optimization is. In Android's media provider database, a newer version of the platform changed the schema significantly to add some new features. The upgrade code to modify an existing media database to the new schema could take 8 minutes or more to execute.
An engineer made an optimization that reduced the upgrade time of a real test database from 8 minutes to 8 seconds. A 60x performance improvement.
What was this optimization?
It was to create a temporary index, at the point of upgrade, on an important column used in the upgrade operations. (And then delete it when done.) So this 60x performance improvement comes even though it also includes the time needed to build an index on one of the columns used during upgrading.
SQLite is one of those things where if you know what you are doing it can be remarkably efficient. And if you don't take care in how you use it, you can end up with wretched performance. It is a safe bet, though, if you are having performance issues with it that you can fix them by improving how you are using SQLite.

The problem with a memory cache is of course that you need to keep it in sync with the database. I've found that querying the database is actually quite fast, and you may be pre-optimizing here. I've done a lot of tests on queries with different data sets and they never take more than 10-20 ms.
It all depends on how you're using the data, of course. ListViews are quite well optimized to handle large numbers of rows (I've tested into the 5000 range with no real issues).
If you are going to stay with the memory cache, you may want have the database notify the cache when it's contents change and then you can update the cache. That way anyone can update the database without knowing about the caching. Also, if you build a ContentProvider over your database, you can use the ContentResolver to notify you of changes if you register using registerContentObserver.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.