Moving PostGIS database for Social Media to Cassandra?

Question

I currently have a large and growing database of geo-social media. I'm using Postgres with postgis, and so far it's been quite good.

My use case thus far has been to query the database for all posts within specific areas, such as every tweet within each post/zip-code in the US/Canada (or other boundaries -- the posts are not geo-coded to the post/zip code. selection is through intersection of coordinates).

In the past this was relatively quick and easy to do, however, now that my data table has grown to over 350m posts, my queries take forever to complete. Each post code can take up to a day to retrieve the tweets from within its geometry.

I am currently debating what to do next. I know of a few solutions, but I'm curious what you think of them and if I'm missing anything.

throw more memory at postgres, and reduce disk searching. I'm currently working on doing this, but I wonder if it is a good long term solution.
find a better way to partition data into more tables. Currently, all of my posts are in the same table, with each column indexed. Given that I need to search through all of the posts to find the ones in the given areas, is this the wrong way to be doing it?
create a hybrid of Cassandra-PostGIS. Bulk store my data into Cassandra, and then pull out roughly the data I need (say, one state/province at a time) to a staging data table in PostGIS, then query individual post/zip codes using PostGIS. Is this overkill/horribly inefficient? I have looked into Postgres-XL, but it has been kiboshed by others in this project for cost issues.

Are you sure it is the selection that is slow and not just the output of all the data. I mean an index search on a integer field like postal codes on 350 mill rows shouldn't be that slow I think. — Nicklas Avén, Jul 17 '14 at 21:26
i'm slelecting posts by post-code via the geometry intersections, as each post is not geocoded. Additionally, i use other boundaries than post codes, i.e. disemination areas, CMA districts, etc so I don't think a mass geo-code is terribly effective — mikedotonline, Jul 17 '14 at 21:33
Ok, but still. If you have a working spatial index on the geometries I cannot understand that it should take that long time. 350 mill records is not that much. — Nicklas Avén, Jul 17 '14 at 21:56
BTW, you say that you do the selection spatially, but you are also talking about many indexed columns. Why do you have many indexed columns if you don't use them for the selection? — Nicklas Avén, Jul 17 '14 at 21:58
depends on the operation that i'm up to. Most of the time I'm selection by geometry, and that's the one that's taking a long time right now. However, I also do selections based on post content (full-text search), and on date-time columns, too. these columns are indexed. — mikedotonline, Jul 17 '14 at 22:14
Agree with Nicklas on this, I feel like 350mil should be manageable, but that probably should be a separate question. (but this question is good too...don't delete it!). — Jay Cummins, Jul 17 '14 at 22:48
@mikedotonline ok, but do you have a spatial index that is functional? — Nicklas Avén, Jul 18 '14 at 07:21
@Jay Cummins, I agree that the question isn't about query design and indexing. But to buy more ram before tuning the query seems like starting in the wrong end. That is more the microsoft/esri way of doing things :-) — Nicklas Avén, Jul 18 '14 at 07:26
@Nicklas 100% agree! I was thinking the if the question was re-worked a little more like the postgis vs mongodb question: http://gis.stackexchange.com/questions/9809/best-gis-system-for-high-performance-web-application-postgis-vs-mongodb, it would be valuable. — Jay Cummins, Jul 18 '14 at 12:42

score 1 · Answer 1 · answered Jul 22 '14 at 00:20

use "explain" (http://www.postgresql.org/docs/9.3/static/sql-explain.html) to see if your spatial index is being used, and make sure you've run "analyze" (http://www.postgresql.org/docs/9.3/static/sql-analyze.html) to make sure that the db has collected sufficient statistics on your indices.

It sounds to me like you don't know what you're doing with respect to planning for queries. It is usually not good practice to index every column. For example, if you know that spatial queries will drop the number of items to a small number, then just use space. If time will do a better job than space, then just use that. Then use sub-selects or with statements to drill down even further.

But really, spending some time playing with "explain" will greatly improve your understanding of what the engine is doing.

score 0 · Answer 2 · answered Jul 22 '14 at 08:35

I agree with the other comments, for it to be this slow you need to not have setup indexing correctly. How are you storing the data ? Are you using JSON ? You need to have spatial columns or indexes to get fast queries working.

Check out the links below for examples of spatially indexing JSON twitter data.

http://bibhas.in/blog/postgresql-swag-json-data-type-and-working-with-twitter-data/ http://bibhas.in/blog/postgresql-swag-part-2-indexing-json-data-type-and-full-text-search/

Moving PostGIS database for Social Media to Cassandra?

2 Answers2