[postgis-users] Determining clusters of points

Tue Dec 7 17:15:19 PST 2010

Hi Sébastien,

Sounds like an interesting situation :-)

You might look at the newish (v8.4+ ?) RECURSIVE/WITH capabilities of Postgres.

Given Postgis allows you to create arbitrary buffer zones around your points, the RECURSIVE query capability allows you to write an SQL that will recursively track points which can interact with each other, so you should be able to 

select points which interact with points which interact with points which.... 
(ie: creates an "interaction cluster")

by using a Postgis buffer or a distance expression to represent "interact with". Or a similar effect to a predetermined level with subqueries in an SQL & cut-and-paste.

However I'm not sure this will give the performance you need, and an in-memory tool like R may be more appropriate than a disk based RDBMS solution.

A bit left field, but, depending on your use case, you could generate polygons for an "immediate neighbour" interaction as the distinct geomunion of all the point buffers which intersect with each other, then for each, select the count() of points contained in each to "rank" the clusters, or the count/area to estimate density within clusters, or assign points to clusters or vice versa.

There are lots of things you can do with Postgis in this arena, you need to really clarify the actual use case to see if it can be done in SQL, how it could be done and if it should be done in SQL :-)

Cheers,

  Brent Wood

--- On Wed, 12/8/10, Sébastien Lorion <sl at thestrangefactory.com> wrote:

From: Sébastien Lorion <sl at thestrangefactory.com>
Subject: Re: [postgis-users] Determining clusters of points
To: pcreso at pcreso.com
Cc: "PostGIS Users Discussion" <postgis-users at postgis.refractions.net>
Date: Wednesday, December 8, 2010, 10:53 AM

Hello Brent,
Thanks for your help also. I saw this method, but I cannot use it in my scenario as it is too imprecise.
Consider all points as agents that interact with each other, each up to a certain range. I need to find interaction clusters and cannot afford to have those split in many parts, though it is ok to miss including some points.

Sébastien
On Tue, Dec 7, 2010 at 15:06,  <pcreso at pcreso.com> wrote:

Hi Sébastien ,

One way I "clustered" points in the (distant) past for a zoom-layer mapping solution was to create multiple geometries, by reducing the number of decimal points in the X & Y coords and grouping by these locations. Simplistic & not especially statistical, but given a random point distribution you get an order of magnitude reduction in the number of positions of points at each step. With a non-random distribution, the reduction in number of these clusters is often greater. 

I have also done a similar thing by creating grids of square cells of appropriate sizes, then assigning points to the cells using Postgis queries,  which allows a regular clustered sampling of the points by grid cell.

If you are after a more statistical approach, I'd look at using R, and once you have your R clustering capability coded, perhaps
 look at PL/R to embed the clustering capability within your database.

What isn't clear from your question is whether the clustering is something done often, perhaps at various resolutions, or is done once, to assign points to clusters, then used as these groups from then on. This would have an impact on performance & the approach you might take.

Also note that with both Postgis & R, any use of multiple cores or hardware clusters will require some work, as these are not applications that parallel process particularly well, but you could perhaps develop code that assigns individual analyses of data clusters to a load balanced hardware cluster. 

If you have a clusterin algorithm, it may also be feaqsibl to implement it as a native user defined Postgis/Postgres function, but I think PL/R is an easier approach for embedded SQL clustering capability.

HTH,

  Brent Wood

--- On Wed, 12/8/10,
 Sébastien Lorion <sl at thestrangefactory.com> wrote:

From: Sébastien Lorion <sl at thestrangefactory.com>
Subject: Re: [postgis-users] Determining clusters of points
To: "PostGIS Users Discussion" <postgis-users at postgis.refractions.net>

Date: Wednesday, December 8, 2010, 8:03 AM

I like your idea very much, especially since something I did not say is that the points have a range attribute which determines how far they can interact with other points. So the circle buffer you talk about would have a diameter equals to the range of each point.

What would be fastest as the last step : bounding box, minimum bounding circle or minimum convex hull ? I am guessing the BB, but is the difference significant enough that one should be chosen over another ?

Sébastien

On Tue, Dec 7, 2010 at 13:42, Emilie Laffray <emilie.laffray at gmail.com> wrote:

On 7 December 2010 17:01, Sébastien Lorion <sl at thestrangefactory.com> wrote:

Hello,
I am trying to find an efficient way to find clusters of points as shown in the attached image. The only clustering criteria is the distance between the points. The dataset can be very large (millions of points) and point distribution is mostly clustered with some sparse points in the gaps.

I searched the net and this mailing list and found two promising solution paths: 
- use a statistical tools such as R with a density function (http://www.r-project.org)

- use a clustering algorithm like those explained here http://www.med.nyu.edu/biostatistics/people/Ilana%20Belitskaya-Levy/Courses/MAS/Handouts/clustering.pdf (agnes seems the most promising for my purposes)

I would like your advice to help me find which approach would be best suited with PostGIS (maybe there is even something already made that I can use?). Whatever solution I pick, it must be efficient and the workload must be able to be distributed on a cluster of commodity hardware.

I am new to GIS and this mailing list, so please excuse me if I am not using the right vocabulary.
Thank you very much!

Hello,

Some time ago I have worked on something similar, except that I was using circles instead of boxes which should not be a problem. I am just giving the logic as I don't have access to the code right now.

You can start by creating a buffer around each of your points of the distance you want. 
The next step is to create an UNION of all the buffers that intersect.
You get the list of points included in each of the resulting polygons and then you create either a bounding box around them or use a minimum bounding circle (Postgis 1.5 and above).

Emily Laffray

_______________________________________________

postgis-users mailing list

postgis-users at postgis.refractions.net

http://postgis.refractions.net/mailman/listinfo/postgis-users

-----Inline Attachment Follows-----

_______________________________________________
postgis-users mailing list
postgis-users at postgis.refractions.net

http://postgis.refractions.net/mailman/listinfo/postgis-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20101207/b430ece2/attachment.html>