Does File Geodatabase performance degrade as it fills up?

Question

I'm having a memory leak issue. Upon investigating this more, it seems the file based geodatabase I write to in a loop, is growing very large, and as it grows, it significantly degrades the performance of the scripts I am running.

Any ideas how to optimize the configuration of the fgdb? Or how to speed it all up? I am not writing 'in_memory', I am using aggregatePoints to create a temp featureclass (which I delete), and I buffer this FC, which I keep.

However, it seems to get slower, and slower, and slower...

def createGeom1(geom, scratchDB):
    filetime = (str(time.time())).split(".")
    outfile = "fc" + filetime[0] + filetime[1]
    outpath = scratchDB + "tmp1.gdb/Polygon/"
    outFeatureAggClass = outpath+outfile + "_Agg"
    arcpy.AggregatePoints_cartography(geom, outFeatureAggClass, "124000 meters")

geom is a collection of Points, scratch the scratch area (local gdb I am using).

Just looping through a list of files, I call a procedure that creates alist of geoms (and doesn't degrade) and then call this. Doing this, will see this function, createGeom, degrade significantly, and the previous one, not a bit.

Can't you flush (to the FC) at regular intervals, instead of doing it one time at last? Just a thought... — ujjwalesri, Aug 09 '11 at 07:53
Post your machine specs including hard drive model/speed/size, RAM speed/size/timings, graphics card model/memory and processor. — MLowry, Aug 09 '11 at 13:59
XP SP3 - Intel Core 2 Duo E6750 2.66GHz with 2.95GB RAM - Video Card not really important, as it is done through Eclipse/PyDev - Python. — Hairy, Aug 09 '11 at 14:12
The increase in memory usage hints that you might be using the wrong cursor, or are leaking geometries. Can you post a sample of what you are doing? — Ragi Yaser Burhum, Aug 10 '11 at 04:34
Can you remove the file geodatabase from the equation, so you can see whether the problem is caused by something else (eg a logic flaw in your code)? For example, run the loop outside of an edit session, or comment out the section of the code which writes to the GDB. The fact the process starts fast then slows down implies it may not be a hardware issue, no? — Stephen Lead, Aug 10 '11 at 04:45
@Stephen, it isn't hardware, it is a memory issue, or something to do with accessing the fgdb. I am using an UpdateCursor, in Arcpy, to undate opne row, for new data I have created. But even without that, there is an issue simply creating a featureclass, from agg points — Hairy, Aug 10 '11 at 06:54

score 4 · Answer 1 · edited Apr 13 '17 at 12:33

4

Depending on how big your file geodatabase, you might be able to use a RAM disk (e.g. ImDisk: http://www.ltr-data.se/opencode.html/#ImDisk ). It might speed things up if your hard disks are slower.

Here's a question (mine) about RAM Disks: Does using RAM Disk improve ArcGIS Desktop performance appreciably?

The RAM Disk option is low-hanging fruit. And unfortunately, I can't link to whuber's: comment on that question, but I think it highlights where the most performance gains can be made (so maybe you should post your code in your question).

whuber: Even a high-end (ordinary) disk drive can make an appreciable difference. However, by far the most dramatic improvement in long-running GIS processes is made by improving the algorithm. In many, many cases, if your computation is taking noticeably longer than the time required to read all inputs and write all outputs, chances are you are using an inefficient algorithm.

edited Apr 13 '17 at 12:33

Community

1

answered Aug 09 '11 at 11:18

Jay Cummins

14,642
7
66
141

1

Yeah, I have been through the code a few times and even had it code reviewed by ESRI, who said it couldn't, really, be improved. I think a hard ware review is out of the questions, but I might push for a solid state drive lol. I saw those posts before, but thanks for adding them here for completeness. Thanks for the help. – Hairy Aug 09 '11 at 11:46
1

yes, if the inefficient algorithms are in Esri's geoprocessor, it doesn't matter how perfect your python code is. – Jay Cummins Aug 09 '11 at 11:59
it's actually a fair point! I put a test harness together for ESRI, which simply looped through a list of text files containing a collection of points, to which it went to one procedure to aggregate the points. One call to arcpy caused the memory to get all beat up. – Hairy Aug 09 '11 at 12:15
When the process starts, it uses very limited memory and performs incredibly quickly ~2 seconds a polygon. However, this slowly creeps up as data is added to the fgdb until it's almost 30 seconds a polygon for more or less the same size unit. – Hairy Aug 09 '11 at 13:39
Solid state drives are faster but the arcgis software is the bottleneck in performance (re-write required). If no need write to the geodatabase then compressed (rather than compacting) data is a direct-access format so faster. http://webhelp.esri.com/arcgisserver/9.3/java/index.htm#geodatabases/file_g-1798439258.htm – Mapperz Aug 09 '11 at 14:36

score 2 · Answer 2 · edited Apr 13 '17 at 12:33

2

Check out item #4 of my answer for this related question Performance of ArcGISScripting and large spatial data sets I bet you are loading inside an EditSession and that you are not using recycling cursors :)

Update based on comments:

Although ArcPy and ArcObjects are syntactically different, semantically it is the same. You asked if performance of FileGDB degrades as it fills up. The answer is no it does not. What degrades is when cursors you are using whether explicit (i.e by declaring them yourself) or implicitly (by using a canned arcpy function that creates them) are not released properly. There is a price for using the canned functions - a trade off of simplicitly for resource control (you barely have any). This is not inherent to FileGDB, but to ArcGIS architecture as a whole. The moment you choose to use ArcPy, you are making the statement "I'd rather trade-off the complexity of using a lower level API for performance". The higher you go, the more you trade. ArcPy is pretty much at the top of it all... one function and you are done! Nevertheless, that thread still went to the ArcObjects initialization process, and the associated workspace factory allocation, and the associated file handle open for every scratch workspace (a full gdb!) and until the internal memory freeing routines kicks in (in .net you can explicitly call the gargabe collector and in C++ is inmediate, not sure about the Python-Com memory model honestly) you will have these gdbs and resources staying resident. So you probably have several dbs, and fc handles in memory every time you loop. That is the price of ArcPy and it is up to you to decide whether it is worth it.

edited Apr 13 '17 at 12:33

Community

1

answered Aug 09 '11 at 20:28

Ragi Yaser Burhum

15,339
2
59
76

That's for ArcObjects, not ArcPY Ragi – Hairy Aug 10 '11 at 06:52
ArcPy are just python wrappers on top of ArcObjects, so the same concepts apply, except that you have less explicit control of triggering the right behaviors. – Ragi Yaser Burhum Aug 10 '11 at 08:41
http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#//002z0000001q000000.htm – Ragi Yaser Burhum Aug 10 '11 at 08:47
I know they are, which introces a lot of issues. However, the use of cursors is 'syntactically' different, so I don't have the same options, it also isn't the issue, as I get this by simply calling the AggPoints function inside the call, without a cursor. – Hairy Aug 10 '11 at 08:57
Updated my answer to include comments to your comments :) – Ragi Yaser Burhum Aug 10 '11 at 15:28
+1 I wonder though, if internally there might be some sort of equivalent to "INCREASEBY" which is used in other DBMS's, like DB2. With these DBMS's if you have a tablespace with AUTORESIZE ON, won't you see the same behavior? Internally, is the file gdb hardwired with an AUTORESIZE? – Kirk Kuykendall Aug 10 '11 at 17:26
FileGDB has configuration keywords, too. Some shared with ArcSDE's DBTune keywords http://edndoc.esri.com/arcobjects/9.2/ComponentHelp/esriGeoDatabase/IConfigurationKeyword2.htm For example, you can choose for how the geometries are stored (optimized for size vs speed). I guarantee to you that the defaults are just fine. This sounds more like a different issue related to resources that stay resident in memory – Ragi Yaser Burhum Aug 10 '11 at 20:36
@Ragi - This happens when I don't use cursors. I keep adding thsi fact, it is not just when I use cursors. I understand there is an issue with the Python-COM interface, but my client has chosen Python as their architecture and I have to work within that framework. You keep linking me to ArcObjects FileGDB, but I have no interface to it. I appreciate the advice, I really do, but it's of no help to me, I know all of this. What I think is happening, is that even deleting the objects in the fgdb doesn't remove their references, so you have to compact, but there is a leak between Py/COM – Hairy Aug 11 '11 at 06:28
...and I have to deal with that, given my clients choice. – Hairy Aug 11 '11 at 06:29
The reason why I give your samples and mention cursors, is that when you see a complete example of cursor usage in ArcPy, as in some of the examples here http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#//002z0000001q000000.htm you will see the usage of the keyword del http://docs.python.org/reference/simple_stmts.html#the-del-statement . My advice to you is to use the crap out of it and hope that the correct COM object goes out of scope and thus releases the zombie COM handles that are thrashing your process. – Ragi Yaser Burhum Aug 11 '11 at 18:27
Also, when my clients pick a tech instead of me (the consultant), I warn them of the choice they have made and then become the smart-ass consultant that reminds them of their choice. But that is me. – Ragi Yaser Burhum Aug 11 '11 at 18:28
@Ragi - Unless the tech is better ;) – Hairy Oct 11 '11 at 09:34
@RagiYaserBurhum - I also get multiple call backs from clients, so it pays to listen, and not be the smart ass ;) – Hairy May 26 '12 at 17:40
I also get multiple callbacks, and the appreciate it when I am honest :) – Ragi Yaser Burhum May 28 '12 at 05:16

score 2 · Answer 3 · answered May 25 '12 at 20:19

I know an answer has been accepted already, but I thought I'd add my experience and some code I use to get around it. I have a project that generates a large volume of intermediate data. The size is small, but the number of features/rasters is significant. After having my program benchmark where the performance loss was occurring, it was clear that it was happening on the FGDB writes. At the start of the program, an FGDB write would take ~2 seconds. After adding about 100-150 features, it would take about 6 seconds, and increase approximately linearly from there.

I solved this by creating a simple gdb class that tracks the number of features and creates a new gdb for that purpose if we hit the threshold. NOTE that it requires a bit of adaptation because it takes a variable name in another module as a parameter and sets that variable's value to the path to the gdb. In this case the module is "config" as seen in the method switch() below. If you correct that to your own module (or just switch it to set a name in the current module), this can be adapted. It also generates names by splitting off the FGDB name from the input variable, so you'll need to seed that first. Otherwise, you can adjust the code in init to behave differently.

db_size_threshold = 150

class gdb:
    def __init__(self,tconfig=None):
        self.name = None
        self.config_var = tconfig
        self.parent_folder = None
        self.db_count = 1
        self.features_count = 0

        name_parts = os.path.split(getattr(config,self.config_var))
        self.name = os.path.splitext(name_parts[1])[0]
        self.parent_folder = name_parts[0]

        self.switch() # make a new db to start

    def check(self):
        '''checks if we've hit the threshold - if not, increment its count'''

        if self.features_count > db_size_threshold:
            self.switch()
        else:
            self.features_count += 1

    def switch(self):
        '''creates a new gdb for this object''' 

        new_name = self.next_db()

        arcpy.CreateFileGDB_management(self.parent_folder,"%s_%s.gdb" % (self.name,self.db_count))
        self.features_count = 0

        setattr(config,self.config_var,new_name)

    def next_db(self):
        '''returns the name of the next db after checking for existence - this could possibly use arcpy.CreateUniqueName instead'''
        new_name = os.path.join(self.parent_folder,"%s_%s.gdb" % (self.name,self.db_count))
        while arcpy.Exists(new_name):
            self.db_count += 1
            new_name = os.path.join(self.parent_folder,"%s_%s.gdb" % (self.name,self.db_count))

        return new_name

I've found 150 to be a good tradeoff between complexity and performance for my project. Write times tend to fluctuate between about 3 and 7 seconds with that threshold. Set db_size_threshold to whatever balance you are looking to strike on your own hardware.

score 0 · Accepted Answer · answered Oct 03 '11 at 07:17

There's a memory leak in ArcGIS 10 which is being fixed in SP3 apparently.

Also, I decided I would delete the 'in_memory' data and compact the database on each loop. which actually sped the application up. Then, when I run the script again, I delete the fgdb and recreate it. it's sped it all up by 30%. However, once the memory leak has been fixed, we expect much better gains in performance. Arcpy is a pig in loops...

Does File Geodatabase performance degrade as it fills up?

4 Answers4

Linked