6

I am working on a big vector data set with more than 2.4 million features containing 10x10m squares. I wrote a small script in Python to delete uninteresting fields and perform calculations on the remaining fields of interest, but this process comes to no end.

I am wondering how I can speed up processing.

I thought about the implementation of parallel processing using all four cores, but I didn't find any helpful information.

What I want to do is split the whole area in four parts, than for every tile I want remove to all fields where I'm not interested in and make some calculation on the remaining fields followed by a conversion to a raster. This is my code:

from qgis.core import QgsTrackedVectorLayerTools

layer = iface.addVectorLayer("[mypath]name.gpgk]","name","ogr")

##################################
###Filter Fields by selecting columns of interest

idx_keep = [] idx_all = [] idx_delete = []

fields = ["basalareap","basalareas","basalaread"]

for i in fields: idx_keep.append(layer.fields().indexFromName(i))

field_names = [field.name() for field in layer.fields()]

for i in field_names: idx_all.append(layer.fields().indexFromName(i))

idx_delete = [x for x in idx_all if x not in idx_keep]

layer.startEditing()

layer.deleteAttributes(idx_delete)

##################################
##Create new fields

pv = layer.dataProvider() pv.addAttributes([QgsField('Share_SW',QVariant.Double),
QgsField('Share_HW', QVariant.Double)])

layer.updateFields()

expression_SW = QgsExpression('("basalareap"+"basalareas")/("basalareap"+"basalareas"+"basalaread")') expression_HW = QgsExpression('("basalaread")/("basalareap"+"basalareas"+"basalaread")')

context = QgsExpressionContext() context.appendScopes(QgsExpressionContextUtils.globalProjectLayerScopes(layer))

#with edit(layer): layer.startEditing()

for f in layer.getFeatures(): context.setFeature(f) f['Share_SW'] = expression_SW.evaluate(context) f['Share_HW'] = expression_HW.evaluate(context) layer.updateFeature(f)

QgsVectorFileWriter.writeAsVectorFormat(layer,'[mypath]','utf-8',layer.crs(),'ESRI Shapefile') iface.mainWindow().findChild(QAction, 'mActionToggleEditing').trigger() #Here convert to raster

Can anybody give me a hint how to realize this?

Perhaps there is a tutorial.

I'm using Windows 10 with QGIS 3.10.

Update

I think the geopandas approach from olivier is faster and cleaner. I asked an seperate question for the geopandas package: Parallel processing in Pandas and already get a good answer.

DrSnuggles
  • 399
  • 4
  • 13
  • related, but quite old https://gis.stackexchange.com/questions/98335/enabling-multithreaded-processing-in-qgis? – nmtoken Jan 27 '20 at 15:44
  • check out the threading module https://realpython.com/intro-to-python-threading/ https://docs.python.org/3/library/threading.html#module-threading – jbalk Jan 28 '20 at 00:09

1 Answers1

2

I would first switch to using geopandas (you can install it using osgeo4w installer using the advanced install option). It should make the processing faster, avoiding the loops. Check the speed on a subset of your data before attempting parallelization.

import geopandas as gpd

layer = gpd.read_file("[mypath]name.gpgk]", layer="name")

###Filter Fields by selecting columns of interest
fields = ["basalareap","basalareas","basalaread"]
layer = layer[fields+['geometry']]

#Create new columns
layer['denominator'] = layer[["basalareap","basalareas","basalaread"]].sum(axis=1)

layer['Share_SW'] = layer[['basalareap','basalareas']].sum(axis=1)/layer['denominator']
layer['Share_HW'] = layer['basalaread']/layer['denominator']

layer.drop(labels='denominator', axis=1, inplace=True)

For parallelization, you would most likely use the function apply and can start from that blog post: http://blog.adeel.io/2016/11/06/parallelize-pandas-map-or-apply/

Olivier
  • 419
  • 2
  • 11