Joining vector layer with CSV-file that contain duplicates in QGIS

Question

I'm currently working on a project, and I received a CSV file which has additional information about each feature. I intend to join the CSV file with an existing layer. Picture it like this:

However, as in the picture above, the field I intend to join has some duplicates. This means that not all information will be transferred during the join if I don't edit it beforehand.

It would be helpful if I could get the data to look like this:

Sometimes when I join different layers, this happens automatically even though I don't want it to. However, since the layer does not have a geometry, I can't use it.

The "Aggregate" tool (and similar) is not really helpful, since I need the fields to remain separate instead of simply put together.

I could create a dummy geometry for each feature. However, in the actual project, I'm working with hundreds of different features and this would take me a longer time than editing it manually. Editing it manually is also not really optimal, as there will be many more features added to the project.

TL;DR: Is there any tool on QGIS that can help me achieve what I need?

I have, actually! But maybe I don't have the proper vocabulary to find what I was looking for. — eafwnrg, Oct 27 '23 at 08:09
This is data from a single CSV file, like I said initially. There is no geometry, just the dataset. The fid does not have to match, as I'm trying to join them by the attribute name. — eafwnrg, Oct 27 '23 at 09:44
Not the one from the project, as it contains sensitive data. The dummy dataset I used for the pictures can be recreated quite easily though. I just need the additional data from the duplicates to go into their own column. — eafwnrg, Oct 27 '23 at 09:54
I think Python would be good enough. I was trying to aggregate the values and then split it into different columns, but it is beyong my abilities. — eafwnrg, Oct 27 '23 at 10:29

Taras · Answer 1 · 2023-11-01T08:42:51.670

Python + PyQGIS solution

For now, it is hard to say which fields from the CSV-file you are joining to your original vector data. However, my assumption will be the join is done via the "name" attribute, isn't it?

So, let's consider the following example.

There are two objects: a point layer called 'points' and a CSV-file named 'test' with some dummy data, see the image below.

The task is the same as yours. I will process the CSV-file, create a temporal layer from this data, and join it to the original vector layer.

Proceed with Plugins > Python Console > Show Editor and paste the script below:

# imports
import csv
from os.path import realpath
from itertools import groupby
from qgis.core import QgsProject, QgsFeature, QgsVectorLayer, QgsField, QgsVectorLayerJoinInfo
from PyQt5.QtCore import QVariant
######################################################
PART 1 : PROCESSING THE ORIGINAL CSV-FILE AND
PROVIDING THE RESULT STORED AS A LIST WITH DICTS
######################################################
referring to the original CSV-file
path_to_csv = realpath("D:/qgis_test/test.csv")
referring to the grouping and working fields in the CSV-file
grouping_field = "city"
working_field = "month"
opening the csv-file
with open(path_to_csv, 'r', newline='', encoding='utf-8') as csv_file:
    # creating a csv reader object
    csv_reader = csv.reader(csv_file, delimiter=',')
# getting original column names
columns = next(csv_reader) # e.g. 'id', 'city', 'month'
# getting index of the grouping and working fields
ind_group = columns.index(grouping_field)
ind_work = columns.index(working_field)

# getting original data as a list with lists
original_data = list(csv_reader)
# grouping data by the second column i.e. &quot;city&quot;
data_grouped = {key: list(group) for key, group in groupby(original_data, lambda column: column[ind_group])}

# finding the longest grouped list of data
n = max(list(len(set(column[ind_work] for column in value)) for value in data_grouped.values()))
# creating new columns
new_columns = [columns[ind_work]] + [columns[ind_work] + str(i + 1) for i in range(n) if i &gt; 0]  # 'month', 'month2', 'month3'

# initiating a temporal storage for processed data
new_data = []

# looping over grouped data
for key, value in data_grouped.items():
    # making storage for processed data
    feature_new_data = {}
    # processing new data for each record
    feature_new_data[grouping_field] = key
    dummy_fill = [None] * n
    #print(value)
    unique_values = set(x[ind_work] for x in value)
    new_values = list(unique_values)[:n] + dummy_fill[len(unique_values):]
    # works for Python &gt;= 3.9 https://peps.python.org/pep-0584/
    feature_new_data = feature_new_data | dict(zip(new_columns, new_values))
    new_data.append(feature_new_data)


######################################################
PART 2 : CREATING A TEMPORAL LAYER WITH ATTRIBUTE
TABLE WHERE THE PROCESSED DATA WILL BE STORED
######################################################
creating the temporal layer
attr_layer = QgsVectorLayer("None", "attr_layer", "memory")
accessing its provider
provider = attr_layer.dataProvider()
attr_layer.startEditing()
setting columns that have to be created
relevant_columns = [columns[ind_group]] + new_columns
creating columns in the temporal layer
for column in relevant_columns:
    provider.addAttributes([QgsField(column, QVariant.String)])
attr_layer.updateFields()
nesting new data into the temporal layer
for new_data_set in new_data:
    feat = QgsFeature()
    feat.setAttributes([*new_data_set.values()])
    provider.addFeature(feat)
    attr_layer.updateExtents()
attr_layer.commitChanges()
If necessary the 'attr_layer' layer can be added to the Project
QgsProject.instance().addMapLayer(attr_layer)
######################################################
PART 3 : JOINING ORIGINAL VECTOR LAYER WITH
TEMPORAL THAT CONTAINS PROCESSED DATA
######################################################
referring to the original Vector layer
point_layer = QgsProject.instance().mapLayersByName("points")[0]
target layer to join
target_layer_id = attr_layer.id()
parameters for the join
joining_field = grouping_field
prefix = ""
Performing join
joinObject = QgsVectorLayerJoinInfo()
joinObject.setJoinFieldName(joining_field)
joinObject.setTargetFieldName(joining_field)
joinObject.setJoinLayerId(target_layer_id)
joinObject.setJoinFieldNamesSubset(new_columns)
joinObject.setPrefix(prefix)
joinObject.setUsingMemoryCache(True)
joinObject.setJoinLayer(attr_layer)
point_layer.addJoin(joinObject)

Change the names of the input elements. Press Run script and get the output that will look like this:

Note: Sorting was not a part of this script.

References:

Babel · Answer 2 · 2023-10-27T15:37:52.630

You can use in fact Aggregate for that and create different new fields during this process.

Simply aggregate all responsibles for each name in an array an then get the first element of the array (index 0: [0]) for responsible1, the 2nd element for responsible2 etc. In the Aggregate dialog, replace the Field name in Source Expression with this expression:

array_agg (responsible,group_by:=name)[0]

Then add additonal fields an proceed accordingly.

If you have a large dataset and you do not know how many new fields you have to create (maximum number of responsibles for the same city/name), you can calculate this number in the field calculator with this expression:

case 
when    $id = minimum ($id)
then    maximum( 
            array_length(
                array_filter ( 
                    array_agg ("responsible", group_by:="name"), 
                    @element is not NULL
                )
            )
        )
end

In this case, you have to create 4 new fields as Schwerin appears 5 times, but only 4 times with a responsible; all other names appear less then 4 times:

Joining vector layer with CSV-file that contain duplicates in QGIS

2 Answers2

Python + PyQGIS solution

PART 1 : PROCESSING THE ORIGINAL CSV-FILE AND

PROVIDING THE RESULT STORED AS A LIST WITH DICTS

referring to the original CSV-file

referring to the grouping and working fields in the CSV-file

opening the csv-file

PART 2 : CREATING A TEMPORAL LAYER WITH ATTRIBUTE

TABLE WHERE THE PROCESSED DATA WILL BE STORED

creating the temporal layer

accessing its provider

setting columns that have to be created

creating columns in the temporal layer

nesting new data into the temporal layer

If necessary the 'attr_layer' layer can be added to the Project

QgsProject.instance().addMapLayer(attr_layer)

PART 3 : JOINING ORIGINAL VECTOR LAYER WITH

TEMPORAL THAT CONTAINS PROCESSED DATA

referring to the original Vector layer

target layer to join

parameters for the join

Performing join

Linked