0

I have a table in mysql as mentioned below :

create table test (
id int,
page_id varchar(2),
name varchar(255));

insert into test values 
(1,'P1','a,b,c,d,e'),
(1,'P2','f,h,z,a,c'),
(1,'P3','x,yz,g,c'),
(2,'P1','a,z'),
(2,'P2','a,c,v,b');

I need to count the occurrence of the values available in different pages. For e.g.

  • count of "a" is 2 for id 1 because it is available in page P1 and P2.
  • count of "b" is 1 for id 1 because it is only available in
    page P1.
  • count of "c" is 3 for id 1 because it is only available in page P1,P2 and P3

The number of pages are approximately 2,00,000 for id 1 and 56,000 for id 2

Can anyone help me how to count the occurrence of "name" in id for the page.

Ergest Basha
  • 4,865
  • 4
  • 5
  • 26
Ankur Sharma
  • 125
  • 1
  • 11

1 Answers1

1

First suggestion from me is fixing database design and never store comma separated values.

One way of doing this is first you need to transform the name column into rows. To do so, create a number table with the maximum character length of name column.

create table numbers (
nr int );

insert into numbers values (1),(2), (3),(4), (5),(6);

Your data example:

create table test (
id int,
page_id varchar(2),
name varchar(255));

insert into test values 
(1,'P1','a,b,c,d,e'),
(1,'P2','f,h,z,a,c'),
(1,'P3','x,y,z,g,c'),
(2,'P1','a,z'),
(2,'P2','a,c,v,b');

Below query will return name column into rows

select id,
       page_id,
       SUBSTRING_INDEX(SUBSTRING_INDEX(test.name, ',', numbers.nr), ',', -1) as name
from numbers 
inner join test on CHAR_LENGTH(test.name)-CHAR_LENGTH(REPLACE(test.name, ',', ''))>=numbers.nr-1
group by id,page_id,SUBSTRING_INDEX(SUBSTRING_INDEX(test.name, ',', numbers.nr), ',', -1)
order by id asc;

Finally for your expected result, apply an outer query counting:

select id,name,count(distinct page_id,name) as page_count
from (
       select id,
       page_id,
       SUBSTRING_INDEX(SUBSTRING_INDEX(test.name, ',', numbers.nr), ',', -1) as name
from numbers 
inner join test on CHAR_LENGTH(test.name)-CHAR_LENGTH(REPLACE(test.name, ',', ''))>=numbers.nr-1
group by id,page_id,SUBSTRING_INDEX(SUBSTRING_INDEX(test.name, ',', numbers.nr), ',', -1)
     ) as t1
group by id,name
order by id asc;

https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=a5dc594b23eacc3298ca2c5a103e0fae

Note. For count(distinct page_id,name) both columns should have the same data types

Refrences:

SQL split values to multiple rows

Count distinct value pairs in multiple columns in SQL

Ergest Basha
  • 4,865
  • 4
  • 5
  • 26
  • 1
    Nice fiddle! Yes, this is exactly the right approach, as Hadley describes in https://r4ds.had.co.nz/tidy-data.html : "Each observation must have its own row." – J_H Jun 04 '22 at 20:46