1

From a large-scale patent analysis, I have some results which can be shown in a graphic but I would like to capture them in a model somehow, to be able to quantify a fit.

The scatterplot below (in my perception) shows that up until a certain optimum, increase region diversity enables high radicality. Due to a huge amount of data points close to the x axis, I am unable to capture this in regression (linear or polynomial) as r^2 values are below 0.01.

Scatter plot of region diversity versus radicality

If take the average radicality over each region, a (linear) regression is possible yet I still feel it is not the most appropriate model to fit to the data: enter image description here

Based on these images, would there be a statistic/model/measure/method to capture the enabling capacity of region diversity on radicality?

I am sorry to say I lack experience and knowledge with statistics so apologies for my probable abuse of terminology.

Any help is highly appreciated!

[Edit] My variables Patent Radicality and Region Diversity are derived measures:

  • All patents have 1 or more technology class assigned, based on the co- occurrence of these classes, relative distances between the classes were calculated.
  • The patent radicality is the average distance between the classes assigned to it.
  • The region diversity is the (Rao-Stirling) diversity of technology classes occurring in it, accounting for variety, balance and disparity.

My hypothesis is that increased region diversity should foster the radical innovation potential of that region.

I have about 5 million patents and two region levels (TL2 and TL3 as defined by the OECD) with about 300 and 2000 regions with sufficient data respectively. Both region levels yield similar results.

  • 1
    Can you say more about your situation & your data? What are your variables? Are these counts? – gung - Reinstate Monica Sep 08 '15 at 15:43
  • I added some additional information about my variables. I could (and in fact am) write a full thesis about this, but that will not fit on the page. If you lack any specific information pleas let me know, I'll add more details. – user3527743 Sep 08 '15 at 16:30
  • Looks like your "radicality" score has to be non-negative and has some very high values (>700,000) even though average radicality is only about 100. You thus might try using the log of the radicality (or the log of (1+radicality, or something similar, if you have some radicality values of 0) for your analyses. – EdM Sep 08 '15 at 16:39
  • 1
    It also would help for those trying to provide an answer to know how many patents and how many regions are involved. Hard to tell from the posted plots. – EdM Sep 08 '15 at 16:49
  • @EdM Thanks, I tried the log but this unfortunately does not yield much improvements. The scatterplot is not very insightful. There is too much data do make anything of that. – user3527743 Sep 08 '15 at 17:05
  • Also, probably regions will be of different size (population count?) and a linear regression will not take that into account. Maybe an glm with log link? see my answer here: http://stats.stackexchange.com/questions/142338/goodness-of-fit-and-which-model-to-choose-linear-regression-or-poisson/142353#142353 – kjetil b halvorsen Sep 08 '15 at 20:02
  • @kjetilbhalvorsen, I read your answer, which is very insightful, however while it is true that my regions are of different size( in population, surface area, and number of patents), all my variates are intensive (or at least I think so). The radicality is an average of scores while the regional knowledge diversity is calculated with a diversity-measure which controls for size (Rao-Stirling). Then I attempted to use a log-link function, but so far binomial, gaussian and gamma glm's return AIC values between 3400 and Inf, which intuitively seem quite high. Could you suggest a log-link function? – user3527743 Sep 10 '15 at 08:59

0 Answers0