Number of distinct nodes in a random walk

Question

Commute time in a connected graph $G=(V,E)$ is defined as the expected number of steps in a random walk starting at $i$, before node $j$ is visited and then node $i$ is reached again. It is basically the sum of the two hitting times $H(i,j)$ and $H(j,i)$.

Is there anything similar to the commute time (not exactly the same) but defined in terms of nodes? In other words, what is the expected number of distinct nodes a random walk starting at $i$ and returning at $i$ will visit?

Update (Sept 30, 2012): There is a number of related work on the number of distinct sites visited by a random walker on a lattice (i.e., $\mathbb{Z}^n$). For instance, see: http://jmp.aip.org/resource/1/jmapaq/v4/i9/p1191_s1?isAuthorized=no

Anybody has ever read something on this?

What is the problem with the following argument? A random walk on a graph can be described by a Markov chain where the states are the nodes. Similarly, one can represent the same walk by a Markov chain where the states can be the edges. (Each edge also holds the current visited node info.) Once a Markov chain is obtained, you can use any definition/result of Markov chains. — Abuzer Yakaryilmaz, Sep 25 '12 at 19:36
Thanks for the comment. I actually forgot to say distinct nodes. I'm going to modify the question right now. — Fabrizio Silvestri, Sep 25 '12 at 20:14
Maybe I missed it (sorry if so), but what is the URL to the SE post? — , Sep 26 '12 at 11:46
I have removed the SE post... It is forbidden to post the same question in two different places. — Fabrizio Silvestri, Sep 26 '12 at 17:09
it depends on the particular graph, right? can you sketch out anything known about similar problems? — vzn, Sep 27 '12 at 23:02
Right... To be honest, I've not found anything on the topic. Indeed, I thought somebody in the online algorithms community had done something because one of the possible applications of that result would be to find the optimal (in expectation) size of an LRU cache. — Fabrizio Silvestri, Sep 28 '12 at 11:58
it would be helpful if you sketch out how size optimization of an LRU cache reduces to this problem. and could it be you are more interested in results relating to optimization of LRU caches? also, maybe this is obvious, but have you thought of monte carlo simulations? — vzn, Sep 28 '12 at 16:17
Just to give an idea. Suppose requests are nodes of a graph end vertexes link subsequent requests. E.g. r1 --> r2 means r2 is requested after r1. Knowing the maximum of the expected number of distinct vertexes for a given graph would result in 0 fault in expectation. Am I wrong? — Fabrizio Silvestri, Sep 28 '12 at 16:35
interesting idea. somewhat reasonable. however Im sure there is literature on LRU cache optimization that might be more directly relevant. ... looking over random graph walk theory, it seems like mixing time might be relevant or related.... what about your condition that you return to a starting vertex? what does that have to do with the LRU optimization? — vzn, Sep 28 '12 at 16:56
just looked up a paper on optimizing LRU caches & it considers the case where the objects/"documents" are different size & how to optimize in that case.... you are assuming each request object is a constant/uniform size (eg "pages").... is that as intended? — vzn, Sep 28 '12 at 17:03
yes... my main motivation, as you can guess from my research background, is caching of query results. In this case you have a graph (a.k.a. query-flow graph) describing how queries are requested. Indeed, this can be generalized to any request sequence. — Fabrizio Silvestri, Sep 28 '12 at 17:16
not related to the question in the title, but to analyzing LRU, taking locality into consideration: http://www2.informatik.hu-berlin.de/~albers/papers/stoc021.pdf — Sasho Nikolov, Sep 30 '12 at 19:48
I am interested to know about this question too. In fact my question is the following: Assume that we have an undirected connected graph and a simple lazy random walk on it. Now since the random walk is lazy and the graph is connected we converge to a stationary distribution. And we can ask about the mixing time. i.e the time to get close enough to stationarity. The mixing time is well studied and in fact we know it is bounded by $\Phi^{-2}$ where $\Phi$ is the graph's conductance. Now the question is the following: What if we care about the number of distinct vertices traversed by the rando — shahrzad haddadan, Mar 27 '18 at 14:28

score 4 · Answer 1 · answered Sep 28 '12 at 13:20

4

A comment: I recently attended a talk by Bruce Reed with the title Catching a Drunk Miscreant, which was joint work with Natasha Komorov and Peter Winkler. If you can get a hold of the results from this work, maybe that might help you in some direction.

In general, they prove an upper bound on the number of steps a cop need in a general graph to be able to catch a robber, when we know the robber moves at random along the edges.

answered Sep 28 '12 at 13:20

Pål GD

550
4
16

Any possibility to have a draft or a copy of the slides? – Fabrizio Silvestri Sep 28 '12 at 13:39
2

I'm sorry I don't have any more to give, but maybe this MO thread is to a help: Cops and drunken robbers. – Pål GD Sep 28 '12 at 14:45
Thanks Pål... I am looking at the paper linked from the MO thread. – Fabrizio Silvestri Sep 28 '12 at 15:52

score 4 · Accepted Answer · answered Sep 28 '12 at 17:44

from Q&A with you in the comments you seem to be interested in studying something defined as the stack distance in this set of slides, On the mathematical modelling of caches

define the stack distance of a reference to be the number of unique block addresses between the current reference and the previous reference to the same block number.

it has some empirical analysis via benchmarks. it says in general there is "no known measurement of locality" of cache requests and then proposes stack distance as such a measure. it does not relate it to random graph theory although you sketch out such a connection in your comments. (it seems like there stack distance could be related to markov chain mixing?)

it appears that you are interested in modelling cache performance or optimization algorithms by considering cache requests as nodes of a graph and the edges as transitions between adjacent requests. have not seen papers that study the structure of this graph. it does appear to provably not to be a purely random graph in real applications due to the success of caches in practice and what is referred to as spatial and temporal locality in the above slides. ie some kind of "clustering" as joe sketches out in his answer.

(maybe it has a small world structure?, which is quite ubiquitous in real world data)

Nice catch. Indeed, it has small world structure. In fact, in the application I have in mind degree distribution follows a power-law. Now, this can help... Still, we've not found a good way to go :) — Fabrizio Silvestri, Sep 28 '12 at 18:04
thx. what caching parameter are you trying to optimize? seems its likely to correlate directly with the power law exponent somehow....? suspect that simple monte carlo approaches could show that stack distance is related to the power law exponent etc — vzn, Sep 28 '12 at 18:16
well... In the beginning, I was thinking to correlate k with $\alpha$ in the power law. Obviously, different values of $\alpha$, i.e., $=1, < 1, > 1$, will have to be treated separately.
I was just trying to see if there's anything beyond power-law graphs. Something more general, so to say.

Anyway, I wanna check into the stack distance concept. Didn't know about that. — Fabrizio Silvestri, Sep 28 '12 at 18:52
seems like stack distance hasnt been studied directly in graph theory but its a vast field. note the watts/strogatz model is good for monte-carlo approaches generating small world graphs. also random walks on a graph by lovasz is a good survey of theory of walks on random graphs. — vzn, Sep 28 '12 at 21:41

Joe Fitzsimons · Answer 3 · 2012-09-28T11:26:33.860

This isn't really a proper answer to your question, but is a bit too long for a comment.

The quantity you are after will vary from graph to graph, and will depend on the initial site of the walker. The expected number of distinct intermediate nodes will depend strongly on clustering within the graph, and I would expect the expected number of distinct intermediate nodes to be correlated with the clustering coefficient.

A cluster is basically a subset of vertices which share a large number of edges, so that each vertex is connected to a large fraction of the other vertices within the cluster. When a walker enters a cluster it is likely to stay in that region for a large number of hops, possibly revisiting each node many times. Indeed, using random walks in this way is one of the computational techniques used for identifying clusters in large graphs. Thus for a walker starting in a cluster, the expected number of distinct intermediate vertices will likely scale with the size of the cluster and the average probability of leaving the cluster.

While the above applies to both weighted and unweighted graphs, lets take the maximally connected unweighted graph of $N$ vertices as an example. In this case, at each hop after the first, the walker has probability $\frac{1}{N}$ of returning to the initial vertex. Thus the expected number of hops to return to the initial site is $N+1$. Even if we connect some vertices in this graph to a vertex in some other graph (i.e. outside of this clique) the probability of each hop leaving the cluster before returning to the initial site can be very low. Thus we expect clustering to reduce the number of distinct intermediate vertices by confining the walker within the cluster.

The average degree of vertices within the graph will also play an important role, though this is linked to the clustering. The reason for this is that when the walker jumps onto a vertex with degree 1, it must hop back to the previous vertex on the next hop. Even when the degree is 2, there is only one path which can be followed through the graph, although it can be traversed in either direction at each hop. On the other hand, for graphs with degree higher than 2, the number of paths can explode, making it extremely unlikely to return to the initial site even if the shortest path between then is small.

Thus you would expect the number of distinct intermediate vertices to be high for graphs which both have an average degree substantially above 2, and also have no significant clustering, such as trees.

Of course these comments no longer hold in the case of quantum random walks, but I guess you only care about the classical case.

Number of distinct nodes in a random walk

3 Answers3