4

I installed Cytoscape 3.7.1 with java 1.8.0_191 in windows server 2012. I have a 7.4 GB csv file (about 1,500,000 reords) and when I tried to load it into the Cytoscape it throws an error java.lang.OutOfMemoryErroe:Java heap space.

My system has 500GB ram and then I changed Cytoscape.vmoptions file according to -Xms102400M but when I load my file, in task manager, the memory increased until 50GB and then Cytoscape crashed and again it threw the same error: java.lang.OutOfMemoryErroe:Java heap space.

My data is like this sample:

45677888,test1,3453453453,test2,3534523453,235412352,3453452345,test3,235423452,test4,1980/09/02,14:13,23523525,test5
45234288,test11,34234553453,test12,353434534553,23453452352,342422345,test13,23456543452,test14,1980/04/01,14:12,2323423425,test15
4243424888,test12,3235253453,test22,3533456343453,233534532352,2343452345,test33,23345342,test44,1980/03/4,14:11,23674575,test55

My file is in CSV format. col2 is source node, col1 and col3 are attributes for source node,col 4 is destination node, col5,col6,col7 are attributes for destination node, col8 until end are attributes for edge. I have about 200,000 nodes and 230000 edges.

How can I read this file in Cytoscape?

terdon
  • 10,071
  • 5
  • 22
  • 48
user4219
  • 41
  • 2
  • 1
    How much RAM does your computer have? And how big is this network? 7.4G csv? Can you show us a few lines? Does it have a lot of extra data or is it just a list of edges? – terdon Mar 06 '19 at 09:00
  • thank you for your response. My answers to your questions add to my post. – user4219 Mar 06 '19 at 10:40
  • Please don't post images of text! Just copy the text and paste it directly into your question. Use the formatting tools to format it as code. And could you explain what format this is? Where's the node? Where's the edge? How many edges and nodes do you have? And, more importantly, why do you want to load such an enormous network into a vizualization tool like cytoscape? You will probably not be able to do anything useful with such a large file anyway. Can't you limit the size somehow? – terdon Mar 06 '19 at 10:45
  • this data is a connected network and its important for me that I can see it as much as possible on one page. – user4219 Mar 06 '19 at 11:10
  • 1
    You won't be able to see anything useful in a GUI with a network of this size. Even if you do manage to load it. – terdon Mar 06 '19 at 11:12
  • if i first detect community in my network and assign a node for each community and then by click on each community, i can see all nodes of that community. for this, what solution do you recomment? – user4219 Mar 06 '19 at 11:47
  • 1
    On a network of 200,000 nodes? Even if cytoscape manages to somehow deal with something that huge, I really doubt you will be able to visualize it in any useful way. f you're looking for communities, why don't you first extract the communities into subgraphs and load those into cytoscape? Alternatively, can you isolate only the connected component of the network? That should also reduce the size. 23e4 edges for 20e4 nodes is a very sparse network. You should be able to separate that into smaller, connected sub networks. There's no point in looking at the whole thing if it isn't connected. – terdon Mar 06 '19 at 11:51
  • thanks for your time. My goal is to review a connected network with 200,000 nodes in cytoscape. To reach this goal and for the greate of the network, I decide first detect community and show each community as a node that edges between this nodes is edges between communities. Then by click on each node, I can see relation between nodes in a community. Can I do this solution by cytoscape? – user4219 Mar 06 '19 at 12:37
  • I really don't think you will be able to do anything useful in cytoscape with such a huge network. As you can see, it can't even read it. Imagine how hard it will be for cytoscape to manipulate it! Why don't you just use the communities as nodes? Is that an option for you? – terdon Mar 06 '19 at 12:40
  • what tool do you recommend? i can not just use the communities as nodes. relation between nodes in communities are important for me. – user4219 Mar 06 '19 at 12:48
  • 1
    I don't think any graphical tool will be able to handle this amount of data. I would instead try two things: i) make a different network for each community and use that to analyze the relationships of the nodes within the community ii) make another network where the communities are the nodes and use that to analyze the relationship between the communities. – terdon Mar 06 '19 at 12:52

3 Answers3

1

I work a lot with large networks and I agree with what most other posters have said, in that it is hardly useful to explore a network like this visually and Cytoscape - while it is one of best GUIs in general - is pretty bad with large graphs. Here are some alternatives you could try if you want to go down that road:

  • Gephi also probably won't handle that number of nodes/edges well, but it may not crash. Development on it has stagnated though.
  • Graphia I haven't tried this, but it is supposed to be better at handling large graphs. This project is in its infancy, so don't expect too much

If you care about the analysis of the graph itself, I'd go with NetworKit or SNAP which are much better suited to large graphs.

If you want to visualize the networks, something that tends to perform well here is:

  1. Graph embeddings with VERSE
  2. Dimensionality reduction of the embeddings with UMAP
  3. Plot with Datashader

This can handle absolutely massive graphs if you have the compute - which you seem to have. Here's one I am just working on (albeit small compared to yours with 30K nodes and 800K edges) - from that exact pipeline. Took ~10 minutes read to render on my desktop: graph example

Finally, as others have mentioned, graph partitioning might be the way to go. I really like InfoMap. It partitions the graph into a hierarchical structure that you can explore top down.

1

The first thing you can try is to separate the network from the meta information. Assuming you have access to a *nix machine, you can run:

awk -F"," '{print $2,$4}' file.csv > network.csv

That will produce a file like this (based on your example data):

test1 test2
test11 test12
test12 test22

That should radically decrease the file size and will give you the best chance of loading it in Cytoscape. You can then similarly extract the attributes:

awk -F"," '{print $2,$1}' file.csv > source.column1.attributes
awk -F"," '{print $2,$3}' file.csv > source.column3.attributes
awk -F"," '{print $4,$5}' file.csv > target.column5.attributes
awk -F"," '{print $4,$6}' file.csv > target.column6.attributes
awk -F"," '{print $4,$7}' file.csv > target.column7.attributes
awk -F',' '{print $2,$4,$8}' file.csv > edge.column8.attributes
awk -F',' '{print $2,$4,$9}' file.csv > edge.column9.attributes
awk -F',' '{print $2,$4,$10}' file.csv > edge.column10.attributes
awk -F',' '{print $2,$4,$11}' file.csv > edge.column11.attributes
awk -F',' '{print $2,$4,$12}' file.csv > edge.column12.attributes
awk -F',' '{print $2,$4,$13}' file.csv > edge.column13.attributes
awk -F',' '{print $2,$4,$14}' file.csv > edge.column14.attributes

Now, try loading network.csv into Cytoscape and then load each of the node and edge attributes separately.


However, a network with 200,000 nodes and 230,000 edges is just too large to be able to manipulate in any useful way in a GUI. It will just look like a horrible hairball. I really urge you to do some preprocessing and pare the network down to something more manageable. I used to work with a human PPI network of ~13000 nodes and ~15000 edges. While I could load that into Cytoscape, that was already far too big to be able to do anything useful with. I could use the data analysis aspects of Cytoscape, but not really the visualization.

terdon
  • 10,071
  • 5
  • 22
  • 48
  • I've done some testing and ~50,000 nodes with ~50,000 edges is the upper limit of what my 2015 macbook pro can handle. Even then it is painfully slow though. Anything you can do to reduce to graph will help. – conchoecia Mar 06 '19 at 17:00
  • @conchoecia well, the OP is using a machine with 500GB, presumably more than your laptop :). So I am guessing their limit will be considerably higher. But even so, even if they do manage to load this behemoth of a network onto Cytoscape, you can't really do anything useful with a GUI on something so large. – terdon Mar 06 '19 at 17:03
  • thanks all, that is right. if i can load such enormous network into cytoscape, GUI of it is not useful, but unfortunately, cytoscape displays data as soon as it is loaded.it is better that it load data without display and then we can filter on data and then display network. – user4219 Mar 07 '19 at 07:56
0

That's certainly a lot of nodes and edges, however, there is a preference called viewThreshold that should prevent the creation of a view if the number of nodes+edges is larger than the set value. I've just tested it on a large network I have (9000 nodes and 4,000,000 edges) and it doesn't appear to be working properly. I'll mark this as a bug. It does, however, render, but it's not really a useful network on Cytoscape 3.7. It will actually be usable on Cytoscape 3.8 based on early testing, however. If you want to make your network available somewhere, I'm happy to try to see why it's not loading.

-- scooter