Multilevel caches are a primarily a compromise between capacity and access cost (both latency/time and energy).
It might help to compare it to buying a tool. Going to the local hardware store (comparable to L1 cache) would be fast and use less energy, but the local hardware store is small and is more likely not to have the specific tool one seeks. Going to the big box hardware store (comparable to L2 cache) will take more time and energy (it is farther away and looking for the tool will take longer), but the tool is more likely to be in stock. If even the big box hardware store does not have the tool, one might go to the manufacturer's warehouse (comparable to main memory) which is almost certain to have the tool. If even the warehouse does not have the tool, then an even longer wait is expected until the manufacturer's factory (comparable to disk) produces more of the tool.
Living next to a big box hardware store (having a very large L1 cache) would save time if the diversity of hardware supplies sought was typically great (some PA-RISC processors targeting commercial workloads did this), but typically a small diversity of supplies are used so a small local store would be very likely to have the item in stock (high probability of a cache hit) and finding a commonly used item is faster in a smaller store.
As jcrawfordor mentioned, there are some advantages to sharing a level of cache among multiple cores since it can: avoid repeated storage of the same memory contents, allow imbalanced use of storage capacity (e.g., one core could use all L2 storage with a shared L2 while with per-core L2 caches the core would be constricted to its own L2 cache), and simplify and speed communication between cores (the same L2 would be accessed anyway on an L1 miss and there would be no need to check if other L2 caches had the data).
(Similar sharing advantages can apply with respect to an L2 and separate L1 instruction and data caches, but such content sharing is usually avoided (i.e., a cache line usually has only code or data) and, excluding less common actions like self-modifying code and JIT compilation, there is rarely communication between an instruction cache and a data cache.)
Sharing does have overhead, however. One might compare it to shopping at a department store. The more shoppers using the store, the more likely there will be a line at any given checkout station (comparable to banks in an L2 cache). In addition, the shared entrance/exit introduces delays (comparable to arbitration delays for cache access), providing multiple doors can support higher throughput but increases the time required to choose a door--the choice overhead may be extremely small (but not non-existent) when no one else is entering/exiting but when the store is busy the choice of door becomes more complex. If one assumes that the store will be busy, some of the decision delay can be avoided; but just using the most convenient door would be faster if the store is not busy (similarly a cache might, e.g., take the extra time to allocate a buffer to hold the memory request information even if such a buffer would not be necessary if the cache is not busy--without such an optimization, if the cache is busy, the two steps of determining whether the cache was busy and allocating a buffer entry would occur in sequence so the total time would be the sum of the two, but if the cache is not busy the buffer allocation step is avoided).
Sharing can also increase the frequency of conflict misses given the limited associativity of a cache and can cause poor cache replacement choices (e.g., one core using a streaming access pattern with little reuse of data would tend to use capacity that another core with frequent reuse of data would have greater benefit in using). There are techniques to reduce such disadvantages, but they add complexity and have other costs.