64

I'd like to use Youtube as an example: they use IDs in the form of PEckzwggd78.

Why don't they use simple integers?

Or imgur.com - they also use IDs such as 9b6tMZS for images and galleries. Not sequential integers.

  • Why don't they use integers (particularly sequential ones)?

  • In what cases is it a wise decision to use such string IDs instead of integers?

Rakori
  • 787

10 Answers10

106

Youtube can't use sequentional IDs for two reasons:

  1. Its databases are almost certainly distributed, making sequential numbering complicated.

  2. It has a privacy option "Unlisted videos": those that don't show up in the search results, but are available if you know the ID.

Therefore, the video IDs should be reasonably random and unpredictable. Whether the ID is represented by digits only, or by a combination of letters and digits, is irrelevant: there is a trivial mapping from one representation to another.

IMil
  • 849
  • 13
    Numeric ids dont have to be sequential – Sopel Nov 28 '17 at 10:02
  • 28
    @Sopel I think IMil's point is that Youtube needs to generate IDs that are sparse. In other words, if it is estimated that you will only ever need to store 2^40 items, in some architectures there are legitimate reasons for choosing a space of 2^80 or 2^120 bits. Examples of reasons are: reducing collision without technically checking for collision; using the sparseness of keys as part of making secrets hard to find (the "unlisted video"), etc. – rwong Nov 28 '17 at 10:33
  • 13
    @Sopel the question was "Why don't they use integers (particularly sequential ones)?" I explain that: 1) sequential IDs are undesired; 2) integers and strings are basically the same thing – IMil Nov 28 '17 at 11:01
  • 3
    The "therefore" clause doesn't logically follow but the two numbered points are correct. As an example of why randomness is not a necessary consequent: sequential numbering with uniform gaps will work to provide unique ids in multiple independent databases such that the results can be combined in a datawarehouse - this is a form of sharding. That is, suppose you anticipate no more than 10000 regional databases (perhaps you have only 10 right now so 10000 is sufficient). Then each db can have an identity column counting by 10000 with unique last 4 digits, there will be no collision on merge. – davidbak Nov 29 '17 at 02:09
  • 2
    @davidbak the requirement for randomness follows from (2). Uniqueness may indeed be obtained by assigning non-overlapping ranges to different database instances, but this would leave the IDs predictable. – IMil Nov 29 '17 at 12:24
  • 1
    @IMil - there are other ways to achieve that as well, e.g., do not use an identity column but instead assign from a given pseudorandom number generator where each db uses a different sequence/seed. They're deterministic (not random) from YouTube's POV but unpredictable from attacker's POV. And if you use specific sharding values at the end (my example above) for the "private" dbs then they're easy to exclude from search results as well. – davidbak Nov 29 '17 at 18:34
  • 2
    @davidbak I didn't mean random numbers in a strong cryptographical meaning. They may be deterministic from internal point of view, but should be unpredictable to outsiders, that's the whole point. I might try to edit the answer to make this clearer. – IMil Nov 29 '17 at 18:56
81
  • On the form of the IDs: They're using Base64 (using the characters a-z, A-Z, 0-9, -, and _). This allows them to have 6 bits of information per character. YouTube uses 11-character video IDs, which means they can generate 26*11, or more than 7*1019 IDs. As Tom Scott put it, that's "enough for every single human on planet Earth to upload a video every minute for around 18,000 years." Base64 is also easy to work with, because 64 is a power of 2, which means every character represents an exact number of bits. We use hexadecimal (base 16) for the same reason.

  • On the non-sequential nature of the IDs: it means they don't need a synchronized counter between all the servers that assign IDs to videos. They can just generate a random number, check if it's already in use, and go from there. They could even assign each server a block of IDs to pick from and eliminate the duplication checking. I don't know if they're doing that, but they could.

  • Another reason for the non-sequential IDs is that it is what makes "unlisted" videos work. These are videos that won't show up in search results or as suggestions, but that are accessible if you've got the link. If you're using sequential counting, you can just go to a video, increase the ID by one, and the idea of unlisted videos is now broken.

  • Non-sequential IDs also help hide information from competitors, such as the total amount of videos, or the number of videos uploaded per timeframe.

I can highly recommend Tom Scott's video. His information is almost always both interesting and accurate.

  • 6
    Let's also point out that 11 characters of a base64 encoding store 66 bits of information, which means that they can easily map a 64bit integer into such a string. I.e. internally, they could use a 64bit int anyway (but need not do so). – Bernhard Hiller Nov 28 '17 at 11:29
  • 1
    For comparison, the conventional decimal representation could require as many as 20 characters, “wasting” up to 9 characters compared with Base64. – dan04 Nov 28 '17 at 22:30
  • The Tom Scott video explains this perfectly. – AGB Nov 29 '17 at 10:46
15
  • Integers do not scale that well, a "normal" 32-bit unsigned integer will max out just over 4 billion.

  • They may not want you to know how many items they have on line or keep track of the rate they are growing. If your order numbers just count up it will be easy for a stranger to deduct how many orders you got in a certain period.

  • Letters can hold more information than digits, you need fewer letters to express the same "number". Not in binary format but in the presented format on a screen or on paper. For instance, you could have the year or an entire date as part of your id, or the type of customer.

Martin Maat
  • 18,435
  • 4
  • why? ........... they're all public anyway. those that aren't public -- aren't accessible. that's it
  • – Rakori Nov 28 '17 at 07:14
  • 4
  • can you elaborate? express what information ?
  • – Rakori Nov 28 '17 at 07:14