If you're talking about ArcMap, you could try something like this in the field calculator's Python code block (untested):
lookup = dict()
def get_id(value):
global lookup
try:
uid = lookup[value]
except KeyError:
uid = lookup[value] = len(lookup)
return uid
get_id(!some_field!)
Edit: annotated to explain the code:
# create a dictionary with nothing in it
lookup = dict()
define the function that will be called for every row in the table
def get_id(value):
# tell the function to look for the dictionary defined outside of the function itself.
# we can't define it inside of the function, otherwise it would get emptied on every row.
global lookup
# attempt to look up the ID associated with this value, store it in the uid variable.
# it will only be present if we've already seen this value, otherwise we'll get a KeyError.
try:
uid = lookup[value]
# in the case that we haven't yet seen the value, then set the ID to the current length
# of the dictionary, which is the number of unique values we've seen so far.
# also store it in the dictionary for later.
except KeyError:
uid = lookup[value] = len(lookup)
# give the ID back to whatever called the function
return uid
Edit 2: A lot of similar approaches are being floated, which are all great! But what's actually fastest? Let's see. I've added a bonus one here as well, that hasn't bee proposed yet.
import timeit
import random
random.seed(42)
field_values = [chr(random.randint(32, 127)) for _ in range(1010*6)]
def test_eafp():
lookup = dict()
def get_id(value):
try:
uid = lookup[value]
except KeyError:
uid = lookup[value] = len(lookup)
return uid
result = [get_id(v) for v in field_values]
def test_lbyl():
lookup = dict()
def get_id(value):
if value in lookup:
uid = lookup[value]
else:
uid = lookup[value] = len(lookup)
return uid
result = [get_id(v) for v in field_values]
def test_get_method():
lookup = dict()
def get_id(value):
uid = lookup[value] = lookup.get(value, len(lookup))
return uid
result = [get_id(v) for v in field_values]
def test_missing_overload():
class LookupDict(dict):
def missing(self, key):
value = self[key] = len(self)
return value
lookup = LookupDict()
result = [lookup[v] for v in field_values]
if name == 'main':
print 'Timings with {} random characters (best of 10)'.format(len(field_values))
timer = timeit.Timer(test_eafp)
print 'EAFP', min(timer.repeat(repeat=10, number=1)), 'seconds'
timer = timeit.Timer(test_lbyl)
print 'LBYL', min(timer.repeat(repeat=10, number=1)), 'seconds'
timer = timeit.Timer(test_get_method)
print 'dict.get()', min(timer.repeat(repeat=10, number=1)), 'seconds'
timer = timeit.Timer(test_missing_overload)
print 'overload missing', min(timer.repeat(repeat=10, number=1)), 'seconds'
Timings with 10000000 random characters (best of 10)
EAFP 0.762635946274 seconds
LBYL 0.871690034866 seconds
dict.get() 1.51433086395 seconds
overload __missing__ 0.721973896027 seconds
tryblock to its minimum. There's also nothing wrong with the LBYL approach you propose, though, just preference.Good point re: the case-sensitive comparison, @Ayayayayaoh might want to think about changing that. I might leave the answer as-is though, since the function should work for any hashable field type, not just strings.
– mikewatt Feb 18 '21 at 01:08uid = lookup[value] = lookup.get(value, len(lookup))– user2856 Feb 20 '21 at 02:04