I think reggressional goodhearting and therefore goal misgeneralisational is the same problem as the winners curse in auction theory and the winners curse is a solved problem [epistemic status - very likely this is wrong but I think is worth formalising]
[current largest uncertainity is if goal misgenneralisation is reggressional goodhearting and insofar that it's not if the core problem of goal misgenneralisation is reggressional goodhearting] [I also don't know if this novel]
reggressional goodhearting is the problem that when defining a metric for an optimisation proccess to aim for, unless that metric is exactly what you actually want you're selecting a proxy.
That proxy is likely to be wrong because there'll be some noise is the metric you've chosen, and you'll select for the noise as well as for the metric - i.e metrics that look really good are likely to be those that both approximate the goal well and got high noise value
goal misgeneralisation is an example of this problem in action. when an rl system is learning what state-action pairs correspond to high reward* in the particualar dataset it is trained on, it'll be picking up on noise as well as signal
the metric is what the model learns is being reward. the goal is what is being actually reward. the noise is things that happen to also be present in high quantities when reward is high and happen to be low when reward is low.
if a mesa-optimiser is created to target the learnt metric it will maximise the metric which means maximising noise as well as the actual reward
in action theory there is a problem that i claim is the same as reggressional goodhearting which i claim is the same as goal misgennerlisation
when bidders are bidding for item that has a common value to the bidders (e.g an oil well which is valuable for the revenue it gennerates which is equally valuable to each firm) firms predictably overpay
this is because each of them gets a private signal of the value of the oil well and the winning firm will be the firm who's signal happened to contain the highest value of noise
I think this the same problem as reggressional goodhearting. Firms take the optimal action given the metric of value of oil well or whatever which is comprised of both signal (how much oil the oil well has) and noise (like, how good your surveryors were)
the harder this metric is optimised by the market - i.e the more firms bid - the more is overpayed because it becomes more likely that there's a firm that has a noise that means that they think that the oil well has loads of oil
in auction theory the solution the winners curse is to bid based on your value conditional on winning the action
in the goodhearts paper law paper states are allowed if the metric used to measure if they're good states or not are above some threshold.
bidding conditional winning in this situation corresponds to adjusting the threshold that states have to pass to be allowed conditional on the value the metric assigns to them (i think - haven't got this cleanly worked out)
i think it's less clean to apply this to goal misgenneralisation
but i'm going to give it a go
the way that an RL system should estimate what action-state pairs correspond to reward is to estimate distributions over elements in the states set (just consider states not actions for simplicity)
and when a state correponds to high reward update less on elements of the state that have high variance and the higher the estimate of the reward the smaller the update should be on elements of the state that have high variance
an example will make this clearer
elements in states are apples, organges and pears.
apples ~N(0,1), oranges ~N(0,2) and pears ~N(0,3). The actual goal is to max apples. In the dataset the model is trained on apples and pears happen to be correlated.
State S_1 has reward 1 and state S_2 has reward 2. Because in the dataset apples and pears are correlated the systems sees reward = 1 there's 1 apple and 2 pears.
When it sees reward = 2 there's 2 apples and 4 pears. So it seems like having more pears is more important than having more pears
However, when the system updates on the higher variance of pears apples and pears look equally important. So long as apples and pears aren't perfectly correlated the system should learn that apples are what are actually being rewarded.