Computer Science Department
University of Virginia,
Thornton Hall
Charlottesville, VA 22901
Phone: (804) 982-2225
FAX (804) 982-2214
The term ``marker'' as used in the active vision and reactive planning literature is poorly defined. The underspecification of the marker concept limits the usefulness of this important tool for building perception and action systems for autonomous robots. This paper refines the concept of markers, both at the level of their role in perception-action system, and the the level of the representations used in their implementation. Our definition encompasses markers in dynamic 3D environments, which are retained even when not in the current perceptual input, either due to occlusion, a limited field of view, or sensing errors. We develop methods for maintaining the representation in the face of these difficulties. The marker concept is further elaborated by a categorization of marker types we have found useful in our research. Relationships among markers are also discussed: relationships which potentially form a link between classical and reactive planning. Finally, we describe a marker-based implementation of an autonomous agent performing search and avoidance tasks in a three-dimensional virtual environment.
This paper elaborates the concept of markers, and extends the marker-based approach to representation of local space in dynamic 3D environments. Occlusion and a limited field of view in 3-space prevent straightforward application of a 2D system. Important problems addressed in this paper include allocation of markers based on images of a 3D environment, determination of the coordinate system in which the markers are located, maintenance of the markers in a dynamic environment, maintenance of markers on objects that are occluded or otherwise outside of the field of view, and mediation among multiple goals to make decisions about what action to take based on the state of markers in three-space.
After discussing related work in section 2, section 3 defines and elaborates the marker concept. Section 4 is a categorization of markers, with special attention to types that arise in three dimensional contexts. Although all markers have the function of retaining what and where information for an object, markers can be further classified into categories based on their more specific functions. Section 5 deals with the problem of maintaining the markers in the face of complications such as movement of the marked object, movement of the agent, markers passing outside the field of view, occlusion, etc. Section 6 describes an implemented autonomous agent that operates in a three-dimensional virtual environment based purely on visual input, and uses many of the concepts described in the previous sections of the paper. This agent is modelled after a small creature such as a rabbit that searches for food and must also avoid predators. Section 7 finishes with some conclusions drawn from our work.
Maja Mataric developed the use of maps in a reactive planning context [10]. Sonar sensors and a low-resolution digital compass were used to identify landmarks and construct a topological map of the environment. The map enables the robot to navigate to locations in the environment as directed by a human. These maps are useful in navigating large-scale space, but are fundamentally different from the local-space representations of our research. The local-space markers are metric, in that they identify the locations of objects relative to the agent in a low-resolution coordinate system, whereas the Mataric maps are primarily topological, e.g., they record relationships like ``to the left of.'' Topological maps are useful for navigating the large-scale space, while local representations are used for coping with the immediate surroundings. In this way, the two representations are complimentary.
Andrew McCallum conducted experiments with deictic representations in a virtual environment [11]. He constructed a ``go-cart'' simulator, for virtual driving to investigate the role of deictic visual behaviors. His work is primarily concerned with the use of the fovea as a marker, and the learning of the visual behaviors. Spatial memory is limited to simple left-right distinctions, whereas we are concerned with higher resolution maintenance of multiple markers.
3.1. Definition
A marker is a data structure that contains two primary pieces of
information: what an object is in terms of its role in the current
task, and where the object is in an egocentric coordinate system. By
what, we do not mean the marker contains a complete, objective,
geometrical description of the object. Rather, the marker identifies
an object relative to the current task being performed by the
agent. One of the most important aspects of markers is their task
dependence. Accomplishing a task in the world requires certain items
be used: markers provide the place holders for these items. For
example, the task of pouring a bowl of cereal requires two items: a
bowl and a box of cereal. There may be many bowls and boxes of cereal
in the cupboard, but we need only one of each. By placing a marker on
one bowl and one box, the of pouring task need only refer to the bowl
and the box. The marker need not contain descriptive information such
as the color or the bowl or the brand of the cereal, since they are
not relevant to the current task.
Similarly, by where, we do not mean the object is located in some objective external coordinate system. The purpose of marker-based representations is to enable to agent to interact with its immediate surroundings, so the appropriate coordinate system is egocentric. We will now flesh out our definition with analogies we have found to be useful in our understanding of markers and their function. These analogies will also provide motivation for our use of markers.
In multiprocessor systems, cache coherence is handled by having writes to main memory be accompanied by broadcast messages advising all processors to invalidate their cache. No analog for such a broadcast exists in the physical world, so an agent acting on the basis of internal state always runs the risk of acting on stale data, which is part of the original motivation for the reactive approach (i.e., no internal state reduces the risk). The Agre and Chapman environment updated the state of the markers automatically once they were placed; thus their environment had the equivalent of the broadcast message that not only invalidated the old data, but updated them, thereby solving the cache coherence problem.
The real world does not have such a broadcast mechanism, so agents must rely on other strategies to synchronize their internal state with the environment. A realistic treatment of the marker maintenance problem is a significant contribution of our work. Fortunately, stabilities and regularities in the physical world mitigate the problem somewhat, so by limiting the amount of internal state and monitoring the environment periodically to update that state, we sharply decrease the risk of acting on stale information.
The locations of some markers can be monitored visually, making marker maintenance for the visible markers amount to visual tracking with multiple targets. Maintenance of markers outside the field of view is rather more difficult. Use of an egocentric coordinate system requires that the marker locations be updated as the agent moves. One of the unique features of our system, as compared to other reactive agents, is use of spatial memory for objects not currently visible, either due to occlusion or limited field of view.
4.1. Instantiation source (top-down vs. bottom-up)
One distinction that can be made among markers is based on how they
are instantiated. Some markers may be created because the perceptual
system has detected something in the sensor input that requires a
marker. Other markers are placed due to the direct actions of an
active task-agency. This is the bottom-up/top-down distinction
familiar to the perception community. The significance of this
distinction among markers in our work is how these markers relate to
tasks.
Recall that our agent architecture is a collection of task-agencies, some of which may be active at any given time. When relevant percepts are detected, a bottom-up marker is instantiated. This will cause one or more task-agencies, which may not have previously been active, to become active, with the new bottom-up marker as one of the agency's inputs. Once an agency is active, it may require other inputs in order to carry out its task. The agency directs the perceptual system to carry out activities that will instantiate markers, this time top-down markers, that serve as the agency's other inputs.
Consider the sample domain of a small creature that must avoid predators. This creature has a run-away-and-hide agency to coordinate the predator avoidance task. Whether this agency is active or not, a ``predator-detector'' is always scanning for predators over the entire field of view. If one is found, a bottom-up marker is instantiated and placed on the predator's location. The instantiation of the predator marker activates the run-way-and-hide agency, which directs the creature to start looking for a place to hide. This might be a rock or a tree, but the object's role in the current task is as a ``hiding-place,'' so when a suitable object is found, a top-down marker is instantiated to serve as the ``hiding-place'' input to the ``run-away-and-hide'' agency.
Note that both the top-down and bottom-up markers are task-oriented, even though the bottom-up marker wasn't instantiated directly by a task-agency. The creature would not need a predator detector or the resulting marker if there were not a run-away-and-hide task-agency that needed a predator location on which to base its actions. Another thing to note is that the hiding-place marker is put on a rock or tree, objects that would not normally be marked if it were not for the task-agency. Put another way, exactly the same stimulus (say, a rock at some relative location) may or may not be marked, depending on the internal state of the agent, i.e., the activation of its task-agencies. To summarize the distinction: the placement of bottom-up markers is stimulus driven, whereas the placement of top-down markers is task driven.
4.2. Primary goal markers vs. dependent markers
The markers discussed thus far are placed on explicit primary goal
objects in the environment. For example, a rabbit's goals are to find
food and avoid predators, so berries and predators are marked. A
navigation task-agency is activated by instantiating an appropriate
destination marker. However, it may not be possible to proceed
directly to the primary destination, due the presence of an
obstacle. In general, complex domains necessitate establishing goals
subsidiary to the primary goals.
The navigation task-agency, rather than blindly moving towards the destination, launches perceptual machinery to find any obstacles to its given destination. If an obstacle is found, it is marked with an obstacle marker. Obstacle markers are always top-down markers, since an object is only an obstacle to the extent that it is in the path to a destination. We say that obstacle markers are dependent on the associated primary destination, so that if the destination is ever dropped or changed, the obstacle marker is dropped as well.
If an obstacle marker is instantiated, the navigation agency looks for a path around the obstacle. When an appropriate location is found, the navigation agency instantiates an intermediate-destination marker which is dependent on the obstacle marker. The navigation agency guides the agent towards the intermediate destination just as it would any other destination. Reaching the intermediate destination triggers a reevaluation of the ``obstacleness'' of the associated object with respect to the primary destination.
As the domain becomes increasingly complex, this pattern can be extended to hierarchies of markers. These hierarchies bear resemblance to the classic concept of a partially ordered plan [7], with dependent markers analogous to steps in the plan. Further research is needed to determine the full relationship of marker hierarchies to partially ordered plans. Marker hierarchies may potentially bridge the gap between reactive and classical planning, at least for navigation tasks.
The entire process--searching for an obstacle, creating intermediate destination markers, and moving to the intermediate destination--repeats until there is no longer an obstacle in the path to the original destination marker, and the destination location is reached. The combination of destination, obstacle, and intermediate destination markers is sufficient for most simple navigation tasks.
4.3. Coordinate system (image vs. egocentric 3D)
Another distinction among markers can be made based on the coordinate
system in which they are located. In previous work with markers this
was not an issue, since there was only a single coordinate system:
image coordinates. However, in a realistic visual environment, the
natural coordinate system for visual perception (image coordinates) is
not the natural coordinate system for acting in the environment
(egocentric 3D coordinates). In a 2D video game environment, the
perception and action coordinates can be confounded, whereas in a 3D
environment, they cannot. This gives rise to the two coordinate
systems, and hence, two types of markers.
Image coordinate markers are placed on objects in the current image, and are used to facilitate visual routines. They mark objects such as edges, lines, terminations, etc. They may either be bottom-up or top-down. Bottom-up image-coordinate markers make up a set of items similar to Marr's raw primal sketch [9]. Top-down image-coordinate markers correspond to the markers in Ullman's visual routines [12].
Egocentric 3D markers are the focus of our research; all markers discussed in the remainder of this paper will be in egocentric 3D coordinates. When the visual system has processed the input enough to determine what and where a task-critical object is in the environment, a marker is instantiated. The natural coordinate system for acting in the world is egocentric, and furthermore, is centered on the effector taking the action. This idea is well known in the active vision community [3]; e.g., if an effector is a manipulator grasping an object, the correct direction to move the effector is given by the signs of the coordinates of the object in the manipulator coordinate system. We are using markers for navigation in a robot that accepts velocity commands in the form of a forward speed and a turn speed. Clearly, the natural coordinate system for such a robot is polar, centered on the robot, with the zero angle directly ahead. When this is the case, the direction to turn to get to a destination at coordinate (r, q) is given by the sign of q, the amount to turn is given by the magnitude of q, and the distance to move is given by r.
4.4. Tentative markers
A single image may contain evidence for the existence of an object;
however, that ``evidence'' may simply be an artifact of the noise in
the imaging process. If we immediately mark an ``object'' without
retaining some measure of our confidence of the object's existence,
the control system will act upon the marker as if there is no doubt as
to the existence of the object, even if the evidence is scant. We
attach a confidence measure to markers indicating estimated
probability of the object's existence, based on the evidence thus
far. While there may be a continuum of ``probabilities,'' we adopt the
simplification of categorizing markers into two classes based on their
confidence measure: those we intend to act upon, and those we do
not. Tentative markers are placed on objects for which there is some
evidence, but which we do not intend to act upon until more evidence
for their existence is obtained.
Our intent with the tentative markers is to filter out noise in the perceptual input, so one useful measure of confidence is whether the percept is stable over a sequence of images. One implementation of a confidence measure can therefore be a count of the number of times the object has been found in the image stream at the expected location. When first seen, an object is given a tentative marker, but when the number of frames in which it has been seen crosses some threshold, the tentative marker is replaced with a regular marker. This implementation is especially effective coupled with an ego motion induced change in viewpoint. If the viewpoint changes between consecutive frames, the location of the potential object's projection in the image will change. If the ego motion is known, the new location of the projection in the next image can be predicted. It is highly unlikely that noise will move in the image in a way that is consistent with the ego motion, so the tentative marker on noise will be dropped.
4.5. Hypothesized object markers
In some cases, especially in a 3D environment with occlusion, it is
necessary to reason about objects that might exist, even though they
aren't currently visible, or possibly haven't even been seen yet. For
example, consider a rabbit in the presence of rocks and
predators. Since a predator might be behind any given rock, the rabbit
could consider the possibility that such a predator exists. In fact,
the rabbit might behave as if there were a predator behind every
rock. We can induce this behavior by hypothesizing the existence and
location of a predator, and placing a hypothesized object marker
labelled ``predator'' at that location. In some cases, the desired
action to take depends on whether the object is actual or
hypothesized, whereas in others it isn't necessary to make a
distinction. Our rabbit, once it hypothesizes the existence of a
predator behind a rock, may behave exactly as if there were an actual
predator. The marker update procedure will treat real and hypothesized
object markers identically.
5.1. Compensating for ego-motion
When the agent moves, updating the markers requires estimating the
coordinate transformation between the agent's previous and current
locations. It is then a simple matter of applying the transformation
to all the marker coordinates. We distinguish short term and long term
approaches to estimating the transformation. Short term methods
estimate the transformation continuously (or as near to continuous as
possible), but have a tendency to drift from the actual transformation
over time due to accumulated error. Long term methods are used
periodically to correct the drift.
There are three main short-term methods: dead-reckoning, acceleration measurement, and optical-flow based methods. Dead-reckoning assumes that all motion of the autonomous agent is due purely to the motor commands given by the agent itself. Since the agent knows the motor commands issued, the expected motion can be calculated directly. However, motor commands are often not executed perfectly, and forces external to the agent can influence the motion. If the motor commands are accurate and well characterized, this is the easiest option to implement. Another method is to measure the acceleration continuously, and then integrate twice to determine change in position. Measuring the acceleration enables the marker locations to be updated even if the motion of the agent is not purely the result of the agent's motor commands. Optical flow can also estimate the transformation under these conditions. An ideal system will use a combination of all three methods.
The primary long term method is to locate several points in the environment, and then solve a structure from motion problem. However, this option requires a correspondence problem be solved. (Another possibility might be to use a ``global positioning system,'' if it's an option in the domain.)
5.2. Correspondence for visible markers
Marked objects are not featureless points; they have perceptual
properties that allow them to be identified. Those same properties can
be used to re-identify them later. Clearly, to do this we must retain
some of the perceptual information associated with the marker from
frame to frame, i.e., primary visual cues, e.g., the color and size of
a berry. The obvious place to keep this information is in the
marker.
Our definition of ``marker'' required only two elements: a role in a task (what), and a location (where). This does not mean that these are the only two pieces of information allowed in the marker. In order to maintain them, markers must also retain some information about the perceptual aspects of the objects they mark. We might say that markers must have the ability to ``find themselves'' in the perceptual input, and will therefore retain any additional information necessary to do so.
Given markers with this additional information, the general marker maintenance procedure at each time frame has two steps. First, all markers ``find themselves'' in the input. Second, any objects in the new input that aren't already marked can potentially instantiate a new bottom-up marker.
5.3. Perception overrides memory
As we have discussed, memory for the location of objects runs the risk
of being incorrect, and acting on incorrect information is potentially
hazardous. Information derived from current perceptual input is more
reliable than that stored in memory, and therefore overrides the
information in memory. In a first cut at an implementation of this
policy, whenever a marker cannot find itself in the current input, the
marker is dropped. But this policy would drop any marker outside of
the current field of view, so we amend it to dropping any marker we
expect to find in the input but don't. For visual input, the two main
cases in which we don't expect to find the marked object are when the
object is outside the field of view, and when the object is
occluded. The field of view case can be detected by knowing the field
of view of the sensor and comparing it with the location stored in the
marker. The second case requires additional visual processing; one
must determine that there is some object along on the azimuth towards
the marker, but is nearer than the marked object. The visual routine
that performs this occlusion computation is only executed when a
marker indicates that an object should be in the field of view, but no
visual cue for the object is found. Note that this policy enables the
maintenance and deletion of hypothesized object markers without any
special machinery.
The agent's control process decides upon an action based on the images, and sends velocity commands to the simulation system, which updates the display based on the new location of the viewpoint. The agent and simulation are completely asynchronous, and run in real time at 10-15 frames per second.
Object locations are determined based on azimuth and elevation in the visual field. The system attempts to keep the four closest berries and any nearby predator marked at all times. The markers are maintained by using dead-reckoning based on the most recent velocity commands to estimate the expected new position of the object. If the expected position is in the image, then the correspondent is found in the image if possible.
The strategy for the agent is to head for the nearest berry, which may not be currently visible, due to occlusion or the limited field of view. We have to carefully follow the ``perception overrides memory'' doctrine with this agent. Due to noise in the image caused by severe color quantization some erroneous markers are created. Markers also occasionally drifted due to imperfections in the dead-reckoning procedure.
The problems with erroneous markers were overcome by having the agent drop any markers that were supposed to be in the field of view, but didn't have any perception corresponding to them. So for example, if a berry marker is maintained to the left of the agent, the agent turns to the left to look at it. If no berry is seen to the left, the marker is dropped. This behavior serves to allow the agent to use the markers that are accurate, without being led too far astray by incorrect markers.
The strategy described above must be augmented to notice the occlusion of marked objects and behave appropriately. The technique we used is similar to Horswill's obstacle avoidance routine [8], in that it relies only on being able to segment the ground from non-ground (i.e., obstacles). To determine if a marked object is occluded, points on the ground/non-ground boundary are converted to egocentric 3D coordinates and compared with coordinates of the marker. If the ground boundary points in the direction of the marked object is strictly nearer than the marked object, then the object is assumed to be occluded. Occluded markers are not dropped, but now we need a strategy for navigating around the obstacle. The original behavior, which was to go directly towards the nearest marker, obviously will not work for occluded markers. We have implemented the navigation algorithm described in section 4.2 using the target berry's location as the destination marker.
To determine the location of the obstacle marker, the ground/non-ground boundary in the region of the goal marker is analyzed for sharp jumps that indicate the edges of the obstacle. Simple geometric reasoning is used to determine the shortest distance around the obstacle, and the obstacle's edge is marked. An intermediate destination marker is then instantiated, and the agent is directed to move towards the intermediate destination, thereby circumventing the obstacle.
This paper represents the first work to demonstrate that marker-based representations can improve performance in 3D domains. The 3D virtual environment used has several real-world properties, each of which are interesting research problems:
All of these issues are important topics of research, and all are present in the simulation that forms the platform for this research. A fully implemented system is described that details algorithms for instantiating, maintaining, and deleting markers in pursuit of a given goal. The marker-based system measurably improves the performance of an autonomous agent in an environment with these complicating properties.
The paper develops and elaborates a vocabulary for marker-based systems. Several techniques using various ``types'' and hierarchies of markers are described. A classification of markers is constructed to facilitate this description.
2. Why should this contribution be considered important?
Classical and reactive planning have complementary strengths and weaknesses. Autonomous agents operating in real physical environments using visual perception must combine the strengths of both approaches to successfully perform useful tasks. Marker-based representations have the potential for bridging the gap between classical and reactive planning. The work described in this paper demonstrates the viability of the marker-based approach in three-dimensional environments.
3. What is the most closely related work by others and how does this work differ?
This work is related to the original work in markers conducted by Agre and Chapman in 2D environments. While Agre and Chapman only address points d and e above, this paper represents the first marker-based system that addresses points a, b, and c above.
4. How can other researchers make use of the results of this work?
This system architecture and the techniques it uses are suitable for incorporation into mobile robotics systems. Certainly, the low-level perception and manipulation components must be replaced by appropriate real-world analogs, but the core architecture is fully transferrable to physical robots.