We present a methodology for generating optimized architectures
for data bandwidth constrained extensible processors.
We describe a scalable Integer Linear Programming
(ILP) formulation, that extracts the most proftable set
of instruction-set extensions given the available data bandwidth
and transfer latency. Unlike previous approaches,
we differentiate between number of inputs and outputs for
instruction-set extensions and the number of register fle
ports. This differentiation makes our approach applicable
to architectures that include architecturally visible state
registers and dedicated data transfer channels. We support
a comprehensive design space exploration to characterize
the area/performance trade-offs for various applications.
We evaluate our approach using actual ASIC implementations
to demonstrate that our automatically customized processors
meet timing within the target silicon area. For an
embedded processor with only two register read ports and
one register write port, we obtain up to 4.3x speed-up with
extensions incurring only a 35% area overhead.