Maybe it helps when we examine what the blitter really does:

- Blitter.png (120.36 KiB) Viewed 5322 times
For each blit cycle and depending on the intended operation, worst case it needs to read the source word, read the destination word, do the logic op and write the destination word. It can't do memory accesses any faster than the CPU as it is bound to the same bus protocol and speed.
Now, why is it faster than the CPU doing the same?
- It does not need to read instructions from memory inbetween
- Skew requires a shift on the source data. the 68000 is very slow when shifting, the blitter can do this (barrel shifter) any width in a single cycle.
If you want that blit cycle at maximum speed, you need to align the memory accesses together as close as possible and try to fit internal operations (source skew shifting, halftone logic op, logic op and endmask masking operations) in between.
I would assume this is not optimal in your implementation. Maybe/probably you need to clock the blitter internally at higher speed (not a real problem, since your FPGA has probably unused PLLs that can provide that clock) to gain additional clock cycles for this internal operations.
And remember: Beethoven wrote his first symphony in C.