Blitter test programs?

Badwolf · Post by **Badwolf** » 30 Jul 2024 10:47

Badwolf wrote: 29 Jul 2024 22:29 And grr. My debug board insert is enough to crash blitter access before I've even wired up the logic probe.

It's more than that -- I think all the jimmying in and out has damaged somethign on my board. Blitter accesses now just crash the system hard.

May have to do a bit of inspection and reflow before picking it up again :(

BW

ijor · Post by **ijor** » 30 Jul 2024 15:10

Badwolf wrote: 30 Jul 2024 10:46 I may be able to follow the logical steps between various things happening and the resultant output, but I'd not know how to estimate the logic delay for each of them (I don't know if it's safe to just say '20ns!' for each internal logic step [it's a 10ns chip, but what if the signal has to cross function blocks -- I don't even know how to tell]).

Ah, no, I was talking about the board level logic. The internal timing analysis must be performed by the Xilinx tools. Sorry if I wasn't clear enough.

Normally the FPGA tools can also analyze the external interface timing as well. But in this case you have three (or more) separate chips, connected with a not fully synchronous interface. You need to perform some "manual work".

I'm also not sure what I'm measuring from yet. Like you say, my initial assumption that DS meant data would be ready by the time by (slow) state machine got there doesn't look to be valid for the blitter.

If your state machine is "slow" enough, then data might be already available. But then again, you must perform a timing analysis (in this case, a functional analysis) to know how many cycles your state machine takes since you detect DS until the SDRAM chip latches the data. Some simulation might help here.

Btw, I missed the fact that if you add a delay things get worse. May be my initial suspicion that it is actually a hold time issue, might make more sense then.

But there's no doubt this is still a little experimental in my mind. If everything were perfect you'd expect the skew to vary from around +8 to -8ns.

I will elaborate about this later.

ijor · Post by **ijor** » 31 Jul 2024 03:42

Badwolf wrote: 30 Jul 2024 10:46 But there's no doubt this is still a little experimental in my mind. If everything were perfect you'd expect the skew to vary from around +8 to -8ns. But across different CPLDs, with different initial clock phases, or as things heat up I could imagine that shifting to being non-symmetrical quite easily.

I can't be sure without studying your code with more details, but seems to be that your clock skew would be much worse than that. But let's assume that, indeed, skew is +-8ns. That might be bad enough to violate timing and create synchronization problems. I'm not sure if you are working in the opposite edge of the clock or not. If you are working in the same edge, this is not good for the chipset. If you are working on the opposite edge, it might be safe for the chipset, but still not good for devices working on the fast clock, in this case mainly the CPLD itself.

Your CPLD code is not ready for asynchronous operation, neither for any significant skew. You would need a synchronizer chain on every signal coming from the the system. You actually even need to synchronize the 8 MHz clock itself, because you are reading it as data. You should synchronize external reset as well.

Now, the Blitter problem you are describing is most likely caused by different reasons. I suspect that either because you are latching too early, or either there is a hold timing issue with the Blitter output enable.

The clock skew and synchronization problem is an additional issue. But it is likely to trigger an error rather infrequently.

Code: Select all

reg CLKOSC_2 = 1'b1;
always @(posedge CLKOSC ) begin
	CLKOSC_2 <= ~CLKOSC_2;
end
reg CLKOSC_4 = 1'b1;
always @(posedge CLKOSC_2 ) begin
	CLKOSC_4 <= ~CLKOSC_4;
end
reg CLKOSC_8 = 1'b1;
always @(posedge CLKOSC_4 ) begin
	CLKOSC_8 <= ~CLKOSC_8;
end

What is the purpose of this logic? Are you creating clock skew intentionally? Or this is just a quick way to divide the clock? If so, you are creating an unnecessary clock skew. Let alone that it makes very difficult to perform timing analysis.

Aim to use a single clock in the whole system. You can mux two clocks in different ways. But, ideally, all the system should use the same muxed clock. Otherwise, they should be treated as asynchronous systems and you would need to synchronize accordingly.

(I wasn't planning to support blitter access to SDRAM, but got on a roll and chucked my hat into the ring -- I knew it'd be a mistake!)

Nah, why it would be a mistake? Not at all! :)

Badwolf · Post by **Badwolf** » 31 Jul 2024 17:25

ijor wrote: 31 Jul 2024 03:42 I can't be sure without studying your code with more details, but seems to be that your clock skew would be much worse than that. But let's assume that, indeed, skew is +-8ns. That might be bad enough to violate timing and create synchronization problems. I'm not sure if you are working in the opposite edge of the clock or not. If you are working in the same edge, this is not good for the chipset. If you are working on the opposite edge, it might be safe for the chipset, but still not good for devices working on the fast clock, in this case mainly the CPLD itself.

What do you mean by 'not good for' in this case?

Your CPLD code is not ready for asynchronous operation, neither for any significant skew. You would need a synchronizer chain on every signal coming from the the system. You actually even need to synchronize the 8 MHz clock itself, because you are reading it as data. You should synchronize external reset as well.

These are terms I'm not familiar with, let alone the actual concept of what they're trying to convey. I found this online looking for 'verilog synchroniser chain'. Is this a fair introduction to what you mean? https://forum.digikey.com/t/implementin ... ilog/35809

Now, the Blitter problem you are describing is most likely caused by different reasons. I suspect that either because you are latching too early, or either there is a hold timing issue with the Blitter output enable.

Mmm. I'm going to investigate what the onboard DRAM does in relation to the blitter as soon as I've fixed my test harness. Thanks for that.

Code: Select all
reg CLKOSC_2 = 1'b1;
always @(posedge CLKOSC ) begin
	CLKOSC_2 <= ~CLKOSC_2;
end
reg CLKOSC_4 = 1'b1;
always @(posedge CLKOSC_2 ) begin
	CLKOSC_4 <= ~CLKOSC_4;
end
reg CLKOSC_8 = 1'b1;
always @(posedge CLKOSC_4 ) begin
	CLKOSC_8 <= ~CLKOSC_8;
end
What is the purpose of this logic? Are you creating clock skew intentionally? Or this is just a quick way to divide the clock? If so, you are creating an unnecessary clock skew. Let alone that it makes very difficult to perform timing analysis.

The latter. It's just a multi-stage divider. I use CLKOSC for the SDRAM, CLKOSC/4 for the 'fast' mode and the domains are allowed be switched on the CLKOSC_2. I think the last stage is unused and likely optimised out.

I hadn't thought about skew being an issue here, or this being an overly skew-inducing derivation technique.

Would a single always() block and non-dependent derivations be less of a problem?

(I wasn't planning to support blitter access to SDRAM, but got on a roll and chucked my hat into the ring -- I knew it'd be a mistake!)

Nah, why it would be a mistake? Not at all! :)
[/quote]

:dizzy:

Thanks, Ijor.

I don't understand half of it, but slowly the different tips will make sense to me, I'm sure.

BW

ijor · Post by **ijor** » 01 Aug 2024 03:11

Badwolf wrote: 31 Jul 2024 17:25 What do you mean by 'not good for' in this case?

Not good, in this context, means that chances that you will not meet hold or setup requirements (probably setup would be ok, hold is more problematic).

Any flip flop requires the input data to be stable from a determined amount of time before the active clock edge (setup time) until a determined amount of time (hold time) after the edge. If these requirements are not met, the behavior is not predictable. The output might take the old value, or the new value, it is unknown. Furthermore, the output might become metastable (although this is very unlikely).

Setup time is usually easy to meet at slow frequencies, even with some clock skew. But hold time doesn't depend on the frequency, and can be violated even with very small skew if data input changes too close to the clock edge. This is why it is very common to work on the opposite edge of the clock on external interfaces. You sacrifice setup time (essentially as if you would double the frequency), but you gain a lot of slack on the hold time.

Meeting timing is especially problematic when you interface modern and old school technology. Old school slow technology typically has very large hold timing requirement. This is no problem as long as the transmitter is also slow, because Tco (time to clock output) would usually be even larger. But if the transmitter is too fast, chances that it would change the output too soon and it would violate hold timing at the receiver. This could happen even without any skew at all. This is another reason why it is common to work on a negated clock.

Now, if the output is not 100% predictable, it might not be a problem by itself. Depending on the case, you might not care. If it kept the old value, it will change at the next cycle as long as the data signal is stable. And in many cases this is good enough.

The big problem is when you read the external signals at multiple registers at the same time. Let's consider something like this:

Code: Select all

	module noSyncr( input extInput);
	reg r1, r2;
	always @( posedge clk) begin
		r1 <= extInput;
		r2 <= extInput;
	end

You might expect that at any given cycle, extInput would have the same value everywhere you read it, and then as a consequence, r1 and r2 would always be the same, but they might be not. As described above, if timing is not met, r1 might take the old value and r2 the new one. This is, of course, a trivial case just for illustration. In many cases it might be very difficult to see the problem at all.

If we add a synchronizer, then we avoid the problem:

Code: Select all

	module withSyncr( input extInput);
	reg r1, r2, extSynced;
	always @( posedge clk) begin
		extSynced <=extInput;
		r1 <= extSynced;
		r2 <= extSynced;
	end

Now you are guaranteed that r1 and r2 would always be the same.

A single synchronizer doesn't still address the metastability problem. For that you need a syncrhonizer chain. I won't elaborate here as it would be too long and there are many online references. Metastability is very unlikely, especially on modern tech. In some cases you might not care. Say, if we assume that you might get a wrong blitted pixel once per month, who cares. But it is important when you are synchronizing a clock itself. You already seem to have a synchronizer chain on CLK_D. Not sure if this was by design, or it was only for delay purposes. But as long you don't use the first two bits in the shift register, you should be safe. I would add a synchronizer chain for DTACK and AS as well. But again, might be not critic.

These are terms I'm not familiar with, let alone the actual concept of what they're trying to convey. I found this online looking for 'verilog synchroniser chain'. Is this a fair introduction to what you mean? https://forum.digikey.com/t/implementin ... ilog/35809

Doesn't seem to be the best and most intuitive elaboration that I've seen on the subject, but yes, that is the topic.

Mmm. I'm going to investigate what the onboard DRAM does in relation to the blitter as soon as I've fixed my test harness. Thanks for that.

I was trying to check your SDRAM controller timing, but access from Blitter seems to be disabled. Are you sure you pointed me to the right version?:

Code: Select all

wire altram_access_int =  AS_INT | ( altram_access & rom_access );
...
nouveau_sdram sdram(
	.CLK(RAMCLK),
//	.RST(RST_IN),
	.RST(RST),
	.AS( altram_access_int ),
...
	ACTIVE <= AS | ( UDS & LDS );

ACTIVE is asserted only when AS_INT (CPU AS) is asserted???

The latter. It's just a multi-stage divider. I use CLKOSC for the SDRAM, CLKOSC/4 for the 'fast' mode and the domains are allowed be switched on the CLKOSC_2. I think the last stage is unused and likely optimised out.

I hadn't thought about skew being an issue here, or this being an overly skew-inducing derivation technique.
Would a single always() block and non-dependent derivations be less of a problem?

That would be better because it would avoid the skew between CLKOSC_2 & CLKOSC_4, but you would still have skew in relation with CLKOSC. Ideally you should not divide the clock at all, just use clock enables. I realize that here you must output a clock externally, so you can't avoid some skew (not without a PLL).

Badwolf · Post by **Badwolf** » 01 Aug 2024 10:52

ijor wrote: 01 Aug 2024 03:11
Badwolf wrote: 31 Jul 2024 17:25 What do you mean by 'not good for' in this case?
Not good, in this context, means that chances that you will not meet hold or setup requirements (probably setup would be ok, hold is more problematic).

Any flip flop requires the input data to be stable from a determined amount of time before the active clock edge (setup time) until a determined amount of time (hold time) after the edge. If these requirements are not met, the behavior is not predictable. The output might take the old value, or the new value, it is unknown. Furthermore, the output might become metastable (although this is very unlikely).

...snip good stuff...

Oho, I see what you mean. Yes, it's not somthing I've thought too much about except in the SDRAM controller part, but not to the proper extent even then.

And thanks for the quick primer on synchronisation chains.

You already seem to have a synchronizer chain on CLK_D. Not sure if this was by design, or it was only for delay purposes. But as long you don't use the first two bits in the shift register, you should be safe. I would add a synchronizer chain for DTACK and AS as well. But again, might be not critic.

That is indeed by accident. It's purely the 8MHz clock derivation logic based on what the early TF boards did. Sample the invert the input clock, delay it a few cycles and then take that delayed, inverted, sample as your approximation of the system clock. It presupposes a 50% duty cycle (a longer delay without inverting could avoid that, though) and will have that significant jitter discussed earlier (as it's only sampling every 15ns whch is not a divisor of 125), but means from that point forward everything we do is now derived from our oscillator instead.

I was trying to check your SDRAM controller timing, but access from Blitter seems to be disabled. Are you sure you pointed me to the right version?:

No, damn. I'm sorry. I haven't. I thought I had "git push"ed my blitter work, but had only committed it locally. I then went down a dead end and decided to revert with a "git reset" and only at that point realised myself.

I'm afraid I need to redo the blitter support logic.

The fundamental was that AS_COMBINED [ something like BGK_IN ? AS_INT : AS ] is used instead of AS_INT in the sdram_access wire.

The SDRAM state machine, for simplicity, only springs into life when both AS and ?DS are asserted taking at least two 15ns clock cycles to assert CMD_ACCESS, and one further to assert CMD_WRITE. Latching should occur one clock cycle later still. So I make that a minimum of 60ns from UDS/LDS going low to latching.

I know this is slow on the write side -- I can't max out the 16MHz CPU on writes. But it was an intentional trade off.

I initially wrote the controller to assert CMD_ACCESS shortly after AS goes low, and then the state machine would have a wait step for ?DS to assert. I later went through a process of trading speed for reduced complexity in an effort to improve reliability, however.

60ns could very well be too long. My initial timing measurements of blitter access to STRAM suggest data is pretty much valid at assertion of DS and that CAS gets asserted around 20-30ns later. I couldn't find the exact data sheet for the chips on the SIMMs in my STE, but a similar model states the chips then latch within 25ns of this edge.

Which says to me the data must be stable (set up) within about 20ns of DS going low. If I'm sampling *at least* 60ns later (could be delayed by an in-progress refresh, for example), perhaps the blitter doesn't have a particular long hold time and I miss it.

Mind you, whilst that could explain general on-screen mess, I'm not sure it explains why pixels from the right most word are appearing at the 9th word position in that test I ran. I'd need to think about that some more.

But at the moment I don't have a working blitter-supporting AltRAM set up. Embarassingly.

Once again, many thanks for your time, Ijor. Especially the introduction to these synchronisation chains.

I think I need to get back to where I was and convince myslef I'm not introducing any instablitly in the ways you describe.

Cheers,

BW

ijor · Post by **ijor** » 01 Aug 2024 17:06

Badwolf wrote: 01 Aug 2024 10:52 It presupposes a 50% duty cycle (a longer delay without inverting could avoid that, though) and will have that significant jitter discussed earlier (as it's only sampling every 15ns whch is not a divisor of 125), but means from that point forward everything we do is now derived from our oscillator instead.

I can see that the 8MHz clock is captured by your main oscillator. But unless you changed it, in the version I am seeing CLKOUT is clocked by CLKOSC_2, which would make the jitter twice as worse:

Code: Select all

reg CLK_OUT_INT;
always @( negedge CLKOSC_2 ) begin
	if( SLOW )
		CLK_OUT_INT <= ~CLKOSC_4;
	else
		CLK_OUT_INT <= ~CLK_D[shift-1];
end
...
assign CLKOUT = ~CLK_OUT_INT;

The SDRAM state machine, for simplicity, only springs into life when both AS and ?DS are asserted taking at least two 15ns clock cycles to assert CMD_ACCESS, and one further to assert CMD_WRITE. Latching should occur one clock cycle later still. So I make that a minimum of 60ns from UDS/LDS going low to latching. 60ns could very well be too long.

Hmm, no, why 60ns after the DS edge would be too long? If you are latching before S5, it could actually be too early.

... perhaps the blitter doesn't have a particular long hold time and I miss it.

Blitter certainly has less hold time than the CPU. It should be easy to confirm if that's the problem by delaying DTACK a few cycles. But ....

I was checking your SDRAM controller logic, and it seems to me that the logic that starts an SDRAM access is level triggered:

Code: Select all

always @(posedge CLK) begin
	ACTIVE <= AS | ( UDS & LDS );

This is a bit dangerous, it should better be edge triggered. Otherwise if you are too fast you might start a second access before Blitter gets to deassert the control signals. I can't be sure if this is happening or not without performing a simulation, but you should make sure this is not happening. This might explain the behaviour.

Mind you, whilst that could explain general on-screen mess, I'm not sure it explains why pixels from the right most word are appearing at the 9th word position in that test I ran. I'd need to think about that some more.

As I said, the timing on the last word is slightly different, and only at the last word Blitter's Output Enable timing is involved. Also, if you are not word aligned, Blitter needs to perform a read from the same location before actually performing the write. This also changes the timing.

One last comment about the clock aligning technique. Personally, I don't like it. Yes, a PLL would be way much better. Yes, using something like a MAX10 that has a PLL would make the design slightly more expensive. It would also make the design more complicated because you would need voltage level shifters, the MAX 10 is not 5V tolerant. But you can't help it, the XC95XL family is not manufactured anymore, and shortly it would be very difficult to find reliable parts at a reasonable cost.

Badwolf · Post by **Badwolf** » 02 Aug 2024 17:07

Thanks Ijor.

I've not had time to digest all of this yet and unfortunately since reverting my code I've broken something such that any access to the blitter crashes the machine at the moment (even when my board is not responding to it), so I need to get this back into a working state.

NB. I need to think through that clock switching code again as perhaps I've made a mess there.

Cheers,

BW

Badwolf · Post by **Badwolf** » 16 Aug 2024 10:39

OK, so I don't want to leave this up in the air. Here's are the conclusions I've come to for now.

I don't think my current implementation is going to reliably support blitter access into AltRAM so I'm going to BERR it out for the time being and look for a software solution (reworking Anders' BLITFIX).

It seems that the blitter drives the data quite early during a write and doesn't hold it very long. If I have a fairly asynchronous interface to my SDRAM controller I can just about meet its timing requirements for writing (specifically latching the write data as quickly as possible) but I've taken on board the tips about tying as much as possible to clock edges and adding a few synchronisers here and there and with these delays I just can't make it.

That also ignores read. Read is a lot more reliable. I can present the data in time for the blitter to latch in the overwhelming majority of cases but every now and again -- perhaps when the refresh happens at a particular sensitive time -- it'll miss a word.

Even worse one of my boards will induce various errors in the processor during an attempts to service the blitter which suggests more timing violations (Line-F errors for example) with data being on the bus when it shouldn't. That one does it and not the other suggests somethign is just too tight to be reliable.

Therefore, I'm going to try to choose reliability over feature creep and not permit blitter access in rev 2.

BUT, it's an open source project and anyone's welcome to have a go at making it work, of course :)

I hope to have a rev 2 design in the next month or so. And a rev 1.5 (rev 1 with hacks to make it a rev 2) on my channel soon.

Many thanks for all the help everyone, but specifically @ijor for the time he's spent.

BW

ijor · Post by **ijor** » 16 Aug 2024 18:41

Badwolf wrote: 16 Aug 2024 10:39 It seems that the blitter drives the data quite early during a write and doesn't hold it very long.

Hi Dave,

As I said in my previous message, I'm not sure that is the problem. Unless you changed the code, it seems the problem is the opposite, your SDRAM controller is too fast.

As I mentioned in my previous message, your "ACTIVE" logic at the DRAM controller is level triggered. And since you are too fast for Blitter, you start a second SDRAM write access. I performed a quick simulation. This is the simulation waveform:

Dtsb-Blitter1-Waveform.jpg

At the first yellow time mark at the left, Blitter is in S4 and asserts UDS/LDS. At the next 66MHz raising edge you set your "ACTIVE" flag, and you start a full SDRAM write operation. At the second time mark on the right, the SDRAM controller goes back to the idle state. But at this point, Blitter is "only" at the S6 state and all the control signals are still asserted. So the controller starts another write access. But this second write might be too late, at the time the actual write is performed, Blitter data might not be there anymore.

Note that if you delay one cycle, as you did when you tried using the registered DS signals, it is even worse. In this case you are still too fast, you would still start a second write cycle. But in this case (almost) for sure you are too late at the point that the DRAM chip actually needs the data.

This is all assuming you are using the code you pointed me in a previous message. And this needs to be verified in real hardware, of course.

exxos's Atari, Amiga & retro forum

Blitter test programs?

Re: Blitter test programs?

Re: Blitter test programs?

Re: Blitter test programs?

Re: Blitter test programs?

Re: Blitter test programs?

Re: Blitter test programs?

Re: Blitter test programs?

Re: Blitter test programs?

Re: Blitter test programs?

Re: Blitter test programs?

Who is online