Expansive expansions

Blogs & guides and tales of woo by forum members.
SpacedCowboy
Posts: 35
Joined: Sat Oct 14, 2023 5:43 am

Expansive expansions

Post by SpacedCowboy »

It was suggested I write up the project I'm currently spending my time on (thanks Icky :)) so this is an attempt to document it and not forget too many of the constituent parts. It's fairly ambitious (at least by my standards) so you may have to bear with me as I transcribe it all.

The basic idea is to use bus-snooping to reconstruct the graphics output of any computer that exposes the data-lines. Since the CPU (or other chip) goes and fetches data from RAM in order to display the screen, parsing that data out ought to let me display the screen myself, without having to solder chips into, and cut tracks on ancient and hard-to-replace hardware...

In my case, the target was the 8-bit atari line (XL/XE, since they have the ports on the back). I am planning to do this with an FPGA, and since there's an FPGA, all it takes is a bitstream-swap to support different computer typse - XL, or XE (with the Cartridge + ECI instead of the parallel port) and possibly even the ST (via the cartridge port, though this would only be a VDI-compatible output because of the limited address space, so no memory-mapped graphics).

The block diagram at the project page looks like:

Screenshot 2023-10-14 at 11.08.28 PM.png
Screenshot 2023-10-14 at 11.08.28 PM.png (117.25 KiB) Viewed 654 times

The idea is to (starting at the host computer and moving up to the actual display:
  • Have a small PCB that mates with the host port and has a mini-SAS 8087 cable socket on it. That socket can carry 36 signals, just about enough for both ST and 8-bits, and it's thin, flexible and even looks reasonably nice. See below for the XE version.
  • Have a main PCB that converts the voltages down to 3.3v, then passes them to an FPGA so the bus protocol (whatever it is) can be parsed into a standard format that can be passed onto the next stage in a packetised format, which is
  • An FPGA implementation of the slave-side of the secondary memory interface on a raspberry pi. All R-Pi's have this interface (yes, even the new 5, and going back to the very first Pi). It's criminally under-used in the Pi community IMHO - it's a low-latency, high (50-100MBytes/sec) bandwidth interface with DMA direct to user-space. You lose pretty much all the GPIO pins, but hey, you have an FPGA for that...
  • A raspberry Pi. This will boot into a captive application (at least by default, there's nothing that requires this though, it's just that I'm dedicating the Pi to the task). I'm currently using a '4 but am glancing over at the '5 as well :) This application is a client to the display service (Gemd), and is responsible for interpreting the packets and either sending responses or performing actions (like: expand this line of host-memory video into something the Pi can display).
  • The client app talks to "Gemd", which opens up a socket to listen on (currently a local unix socket for speed, but no reason it couldn't be TCP meaning over-the-network applications). Actually, any number of client apps can connect to the socket, send commands over it, and listen for events/replies coming back. The back-end Gemd service is written in QT, (in C++) but the front-end API (which looks remarkably like the AES/VDI :)) is written in C. All the front-end API does is marshal the arguments into a serialised form and send them over the pipe to Gemd.
  • Gemd will provide a VDI physical workstation of 1920x1080x32-bit, and offer the standard 256 "pens" to GEM for when you want to pretend it's indexed.
Memory apertures
The FPGA has direct access to 8MB of PSRAM (rather than DDR, mainly because it's easy to write a PSRAM interface and routing is easier too :)) and it can provide access to that RAM to the host via what I'm calling "memory apertures". At least on the XL/E, any RAM location can be supplied by an external module by bringing low signals on the bus. I intend to do that as a matter of course, so all RAM will be serviced by the FPGA. That means I can define a different base-address for (say) 8 sets of pages (low-page to high-page) on-the-fly, allowing access to the entire 8MB within the 64KB of 8-bit address space in a really flexible manner.

Memory apertures would use some defined addresses in peripheral space to specify how memory in a range of pages would be accessed, for example:

Code: Select all

$+0000 [4]: Base address in PSRAM for start of memory aperture
$+0004 [1]: Page address in host-memory for start of memory aperture
$+0005 [1]: Number of pages to map
$+0006 [1]: 'stride', or pages per horizontal line
$+0007 [2]: width in bytes of each virtual line
The stride and width allow us to specify a longer virtual horizontal length than the physical linear map would allow - but yet we can map this into a linear space in host RAM. For video, moving everything left or right by a byte is a matter of changing the base address, by 1 and moving everything up or down by a line is a matter of changing the base address by 'stride * 256' bytes.

Peripheral expansion slots
I'm intending to provide "slots" on the main PCB using PCIEx1 connectors so you can just plug in boards with "gold-finger" edge connectors. Not sure how many yet, but (say) 4 to 6. These will interact using serial ports (at various baud rates configured by pulling lines low on the edge-connector) or SPI interfaces, talking to an RP2040, which in turn will feed commands/data to the FPGA over a dedicated bus.That bus can be routed to the host or to the Pi's SMI bus by the FPGA, allowing data to be sent to either host or Pi. It'll be bidirectional of course, so you can send data *to* the peripheral slots too.

On that note, and since the slots are defined by an interface, it ought to be possible to write software "peripherals" on the Pi which interact with the host computer as if they were hardware. A simple buffering scheme ought to make this work pretty well.

Networking
There's a Pi there, and it's running Linux - networking ought to be a cinch, just a matter of routing bytes around, since all the hard work is already done on the Pi.

"Hard disks" via SD or USB Pen drive
Similarly, plugging in a USB drive on the Pi could make it available to the host computer

Firmware updates
One of the really nice things about the RP2040 is that you can just plug it in, hit the DFU button, and it appears as a USB flash drive itself. That makes it trivial to upgrade in the field. I intend to append the FPGA bitstream to the RP2040 config file, and the RP2040 will then reconfigure the FPGA flash, holding the FPGA in reset while it does so. Upgrading the firmware, then, is just "plug the device into a computer while pressing the DFU button, and copy the firmware to the newly-appearing "drive".


Approach
I recognised there was a fair number of options and possible configuration, and I wanted to make it easy to build on and use, so I decided I needed a GUI environment to run this in. One of the things I want to implement is a "desktop" like environment for the host computer, so if you move the mouse, the pointer appears and if you move to the top-left of the screen, you can flip to a desktop which allows you to "download" applications to the host, as we'll as do any configuration... To try and make sure that using the Pi was feasible for all this, I decided to write up the software-side first this time, and so QGem was born. If I was going to implement a GUI, then GEM seemed pretty appropriate for an Atari computer system :) So Gemd is a lot like an X11 server, except that it runs VDI/AES rendering instead of raw Xlib.

It then occurred to me that if I wanted a high-res desktop for the ST/TT, the cartridge port could be a nice interface to bind the FPGA to. You can read (easily) and fake writes with addresses using the second bank of addresses. That's not a world away from a socket interface...

Another thought is that I can implement a sort-of direct-rendering interface by using shared memory between the Gemd server and the client application. It's not too hard to have the client-app expanding (say) an 8 KByte video-screen from the XL/XE into a 320x200 32-bit RGBA memory space, which just happens to be where Gemd reads a QPixmap from every time it refreshes the screen. Since it's client/server, you get one CPU writing data and one CPU reading data, with the synchronisation happening within the shared memory pool.

To give some idea of where I am, the simple XE interface card looks like:

IMG_0557.thumb.jpeg.50bc46c63f8d06952c7937cf09d0c309.jpeg
IMG_0557.thumb.jpeg.50bc46c63f8d06952c7937cf09d0c309.jpeg (76.4 KiB) Viewed 654 times

The VDI code is functionally complete (ie: I'm assuming there's a whole boat-load of bugs, but all the calls I need at least have an implementation). I've been testing it along the way, and although the below isn't exactly stunning (it's just test code), it does show user-defined fill-patterns and line-styles, vro/vrt_cpyfm, writing modes, flood fill, text alignment, markers, outline text, and in my case the line styles etc. work as you increase the width of a line, plus you have arbitrary text rotation :) It seems a lot faster than my TT as well, and the below is actually at 1080p... :)

The C code that generates the below is (at least currently) here. If you jump to the main() function, it looks remarkably like a bunch of VDI calls ... :)

Screenshot2023-10-05at9_11_37PM.png.623b5ce621055b81197db0bc7788244e.png
Screenshot2023-10-05at9_11_37PM.png.623b5ce621055b81197db0bc7788244e.png (65.38 KiB) Viewed 654 times

Anyway, that's where it is. Currently I have an implementation (at least at the 'C function-call' level) of the VDI, and I've just finished the first pass at getting resource files loaded in the AES code. Next up is to finish the rest of the rsrc library, and then start drawing OBJECTs before starting to tackle some of the higher-level constructs like windows and getting the event library working.

Pie-in-the-sky ideas
  • Implement a host-computer-side API which sends commands to a client-app, which in turn sends commands to QGem. You end up with GEM on your host computer. Not a huge advantage for an ST, but significant for an XL...
  • Add a 68K emulator, intercept TRAPs and redirect to the C-API ... see if we can run native 68k programs on the Pi via emulation. No use for games that write directly to memory but good for "serious" apps. Hmm... unless that direct-rendering idea pans out, in which case emulation of games might be feasible too (hey, this is the pie-in-the-sky section)
User avatar
rubber_jonnie
Site Admin
Site Admin
Posts: 9568
Joined: Thu Aug 17, 2017 7:40 pm
Location: Essex
Contact:

Re: Expansive expansions

Post by rubber_jonnie »

Not sure if you've seen this, but it may be of interest: Making a 3D Graphics Card for the Atari 800 XL using the Raspberry Pi

Good luck with your project, and I do have a 130XE, 2x 800XL and a 600XL if you need a tester further down the line :)
Collector of many retro things!
800XL and 65XE both with Ultimate1MB,VBXL/XE & PokeyMax, SIDE3, SDrive Max, 2x 1010 cassette, 2x 1050 one with Happy mod, 3x 2600 Jr, 7800 and Lynx II
Approx 20 STs, including a 520 STM, 520 STFMs, 3x Mega ST, MSTE & 2x 32 Mhz boosted STEs
Plus the rest, totalling around 50 machines including a QL, 3x BBC Model B, Electron, Spectrums, ZX81 etc...
SpacedCowboy
Posts: 35
Joined: Sat Oct 14, 2023 5:43 am

Re: Expansive expansions

Post by SpacedCowboy »

rubber_jonnie wrote: Sun Oct 15, 2023 11:05 am Not sure if you've seen this, but it may be of interest: Making a 3D Graphics Card for the Atari 800 XL using the Raspberry Pi
I had considered using Circle to get the performance up (and faster boot times), but then you lose a lot of the benefits of linux support. I'm going to mitigate the boot time by having it self-powered and in its own box at the end of the cable (and at the low voltage of a Pi, that's not a huge power drain) so it can actually be left on - instaboot :)

As for performance, I spent some time checking out SDL, SFML, raw OpenGL, and QT - and I can get 60-odd fps at 1080p in QT on a Pi4, which also gives me by far the best development environment, with widgets, 2D accelerated drawing etc. So QT won out.

Their project is also a different approach - instead of a secondary 3D display, my intention is to boot up into the exact same display as the native XL/XE/host, it'll just be on an HDMI screen. I'm not sure yet whether to implement ANTIC in the FPGA or to do it in software - a Pi is a *lot* faster than a 6502 (or a 68k, for that matter). The new Pi5 is a quad-core 2.4GHz device...

Similarly, if we end up with an ST version, the cartridge will boot, enable the new VDI routines, and the computer will just talk natively to that new display (@ 1080p :)) - given the cartridge port addressing limitations, games (or programs that do direct access to screen RAM) are unlikely to work, but "serious" stuff ought to be good to go. It might be possible to do more with a VME option and playing with how the ST/TT report their memory, but that has its own issues.
rubber_jonnie wrote: Sun Oct 15, 2023 11:05 am Good luck with your project, and I do have a 130XE, 2x 800XL and a 600XL if you need a tester further down the line :)
Thanks :)

Fair warning that this is probably going to be a slow-burner. There's a lot to do, and time is a limited commodity...
Steve
Posts: 2532
Joined: Fri Sep 15, 2017 11:49 am

Re: Expansive expansions

Post by Steve »

@SpacedCowboy Have you seen this by @Badwolf ? : https://exxosforum.co.uk/forum/viewtopi ... =29&t=6581

Is that in any way similar to what you may be planning for ST? Perhaps you two gents could collaborate :)
SpacedCowboy
Posts: 35
Joined: Sat Oct 14, 2023 5:43 am

Re: Expansive expansions

Post by SpacedCowboy »

If I'm reading that correctly, it's similar in as much as they both would use the cartridge port, and they both would provide an external display. I'm trying to leverage the Pi to get a really *big* display in full 32-bit RGBA. I *think* @Badwolf is integrating the cartridge port into the standard memory map of the ST, whereas I plan to treat the cartridge port as a pipe to send commands down, and receive data back over.

The current design does just that with the VDI, so the code to (for example) draw an ellipse (v_ellipse in the VDI) looks like...

Code: Select all

/*****************************************************************************\
|*  11.5: Fill an ellipse			[type=5] [pxy=x,y,rx,ry]
\*****************************************************************************/
void v_ellipse(int16_t handle, int16_t x, int16_t y, int16_t rx, int16_t ry)
	{
	/*************************************************************************\
	|* Check to see if we're connected
	\*************************************************************************/
	if (!_gemIoIsConnected())
		if (!_gemIoConnect())
			return;
	
	/*************************************************************************\
	|* Construct and send the message
	\*************************************************************************/
	int16_t info[] = {4, x, y, rx, ry};
	GemMsg msg;
	_gemMsgInit(&msg, MSG_V_ELLIPSE);
	_gemMsgAppend(&msg, info, sizeof(info)/sizeof(int16_t));
	_gemIoWrite(&msg);
			
	/*************************************************************************\
	|* Clear the message allocations
	\*************************************************************************/
	_gemMsgDestroy(&msg);
	}
... from the client point of view. All this does is marshal up the parameters, and send it down a pipe to the actual daemon that does all the work of drawing it (which is 'Gemd' running on the Pi). In the current implementation, that "send it down a pipe" is quite literally serializing it and writing to a socket interface - but you could easily see that being "use the address-as-data trick to 'write' to bank 2 of the cartridge port", reading is very similar.

What all this means is that I don't have a memory-mapped display, from the perspective of the ST. I instead have an interface to VDI functions that communicate over a pipe to the Pi which does the actual drawing, whether that communication is via the cartridge port or via a unix pipe.

Also, I should point out that the ST was an after-thought for this project, and I was really targeting the XL/XE. I do think using the ST cartridge port is a viable technical approach, but I'm not sure just how much software on the ST used the API calls exclusively to output their display, and didn't rely on hacks to write to the screen memory. The ST wasn't actually that fast once you took the 68k bus access protocol into consideration... I suspect "cheating" to get performance was widespread.

I will say that the test app display is instant :) There's no sense of "drawing" going on, even for large text blits. It "feels" significantly faster than my TT does, so assuming the software works with this approach, it ought to be pretty performant.
ijor
Posts: 411
Joined: Fri Nov 30, 2018 8:45 pm

Re: Expansive expansions

Post by ijor »

SpacedCowboy wrote: Sun Oct 15, 2023 7:13 am The basic idea is to use bus-snooping to reconstruct the graphics output of any computer that exposes the data-lines. Since the CPU (or other chip) goes and fetches data from RAM in order to display the screen, parsing that data out ought to let me display the screen myself, without having to solder chips into, and cut tracks on ancient and hard-to-replace hardware...
...
Their project is also a different approach - instead of a secondary 3D display, my intention is to boot up into the exact same display as the native XL/XE/host, it'll just be on an HDMI screen.
Very interesting project.

I assume you will want to output the computer audio through HDMI as well. Note that you can't reproduce the SIO audio accurately just by snooping the bus and without access to the SIO port. You might not care, and probably the most common scenario will be boot from the PBI, not from SIO, anyway. But it is something to be aware, nevertheless, if you care about backwards compatibility.
http://github.com/ijor/fx68k 68000 cycle exact FPGA core
FX CAST Cycle Accurate Atari ST core
http://pasti.fxatari.com
SpacedCowboy
Posts: 35
Joined: Sat Oct 14, 2023 5:43 am

Re: Expansive expansions

Post by SpacedCowboy »

ijor wrote: Mon Oct 16, 2023 12:24 am Very interesting project.
Thanks :)
ijor wrote: Mon Oct 16, 2023 12:24 am I assume you will want to output the computer audio through HDMI as well. Note that you can't reproduce the SIO audio accurately just by snooping the bus and without access to the SIO port. You might not care, and probably the most common scenario will be boot from the PBI, not from SIO, anyway. But it is something to be aware, nevertheless, if you care about backwards compatibility.
I had thought of the SIO audio - and because I don't need all 36 data lines on the XL/XE, one of them is used for audio. The RP2040 has a circuit which converts the voltage levels to ones appropriate to its ADC, and injects audio packets along with the bus traffic coming from the host computer. It can actually do that for a stereo jack input as well, and mix them all together before sending on to the Pi via the FPGA.

What I need to figure out now that the ST might be part of the picture, is how that gels with the ST actually needing all 36 signals... If it's a common signal interface over the mini-SAS cable, I might need some jumpers to route signals...
SpacedCowboy
Posts: 35
Joined: Sat Oct 14, 2023 5:43 am

Re: Expansive expansions

Post by SpacedCowboy »

Thinking a little bit more about how one might go about using the cartridge port for at least sort-of-memory-mapped (ie: paged) access (because the 128K of memory space is pitiful when it comes to screen memory), I think the below scheme might work. The cartridge port has memory mapped space at $FA0000 thru $FBFFFF

$FA0000 -> $FAFFFF : /ROM4 asserted
$FB0000 -> $FBFFFF : /ROM3 asserted

One huge assumption is that it's ok to just have word-level access to the memory. This is a useful thing to set as a limitation because (apart from the address/data lines) there are precisely 4 signals coming out of the atari cartridge port - the two assertions above, based solely on address, and /UDM and /LDM depending on whether it's a move.b or move.w instruction. Restricting "normal" access to word-only means that byte-access can be used to signal control logic, rather than reading memory.

If we make an access to $FB0000 -> $FBFFFF, (ie: /ROM3 is asserted), we consider this to be a write-to-memory operation. We then further use the byte/word access flags to simulate bit0 of the address or data to be written.

If we make an access to $FA0000 -> $FAFFFF, (ie: /ROM4 is asserted), and it is a move.w read operation, we consider this to be a read-from-memory operation. since we only support word-reads, if we get a move.b read operation in the /ROM4 space, we consider it to be a read/write (see below for details) access to the control registers. Speaking of which...

Let's set up some memory-mapped registers. These need to be in the same memory-space as the cartridge because both the FPGA on the cartridge and the ST need to have access to what the current settings are. Since we want to access as much of the limited RAM at a time as we can, these shadow the same address as the RAM, and depend on accessing with .b to be read/written. So these "hidden" registers are:

Code: Select all

	$FA0000,1	: read-access page-id -> 4GB accessible in banks of 64K
	$FA0002,3	: write-access page-id -> 4GB accessible in banks of 64K
	$FA0004,5	: current write address offset (within the 64K)
	$FA0006		: flags as below
				bit 0:	if 1, auto-address pre-increment on write access
				bit 1:	if 1, read-page-id will roll over on access to $FFFF
				bit 2:	if 1, write-page-id will roll over on access to $FFFF
Some additional explanation on what these mean
  • $FA0000,1 and $FA0002,3 are the bank-registers, so when a read/write is made, the value in the appropriate register is multiplied by 65536 and added to the address specified by the instruction, and that is the actual address read/written in the attached PSRAM. The page registers for read and write are separate so that efficient memory->memory copies can be done without constantly changing a single page-id register.
  • $FA0004,5 is the "current" address pointer for writing within the 64K banks.
  • $FA0008 is the flags register which says how the "current" address will update as accesses happen, and how the page-registers will update as an access rolls over the page boundary

Read access to memory

So to read memory at a random offset in the 4GB of page-space, you
  • Set up the read-page register (probably just $FA0000 since that goes up to 16MB) using move.b
  • Do a move.w $FAxxxx,dx to read the 16 bits at the address ($xxxx + page offset).
To set up the read-page register, you "write" to the register by doing a move.b $FAxxyy, d0 - where xx has the high-bit set to' mean "write" and the remaining 7 bits are the register byte-address (starting at 0), and yy is the value to put into that byte. So for example to write the value $aa to register $FA0000 one might do:

Code: Select all

move.b $FA8055, d0
and to read the value at $FA0001,2 into d0 one might do:

Code: Select all

move.b $FA0000, d0
lsl.w #8, d0
move.b $FA0001, d0		-> d0 contains value in $FA0000,1
Once the page is set up, for reading you can just access random (word-based) addresses within the page segment, and if you have bit-1 set in the flags register, then sequentially reading $FFFE,$0000 will "roll over" the page-id in $FA0000,1 and you'll be effectively reading linear memory, so you don't have to worry about page boundaries.


Write access to memory

Writing is a little more involved - you still have to set up the page-id register at $FA0002,3 just as above, but then there are two ways to write to memory, the best one to choose depending on how many bytes you're going to write. Because we're using the address lines as data for writing, we have to be able to send both address and data over the same (address) signals, so we're multiplexing {address, data, address, data, ...}. This is obviously inefficient for sequential access, but bit-0 of the flags comes to our rescue.

First, here's the above multiplexed method to write 16-bit value $yyyy to memory address $xxxx in the 64K page. This is the case when bit-0 in the flags register is clear.
  • Set up the write-page register if necessary (probably just $FA0002 since that goes up to 16MB) using move.b
  • Issue move.w of 15-bit address $FBxxxx - a0 can't be 1 because we're reading words at a time, so only top 15-bits matter.
  • If the data to write has bit0 set, issue move.b of 15-bit address $FByyyy to dx
    Else issue move.w of 15-bit address $FByyy to dx
However if bit-0 in the flags register is set, there is no need to set the address if the values being written are to sequential memory locations. The process then becomes:
  • Set up the write-page register if necessary (probably just $FA0002 since that goes up to 16MB) using move.b
  • If the data to write has bit0 set, issue move.b of 15-bit address $FByyyy to dx
    Else issue move.w of 15-bit address $FByyy to dx
In all cases, the FPGA will construct the 16-bit value passed via the address and access-mode as {A15..A1, /UDM}, thus getting 16 bits of data. Using .b will force /UDM high, using .w will force /UDM low.

The two approaches can be alternated of course, with a call to change bit-0 (using a read.b in the $FAxxxx space as above), so you can set the address with the first type of write, change the flag, then just continually issue move.{w,b} and the address will automatically increment as you issue the "writes".

Summary

None of this is good enough for emulation or generic frame buffer code because you still have to be aware of how to "write" to the paged cartridge memory (or at least call into the supplied routines to do so). It could be useful for a custom frame buffer that exists on an FPGA (or link to a Pi :)) but I still think it's probably easier to just have the pipe interface. It could certainly be used for something like a Ramdisk where the code understands the limitations. Still, it was a fun academic exercise :)
User avatar
mrbombermillzy
Posts: 1389
Joined: Sun Jun 03, 2018 7:37 pm

Re: Expansive expansions

Post by mrbombermillzy »

Interesting project.

I agree that the best solution for a bandwidth limited computer system is sending command codes to an off board system which can use them to execute rendering primitives of some manner. Although I applaud your efforts regarding the paged memory mapped display buffer solution too.

The bus snooping as a method for ascertaining what was happening on the internal 'system' was brought up in a discussion as a solution recently at our Cyberlegends event, but kudos to you for (heading towards) implementing the reconstructed GEM desktop rendering.

I will keep an eye on how this develops.

Well done so far and good luck! :)
User avatar
Badwolf
Posts: 2199
Joined: Tue Nov 19, 2019 12:09 pm

Re: Expansive expansions

Post by Badwolf »

Very interesting and the remote protocol is definitely the way to the largest resolutions.

I suspec though there are many (if not the majority) of programs that for the sake of speed infer (or simply assume) the pixel format in use and bash their graphics to what they consider screen RAM.

It'll be fascinating to see if this is a problem down the line -- it may be that a smart raster sync protocol is needed later on. :)

I'll be following this. Do keep us updated, please. 8-)

BW
DFB1 Open source 50MHz 030 and TT-RAM accelerator for the Falcon
DSTB1 Open source 16Mhz 68k and AltRAM accelerator for the ST
Smalliermouse ST-optimised USB mouse adapter based on SmallyMouse2
FrontBench The Frontier: Elite 2 intro as a benchmark
Post Reply

Return to “MEMBER BLOGS”