Tuesday, February 26, 2013

Architecture simulators, SystemC and ArchC

I've been looking into existing literature and simulators for the heterogeneous simulator I'll be developing, and after discussing some of the options with my supervisor Magnus the final conclusion was to use SystemC together with ArchC for the simulation infrastructure. To give a brief overview of what they are:

  • SystemC is a set of C++ libraries and a simulation kernel, with plenty of useful functionality for creating models of complex hardware systems at different levels. Combining the object oriented paradigm with extra modelling capabilities for concurrency, timing and communication results in a flexible and powerful tool, and you get to decide the level of detail you would like for your models - anything from RTL (a hardware-synthesizable subset exists!) to expressing a whole processor instruction execution cycle with a switch statement. 
  • ArchC is an open-source architecture description language that was built to allow researchers or companies quickly prototype new computer architecture ideas. It can create SystemC simulations of the proposed architecture or create compiled versions for more speed, and it can even generate a GNU bintools suite targeting the architecture you specified!
So what's the first thing you'd like to see when someone starts talking about some new/esoteric language? You'd probably want to see examples. Here's a SystemC example from Wikipedia:

#include "systemc.h"
 
SC_MODULE(adder)          // module (class) declaration
{
  sc_in<int> a, b;        // ports
  sc_out<int> sum;
 
  void do_add()           // process
  {
    sum.write(a.read() + b.read()); //or just sum = a + b
  }
 
  SC_CTOR(adder)          // constructor
  {
    SC_METHOD(do_add);    // register do_add to kernel
    sensitive << a << b;  // sensitivity list of do_add
  }
};

ArchC syntax looks pretty similar (it's inspired by SystemC in any case) with specific language constructs to specify the instruction set architecture (ISA) and the microarchitecture. They have ArchC models for a number of different cores at the ArchC website, check it out!

So the microarchitecture description (well, it's not a complete microarchitecture description, but you can always customize the connections and the components if you want to) looks like this in ArchC:

AC_ARCH(mips1){

  ac_mem   DM:5M; // 5 megs of direct access memory
  ac_regbank RB:32; // register bank
  ac_reg npc;
  ac_reg hi, lo;

  ac_wordsize 32; // 32-bit words

  ARCH_CTOR(mips1) {

    ac_isa("mips1_isa.ac"); // set ISA
    set_endian("big"); // big endian

  };
};

And here is an excerpt from the ISA description and a instruction behaviour description:

AC_ISA(mips1){

  // declare the format of a group of instructions for decoding
  // this can be thought of as parameter type/count declaration
  ac_format Type_R  = "%op:6 %rs:5 %rt:5 %rd:5 %shamt:5 %func:6";
  
  // ... insert more instruction formats here
  
  // which instructions belong to which format?
  ac_instr add, addu, sub, subu, slt, sltu;
  
  // ... insert more instruction-format matchings here
  
  // assembly equivalent of instructions, for bintools generation
addi.set_asm("addi %reg, %reg, %exp", rt, rs, imm);
addi.set_asm("add %reg, %reg, %exp", rt, rs, imm);
addi.set_decoder(op=0x08);
// ...
};
...

// Instruction addi behavior method.
void ac_behavior( addi )
{
  dbg_printf("addi r%d, r%d, %d\n", rt, rs, imm & 0xFFFF);
  RB[rt] = RB[rs] + imm;
  dbg_printf("Result = %#x\n", RB[rt]);
  //Test overflow
  if ( ((RB[rs] & 0x80000000) == (imm & 0x80000000)) &&
       ((imm & 0x80000000) != (RB[rt] & 0x80000000)) ) {
    fprintf(stderr, "EXCEPTION(addi): integer overflow.\n"); exit(EXIT_FAILURE);
  }
};

Tuesday, February 19, 2013

I'm back - with heterogeneous multicore computing!

Following the (bad) blogger tradition of long periods of silence followed by a "I'm back!" post, here we go :)

In the past 2.5 years I did a lot of things that I really felt like I should blogged about here, mostly as part of the fantastic Erasmus Mundus in Embedded Computing Systems (EMECS) master programme, but ah well. I intend to share the rest of my embedded systems and computer architecture adventures here, and if I manage to get back into the writing mode I might even become retrospective and write about some cool stuff I've seen during the Long Period of Silence :P

To summarize the current situation, I'm in the last semester of the EMECS programme writing my Master's thesis, and recently got a PhD position offer from the Norwegian University of Science and Technology (NTNU) Computer Architecture and Design Group. My thesis and the doctoral research for the following 4 years will be in the fascinating world of heterogeneous multicore architectures, and at the heart of the reason for doing all this still lies the same craving which fueled my Google Summer of Code 2010 project at BeagleBoard with dsp-rpc-posix: to make these amazing hardware be used to their full or near-full potential more easily by more developres.

On a more specific level of detail, my MSc thesis is going to be about developing a simulator for a tile-based heterogeneous architecture. I've done a bit of literature research and review of existing simulator work (and most naturally, a brain muddled due to trying to absorb all that information in 1.5 weeks) and for the moment it feels like I'll be basing my work on ArchC / SystemC for a variety of reasons which I'll hopefully go more into soon.

Expect to see blog posts about practical SystemC/ArchC issues, random ramblings about heterogeneous multicore architectures and maybe some cool embedded systems projects soon!

Monday, August 16, 2010

Finaly Weekly Report, no #13

So here comes the final weekly report... I just want to say once again what a wonderful experience it has been to work with the BeagleBoard for GSoC! As I stated in my previous week's report, I do believe I've managed to complete most of what I (and my mentors) had in mind for the end of the GSoC period. There's (as always!) more to do, of course... and I intend to continue development for C6Run in specific and BeagleBoard in general (though with the start of the academic year, it's unlikely that I'll be able to devote as much time), though right now I could really use a little break, and will be taking one :)


Weekly Report #13 
Submitted on 2010-08-16
Covers 2010-08-09 to 2010-08-16

Status and Accomplishments
• documentation moved to its own eLinux wiki page (at http://elinux.org/BeagleBoard/GSoC/2010_Projects/C6Run/Documentation), divided into two sections (Usage, Architecture) and expanded
• build system modifications: support auto-copying of the generated stub files by the stub generator utility, make a folder for user sources
• code cleanup: move GPP&DSP common definitions to a single header file, work on GPP-side pointer alignment problems, partition GPP-side code into modules

Wednesday, August 11, 2010

Documentation's moved

Just writing to let my readers know that the C6Run documentation has moved to the elinux wiki and will be residing there from now on:

http://elinux.org/BeagleBoard/GSoC/2010_Projects/C6Run/Documentation

I've been working on expanding and improving it yesterday, and there's more to come.

Monday, August 9, 2010

Weekly Report #12

Weekly Report #12
Submitted on 2010-08-09
Covers 2010-08-02 to 2010-08-09

Reflections
As we're past the soft deadline (aka "the suggested pencils down date" in GSoC speak) I wanted to write a bit about how DSP-RPC-POSIX has progressed so far. I'd say that the project is now in a state that can do most of the things that me (and my mentors) had imagined it would be doing by now. C6Run's own goals of easy DSP app prototyping with console/file system access are pretty much accomplished, as is being easy to get started with (although there are debatable issues surrounding this particular goal, see below). DSP-RPX-POSIX's twofold goals, which were basically being able to call GPP side functions from the DSP, and providing a prebuilt set of POSIX (actually, "standard C Library" would be a much appropriate term) remote accessor functions, are also met. Right now, a sample DSP development cycle for a beginner developer utilizing the aid of C6Run could look like:

• obtaining the sources from SVN
• setting up (which includes a script that downloads and sets up the C6Run dependencies) and building
• using C6RunApp and DSP-RPC-POSIX to construct, test and debug the wanted DSP-side algorithm
• using C6RunLib to wrap up the finalized DSP-side algorithm with an ARM library and use this library to build DSP-aided GPP apps

The specific capabilities and properties of DSP-RPC-POSIX include:

• ability to call most C standard library functions on the GPP side without any extra effort (stubs are already generated)
• ability to call any GPP-side function, with support for all basic parameter types and buffers and obtain the return value, provided that stubs for the target function exist
• a stub generation tool to easily generate stubs for GPP-side functions (given a number of C source files, it will generate the stubs necessary for RPC access to all the functions contained)
• function signature system allows addition of new data types for parameters/return types with special treatment
• architecturally, the RPC layer is mostly seperate from C6Run's other functionality so it could be taken apart and used in other projects

There are, of course, certain shortcomings:

• C6Run still does not build as part of OE. the build system has been modified to skip rebuilding existing dependencies, but there are still problems building for the Beagle. once these are fixed it won't take long to have a OE recipe in place. to my knowledge, there are folks at TI which are already working on this.
• pointers/buffers issues: double/triple/etc. pointers, pointers inside structs, pointers hidden inside buffers (ie, any pointer that is not explicitly declared as a parameter) can't be handled automatically and need to be manually address-translated by the user.
• structs can't be passed as parameters since a different signature character would be needed for each, they need to be passed as pointers to structs instead
• pointers/buffers to be passed via RPC need to be either less than a fixed number of bytes, or be allocated using rpc_malloc instead of from the DSP heap or stack

Status and Accomplishments
• DSP-side caches are re-enabled for all platforms, stubs that use pointer/buffer parameters call the needed writeback/invalidate functions when needed to keep cache coherence
• ARM-side cache coherence code and cache coherency testing example in place
• stubs for string.h functions and POSIX low level I/O functions

Plans and Tasks
• what Google suggests after the soft pencils down date - code scrubbing (mainly fixing possible memory alignment issues, moving commonly used things to GPP/DSP shared header files etc.), improving documentation and writing more tests/examples

Risks, issues, blockers
• for some reason (probably due to the absence of a BCACHE_wbAll function on DSP-side?) enabling ARM caches still results in cache coherency problems.
• the C6Run trunk still produces problematic executables for the BeagleBoard, so a merge with the dsp-rpc-posix branch doesn't make sense at this point, which means OE integration won't make it to the GSoC deadline

Wednesday, August 4, 2010

Cache Coherency

The presence and usage of ARM and DSP side caches poses a cache coherency problem when we want to access a shared area by both processors. Let's consider the following scenario:

1. A CMEM buffer is allocated for shared usage by the GPP side and its physical pointer is passed to the DSP.
2. The DSP wants to read and then write some data into this buffer. Let's say that there are free entry slots in the DSP L2 cache - so the data actually gets written to the DSP cache, instead of making it to the DDR shared region. The DSP then signals the GPP that it's done with the buffer for the time being.
3. The GPP attempts to read the buffer, but what it reads is just the garbage values present in the buffer after initialization since the DSP-written data is in the DSP cache. This buffer also gets cached in the ARM-side now, so when the GPP tries to write some new data into it, it stays into the ARM cache and doesn't make it to the main memory either.
4. If the DSP tries to read the cache now, it won't get what the ARM has written into it most recently since it'll be reading from its own cache, and vice-versa for the ARM side.

We can see that the "same" buffer actually exists in three different locations (main memory, DSP cache, ARM cache), all of which can contain totally different data - in this case it is said that they are not coherent, and that we have a cache coherency problem.

In most cached systems, there are cache coherency protocols which prevent these situations from occuring. The TMS320C64x+ DSP Cache User's Guide states:

In the following cases, it is your responsibility to maintain cache coherence:
• DMA or other external entity writes data or code to external memory that is then read by the CPU
• CPU writes data to external memory that is then read by DMA or another external entity

thus we have to manually maintain cache coherence for mutual access to CMEM regions by the DSP and the GPP. Studying the scenario above, we can observe that there are two underlying problems:

1. If the memory block to be read already exists in the local cache, there's a risk that the local cache is outdated: we need to discard the local cache entries and fill them up with information from the main memory. This process is called cache invalidation.
2. When the memory block is to be written into, there's a risk that the info remains in the local cache and doesn't make it to the main memory: we have to make sure that the new info gets written to the main memory as well. This process is called cache writeback.

Therefore, from a RPC perspective, for a call that involves transferring buffers, the steps we have to take are as follows:

1. Before passing the marshalled info via DSP/Link, the DSP must do a cache writeback
2. Before passing the params to the GPP side stub, the GPP must do a cache invalidate
3. After the GPP side stub is finished, the GPP must do a cache writeback
4. The DSP side stub must do a cache invalidate before terminating

This is assuming that both processor caches will be active - in case the DSP cache is disabled, the steps 1 and 4 will not be necessary, and likewise with steps 2 and 3 for a disabled GPP cache.

Monday, August 2, 2010

Weekly Report #11

Weekly Report #11
Submitted on 2010-08-02
Covers 2010-07-26 to 2010-08-02

Status and Accomplishments
• an elementary version of the mechanism to be able to pass DSP-allocated (stack or heap) buffers to RPC functions is now in place. what essentially happens is that when the GPP side detects a non-CMEM-allocated buffer is passed as a parameter, it uses PROC_read to read from the DSP memory and copy this into a regular GPP buffer which the functions can access.
• bugfixes for RPC functions returning structures - but pointer parameters inside structs still remain an issue and have to be translated manually
• dsp-rpc-posix branch re-synced with c6run trunk, but to no avail (see blockers section)

Plans and Tasks
• the ARM and DSP caches are both disabled for dsp-rpc-posix to deal with cache coherency, but this impacts the memory access performance quite negatively - enable them again and use writeback/invalidate functions to keep caches in sync
• some POSIX functions are still missing from the RPC library since their corresponding header files (with type definitions and all) don't exist for the DSP side. this includes useful things like ioctls - find a way to expose these through RPC

Risks, issues, blockers
• despite all the time I spent on it, I couldn't uncover the cause of the "bus error" that occurs when I compile even the simplest things (such as hello world) with the latest C6Run trunk (so it's not just my synced branch that's troublesome). in some cases it's just "bus error" and the program stopping, in others the PROC_setup fails. the code responsible for these parts hasn't really changed so I'm suspecting it to be a build/configuration issue. due to this and the fact that my primary computer's hard drive crashed (I'm stuck with a silly little netbook!) I haven't been able to try and build C6Run inside OE. I'm not sure if this is just happening for me or the c6run trunk is broken (there are indeed some errors that prevent it from building but they are small and easy to fix)