Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conv_weights method? Vivado/Vitis version #9

Open
WinstonTNguyen opened this issue Dec 2, 2022 · 6 comments
Open

Conv_weights method? Vivado/Vitis version #9

WinstonTNguyen opened this issue Dec 2, 2022 · 6 comments

Comments

@WinstonTNguyen
Copy link

WinstonTNguyen commented Dec 2, 2022

Due to my limited knowledge as I am still learning, I have encountered problems when implementing the FracBNN model onto the FPGA Ultra96v2 with a custom dataset.

Specifically, what technique did you use for encoding the convolutional weights in conv_weights.h? Is it Thermometer encoding the same as the input? Can you please go into details/sample code of your encoding process in 2-bit with hex representation? as the shape is [45][16][3][3], I am quite confused into which layers' weights did you use for the conv_weights.h?

Second question, which Vivado version did you use for implementation? Was it 2019.1 or 2019.2? I tried using 2022.2 version but I have run into a lot of problems with tcl.

Thanks in advance!
Winston

@chhzh123
Copy link
Member

chhzh123 commented Dec 2, 2022

Thanks for your interests in our project! We take the following process to encode the convolutional weights:

  1. Binarize the weights based on threshold 0, meaning if the element is greater than 0, we quantize it as 1; otherwise, we quantize it as 0.
  2. Pack the bits along the channel dimension into integers, e.g., uint64, so you can see the data type is uint64 in https://github.com/cornell-zhang/FracBNN/blob/main/xcel-cifar10/source/conv_weights.h#L3.
  3. Generate a packed array to store the weights of all the conv layers. That is what you saw in conv_weight.h. In this way, we can use an index pointer to access the weights without extra memory storage overheads (See https://github.com/cornell-zhang/FracBNN/blob/main/xcel-cifar10/source/bnn_tiled.cc#L81).

We use Vivado 2019.2 to synthesize the design. We notice that Vitis HLS has made lots of changes to its API and compilation flow, and we didn’t tested our design on the latest version. To reproduce our results, we suggest using Vivado HLS instead of the Vitis one.

@WinstonTNguyen
Copy link
Author

Thank you very much for the response. I have managed to pack convolutional weights and reproduced your CIFAR-10 Acceleration model. Can you please also share the method of your packing of validation images and labels into .bin file? What was the structure of the packaging process?

Thanks a bunch.
Winston

@chhzh123
Copy link
Member

chhzh123 commented Dec 4, 2022

The validation images also use thermometer encoding, and the process is exactly the same as the one of training images.

We don't need to pack labels. Since labels are just 0-9, it is fairly easy to compare the predicted results with the golden results in Python. Hope this answer your question :)

@WinstonTNguyen
Copy link
Author

WinstonTNguyen commented Dec 12, 2022

Thanks a lot for your answers. With your help I have been able to generate IP on Vivado HLS and built a bitstream file on Vivado using the CIFAR-10 acceleration approach. However I have encountered problems when pushing the bitstream to the Ultra96v2 board and it doesn't work properly.

Upon inspection, the bitstream generated is around 5MBs which is 100x bigger than the results in the paper of 0.03mb. I have tried rebuilding your CIFAR-10 on Vivado HLS without any modification of the source code and noticed that the IP's LUTs and BRAM exceeded the available resource on the board by a large margin. I have followed closely your instructions and this is the result. I have attached a picture showing the exported IP summary. As I am still learning, would you know how this is the case? Have I done some mistakes?

Note: One difference that I had done is the usage of Faketime library to bypass a bug in Vivado 2019.2 when attempting to export IP via Vivado HLS. Maybe this could have caused the issue?

Thanks for your assistance in advance.
Screenshot 2022-12-12 114341

@chhzh123
Copy link
Member

The result is expected. You can see the following report I generated using Vivado 2019.2. It is the same as what you obtained. Since the resource usage in the HLS report is estimated, which may have large difference with the actual resource usage.

+-----------------+---------+-------+--------+--------+-----+
|       Name      | BRAM_18K| DSP48E|   FF   |   LUT  | URAM|
+-----------------+---------+-------+--------+--------+-----+
|DSP              |        -|      -|       -|       -|    -|
|Expression       |        -|      -|      40|    1741|    -|
|FIFO             |        -|      -|       -|       -|    -|
|Instance         |      314|    108|   87924|  148628|    0|
|Memory           |      216|      -|      64|       5|    0|
|Multiplexer      |        -|      -|       -|   43006|    -|
|Register         |        -|      -|    1023|       -|    -|
+-----------------+---------+-------+--------+--------+-----+
|Total            |      530|    108|   89051|  193380|    0|
+-----------------+---------+-------+--------+--------+-----+
|Available        |      432|    360|  141120|   70560|    0|
+-----------------+---------+-------+--------+--------+-----+
|Utilization (%)  |      122|     30|      63|     274|    0|
+-----------------+---------+-------+--------+--------+-----+

Actually you can check the impl/ip/FracNet-CIFAR10/FracNet-CIFAR10.runs/impl_1/design_1_wrapper_utilization_placed.rpt file, which shows the resource usage after placement and routing. You can see the actual utilization of LUT and BRAM are below the threshold.

1. CLB Logic
------------

+----------------------------+-------+-------+-----------+-------+
|          Site Type         |  Used | Fixed | Available | Util% |
+----------------------------+-------+-------+-----------+-------+
| CLB LUTs                   | 51475 |     0 |     70560 | 72.95 |
|   LUT as Logic             | 49911 |     0 |     70560 | 70.74 |
|   LUT as Memory            |  1564 |     0 |     28800 |  5.43 |
|     LUT as Distributed RAM |   521 |     0 |           |       |
|     LUT as Shift Register  |  1043 |     0 |           |       |
| CLB Registers              | 39618 |     0 |    141120 | 28.07 |
|   Register as Flip Flop    | 39618 |     0 |    141120 | 28.07 |
|   Register as Latch        |     0 |     0 |    141120 |  0.00 |
| CARRY8                     |  5021 |     0 |      8820 | 56.93 |
| F7 Muxes                   |   596 |     0 |     35280 |  1.69 |
| F8 Muxes                   |   288 |     0 |     17640 |  1.63 |
| F9 Muxes                   |     0 |     0 |      8820 |  0.00 |
+----------------------------+-------+-------+-----------+-------+

2. CLB Logic Distribution
-------------------------

+--------------------------------------------+-------+-------+-----------+-------+
|                  Site Type                 |  Used | Fixed | Available | Util% |
+--------------------------------------------+-------+-------+-----------+-------+
| CLB                                        |  8557 |     0 |      8820 | 97.02 |
|   CLBL                                     |  5047 |     0 |           |       |
|   CLBM                                     |  3510 |     0 |           |       |
| LUT as Logic                               | 49911 |     0 |     70560 | 70.74 |
|   using O5 output only                     |   479 |       |           |       |
|   using O6 output only                     | 38961 |       |           |       |
|   using O5 and O6                          | 10471 |       |           |       |
| LUT as Memory                              |  1564 |     0 |     28800 |  5.43 |
|   LUT as Distributed RAM                   |   521 |     0 |           |       |
|     using O5 output only                   |     0 |       |           |       |
|     using O6 output only                   |     1 |       |           |       |
|     using O5 and O6                        |   520 |       |           |       |
|   LUT as Shift Register                    |  1043 |     0 |           |       |
|     using O5 output only                   |     0 |       |           |       |
|     using O6 output only                   |   754 |       |           |       |
|     using O5 and O6                        |   289 |       |           |       |
| CLB Registers                              | 39618 |     0 |    141120 | 28.07 |
|   Register driven from within the CLB      | 24666 |       |           |       |
|   Register driven from outside the CLB     | 14952 |       |           |       |
|     LUT in front of the register is unused |  5646 |       |           |       |
|     LUT in front of the register is used   |  9306 |       |           |       |
| Unique Control Sets                        |   911 |       |     17640 |  5.16 |
+--------------------------------------------+-------+-------+-----------+-------+

3. BLOCKRAM
-----------

+-------------------+------+-------+-----------+--------+
|     Site Type     | Used | Fixed | Available |  Util% |
+-------------------+------+-------+-----------+--------+
| Block RAM Tile    |  216 |     0 |       216 | 100.00 |
|   RAMB36/FIFO*    |  109 |     0 |       216 |  50.46 |
|     RAMB36E2 only |  109 |       |           |        |
|   RAMB18          |  214 |     0 |       432 |  49.54 |
|     RAMB18E2 only |  214 |       |           |        |
+-------------------+------+-------+-----------+--------+

Notice the size of the bitstream does NOT mean anything. Even you synthesize a simple adder, the size of the bitstream for Ultra96V2 is around 5MB, since it not only contains your synthesized design but also other hardware logics. The reported result "0.03MB" in our paper is the model size. Basically the size of the parameters.

Also, we did not use third-party libraries. The bitstream should be properly generated using pure Vivado and executed on the target device.

@WinstonTNguyen
Copy link
Author

WinstonTNguyen commented Dec 17, 2022

Hello, sorry to bother again.

Thank you for your previous replies, as it has helped me tremendously.

I have another question. Just for confirmation, the thermo-encoded "conv_input_uint64.npy" used for evaluation on Ultra96v2 was encoded using resolution=4? The input images are of size (N, 3, 32, 32) with each elements of 64-bits. However, in your paper you stated that for all experiments, resolution=8 is used, which will produce the elements the encoded images of 32-bits. Or did you use another method with resolution=8?

If so, would you be kindly to check my code as I having been stuck for a few days and haven't been able to reproduce your result on the CIFAR-10 dataset. Below is the code for thermometer encoding which has taken a lot of ideas from your already implemented code. The purpose of this is that I want a sequential encoding of images e.g. for a video feed.
Input is image of size (3,32,32) each element is float
Output is encoded_image of size (3,32,32) each element of uint64.
I ran this with a resolution of 4.

placeholder = torch.ones(color, b, img_height, img_width, dtype=torch.float32)
placeholder *= torch.arange(b).view(1, -1, 1, 1)

def thermometerEncode(image):	
   image_255 = (image*255).view(color, 1, img_height, img_width).float()
   image_encoded = (placeholder < torch.round(image_255/resolution)).float()
   bits = image_encoded.cpu().numpy().astype('float')
   coef = np.array([2**k for k in range(b)])
   thermo=np.einsum('ijkl,j', bits, coef).astype('float')
   for i in range(color):
      for j in range(img_height):
         for k in range(img_width):
            encoded_str[i,j,k] = np.binary_repr(np.int64(thermo[i,j,k])).ljust(b, '0')
            encoded_uint64[i,j,k] = int(encoded_str[i,j,k], 2)
   return encoded_uint64

Thank you so much in advance, you have been a life saver!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants