



# Online-Cluster-Analysis on FPGAs for recovery of slow pions

Steffen Bähr



#### **D\* Decay in VXD**



- Particles with low impuls experience high stopping power
- D\* decays produce pions with low transversal momentum
  - Below 60 MeV majority of the pions attributed to to this decay



PXD Hits of slow pions will get lost since no Rol is built

 $P_t < 60 \text{ MeV}$  may already

be insufficient to reach all

Recovery mechanism necessary

#### - Deeever versekeri

SVD Layers

SVD Layer 3 50 SVD Layer 2 SVD Layer 1 **PXD** Layer 2 PXD Layer 1 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 -1 -0.8 0.8  $cos(\Theta)$ 

## **Momentum Distribution of Slow Pions**

p<sub>t</sub> / MeV

Institute for Information Processing Technologies (ITIV)



#### **Charge of slow Pions with PXD**



Use charge deposit in PXD for different momenta tracks to separate pi from mip



Charge deposit for different momentum

Institute for Information Processing Technologies (ITIV)

Institute for Experimental Nuclear Physics (IEKP)

#### Algorithm for Recovery of slow pions

 Usage of the NeuroBayes Algorithm to predict Slow Pions by using cluster features

#### Training

- Usage of Simulated Clusters
- Conducted Offline
- Prediction
  - Usage of Detector Data





#### **Cluster Data used for Recovery**



- Charge deposition of Clusters can be supplemented by additional Information
  - Additional information about the charge in a cluster
  - Spreading of clusters in the PXD
  - Features have different impacts on the result

| <b>Cluster Feature</b>  | Importance |
|-------------------------|------------|
| Total Charge            | 1st        |
| Standard<br>Deviation   | 8th        |
| Maximum Pixel<br>Charge | 5th        |
| Minimum Pixel<br>Charge | 3rd        |
| Length in Z             | 4th        |
| Length in Phi           | 6th        |
| Total Length            | 2nd        |
| Number of Pixels        | 9th        |
| PXD Layer               | 7th        |

#### Sample used for NeuroBayes Training



- The Pion Sample covers the transversal impuls of pions not reaching the outer Layers
- Background includes QED, Touschek and Coulomb



Institute for Information Processing Technologies (ITIV)

Institute for Experimental Nuclear Physics (IEKP)

#### **NeuroBayes Performance**



 NeuroBayes achieves good Signal Efficiency and Background Rejection Ratios for pions with P<sub>t</sub> < 65 MeV</li>



Institute for Experimental Nuclear Physics (IEKP)

#### Using the NeuroBayes at the PXD



- Algortihm needs to handle the data rates
  - Usage of FPGAs with guaranteed Throughput
- Needs to be placed close to the readout of PXD
  - FPGAs of DHH has free resources





### **Challenges for the Implementation on FPGA**

#### Performance

- Interface is clocked with 200 MHz
- 1 Cluster / Clock Cycle has to be able to be achieved

#### Resources

- 30 % of the CLBs available
- 50 % of the DSPs available

#### Quality

 Output of FPGA Implementation cannot differ to much from Software



#### **Cluster Processing Pipeline**

- Incoming Clusters are passing through a processing pipeline
  - Protocol Handling for the Interface to the DHH's Clustering
  - Computation of Cluster Features for the NeuroBayes
  - NeuroBayes Expert makes the decision and passes it to the output interface



#### **Overview of NeuroBayes on FPGA**



- The NeuroBayes Expert consists of 4 major components
  - Binning of the Cluster Features
  - Preprocessing by using CDF
  - Zero-Iteration done by vector multiplication
  - Cut on the Ouput



#### **Conversion to Fixpoint**



- Original NeuroBayes Algorithm uses Floating Point
  - Usage on FPGA requires more ressources and impacts latencies
  - Transformation to fixpoint is more efficient, but impacts the quality
- Quality of a Fixpoint implementation's Output depends on the used bitwidth
  - Higher Bitwidth equals higher consumption of Ressources, while increasing Quality
  - DSPs have 25\*18 Bit Inputs

#### **Evaluation of Fixpoint Implementation**



- Usage of 25\*25 Bitwidth results in no difference of output to floating point implemenation
- Difference is small for 25\*18
  - Most resource efficient configuration for multiplication
  - Difference is at 1.5 \* 10^-5



#### **Evaluation of Performance**



- Computation of the Algorithm takes several clock cycles
  - Performance requirements cannot be met
- Usage of Pipelining throughout the implemenation
- Fixpoint Vector Multiplication is the critical path
  - High Frequency through usage of cascading DSPs

| Component               | Latency   | Frequency |
|-------------------------|-----------|-----------|
| Cluster Feature         | 1 cycle   | 446 MHz   |
| Preprocessing & Binning | 1 cycle   | 446 MHz   |
| Vector multiplication   | 9 cycles  | 350 MHz   |
| Cut                     | 1 cycle   | 446 MHz   |
| Total                   | 12 cycles | 350 MHz   |

#### **Ressource Consumption**



- Critical Path for CLBs in the implemenation is the Binning and Preprocessing
  - Mostly Usage of Look Up Tables
- Usage of DSPs for vector multiplication makes life easier
  - 50 % of DSPs are still available
  - Each multiplication needs one DSP

| Ressource | Demand  | Constraint |
|-----------|---------|------------|
| DSP       | 3,125 % | < 50 %     |
| CLB       | ~ 2 %   | < 30 %     |

#### 17 02.10.2014 Online-Cluster-Analyse

Institute for Information Processing Technologies (ITIV) Institute for Experimental Nuclear Physics (IEKP)

### Simulation

|                      |                              | 130.100 ns                  |
|----------------------|------------------------------|-----------------------------|
| Name                 | Value                        | 0 ns  50 ns  100 ns  150 ns |
| 堝 clk                | 1                            |                             |
| ₩ rst_n              | 1                            |                             |
| •                    | 0001101                      |                             |
| •                    | 0000000000100000000000000000 |                             |
| •                    | 00000000000000011101000111   |                             |
| • 📲 res[17:0]        | 000000111010001111           |                             |
| • 📲 douta[56:0]      | 0000000001000000000000000000 |                             |
| •                    | 000100000111101011,00000110  |                             |
| • 📲 cluster_data[0:8 | 111000110000111111,0001010   |                             |
| • 📲 result[47:0]     | 00000000000000011101000111   |                             |
| •–• casc[7:0][47:0]  | 111111111111111010001011111  |                             |
| •                    | 111111111111111010001011111  |                             |
| •                    | 11111111111000111011011111   |                             |
| •                    | 00000000000000010001100001   |                             |
| •                    | 1111111111100001100011101    |                             |
| •                    | 111111111111110001111110000  |                             |
| • 📲 [2][47:0]        | 11111111111111011000000011   |                             |
| • 📲 [1][47:0]        | 11111111111111100110110000   |                             |
| •                    | 11111111111111110100000101   |                             |
| • 📲 data_o[0:8][17:0 | 111000110000111111,0001010   |                             |
|                      |                              |                             |

Result of Software is: 000000111010001110010001 -> slow Pion Cluster (S)DDD,DDDDDDDDDDDDDD

| Cluster<br>Feature      | Value |
|-------------------------|-------|
| Total Charge            | 425   |
| Standard<br>Deviation   | 49    |
| Maximum<br>Pixel Charge | 135   |
| Minimum<br>Pixel Charge | 23    |
| Length in Z             | 2     |
| Length in<br>Phi        | 2     |
| Total Length            | 1     |
| Number of<br>Pixels     | 4     |
| PXD Layer               | 2     |



#### Summary



- Recover slow Pions using FPGAs near PXD
- FPGA-Implementation is sufficient
  - Quality : Difference < 1.5 \* 10 ^ -5</p>
  - Throughput : 350 Mio Cluster per secons
  - Resources : ~ 2 % CLBs , ~3 % DSPs
- Integration into DHH



# Demo

19 02.10.2014 Online-Cluster-Analyse