CANBUS Troubleshooting¶
This document provides information on troubleshooting communication issues when using Kalico with CAN bus.
Verify CAN bus wiring¶
The first step in troubleshooting communication issues is to verify the CAN bus wiring.
Be sure there are exactly two 120 Ohm terminating resistors on the CAN bus. If the resistors are not properly installed then messages may not be able to be sent at all or the connection may have sporadic instability.
The CANH and CANL bus wiring should be twisted around each other. At a minimum, the wiring should have a twist every few centimeters. Avoid twisting the CANH and CANL wiring around power wires and ensure that power wires that travel parallel to the CANH and CANL wires do not have the same amount of twists.
Verify that all plugs and wire crimps on the CAN bus wiring are fully secured. Movement of the printer toolhead may jostle the CAN bus wiring causing a bad wire crimp or unsecured plug to result in intermittent communication errors.
Check for incrementing bytes_invalid counter¶
The Kalico log file will report a Stats
line once a second when the
printer is active. These "Stats" lines will have a bytes_invalid
counter for each micro-controller. This counter should not increment
during normal printer operation (it is normal for the counter to be
non-zero after a RESTART and it is not a concern if the counter
increments once a month or so). If this counter increments on a CAN
bus micro-controller during normal printing (it increments every few
hours or more frequently) then it is an indication of a severe
problem.
Incrementing bytes_invalid
on a CAN bus connection is a symptom of
reordered messages on the CAN bus. There are two known causes of
reordered messages:
- Old versions of the popular candlight_firmware for USB CAN adapters
had a bug that could cause reordered messages. If using a USB CAN
adapter running this firmware then make sure to update to the
latest firmware if incrementing
bytes_invalid
is observed. - Some Linux kernel builds for embedded devices have been known to reorder CAN bus messages. It may be necessary to use an alternative Linux kernel or to use alternative hardware that supports mainstream Linux kernels that do not exhibit this problem.
Reordered messages is a severe problem that must be fixed. It will result in unstable behavior and can lead to confusing errors at any part of a print.
Use an appropriate txqueuelen setting¶
The Kalico code uses the Linux kernel to manage CAN bus traffic. By
default, the kernel will only queue 10 CAN transmit packets. It is
recommended to configure the can0 device
with a txqueuelen 128
to increase that size.
If Kalico transmits a packet and Linux has filled all of its transmit queue space then Linux will drop that packet and messages like the following will appear in the Kalico log:
Got error -1 in can write: (105)No buffer space available
Kalico will automatically retransmit the lost messages as part of its normal application level message retransmit system. Thus, this log message is a warning and it does not indicate an unrecoverable error.
If a complete CAN bus failure occurs (such as a CAN wire break) then
Linux will not be able to transmit any messages on the CAN bus and it
is common to find the above message in the Kalico log. In this case,
the log message is a symptom of a larger problem (the inability to
transmit any messages) and is not directly related to Linux
txqueuelen
.
One may check the current queue size by running the Linux command ip
link show can0
. It should report a bunch of text including the
snippet qlen 128
. If one sees something like qlen 10
then it
indicates the CAN device has not been properly configured.
It is not recommended to use a txqueuelen
significantly larger
than 128. A CAN bus running at a frequency of 1000000 will typically
take around 120us to transmit a CAN packet. Thus a queue of 128
packets is likely to take around 15-20ms to drain. A substantially
larger queue could cause excessive spikes in message round-trip-time
which could lead to unrecoverable errors. Said another way, Kalico's
application retransmit system is more robust if it does not have to
wait for Linux to drain an excessively large queue of possibly stale
data. This is analogous to the problem of
bufferbloat on internet
routers.
Under normal circumstances Kalico may utilize ~25 queue slots per
MCU - typically only utilizing more slots during retransmits.
(Specifically, the Kalico host may transmit up to 192 bytes to each
Kalico MCU before receiving an acknowledgment from that MCU.) If a
single CAN bus has 5 or more Kalico MCUs on it, then it might be
necessary to increase the txqueuelen
above the recommended value
of 128. However, as above, care should be taken when selecting a new
value to avoid excessive round-trip-time latency.
Obtaining candump logs¶
The CAN bus messages sent to and from the micro-controller are handled by the Linux kernel. It is possible to capture these messages from the kernel for debugging purposes. A log of these messages may be of use in diagnostics.
The Linux can-utils tool provides the capture software. It is typically installed on a machine by running:
sudo apt-get update && sudo apt-get install can-utils
Once installed, one may obtain a capture of all CAN bus messages on an interface with the following command:
candump -tz -Ddex can0,#FFFFFFFF > mycanlog
One can view the resulting log file (mycanlog
in the example above)
to see each raw CAN bus message that was sent and received by Kalico.
Understanding the content of these messages will likely require
low-level knowledge of Kalico's CANBUS protocol
and Kalico's MCU commands.
Parsing Kalico messages in a candump log¶
One may use the parsecandump.py
tool to parse the low-level Kalico
micro-controller messages contained in a candump log. Using this tool
is an advanced topic that requires knowledge of Kalico
MCU commands. For example:
./scripts/parsecandump.py mycanlog 108 ./out/klipper.dict
This tool produces output similar to the parsedump tool. See the documentation for that tool for information on generating the Kalico micro-controller data dictionary.
In the above example, 108
is the CAN bus
id. It is a
hexadecimal number. The id 108
is assigned by Kalico to the first
micro-controller. If the CAN bus has multiple micro-controllers on it,
then the second micro-controller would be 10a
, the third would be
10c
, and so on.
The candump log must be produced using the -tz -Ddex
command-line
arguments (for example: candump -tz -Ddex can0,#FFFFFFFF
) in order
to use the parsecandump.py
tool.
Using a logic analyzer on the canbus wiring¶
The Sigrok Pulseview software along with a low-cost logic analyzer can be useful for diagnosing CAN bus signaling. This is an advanced topic likely only of interest to experts.
One can often find "USB logic analyzers" for under $15 (US pricing as of 2023). These devices are often listed as "Saleae logic clones" or as "24MHz 8 channel USB logic analyzers".
The above picture was taken while using Pulseview with a "Saleae
clone" logic analyzer. The Sigrok and Pulseview software was installed
on a desktop machine (also install the "fx2lafw" firmware if that is
packaged separately). The CH0 pin on the logic analyzer was routed to
the CAN Rx line, the CH1 pin was wired to the CAN Tx pin, and GND was
wired to GND. Pulseview was configured to only display the D0 and D1
lines (red "probe" icon center top toolbar). The number of samples was
set to 5 million (top toolbar) and the sample rate was set to 24Mhz
(top toolbar). The CAN decoder was added (yellow and green "bubble
icon" right top toolbar). The D0 channel was labeled as RX and set to
trigger on a falling edge (click on black D0 label at left). The D1
channel was labeled as TX (click on brown D1 label at left). The CAN
decoder was configured for 1Mbit rate (click on green CAN label at
left). The CAN decoder was moved to the top of the display (click and
drag green CAN label). Finally, the capture was started (click "Run"
at top left) and a packet was transmitted on the CAN bus (cansend
can0 123#121212121212
).
The logic analyzer provides an independent tool for capturing packets and verifying bit timing.