Neural Style Transfer with Latte Art, Part 1: Layer Depth

Reading Time: 5 minutes


Neural style transfer (hereafter NST) describes the use of convolutional neural networks to re-render the content of one image in the style of another image.

A few years ago, neural style transfer had a fun fad moment where folks uploaded their pictures into apps like Prism and Pikazo to create their own “paintings.” If you want to try it with some of your own pictures, I recommend trying out this online tool.

In an effort to understand how NST worked, I built an NST network from scratch so I could make incremental changes and see how my results changed.

This post doesn’t explain how or why these changes happen the way they do, so I don’t recommend starting here for a primer on neural style transfer. Instead, I recommend the original paper on neural style transfer or this excellent retelling I found on Medium for a lay audience. Both explain why some of the modifications I made produce the results they produce. They run light on the visual examples, though, so you can consider this series a collection of supplementary visual examples.

I experimented in three areas:

  1. Depth of the layers used to train the style portion of the network (this post)
  2. Method by which correlation is calculated for the style matrix (future post)
  3. The style cost function (future post)

 1. Depth of layers for style training

Here are the images we’ll use for depth of layers. For style, we have a latte art. For content, we have the logo of the University of Chicago. It’s supposed to be a phoenix. It’s not a “maroon chicken.” I found this out the hard, embarrassing way.


Earlier layers of convolutional neural networks contain activations for lower-level, simpler features of the images: color fields, for example, or very simple patterns. Later layers consolidate these earlier results to activate for more complex features—first stripes or dots, for example, all the way up to tessellations, faces, or distinctive object shapes.

The NST network constructs a representation of the style image from a few of these layers. The original paper describes a few iterations of choosing layers from different depths for this process and concludes that the style results look best when using some shallower and some deeper layers for the style representation. I tried a few sets of layers in my network to see how the results would differ.

A. Style layers evenly weighted from early to late

(‘conv1_1’, 0.2),
(‘conv2_1’, 0.2),
(‘conv3_1’, 0.2),
(‘conv4_1’, 0.2),
(‘conv5_1’, 0.2)]

phoenix latte art 400 gram matrix

I don’t think I’d be convinced from this picture that someone managed to render the UChicago phoenix in latte art, but it’s a start. We can see a few different colors in there, and possibly some striation. Let’s see what happens if we weight the early style layers more heavily.

B. Upweighting early style layers

(‘conv1_1’, 0.8),
(‘conv2_1’, 0.2),
(‘conv3_1’, 0.0),
(‘conv4_1’, 0.0),
(‘conv5_1’, 0.0)]

phoenix upweight early style layers

We can see the difference here from the first representation. The image is flatter: it incorporates three colors and not as much texture as the previous image. Early layers don’t yet contain that kind of information. It’s also worth noting that this version of the image is a bit more true to the content image than the one above.

C. Upweighting later style layers

(‘conv1_1’, 0.0),
(‘conv2_1’, 0.0),
(‘conv3_1’, 0.0),
(‘conv4_1’, 0.2),
(‘conv5_1’, 0.8)]

phoenix upweight later layers

This one includes more of the texture we would expect to see from later style layers. We see more variation in the colors and textures, and we also see a figure that’s a bit less true to the content image than what we had before.

Now that I’ve done this, I don’t think a latte art was the best image to use here to demonstrate style transfer. I think that something with more color and texture would better capture the process of applying a style over several layers.


We have a pictorial representation of what earlier and later style layers look like in the implementation of a neural style transfer algorithm. Although I think a more varied style image might make this clearer, I still think that these images offer a visual aid for understanding the difference.

In the next post of the series, we’ll talk about correlation calculation methods: namely, what does the Gram matrix do? Why do we use it? What happens if we use something else?

If you liked this piece, you might also like:

Our case study on fitting classification models for physician data (It will have six parts, and I hope you love all of them)

Tracing our way through the scipy CSR sparse matrix (It’s like a detective novel! But shorter, and with more code)

The time we used numbers to determine if millenials’ spending habits will hurt their investment returns (spoiler alert: doesn’t look like it!)

View story at

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.