– Hello? Okay, it’s after 12, so

I want to get started. So today, lecture eight,

we’re going to talk about deep learning software. This is a super exciting

topic because it changes a lot every year. But also means it’s a lot

of work to give this lecture ’cause it changes a lot every year. But as usual, a couple

administrative notes before we dive into the material. So as a reminder the

project proposals for your course projects were due on Tuesday. So hopefully you all turned that in, and hopefully you all

have a somewhat good idea of what kind of projects

you want to work on for the class. So we’re in the process of

assigning TA’s to projects based on what the project area is and the expertise of the TA’s. So we’ll have some more

information about that in the next couple days I think. We’re also in the process

of grading assignment one, so stay tuned and we’ll get

those grades back to you as soon as we can. Another reminder is that

assignment two has been out for a while. That’s going to be due next week,

a week from today, Thursday. And again, when working on assignment two, remember to stop your

Google Cloud instances when you’re not working to

try to preserve your credits. And another bit of

confusion, I just wanted to re-emphasize is that for

assignment two you really only need to use GPU instances

for the last notebook. For all of the several

notebooks it’s just in Python and Numpy so you don’t need

any GPUs for those questions. So again, conserve your credits, only use GPUs when you need them. And the final reminder is

that the midterm is coming up. It’s kind of hard to

believe we’re there already, but the midterm will be in

class on Tuesday, five nine. So the midterm will be more theoretical. It’ll be sort of pen and paper

working through different kinds of, slightly more

theoretical questions to check your understanding

of the material that we’ve covered so far. And I think we’ll probably

post at least a short sort of sample of the types of

questions to expect. Question? [student’s words obscured

due to lack of microphone] Oh yeah, question is

whether it’s open-book, so we’re going to say

closed note, closed book. So just, Yeah, yeah, so that’s what

we’ve done in the past is just closed note,

closed book, relatively just like want to check

that you understand the intuition behind most of

the stuff we’ve presented. So, a quick recap as a reminder

of what we were talking about last time. Last time we talked about

fancier optimization algorithms for deep learning models

including SGD Momentum, Nesterov, RMSProp and Adam. And we saw that these

relatively small tweaks on top of vanilla SGD, are

relatively easy to implement but can make your networks

converge a bit faster. We also talked about regularization, especially dropout. So remember dropout, you’re

kind of randomly setting parts of the network to zero

during the forward pass, and then you kind of

marginalize out over that noise in the back at test time. And we saw that this was

kind of a general pattern across many different

types of regularization in deep learning, where

you might add some kind of noise during training,

but then marginalize out that noise at test time

so it’s not stochastic at test time. We also talked about

transfer learning where you can maybe download big

networks that were pre-trained on some dataset and then

fine tune them for your own problem. And this is one way that you

can attack a lot of problems in deep learning, even

if you don’t have a huge dataset of your own. So today we’re going to

shift gears a little bit and talk about some of the nuts and bolts about writing software and

how the hardware works. And a little bit, diving

into a lot of details about what the software

looks like that you actually use to train these things in practice. So we’ll talk a little

bit about CPUs and GPUs and then we’ll talk about

several of the major deep learning frameworks

that are out there in use these days. So first, we’ve sort of

mentioned this off hand a bunch of different times, that computers have CPUs,

computers have GPUs. Deep learning uses GPUs,

but we weren’t really too explicit up to this

point about what exactly these things are and

why one might be better than another for different tasks. So, who’s built a computer before? Just kind of show of hands. So, maybe about a third

of you, half of you, somewhere around that ballpark. So this is a shot of my computer at home that I built. And you can see that there’s

a lot of stuff going on inside the computer,

maybe, hopefully you know what most of these parts are. And the CPU is the

Central Processing Unit. That’s this little chip

hidden under this cooling fan right here near the top of the case. And the CPU is actually

relatively small piece. It’s a relatively small

thing inside the case. It’s not taking up a lot of space. And the GPUs are these

two big monster things that are taking up a

gigantic amount of space in the case. They have their own cooling, they’re taking a lot of power. They’re quite large. So, just in terms of how

much power they’re using, in terms of how big they

are, the GPUs are kind of physically imposing and

taking up a lot of space in the case. So the question is what are these things and why are they so

important for deep learning? Well, the GPU is called a graphics card, or Graphics Processing Unit. And these were really developed,

originally for rendering computer graphics, and

especially around games and that sort of thing. So another show of hands,

who plays video games at home sometimes, from time to

time on their computer? Yeah, so again, maybe

about half, good fraction. So for those of you who’ve

played video games before and who’ve built your own computers, you probably have your own

opinions on this debate. [laughs] So this is one of those big

debates in computer science. You know, there’s like Intel versus AMD, NVIDIA versus AMD for graphics cards. It’s up there with Vim

versus Emacs for text editor. And pretty much any gamer

has their own opinions on which of these two sides they prefer for their own cards. And in deep learning we

kind of have mostly picked one side of this fight, and that’s NVIDIA. So if you guys have AMD cards, you might be in a little

bit more trouble if you want to use those for deep learning. And really, NVIDIA’s been

pushing a lot for deep learning in the last several years. It’s been kind of a large focus

of some of their strategy. And they put in a lot

effort into engineering sort of good solutions

to make their hardware better suited for deep learning. So most people in deep learning

when we talk about GPUs, we’re pretty much exclusively

talking about NVIDIA GPUs. Maybe in the future this’ll

change a little bit, and there might be new players coming up, but at least for now

NVIDIA is pretty dominant. So to give you an idea of

like what is the difference between a CPU and a GPU,

I’ve kind of made a little spread sheet here. On the top we have two of

the kind of top end Intel consumer CPUs, and on

the bottom we have two of NVIDIA’s sort of current

top end consumer GPUs. And there’s a couple general

trends to notice here. Both GPUs and CPUs are

kind of a general purpose computing machine where

they can execute programs and do sort of arbitrary instructions, but they’re qualitatively

pretty different. So CPUs tend to have just a few cores, for consumer desktop CPUs these days, they might have something like four or six or maybe up to 10 cores. With hyperthreading technology

that means they can run, the hardware can physically

run, like maybe eight or up to 20 threads concurrently. So the CPU can maybe do 20

things in parallel at once. So that’s just not a gigantic number, but those threads for a

CPU are pretty powerful. They can actually do a lot of things, they’re very fast. Every CPU instruction can

actually do quite a lot of stuff. And they can all work

pretty independently. For GPUs it’s a little bit different. So for GPUs we see that

these sort of common top end consumer GPUs have thousands of cores. So the NVIDIA Titan XP

which is the current top of the line consumer

GPU has 3840 cores. So that’s a crazy number. That’s like way more than

the 10 cores that you’ll get for a similarly priced CPU. The downside of a GPU is

that each of those cores, one, it runs at a much slower clock speed. And two they really

can’t do quite as much. You can’t really compare

CPU cores and GPU cores apples to apples. The GPU cores can’t really

operate very independently. They all kind of need to work together and sort of paralyze one

task across many cores rather than each core

totally doing its own thing. So you can’t really compare

these numbers directly. But it should give you the sense that due to the large number of

cores GPUs can sort of, are really good for

parallel things where you need to do a lot of things

all at the same time, but those things are all

pretty much the same flavor. Another thing to point

out between CPUs and GPUs is this idea of memory. Right, so CPUs have some cache on the CPU, but that’s relatively

small and the majority of the memory for your

CPU is pulling from your system memory, the RAM,

which will maybe be like eight, 12, 16, 32 gigabytes

of RAM on a typical consumer desktop these days. Whereas GPUs actually

have their own RAM built into the chip. There’s a pretty large

bottleneck communicating between the RAM in your

system and the GPU, so the GPUs typically have their own relatively large block of

memory within the card itself. And for the Titan XP, which

again is maybe the current top of the line consumer card, this thing has 12 gigabytes

of memory local to the GPU. GPUs also have their own caching system where there are sort of

multiple hierarchies of caching between the 12 gigabytes of GPU memory and the actual GPU cores. And that’s somewhat similar

to the caching hierarchy that you might see in a CPU. So, CPUs are kind of good for

general purpose processing. They can do a lot of different things. And GPUs are maybe more

specialized for these highly paralyzable algorithms. So the prototypical algorithm

of something that works really really well and

is like perfectly suited to a GPU is matrix multiplication. So remember in matrix

multiplication on the left we’ve got like a matrix

composed of a bunch of rows. We multiply that on the right

by another matrix composed of a bunch of columns

and then this produces another, a final matrix

where each element in the output matrix is a dot product

between one of the rows and one of the columns of

the two input matrices. And these dot products

are all independent. Like you could imagine,

for this output matrix you could split it up completely and have each of those different elements of the output matrix all

being computed in parallel and they all sort of are

running the same computation which is taking a dot

product of these two vectors. But exactly where they’re

reading that data from is from different places

in the two input matrices. So you could imagine that

for a GPU you can just like blast this out and

have all of this elements of the output matrix

all computed in parallel and that could make this thing

computer super super fast on GPU. So that’s kind of the

prototypical type of problem that like where a GPU

is really well suited, where a CPU might have

to go in and step through sequentially and compute

each of these elements one by one. That picture is a little

bit of a caricature because CPUs these days have multiple cores, they can do vectorized

instructions as well, but still, for these like

massively parallel problems GPUs tend to have much better throughput. Especially when these matrices

get really really big. And by the way, convolution

is kind of the same kind of story. Where you know in convolution

we have this input tensor, we have this weight tensor

and then every point in the output tensor after a

convolution is again some inner product between some part of the weights and some part of the input. And you can imagine that a

GPU could really paralyze this computation, split it

all up across the many cores and compute it very quickly. So that’s kind of the

general flavor of the types of problems where GPUs give

you a huge speed advantage over CPUs. So you can actually write

programs that run directly on GPUs. So NVIDIA has this CUDA

abstraction that lets you write code that kind of looks like C, but executes directly on the GPUs. But CUDA code is really really tricky. It’s actually really tough

to write CUDA code that’s performant and actually

squeezes all the juice out of these GPUs. You have to be very careful

managing the memory hierarchy and making sure you

don’t have cache misses and branch mispredictions

and all that sort of stuff. So it’s actually really really

hard to write performant CUDA code on your own. So as a result NVIDIA has

released a lot of libraries that implement common

computational primitives that are very very highly

optimized for GPUs. So for example NVIDIA has a

cuBLAS library that implements different kinds of matrix multiplications and different matrix operations

that are super optimized, run really well on GPU,

get very close to sort of theoretical peak hardware utilization. Similarly they have a cuDNN

library which implements things like convolution,

forward and backward passes, batch normalization, recurrent networks, all these kinds of

computational primitives that we need in deep learning. NVIDIA has gone in there and

released their own binaries that compute these

primitives very efficiently on NVIDIA hardware. So in practice, you tend not

to end up writing your own CUDA code for deep learning. You typically are just

mostly calling into existing code that other people have written. Much of which is the stuff

which has been heavily optimized by NVIDIA already. There’s another sort of

language called OpenCL which is a bit more general. Runs on more than just NVIDIA GPUs, can run on AMD hardware, can run on CPUs, but OpenCL, nobody’s really

spent a really large amount of effort and energy trying

to get optimized deep learning primitives for OpenCL, so

it tends to be a lot less performant the super

optimized versions in CUDA. So maybe in the future we

might see a bit of a more open standard and we might see

this across many different more types of platforms,

but at least for now, NVIDIA’s kind of the main game

in town for deep learning. So you can check, there’s a

lot of different resources for learning about how you can

do GPU programming yourself. It’s kind of fun. It’s sort of a different

paradigm of writing code because it’s this massively

parallel architecture, but that’s a bit beyond

the scope of this course. And again, you don’t really

need to write your own CUDA code much in practice

for deep learning. And in fact, I’ve never

written my own CUDA code for any research project, so, but it is kind of useful

to know like how it works and what are the basic

ideas even if you’re not writing it yourself. So if you want to look at

kind of CPU GPU performance in practice, I did some

benchmarks last summer comparing a decent Intel CPU against a bunch of different

GPUs that were sort of near top of the line at that time. And these were my own

benchmarks that you can find more details on GitHub,

but my findings were that for things like VGG 16 and

19, ResNets, various ResNets, then you typically see

something like a 65 to 75 times speed up when running the

exact same computation on a top of the line GPU, in

this case a Pascal Titan X, versus a top of the line,

well, not quite top of the line CPU, which in this case

was an Intel E5 processor. Although, I’d like to make

one sort of caveat here is that you always need

to be super careful whenever you’re reading

any kind of benchmarks about deep learning, because

it’s super easy to be unfair between different things. And you kind of need to know

a lot of the details about what exactly is being

benchmarked in order to know whether or not the comparison is fair. So in this case I’ll come

right out and tell you that probably this comparison

is a little bit unfair to CPU because I didn’t

spend a lot of effort trying to squeeze the maximal performance out of CPUs. I probably could have tuned

the blast libraries better for the CPU performance. And I probably could

have gotten these numbers a bit better. This was sort of out

of the box performance between just installing

Torch, running it on a CPU, just installing Torch running it on a GPU. So this is kind of out

of the box performance, but it’s not really like

peak, possible, theoretical throughput on the CPU. But that being said, I

think there are still pretty substantial speed ups to be had here. Another kind of interesting

outcome from this benchmarking was comparing these

optimized cuDNN libraries from NVIDIA for convolution

and whatnot versus sort of more naive CUDA

that had been hand written out in the open source community. And you can see that if you

compare the same networks on the same hardware with

the same deep learning framework and the only

difference is swapping out these cuDNN versus sort of

hand written, less optimized CUDA you can see something

like nearly a three X speed up across the board when you

switch from the relatively simple CUDA to these like

super optimized cuDNN implementations. So in general, whenever

you’re writing code on GPU, you should probably almost

always like just make sure you’re using cuDNN because

you’re leaving probably a three X performance boost

on the table if you’re not calling into cuDNN for your stuff. So another problem that

comes up in practice, when you’re training these things is that you know, your model is

maybe sitting on the GPU, the weights of the model

are in that 12 gigabytes of local storage on the

GPU, but your big dataset is sitting over on the

right on a hard drive or an SSD or something like that. So if you’re not careful

you can actually bottleneck your training by just

trying to read the data off the disk. ‘Cause the GPU is super

fast, it can compute forward and backward quite

fast, but if you’re reading sequentially off a spinning

disk, you can actually bottleneck your training quite, and that can be really

bad and slow you down. So some solutions here

are that like you know if your dataset’s really

small, sometimes you might just read the whole dataset into RAM. Or even if your dataset isn’t so small, but you have a giant

server with a ton of RAM, you might do that anyway. You can also make sure

you’re using an SSD instead of a hard drive, that can help

a lot with read throughput. Another common strategy

is to use multiple threads on the CPU that are

pre-fetching data off RAM or off disk, buffering it

in memory, in RAM so that then you can continue

feeding that buffer data down to the GPU with good performance. This is a little bit painful to set up, but again like, these

GPU’s are so fast that if you’re not really

careful with trying to feed them data as quickly as possible, just reading the data

can sometimes bottleneck the whole training process. So that’s something to be aware of. So that’s kind of the

brief introduction to like sort of GPU CPU hardware

in practice when it comes to deep learning. And then I wanted to

switch gears a little bit and talk about the

software side of things. The various deep learning

frameworks that people are using in practice. But I guess before I move on, is there any sort of

questions about CPU GPU? Yeah, question? [student’s words obscured

due to lack of microphone] Yeah, so the question

is what can you sort of, what can you do mechanically

when you’re coding to avoid these problems? Probably the biggest thing

you can do in software is set up sort of pre-fetching on the CPU. Like you couldn’t like,

sort of a naive thing would be you have this

sequential process where you first read data off

disk, wait for the data, wait for the minibatch to be read, then feed the minibatch to the GPU, then go forward and backward on the GPU, then read another minibatch

and sort of do this all in sequence. And if you actually have multiple, like instead you might have

CPU threads running in the background that are

fetching data off the disk such that while the, you can sort of interleave

all of these things. Like the GPU is computing, the CPU background threads

are feeding data off disk and your main thread is kind

of waiting for these things to, just doing a bit of synchronization

between these things so they’re all happening in parallel. And thankfully if you’re using

some of these deep learning frameworks that we’re about to talk about, then some of this work has

already been done for you ’cause it’s a little bit painful. So the landscape of

deep learning frameworks is super fast moving. So last year when I gave

this lecture I talked mostly about Caffe, Torch, Theano and TensorFlow. And when I last gave this talk,

again more than a year ago, TensorFlow was relatively new. It had not seen super widespread

adoption yet at that time. But now I think in the

last year TensorFlow has gotten much more popular. It’s probably the main framework

of choice for many people. So that’s a big change. We’ve also seen a ton of new frameworks sort of popping up like

mushrooms in the last year. So in particular Caffe2 and

PyTorch are new frameworks from Facebook that I think

are pretty interesting. There’s also a ton of other frameworks. Paddle, Baidu has Paddle,

Microsoft has CNTK, Amazon is mostly using

MXNet and there’s a ton of other frameworks as well,

but I’m less familiar with, and really don’t have time to get into. But one interesting thing to

point out from this picture is that kind of the first

generation of deep learning frameworks that really saw wide adoption were built in academia. So Caffe was from Berkeley,

Torch was developed originally NYU and also in

collaboration with Facebook. And Theana was mostly build

at the University of Montreal. But these kind of next

generation deep learning frameworks all originated in industry. So Caffe2 is from Facebook,

PyTorch is from Facebook. TensorFlow is from Google. So it’s kind of an interesting

shift that we’ve seen in the landscape over

the last couple of years is that these ideas

have really moved a lot from academia into industry. And now industry is kind of

giving us these big powerful nice frameworks to work with. So today I wanted to

mostly talk about PyTorch and TensorFlow ’cause I

personally think that those are probably the ones you

should be focusing on for a lot of research type

problems these days. I’ll also talk a bit

about Caffe and Caffe2. But probably a little bit

less emphasis on those. And before we move any farther,

I thought I should make my own biases a little bit more explicit. So I have mostly, I’ve

worked with Torch mostly for the last several years. And I’ve used it quite

a lot, I like it a lot. And then in the last year I’ve

mostly switched to PyTorch as my main research framework. So I have a little bit

less experience with some of these others, especially TensorFlow, but I’ll still try to do

my best to give you a fair picture and a decent

overview of these things. So, remember that in the

last several lectures we’ve hammered this idea

of computational graphs in sort of over and over. That whenever you’re doing deep learning, you want to think about building

some computational graph that computes whatever function

that you want to compute. So in the case of a linear

classifier you’ll combine your data X and your weights

W with a matrix multiply. You’ll do some kind of

hinge loss to maybe have, compute your loss. You’ll have some regularization term and you imagine stitching

together all these different operations into some graph structure. Remember that these graph

structures can get pretty complex in the case of a big neural net, now there’s many different layers, many different activations. Many different weights

spread all around in a pretty complex graph. And as you move to things

like neural turing machines then you can get these really

crazy computational graphs that you can’t even really

draw because they’re so big and messy. So the point of deep learning

frameworks is really, there’s really kind of three

main reasons why you might want to use one of these

deep learning frameworks rather than just writing your own code. So the first would be that

these frameworks enable you to easily build and

work with these big hairy computational graphs

without kind of worrying about a lot of those

bookkeeping details yourself. Another major idea is that, whenever we’re working in deep learning we always need to compute gradients. We’re always computing some loss, we’re always computer

gradient of our weight with respect to the loss. And we’d like to make this

automatically computing gradient, you don’t want to have to

write that code yourself. You want that framework to

handle all these back propagation details for you so you

can just think about writing down the forward

pass of your network and have the backward pass

sort of come out for free without any additional work. And finally you want all

this stuff to run efficiently on GPUs so you don’t have to

worry too much about these low level hardware details

about cuBLAS and cuDNN and CUDA and moving data

between the CPU and GPU memory. You kind of want all those messy

details to be taken care of for you. So those are kind of

some of the major reasons why you might choose to

use frameworks rather than writing your own stuff from scratch. So as kind of a concrete

example of a computational graph we can maybe write down

this super simple thing. Where we have three inputs, X, Y, and Z. We’re going to combine

X and Y to produce A. Then we’re going to combine

A and Z to produce B and then finally we’re going

to do some maybe summing out operation on B to give

some scaler final result C. So you’ve probably written

enough Numpy code at this point to realize that it’s

super easy to write down, to implement this computational graph, or rather to implement this

bit of computation in Numpy, right? You can just kind of write

down in Numpy that you want to generate some random data, you

want to multiply two things, you want to add two things, you

want to sum out a couple things. And it’s really easy to do this in Numpy. But then the question is

like suppose that we want to compute the gradient of C

with respect to X, Y, and Z. So, if you’re working in Numpy,

you kind of need to write out this backward pass yourself. And you’ve gotten a lot of

practice with this on the homeworks, but it can be kind of a pain and a little bit annoying

and messy once you get to really big complicated things. The other problem with

Numpy is that it doesn’t run on the GPU. So Numpy is definitely CPU only. And you’re never going

to be able to experience or take advantage of these

GPU accelerated speedups if you’re stuck working in Numpy. And it’s, again, it’s a

pain to have to compute your own gradients in

all these situations. So, kind of the goal of most

deep learning frameworks these days is to let you

write code in the forward pass that looks very similar to Numpy, but lets you run it on the GPU and lets you automatically

compute gradients. And that’s kind of the big

picture goal of most of these frameworks. So if you imagine looking

at, if we look at an example in TensorFlow of the exact

same computational graph, we now see that in this forward pass, you write this code that ends

up looking very very similar to the Numpy forward pass

where you’re kind of doing these multiplication and

these addition operations. But now TensorFlow has

this magic line that just computes all the gradients for you. So now you don’t have go in and

write your own backward pass and that’s much more convenient. The other nice thing about

TensorFlow is you can really just, like with one line you

can switch all this computation between CPU and GPU. So here, if you just

add this with statement before you’re doing this forward pass, you just can explicitly

tell the framework, hey I want to run this code on the CPU. But now if we just change that

with statement a little bit with just with a one

character change in this case, changing that C to a G,

now the code runs on GPU. And now in this little code snippet, we’ve solved these two problems. We’re running our code on the GPU and we’re having the framework

compute all the gradients for us, so that’s really nice. And PyTorch kind looks

almost exactly the same. So again, in PyTorch

you kind of write down, you define some variables, you have some forward pass

and the forward pass again looks very similar to like,

in this case identical to the Numpy code. And then again, you can

just use PyTorch to compute gradients, all your

gradients with just one line. And now in PyTorch again,

it’s really easy to switch to GPU, you just need to

cast all your stuff to the CUDA data type before

you rung your computation and now everything runs

transparently on the GPU for you. So if you kind of just look

at these three examples, these three snippets of code side by side, the Numpy, the TensorFlow and the PyTorch you see that the TensorFlow

and the PyTorch code in the forward pass looks

almost exactly like Numpy which is great ’cause

Numpy has a beautiful API, it’s really easy to work with. But we can compute gradients automatically and we can run the GPU automatically. So after that kind of introduction, I wanted to dive in and

talk in a little bit more detail about kind of

what’s going on inside this TensorFlow example. So as a running example throughout

the rest of the lecture, I’m going to use the training

a two-layer fully connected ReLU network on random data

as kind of a running example throughout the rest of the examples here. And we’re going to train this

thing with an L2 Euclidean loss on random data. So this is kind of a silly

network, it’s not really doing anything useful, but it does give you the, it’s relatively small, self contained, the code fits on the slide

without being too small, and it lets you demonstrate

kind of a lot of the useful ideas inside these frameworks. So here on the right, oh,

and then another note, I’m kind of assuming

that Numpy and TensorFlow have already been imported

in all these code snippets. So in TensorFlow you would

typically divide your computation into two major stages. First, we’re going to write

some code that defines our computational graph,

and that’s this red code up in the top half. And then after you define your graph, you’re going to run the

graph over and over again and actually feed data into the graph to perform whatever computation

you want it to perform. So this is the really,

this is kind of the big common pattern in TensorFlow. You’ll first have a bunch of

code that builds the graph and then you’ll go and

run the graph and reuse it many many times. So if you kind of dive

into the code of building the graph in this case. Up at the top you see that

we’re defining this X, Y, w1 and w2, and we’re creating

these tf.placeholder objects. So these are going to be

input nodes to the graph. These are going to be sort

of entry points to the graph where when we run the graph,

we’re going to feed in data and put them in through

these input slots in our computational graph. So this is not actually

like allocating any memory right now. We’re just sort of setting

up these input slots to the graph. Then we’re going to use those

input slots which are now kind of like these symbolic variables and we’re going to perform

different TensorFlow operations on these symbolic variables

in order to set up what computation we want

to run on those variables. So in this case we’re doing

a matrix multiplication between X and w1, we’re

doing some tf.maximum to do a ReLU nonlinearity and

then we’re doing another matrix multiplication to

compute our output predictions. And then we’re again using

a sort of basic Tensor operations to compute

our Euclidean distance, our L2 loss between our

prediction and the target Y. Another thing to point out here is that these lines of code are not

actually computing anything. There’s no data in the system right now. We’re just building up this

computational graph data structure telling

TensorFlow which operations we want to eventually run

once we put in real data. So this is just building the graph, this is not actually doing anything. Then we have this magical line

where after we’ve computed our loss with these symbolic operations, then we can just ask TensorFlow to compute the gradient of the loss

with respect to w1 and w2 in this one magical, beautiful line. And this avoids you writing

all your own backprop code that you had to do in the assignments. But again there’s no actual

computation happening here. This is just sort of

adding extra operations to the computational graph

where now the computational graph has these additional

operations which will end up computing these gradients for you. So now at this point we’ve

computed our computational graph, we have this big graph

in this graph data structure in memory that knows what

operations we want to perform to compute the loss in gradients. And now we enter a TensorFlow

session to actually run this graph and feed it with data. So then, once we’ve entered the session, then we actually need to

construct some concrete values that will be fed to the graph. So TensorFlow just expects

to receive data from Numpy arrays in most cases. So here we’re just creating

concrete actual values for X, Y, w1 and w2 using

Numpy and then storing these in some dictionary. And now here is where we’re

actually running the graph. So you can see that we’re

calling a session.run to actually execute

some part of the graph. The first argument loss, tells

us which part of the graph do we actually want as output. And that, so we actually want the graph, in this case we need to

tell it that we actually want to compute loss and grad1 and grad w2 and we need to pass in with

this feed dict parameter the actual concrete values

that will be fed to the graph. And then after, in this one line, it’s going and running the

graph and then computing those values for loss grad1 to grad w2 and then returning the

actual concrete values for those in Numpy arrays again. So now after you unpack this

output in the second line, you get Numpy arrays, or you

get Numpy arrays with the loss and the gradients. So then you can go and

do whatever you want with these values. So then, this has only run sort

of one forward and backward pass through our graph, and it only takes a couple

extra lines if we actually want to train the network. So here we’re, now we’re

running the graph many times in a loop so we’re doing a four loop and in each iteration of the loop, we’re calling session.run

asking it to compute the loss and the gradients. And now we’re doing a

manual gradient discent step using those computed gradients

to now update our current values of the weights. So if you actually run this

code and plot the losses, then you’ll see that the loss goes down and the network is training and

this is working pretty well. So this is kind of like a

super bare bones example of training a fully connected

network in TensorFlow. But there’s a problem here. So here, remember that

on the forward pass, every time we execute this graph, we’re actually feeding in the weights. We have the weights as Numpy arrays and we’re explicitly

feeding them into the graph. And now when the graph finishes executing it’s going to give us these gradients. And remember the gradients

are the same size as the weights. So this means that every time

we’re running the graph here, we’re copying the weights

from Numpy arrays into TensorFlow then getting the gradients and then copying the

gradients from TensorFlow back out to Numpy arrays. So if you’re just running on CPU, this is maybe not a huge deal, but remember we talked

about CPU GPU bottleneck and how it’s very expensive

actually to copy data between CPU memory and GPU memory. So if your network is very

large and your weights and gradients were very big, then doing something like

this would be super expensive and super slow because we’d

be copying all kinds of data back and forth between the

CPU and the GPU at every time step. So that’s bad, we don’t want to do that. We need to fix that. So, obviously TensorFlow

has some solution to this. And the idea is that

now we want our weights, w1 and w2, rather than being

placeholders where we’re going to, where we expect to

feed them in to the network on every forward pass, instead

we define them as variables. So a variable is something

is a value that lives inside the computational graph

and it’s going to persist inside the computational

graph across different times when you run the same graph. So now instead of declaring

these w1 and w2 as placeholders, instead we just construct

them as variables. But now since they live inside the graph, we also need to tell

TensorFlow how they should be initialized, right? Because in the previous

case we were feeding in their values from outside the graph, so we initialized them in Numpy, but now because these things

live inside the graph, TensorFlow is responsible

for initializing them. So we need to pass in a

tf.randomnormal operation, which again is not

actually initializing them when we run this line, this

is just telling TensorFlow how we want them to be initialized. So it’s a little bit of

confusing misdirection going on here. And now, remember in the previous example we were actually updating

the weights outside of the computational graph. We, in the previous example,

we were computing the gradients and then using them to update

the weights as Numpy arrays and then feeding in the

updated weights at the next time step. But now because we want

these weights to live inside the graph, this operation

of updating the weights needs to also be an operation inside the computational graph. So now we used this assign

function which mutates these variables inside

the computational graph and now the mutated value will

persist across multiple runs of the same graph. So now when we run this graph and when we train the network, now we need to run the graph

once with a little bit of special incantation to tell

TensorFlow to set up these variables that are going

to live inside the graph. And then once we’ve done

that initialization, now we can run the graph

over and over again. And here, we’re now only

feeding in the data and labels X and Y and the weights are

living inside the graph. And here we’ve asked the network to, we’ve asked TensorFlow to

compute the loss for us. And then you might think that

this would train the network, but there’s actually a bug here. So, if you actually run this code, and you plot the loss, it doesn’t train. So that’s bad, it’s confusing,

like what’s going on? We wrote this assign

code, we ran the thing, like we computed the

loss and the gradients and our loss is flat, what’s going on? Any ideas? [student’s words obscured

due to lack of microphone] Yeah so one hypothesis is

that maybe we’re accidentally re-initializing the w’s

every time we call the graph. That’s a good hypothesis,

that’s actually not the problem in this case. [student’s words obscured

due to lack of microphone] Yeah, so the answer is that

we actually need to explicitly tell TensorFlow that we

want to run these new w1 and new w2 operations. So we’ve built up this big

computational graph data structure in memory and

now when we call run, we only told TensorFlow that

we wanted to compute loss. And if you look at the

dependencies among these different operations inside the graph, you see that in order to compute loss we don’t actually need to

perform this update operation. So TensorFlow is smart and

it only computes the parts of the graph that are necessary

for computing the output that you asked it to compute. So that’s kind of a nice thing

because it means it’s only doing as much work as it needs to, but in situations like this it

can be a little bit confusing and lead to behavior

that you didn’t expect. So the solution in this case

is that we actually need to explicitly tell TensorFlow

to perform those update operations. So one thing we could do,

which is what was suggested is we could add new w1

and new w2 as outputs and just tell TensorFlow

that we want to produce these values as outputs. But that’s a problem

too because the values, those new w1, new w2 values

are again these big tensors. So now if we tell TensorFlow

we want those as output, we’re going to again get

this copying behavior between CPU and GPU at ever iteration. So that’s bad, we don’t want that. So there’s a little

trick you can do instead. Which is that we add kind of

a dummy node to the graph. With these fake data dependencies and we just say that

this dummy node updates, has these data dependencies

of new w1 and new w2. And now when we actually run the graph, we tell it to compute both

the loss and this dummy node. And this dummy node

doesn’t actually return any value it just returns

none, but because of this dependency that we’ve put

into the node it ensures that when we run the updates value, we actually also run

these update operations. So, question? [student’s words obscured

due to lack of microphone] Is there a reason why we didn’t

put X and Y into the graph? And that it stayed as Numpy. So in this example we’re

reusing X and Y on every, we’re reusing the same X

and Y on every iteration. So you’re right, we could

have just also stuck those in the graph, but in a

more realistic scenario, X and Y will be minibatches

of data so those will actually change at every iteration

and we will want to feed different values for

those at every iteration. So in this case, they could

have stayed in the graph, but in most cases they will change, so we don’t want them

to live in the graph. Oh, another question? [student’s words obscured

due to lack of microphone] Yeah, so we’ve told it,

we had put into TensorFlow that the outputs we want

are loss and updates. Updates is not actually a real value. So when updates evaluates

it just returns none. But because of this dependency

we’ve told it that updates depends on these assign operations. But these assign operations live inside the computational graph and

all live inside GPU memory. So then we’re doing

these update operations entirely on the GPU and

we’re no longer copying the updated values back out of the graph. [student’s words obscured

due to lack of microphone] So the question is does

tf.group return none? So this gets into the

trickiness of TensorFlow. So tf.group returns some

crazy TensorFlow value. It sort of returns some like

internal TensorFlow node operation that we need to

continue building the graph. But when you execute the graph, and when you tell, inside the session.run, when we told it we want it

to compute the concrete value from updates, then that returns none. So whenever you’re working with TensorFlow you have this funny indirection

between building the graph and the actual output values

during building the graph is some funny weird object,

and then you actually get a concrete value when you run the graph. So here after you run updates,

then the output is none. Does that clear it up a little bit? [student’s words obscured

due to lack of microphone] So the question is why is loss a value and why is updates none? That’s just the way that updates works. So loss is a value when we compute, when we tell TensorFlow

we want to run a tensor, then we get the concrete value. Updates is this kind of

special other data type that does not return a value,

it instead returns none. So it’s kind of some TensorFlow

magic that’s going on there. Maybe we can talk offline

if you’re still confused. [student’s words obscured

due to lack of microphone] Yeah, yeah, that behavior is

coming from the group method. So now, we kind of have

this weird pattern where we wanted to do these

different assign operations, we have to use this funny tf.group thing. That’s kind of a pain, so

thankfully TensorFlow gives you some convenience

operations that kind of do that kind of stuff for you. And that’s called an optimizer. So here we’re using a

tf.train.GradientDescentOptimizer and we’re telling it what

learning rate we want to use. And you can imagine that

there’s, there’s RMSprop, there’s all kinds of different

optimization algorithms here. And now we call optimizer.minimize of loss and now this is a pretty magical, this is a pretty magical thing, because now this call is

aware that these variables w1 and w2 are marked as

trainable by default, so then internally, inside

this optimizer.minimize it’s going in and adding

nodes to the graph which will compute gradient

of loss with respect to w1 and w2 and then it’s

also performing that update operation for you and it’s

doing the grouping operation for you and it’s doing the assigns. It’s like doing a lot of

magical stuff inside there. But then it ends up giving

you this magical updates value which, if you dig through the

code they’re actually using tf.group so it looks very

similar internally to what we saw before. And now when we run the

graph inside our loop we do the same pattern of

telling it to compute loss and updates. And every time we tell the

graph to compute updates, then it’ll actually go

and update the graph. Question? [student’s words obscured

due to lack of microphone] Yeah, so what is the

tf.GlobalVariablesInitializer? So that’s initializing w1

and w2 because these are variables which live inside the graph. So we need to, when we

saw this, when we create the tf.variable we have

this tf.randomnormal which is this initialization so the tf.GlobalVariablesInitializer

is causing the tf.randomnormal to actually run

and generate concrete values to initialize those variables. [student’s words obscured

due to lack of microphone] Sorry, what was the question? [student’s words obscured

due to lack of microphone] So it knows that a

placeholder is going to be fed outside of the graph and a

variable is something that lives inside the graph. So I don’t know all the

details about how it decides, what exactly it decides

to run with that call. I think you’d need to dig

through the code to figure that out, or maybe it’s

documented somewhere. So but now we’ve kind of got this, again we’ve got this full

example of training a network in TensorFlow

and we’re kind of adding bells and whistles to make it

a little bit more convenient. So we can also here,

in the previous example we were computing the loss

explicitly using our own tensor operations, TensorFlow

you can always do that, you can use basic tensor

operations to compute just about anything you want. But TensorFlow also gives

you a bunch of convenience functions that compute these

common neural network things for you. So in this case we can use

tf.losses.mean_squared_error and it just does the L2

loss for us so we don’t have to compute it ourself in terms

of basic tensor operations. So another kind of weirdness

here is that it was kind of annoying that we had to

explicitly define our inputs and define our weights and

then like chain them together in the forward pass

using a matrix multiply. And in this example we’ve

actually not put biases in the layer because that

would be kind of an extra, then we’d have to initialize biases, we’d have to get them in the right shape, we’d have to broadcast the

biases against the output of the matrix multiply

and you can see that that would kind of be a lot of code. It would be kind of annoying write. And once you get to like convolutions and batch normalizations

and other types of layers this kind of basic way of working, of having these variables,

having these inputs and outputs and combining them all together with basic computational graph operations

could be a little bit unwieldy and it could

be really annoying to make sure you initialize

the weights with the right shapes and all that sort of stuff. So as a result, there’s a

bunch of sort of higher level libraries that wrap around TensorFlow and handle some of these details for you. So one example that ships with TensorFlow, is this tf.layers inside. So now in this code example

you can see that our code is only explicitly

declaring the X and the Y which are the placeholders

for the data and the labels. And now we say that H=tf.layers.dense, we give it the input X

and we tell it units=H. This is again kind of a magical line because inside this line,

it’s kind of setting up w1 and b1, the bias, it’s

setting up variables for those with the right shapes that

are kind of inside the graph but a little bit hidden from us. And it’s using this

xavier initializer object to set up an initialization

strategy for those. So before we were doing

that explicitly ourselves with the tf.randomnormal business, but now here it’s kind of

handling some of those details for us and it’s just spitting out an H, which is again the same

sort of H that we saw in the previous layer, it’s

just doing some of those details for us. And you can see here,

we’re also passing an activation=tf.nn.relu so it’s

even doing the activation, the relu activation function

inside this layer for us. So it’s taking care of a

lot of these architectural details for us. Question? [student’s words obscured

due to lack of microphone] Question is does the

xavier initializer default to particular distribution? I’m sure it has some default,

I’m not sure what it is. I think you’ll have to

look at the documentation. But it seems to be a

reasonable strategy, I guess. And in fact if you run this code, it converges much faster

than the previous one because the initialization is better. And you can see that

we’re using two calls to tf.layers and this lets us build our model without doing all these

explicit bookkeeping details ourself. So this is maybe a little

bit more convenient. But tf.contrib.layer is really

not the only game in town. There’s like a lot of different

higher level libraries that people build on top of TensorFlow. And it’s kind of due to this

basic impotence mis-match where the computational graph

is relatively low level thing, but when we’re working

with neural networks we have this concept of layers and weights and some layers have weights

associated with them, and we typically think at

a slightly higher level of abstraction than this

raw computational graph. So that’s what these various

packages are trying to help you out and let you

work at this higher layer of abstraction. So another very popular

package that you may have seen before is Keras. Keras is a very beautiful,

nice API that sits on top of TensorFlow and handles

sort of building up these computational graph for

you up in the back end. By the way, Keras also

supports Theano as a back end, so that’s also kind of nice. And in this example you

can see we build the model as a sequence of layers. We build some optimizer object and we call model.compile

and this does a lot of magic in the back end to build the graph. And now we can call model.fit

and that does the whole training procedure for us magically. So I don’t know all the

details of how this works, but I know Keras is very popular, so you might consider using

it if you’re talking about TensorFlow. Question? [student’s words obscured

due to lack of microphone] Yeah, so the question is

like why there’s no explicit CPU, GPU going on here. So I’ve kind of left that

out to keep the code clean. But you saw at the beginning examples it was pretty easy to

flop all these things between CPU and GPU and there

was either some global flag or some different data type or some with statement and

it’s usually relatively simple and just about one line

to swap in each case. But exactly what that line looks like differs a bit depending on the situation. So there’s actually like

this whole large set of higher level TensorFlow

wrappers that you might see out there in the wild. And it seems that like

even people within Google can’t really agree on which

one is the right one to use. So Keras and TFLearn are

third party libraries that are out there on the

internet by other people. But there’s these three different ones, tf.layers, TF-Slim and tf.contrib.learn that all ship with TensorFlow,

that are all kind of doing a slightly different version of this higher level wrapper thing. There’s another framework

also from Google, but not shipping with

TensorFlow called Pretty Tensor that does the same sort of thing. And I guess none of these

were good enough for DeepMind, because they went ahead a couple weeks ago and wrote and released

their very own high level TensorFlow wrapper called Sonnet. So I wouldn’t begrudge you

if you were kind of confused by all these things. There’s a lot of different choices. They don’t always play

nicely with each other. But you have a lot of

options, so that’s good. TensorFlow has pretrained models. There’s some examples in

TF-Slim, and in Keras. ‘Cause remember retrained

models are super important when you’re training your own things. There’s also this idea of Tensorboard where you can load up your, I don’t want to get into details, but Tensorboard you can

add sort of instrumentation to your code and then

plot losses and things as you go through the training process. TensorFlow also let’s you run distributed where you can break up

a computational graph run on different machines. That’s super cool but I

think probably not anyone outside of Google is really

using that to great success these days, but if you do

want to run distributed stuff probably TensorFlow is the

main game in town for that. A side note is that a lot

of the design of TensorFlow is kind of spiritually inspired

by this earlier framework called Theano from Montreal. I don’t want to go

through the details here, just if you go through

these slides on your own, you can see that the code

for Theano ends up looking very similar to TensorFlow. Where we define some variables, we do some forward pass,

we compute some gradients, and we compile some function,

then we run the function over and over to train the network. So it kind of looks a lot like TensorFlow. So we still have a lot to get through, so I’m going to move on to PyTorch and maybe take questions at the end. So, PyTorch from Facebook

is kind of different from TensorFlow in that we have

sort of three explicit different layers of

abstraction inside PyTorch. So PyTorch has this tensor

object which is just like a Numpy array. It’s just an imperative array,

it doesn’t know anything about deep learning,

but it can run with GPU. We have this variable

object which is a node in a computational graph which

builds up computational graphs, lets you compute gradients,

that sort of thing. And we have a module object

which is a neural network layer that you can compose

together these modules to build big networks. So if you kind of want to

think about rough equivalents between PyTorch and TensorFlow

you can think of the PyTorch tensor as fulfilling the same role as the Numpy array in TensorFlow. The PyTorch variable is similar

to the TensorFlow tensor or variable or placeholder,

which are all sort of nodes in a computational graph. And now the PyTorch module

is kind of equivalent to these higher level things

from tf.slim or tf.layers or sonnet or these other

higher level frameworks. So right away one thing

to notice about PyTorch is that because it ships with

this high level abstraction and like one really nice

higher level abstraction called modules on its own,

there’s sort of less choice involved. Just stick with nnmodules

and you’ll be good to go. You don’t need to worry about

which higher level wrapper to use. So PyTorch tensors, as I said,

are just like Numpy arrays so here on the right we’ve done

an entire two layer network using entirely PyTorch tensors. One thing to note is that

we’re not importing Numpy here at all anymore. We’re just doing all these

operations using PyTorch tensors. And this code looks exactly

like the two layer net code that you wrote in Numpy

on the first homework. So you set up some random

data, you use some operations to compute the forward pass. And then we’re explicitly

viewing the backward pass ourself. Just sort of backhopping

through the network, through the operations, just

as you did on homework one. And now we’re doing a

manual update of the weights using a learning rate and

using our computed gradients. But the major difference

between the PyTorch tensor and Numpy arrays is that they run on GPU so all you have to do

to make this code run on GPU is use a different data type. Rather than using torch.FloatTensor, you do torch.cuda.FloatTensor,

cast all of your tensors to this new datatype and

everything runs magically on the GPU. You should think of PyTorch

tensors as just Numpy plus GPU. That’s exactly what it

is, nothing specific to deep learning. So the next layer of abstraction

in PyTorch is the variable. So this is, once we moved

from tensors to variables now we’re building computational graphs and we’re able to take

gradients automatically and everything like that. So here, if X is a variable,

then x.data is a tensor and x.grad is another variable

containing the gradients of the loss with respect to that tensor. So x.grad.data is an

actual tensor containing those gradients. And PyTorch tensors and variables

have the exact same API. So any code that worked on

PyTorch tensors you can just make them variables instead

and run the same code, except now you’re building

up a computational graph rather than just doing

these imperative operations. So here when we create these variables each call to the variable

constructor wraps a PyTorch tensor and then also gives

a flag whether or not we want to compute gradients

with respect to this variable. And now in the forward

pass it looks exactly like it did before in the variable

in the case with tensors because they have the same API. So now we’re computing our predictions, we’re computing our loss

in kind of this imperative kind of way. And then we call loss.backwards

and now all these gradients come out for us. And then we can make

a gradient update step on our weights using the

gradients that are now present in the w1.grad.data. So this ends up looking

quite like the Numpy case, except all the gradients come for free. One thing to note that’s

kind of different between PyTorch and TensorFlow is

that in a TensorFlow case we were building up this explicit graph, then running the graph many times. Here in PyTorch, instead

we’re building up a new graph every time we do a forward pass. And this makes the code

look a bit cleaner. And it has some other

implications that we’ll get to in a bit. So in PyTorch you can define

your own new autograd functions by defining the forward and

backward in terms of tensors. This ends up looking kind

of like the module layers code that you write for homework two. Where you can implement

forward and backward using tensor operations and then

stick these things inside computational graph. So here we’re defining our own relu and then we can actually

go in and use our own relu operation and now stick it

inside our computational graph and define our own operations this way. But most of the time you

will probably not need to define your own autograd operations. Most of the times the

operations you need will mostly be already implemented for you. So in TensorFlow we saw, if we can move to something

like Keras or TF.Learn and this gives us a higher

level API to work with, rather than this raw computational graphs. The equivalent in PyTorch

is the nn package. Where it provides these high

level wrappers for working with these things. But unlike TensorFlow

there’s only one of them. And it works pretty well,

so just use that if you’re using PyTorch. So here, this ends up

kind of looking like Keras where we define our model

as some sequence of layers. Our linear and relu operations. And we use some loss function

defined in the nn package that’s our mean squared error loss. And now inside each iteration of our loop we can run data forward

through the model to get our predictions. We can run the predictions

forward through the loss function to get our scale or loss, then we can call loss.backward,

get all our gradients for free and then loop over

the parameters of the models and do our explicit gradient

descent step to update the models. And again we see that we’re

sort of building up this new computational graph every

time we do a forward pass. And just like we saw in TensorFlow, PyTorch provides these

optimizer operations that kind of abstract

away this updating logic and implement fancier

update rules like Adam and whatnot. So here we’re constructing

an optimizer object telling it that we want

it to optimize over the parameters of the model. Giving it some learning rate

under the hyper parameters. And now after we compute our gradients we can just call

optimizer.step and it updates all the parameters of the

model for us right here. So another common thing

you’ll do in PyTorch a lot is define your own nn modules. So typically you’ll write your own class which defines you entire model as a single new nn module class. And a module is just kind

of a neural network layer that can contain either

other other modules or trainable weights or

other other kinds of state. So in this case we can redo

the two layer net example by defining our own nn module class. So now here in the

initializer of the class we’re assigning this linear1 and linear2. We’re constructing

these new module objects and then store them

inside of our own class. And now in the forward pass

we can use both our own internal modules as well as

arbitrary autograd operations on variables to compute

the output of our network. So here we receive the, inside

this forward method here, the input acts as a variable, then we pass the variable

to our self.linear1 for the first layer. We use an autograd op

clamp to complete the relu, we pass the output of

that to the second linear and then that gives us our output. And now the rest of this

code for training this thing looks pretty much the same. Where we build an optimizer and loop over and on ever iteration

feed data to the model, compute the gradients with loss.backwards, call optimizer.step. So this is like relatively characteristic of what you might see

in a lot of PyTorch type training scenarios. Where you define your own class, defining your own model

that contains other modules and whatnot and then you

have some explicit training loop like this that

runs it and updates it. One kind of nice quality

of life thing that you have in PyTorch is a dataloader. So a dataloader can handle

building minibatches for you. It can handle some of the

multi-threading that we talked about for you, where it can

actually use multiple threads in the background to

build many batches for you and stream off disk. So here a dataloader wraps

a dataset and provides some of these abstractions for you. And in practice when you

want to run your own data, you typically will write

your own dataset class which knows how to read

your particular type of data off whatever source you

want and then wrap it in a data loader and train with that. So, here we can see that

now we’re iterating over the dataloader object

and at every iteration this is yielding minibatches of data. And it’s internally handling

the shuffling of the data and multithreaded dataloading

and all this sort of stuff for you. So this is kind of a

completely PyTorch example and a lot of PyTorch

training code ends up looking something like this. PyTorch provides pretrained models. And this is probably the

slickest pretrained model experience I’ve ever seen. You just say torchvision.models.alexnet

pretained=true. That’ll go down in the background,

download the pretrained weights for you if you

don’t already have them, and then it’s right

there, you’re good to go. So this is super easy to use. PyTorch also has, there’s

also a package called Visdom that lets you visualize some

of these loss statistics somewhat similar to Tensorboard. So that’s kind of nice,

I haven’t actually gotten a chance to play around with

this myself so I can’t really speak to how useful it is, but one of the major

differences between Tensorboard and Visdom is that Tensorboard

actually lets you visualize the structure of the computational graph. Which is really cool, a really

useful debugging strategy. And Visdom does not have

that functionality yet. But I’ve never really used

this myself so I can’t really speak to its utility. As a bit of an aside, PyTorch

is kind of an evolution of, kind of a newer updated

version of an older framework called Torch which I worked

with a lot in the last couple of years. And I don’t want to go

through the details here, but PyTorch is pretty much

better in a lot of ways than the old Lua Torch, but

they actually share a lot of the same back end C code

for computing with tensors and GPU operations on tensors and whatnot. So if you look through this Torch example, some of it ends up looking

kind of similar to PyTorch, some of it’s a bit different. Maybe you can step through this offline. But kind of the high

level differences between Torch and PyTorch are that

Torch is actually in Lua, not Python, unlike these other things. So learning Lua is a bit of

a turn off for some people. Torch doesn’t have autograd. Torch is also older, so it’s more stable, less susceptible to bugs,

there’s maybe more example code for Torch. They’re about the same speeds,

that’s not really a concern. But in PyTorch it’s in

Python which is great, you’ve got autograd which

makes it a lot simpler to write complex models. In Lua Torch you end up

writing a lot of your own back prop code sometimes, so

that’s a little bit annoying. But PyTorch is newer,

there’s less existing code, it’s still subject to change. So it’s a little bit more of an adventure. But at least for me, I kind of prefer, I don’t really see much reason for myself to use Torch over PyTorch

anymore at this time. So I’m pretty much using

PyTorch exclusively for all my work these days. We talked about this a

little bit about this idea of static versus dynamic graphs. And this is one of the main

distinguishing features between PyTorch and TensorFlow. So we saw in TensorFlow

you have these two stages of operation where first you build up this computational graph, then you

run the computational graph over and over again many

many times reusing that same graph. That’s called a static

computational graph ’cause there’s only one of them. And we saw PyTorch is quite

different where we’re actually building up this new computational graph, this new fresh thing

on every forward pass. That’s called a dynamic

computational graph. For kind of simple cases,

with kind of feed forward neural networks, it doesn’t

really make a huge difference, the code ends up kind of similarly and they work kind of similarly, but I do want to talk a bit

about some of the implications of static versus dynamic. And what are the tradeoffs of those two. So one kind of nice

idea with static graphs is that because we’re

kind of building up one computational graph once, and

then reusing it many times, the framework might have

the opportunity to go in and do optimizations on that graph. And kind of fuse some operations,

reorder some operations, figure out the most

efficient way to operate that graph so it can be really efficient. And because we’re going

to reuse that graph many times, maybe that

optimization process is expensive up front, but we can amortize that

cost with the speedups that we’ve gotten when we run

the graph many many times. So as kind of a concrete example, maybe if you write some

graph which has convolution and relu operations kind

of one after another, you might imagine that

some fancy graph optimizer could go in and actually

output, like emit custom code which has fused operations,

fusing the convolution and the relu so now it’s

computing the same thing as the code you wrote, but

now might be able to be executed more efficiently. So I’m not too sure on exactly

what the state in practice of TensorFlow graph

optimization is right now, but at least in principle,

this is one place where static graph really, you

can have the potential for doing this optimization in static graphs where maybe it would be not so

tractable for dynamic graphs. Another kind of subtle point

about static versus dynamic is this idea of serialization. So with a static graph you

can imagine that you write this code that builds up the graph and then once you’ve built the graph, you have this data structure

in memory that represents the entire structure of your network. And now you could take that data structure and just serialize it to disk. And now you’ve got the whole

structure of your network saved in some file. And then you could later

rear load that thing and then run that computational

graph without access to the original code that built it. So this would be kind of nice

in a deployment scenario. You might imagine that you

might want to train your network in Python because it’s

maybe easier to work with, but then after you serialize that network and then you could deploy

it now in maybe a C++ environment where you don’t

need to use the original code that built the graph. So that’s kind of a nice

advantage of static graphs. Whereas with a dynamic graph,

because we’re interleaving these processes of graph

building and graph execution, you kind of need the

original code at all times if you want to reuse

that model in the future. On the other hand, some

advantages for dynamic graphs are that it kind of makes,

it just makes your code a lot cleaner and a lot

easier in a lot of scenarios. So for example, suppose

that we want to do some conditional operation where

depending on the value of some variable Z, we want

to do different operations to compute Y. Where if Z is positive, we

want to use one weight matrix, if Z is negative we want to

use a different weight matrix. And we just want to switch off

between these two alternatives. In PyTorch because we’re

using dynamic graphs, it’s super simple. Your code kind of looks

exactly like you would expect, exactly what you would do in Numpy. You can just use normal

Python control flow to handle this thing. And now because we’re building

up the graph each time, each time we perform this

operation will take one of the two paths and build

up maybe a different graph on each forward pass, but

for any graph that we do end up building up, we can

back propagate through it just fine. And the code is very

clean, easy to work with. Now in TensorFlow the

situations is a little bit more complicated because we

build the graph once, this control flow operator

kind of needs to be an explicit operator in

the TensorFlow graph. And now, so them you can

see that we have this tf.cond call which is kind

of like a TensorFlow version of an if statement,

but now it’s baked into the computational graph

rather than using sort of Python control flow. And the problem is that

because we only build the graph once, all the potential

paths of control flow that our program might flow

through need to be baked into the graph at the time we

construct it before we ever run it. So that means that any kind

of control flow operators that you want to have need

to be not Python control flow operators, you need to

use some kind of magic, special tensor flow

operations to do control flow. In this case this tf.cond. Another kind of similar

situation happens if you want to have loops. So suppose that we want to

compute some kind of recurrent relationships where maybe Y

T is equal to Y T minus one plus X T times some weight

matrix W and depending on each time we do this,

every time we compute this, we might have a different

sized sequence of data. And no matter the length

of our sequence of data, we just want to compute this

same recurrence relation no matter the size of the input sequence. So in PyTorch this is super easy. We can just kind of use a

normal for loop in Python to just loop over the number

of times that we want to unroll and now depending on

the size of the input data, our computational graph will

end up as different sizes, but that’s fine, we can

just back propagate through each one, one at a time. Now in TensorFlow this

becomes a little bit uglier. And again, because we need

to construct the graph all at once up front, this

control flow looping construct again needs to be an explicit

node in the TensorFlow graph. So I hope you remember

your functional programming because you’ll have to use

those kinds of operators to implement looping

constructs in TensorFlow. So in this case, for this

particular recurrence relationship you can use a foldl operation and pass in, sort of implement this particular

loop in terms of a foldl. But what this basically means

is that you have this sense that TensorFlow is almost

building its own entire programming language,

using the language of computational graphs. And any kind of control flow operator, or any kind of data

structure needs to be rolled into the computational graph

so you can’t really utilize all your favorite paradigms

for working imperatively in Python. You kind of need to relearn

a whole separate set of control flow operators. And if you want to do

any kinds of control flow inside your computational

graph using TensorFlow. So at least for me, I find

that kind of confusing, a little bit hard to wrap

my head around sometimes, and I kind of like that

using PyTorch dynamic graphs, you can just use your favorite

imperative programming constructs and it all works just fine. By the way, there actually

is some very new library called TensorFlow Fold which

is another one of these layers on top of TensorFlow

that lets you implement dynamic graphs, you kind

of write your own code using TensorFlow Fold that

looks kind of like a dynamic graph operation and then

TensorFlow Fold does some magic for you and somehow implements

that in terms of the static TensorFlow graphs. This is a super new paper

that’s being presented at ICLR this week in France. So I haven’t had the chance

to like dive in and play with this yet. But my initial impression

was that it does add some amount of dynamic graphs to

TensorFlow but it is still a bit more awkward to work

with than the sort of native dynamic graphs you have in PyTorch. So then, I thought it

might be nice to motivate like why would we care about

dynamic graphs in general? So one option is recurrent networks. So you can see that for

something like image captioning we use a recurrent network

which operates over sequences of different lengths. In this case, the sentence

that we want to generate as a caption is a sequence

and that sequence can vary depending on our input data. So now you can see that we

have this dynamism in the thing where depending on the

size of the sentence, our computational graph

might need to have more or fewer elements. So that’s one kind of common

application of dynamic graphs. For those of you who

took CS224N last quarter, you saw this idea of recursive networks where sometimes in natural

language processing you might, for example,

compute a parsed tree of a sentence and then

you want to have a neural network kind of operate

recursively up this parse tree. So having a neural network

that kind of works, it’s not just a sequential

sequence of layers, but instead it’s kind of

working over some graph or tree structure instead

where now each data point might have a different

graph or tree structure so the structure of

the computational graph then kind of mirrors the

structure of the input data. And it could vary from

data point to data point. So this type of thing seems

kind of complicated and hairy to implement using TensorFlow, but in PyTorch you can just kind of use like normal Python control

flow and it’ll work out just fine. Another bit of more researchy

application is this really cool idea that I like

called neuromodule networks for visual question answering. So here the idea is that we

want to ask some questions about images where we

maybe input this image of cats and dogs, there’s some question, what color is the cat, and

then internally the system can read the question and

that has these different specialized neural network

modules for performing operations like asking for

colors and finding cats. And then depending on

the text of the question, it can compile this custom

architecture for answering the question. And now if we asked a different question, like are there more cats than dogs? Now we have maybe the

same basic set of modules for doing things like finding

cats and dogs and counting, but they’re arranged in a different order. So we get this dynamism again

where different data points might give rise to different

computational graphs. But this is a bit more

of a researchy thing and maybe not so main stream right now. But as kind of a bigger

point, I think that there’s a lot of cool, creative

applications that people could do with dynamic computational graphs and maybe there aren’t so many right now, just because it’s been so

painful to work with them. So I think that there’s

a lot of opportunity for doing cool, creative things with dynamic computational graphs. And maybe if you come up with cool ideas, we’ll feature it in lecture next year. So I wanted to talk

very briefly about Caffe which is this framework from Berkeley. Which Caffe is somewhat

different from the other deep learning frameworks

where you in many cases you can actually train

networks without writing any code yourself. You kind of just call into

these pre-existing binaries, set up some configuration

files and in many cases you can train on data without

writing any of your own code. So, you may be first,

you convert your data into some format like HDF5

or LMDB and there exists some scripts inside Caffe

that can just convert like folders of images and text files

into these formats for you. You need to define, now

instead of writing code to define the structure of

your computational graph, instead you edit some text

file called a prototxt which sets up the structure

of the computational graph. Here the structure is that

we read from some input HDF5 file, we perform some inner product, we compute some loss

and the whole structure of the graph is set up in this text file. One kind of downside

here is that these files can get really ugly for

very large networks. So for something like the

152 layer ResNet model, which by the way was

trained in Caffe originally, then this prototxt file ends

up almost 7000 lines long. So people are not writing these by hand. People will sometimes will

like write python scripts to generate these prototxt files. [laughter] Then you’re kind in the

realm of rolling your own computational graph abstraction. That’s probably not a good

idea, but I’ve seen that before. Then, rather than having

some optimizer object, instead there’s some solver,

you define some solver things inside another prototxt. This defines your learning rate, your optimization algorithm and whatnot. And then once you do all these things, you can just run the Caffe

binary with the train command and it all happens magically. Cafee has a model zoo with a

bunch of pretrained models, that’s pretty useful. Caffe has a Python

interface but it’s not super well documented. You kind of need to read the

source code of the python interface to see what it can do, so that’s kind of annoying. But it does work. So, kind of my general thing

about Caffe is that it’s maybe good for feed forward models, it’s maybe good for production scenarios, because it doesn’t depend on Python. But probably for research

these days, I’ve seen Caffe being used maybe a little bit less. Although I think it is

still pretty commonly used in industry again for production. I promise one slide, one

or two slides on Caffe 2. So Caffe 2 is the successor to

Caffe which is from Facebook. It’s super new, it was

only released a week ago. [laughter] So I really haven’t had

the time to form a super educated opinion about Caffe 2 yet, but it uses static graphs

kind of similar to TensorFlow. Kind of like Caffe one

the core is written in C++ and they have some Python interface. The difference is that

now you no longer need to write your own Python scripts

to generate prototxt files. You can kind of define your

computational graph structure all in Python, kind of

looking with an API that looks kind of like TensorFlow. But then you can spit out,

you can serialize this computational graph

structure to a prototxt file. And then once your model

is trained and whatnot, then we get this benefit that

we talked about of static graphs where you can, you

don’t need the original training code now in order

to deploy a trained model. So one interesting thing

is that you’ve seen Google maybe has one major

deep running framework, which is TensorFlow, where

Facebook has these two, PyTorch and Caffe 2. So these are kind of

different philosophies. Google’s kind of trying to

build one framework to rule them all that maybe works

for every possible scenario for deep learning. This is kind of nice because

it consolidates all efforts onto one framework. It means you only need to learn one thing and it’ll work across

many different scenarios including like distributed

systems, production, deployment, mobile, research, everything. Only need to learn one framework

to do all these things. Whereas Facebook is taking a

bit of a different approach. Where PyTorch is really more specialized, more geared towards research

so in terms of writing research code and quickly

iterating on your ideas, that’s super easy in

PyTorch, but for things like running in production,

running on mobile devices, PyTorch doesn’t have a

lot of great support. Instead, Caffe 2 is kind

of geared toward those more production oriented use cases. So my kind of general study,

my general, overall advice about like which framework

to use for which problems is kind of that both, I think TensorFlow is a

pretty safe bet for just about any project that you

want to start new, right? Because it is sort of one

framework to rule them all, it can be used for just

about any circumstance. However, you probably

need to pair it with a higher level wrapper and

if you want dynamic graphs, you’re maybe out of luck. Some of the code ends up

looking a little bit uglier in my opinion, but maybe that’s

kind of a cosmetic detail and it doesn’t really matter that much. I personally think PyTorch

is really great for research. If you’re focused on just

writing research code, I think PyTorch is a great choice. But it’s a bit newer, has

less community support, less code out there, so it

could be a bit of an adventure. If you want more of a well

trodden path, TensorFlow might be a better choice. If you’re interested in

production deployment, you should probably look at

Caffe, Caffe 2 or TensorFlow. And if you’re really focused

on mobile deployment, I think TensorFlow and Caffe

2 both have some built in support for that. So it’s kind of unfortunately,

there’s not just like one global best framework,

it kind of depends on what you’re actually trying to do, what applications you anticipate

but theses are kind of my general advice on those things. So next time we’ll talk

about some case studies about various CNN architectures.